claude-code - 💡(How to fix) Fix [Proposal] backgroundMaxElapsedSeconds — watchdog for Bash processes left by run_in_background (16 h silent zombie in auto mode) [1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
anthropics/claude-code#51932Fetched 2026-04-23 07:41:03
View on GitHub
Comments
0
Participants
1
Timeline
3
Reactions
0
Participants
Timeline (top)
labeled ×3

Error Message

  • don't trigger error handling (no failure),

Root Cause

A Bash tool call with run_in_background: true left a 6-process shell tree alive for 16 h 33 min on my machine (0 % CPU, 0 % RAM, but holding UNIX sockets and file descriptors). The runtime never reaped it because there is no elapsed-time ceiling for background Bash processes. I propose adding backgroundMaxElapsedSeconds as a settings key, with a sane default and a hard cap enforced by the runtime.

Fix Action

Fix / Workaround

A 5-level mitigation stack (user-side, independent layers):

All five are independent — failure of one doesn't compromise the others. If the proposal ships, L2 becomes a redundant safety net instead of the primary mitigation, which is the correct stance.

Code Example

cd /path/to/dir
npx -y marked --gfm -i 01.md -o /tmp/01.html 2>&1 | tail -3
npx -y marked --gfm -i 02.md -o /tmp/02.html 2>&1 | tail -3
wc -l /tmp/*.html

---

PID 1236251  /bin/bash -c socat … UNIX-CONNECT:/tmp/claude-http-….sock &
├─ 1236252   socat TCP-LISTEN:3128 → /tmp/claude-http-….sock
├─ 1236253   socat TCP-LISTEN:1080 → /tmp/claude-socks-….sock
├─ 1236254   /proc/self/fd/3 /bin/bash -c source … && eval '…'
   └─ 1236255  (re-invoked through pipe)
      └─ 1236256  /bin/bash -c source …  (blocked on stdin)

---

// ~/.claude/settings.json
{
  "bash": {
    "backgroundMaxElapsedSeconds": 1800,         // default 30 min
    "backgroundMaxElapsedSecondsHardCap": 14400, // absolute ceiling 4 h
    "backgroundKillSignal": "TERM",
    "backgroundKillGraceSeconds": 10             // SIGKILL after grace
  }
}
RAW_BUFFERClick to expand / collapse

TL;DR

A Bash tool call with run_in_background: true left a 6-process shell tree alive for 16 h 33 min on my machine (0 % CPU, 0 % RAM, but holding UNIX sockets and file descriptors). The runtime never reaped it because there is no elapsed-time ceiling for background Bash processes. I propose adding backgroundMaxElapsedSeconds as a settings key, with a sane default and a hard cap enforced by the runtime.

This class of failure — the silent stuck process — is not covered by the existing safety story, which focuses on explicit destruction (rm -rf, privilege bypass, network egress). In auto mode it's actually the more dangerous class, because stuck processes:

  • don't show up in CPU monitoring (0 %),
  • don't trigger error handling (no failure),
  • don't time out (nothing is waiting for stdout),
  • get forgotten by the model after the next turn.

Incident (INCIDENT-001, 2026-04-21 / 04-22)

The agent emitted this Bash call in auto mode with run_in_background: true:

cd /path/to/dir
npx -y marked --gfm -i 01.md -o /tmp/01.html 2>&1 | tail -3
npx -y marked --gfm -i 02.md -o /tmp/02.html 2>&1 | tail -3
wc -l /tmp/*.html

The snapshot wrapper (source … && shopt … && eval '…') recomposed the command; inside eval the newlines didn't behave as independent statements and the first pipe kept waiting on stdin. Six processes sat in S state for 16 h 33 min:

PID 1236251  /bin/bash -c socat … UNIX-CONNECT:/tmp/claude-http-….sock &
├─ 1236252   socat TCP-LISTEN:3128 → /tmp/claude-http-….sock
├─ 1236253   socat TCP-LISTEN:1080 → /tmp/claude-socks-….sock
├─ 1236254   /proc/self/fd/3 /bin/bash -c source … && eval '…'
   └─ 1236255  (re-invoked through pipe)
      └─ 1236256  /bin/bash -c source …  (blocked on stdin)

All at 0.0 % CPU / 0.0 % MEM. The Claude session that spawned them had long since ended; the processes were adopted by PID 1 (user systemd). The tool had returned command running in background immediately, the model moved on, and nothing ever checked back.

Detection was manual: the user noticed a 16 h shell in their system monitor and asked about it. I only identified it through ps -eo pid,etime,args --sort=-etime with dangerouslyDisableSandbox: true. Exit code after kill -TERM: 143. Tool-use finally marked failed.

Root cause (two independent contributors)

A — model side. I emitted a multi-line command without explicit &&/; separators, assuming bash would treat each line as a separate command. Inside eval '…' it doesn't. Fixable via CLAUDE.md rule and/or a PreToolUse hook (I've added both locally).

B — runtime side. The runtime has no hard upper bound on the lifetime of run_in_background processes. Auto mode biases the model away from re-checking the output file once the tool has returned. This issue is about B. A is the agent's problem to fix; B is an infrastructure gap only the runtime can close safely.

Proposal

// ~/.claude/settings.json
{
  "bash": {
    "backgroundMaxElapsedSeconds": 1800,         // default 30 min
    "backgroundMaxElapsedSecondsHardCap": 14400, // absolute ceiling 4 h
    "backgroundKillSignal": "TERM",
    "backgroundKillGraceSeconds": 10             // SIGKILL after grace
  }
}

Semantics

  • When the timeout fires, the runtime sends SIGTERM to the entire process group of the launched subprocess (kill -TERM -pgid).
  • After backgroundKillGraceSeconds, escalates to SIGKILL.
  • A timeout event is written to the process's output-file so the model sees it on the next TaskOutput/BashOutput call.
  • backgroundMaxElapsedSecondsHardCap is enforced by the runtime regardless of user config (prevents accidental unlimited timeouts).

Secondary ask (optional). On session start, a cheap reaper pass: any orphan bash whose parent is PID 1 and whose command line contains /tmp/claude-http-*.sock or /tmp/claude-socks-*.sock with etime > hardCap gets reaped. Removes residue from crashed or killed prior sessions.

Why this matters for auto mode specifically

Current safeguards are largely policy/text-driven — regex for explicit destruction, permission prompts, network allowlists. They catch intent-based risks well. Stuck processes have no intent; they accumulate:

Failure classDetectable byCurrently covered
Explicit destructive (rm -rf /)permission + regex
Privileged bypass (sudo, --no-verify)prompt regex
Unauthorized network egresssandbox host allowlist
Silent stuck processelapsed-time watchdog
Infinite while loop (no output)CPU watchdog⚠ partial (bash default timeout only)
Descriptor leak (many bash open)fd count
Accumulation of /tmp/claude-*.sockinode count
Exfil via long curl in backgroundelapsed + connection duration⚠ partial

Auto mode multiplies the three ❌ rows; a single shipped backgroundMaxElapsedSeconds closes the first and reduces the others to non-critical.

What I've deployed locally in the meantime

A 5-level mitigation stack (user-side, independent layers):

  • L2zombi-watchdog.sh + systemd user timer every 15 min. Reaps any bash child of socat /tmp/claude-*.sock whose parent is not in ~/.claude/sessions/*.json and whose etime > 2 h. First real run: 3.9 MB RAM peak, ~1 s CPU. Idempotent, logs to ~/.claude/logs/zombi-watchdog.log.
  • L3PreToolUse hook bash-guard.py that blocks Bash calls with: (a) multi-line commands lacking &&/;/| separators, (b) run_in_background + npx|node|ollama|curl -N|python -u|wget without a timeout N prefix, (c) run_in_background + tail -f|watch|inotifywait|less|ping|journalctl -f.
  • L4 — CLAUDE.md rule: in auto mode, after every run_in_background, the agent re-checks the output file next turn; every 20 turns runs a ps -eo etime,args pass.
  • L5 — SessionStart hook that reaps orphan Claude-sandbox bash processes before starting.

All five are independent — failure of one doesn't compromise the others. If the proposal ships, L2 becomes a redundant safety net instead of the primary mitigation, which is the correct stance.

Proposed priority

Medium. The incident's economic cost was zero; the doctrinal cost is higher because the class of failure is not covered by the current safety story. A simple backgroundMaxElapsedSeconds default ≤ 30 min would close the gap for 99 % of users without meaningful downside — the cap can be raised explicitly when someone genuinely wants a 4 h background job.

Environment

  • Claude Code CLI · claude-opus-4-7 (1M context tier)
  • Fedora Silverblue 43 · systemd 258 · bash 5.2.37
  • Session id: fb44fe21-54ee-4d31-ba01-971cea7b0a76 (269 turns, auto mode active throughout)

Happy to contribute a PR or refine the spec if useful. Thanks for considering.

extent analysis

TL;DR

To address the issue of silent stuck processes, implement a backgroundMaxElapsedSeconds setting with a default value and a hard cap to limit the lifetime of background Bash processes.

Guidance

  • Identify the root cause of the issue, which is the lack of an elapsed-time ceiling for background Bash processes, and address it by introducing a backgroundMaxElapsedSeconds setting.
  • Implement a hard cap for the backgroundMaxElapsedSeconds setting to prevent accidental unlimited timeouts.
  • Consider adding a reaper pass on session start to remove orphaned Bash processes that exceed the hard cap.
  • Review the proposed settings configuration, including backgroundKillSignal, backgroundKillGraceSeconds, and backgroundMaxElapsedSecondsHardCap, to ensure they meet the requirements.

Example

{
  "bash": {
    "backgroundMaxElapsedSeconds": 1800,
    "backgroundMaxElapsedSecondsHardCap": 14400,
    "backgroundKillSignal": "TERM",
    "backgroundKillGraceSeconds": 10
  }
}

Notes

The proposed solution focuses on addressing the infrastructure gap on the runtime side, while the agent-side issue is being addressed separately. The introduction of backgroundMaxElapsedSeconds and its hard cap will help prevent silent stuck processes and reduce the risk of accumulation of unused resources.

Recommendation

Apply the proposed workaround by introducing the backgroundMaxElapsedSeconds setting with a default value and a hard cap, as it provides a straightforward solution to the identified issue and helps prevent similar problems in the future.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING