claude-code - 💡(How to fix) Fix [Proposal] backgroundMaxElapsedSeconds — watchdog for Bash processes left by run_in_background (16 h silent zombie in auto mode) [1 participants]

claude-code2026-04-22 11:19:06

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

anthropics/claude-code#51932•Fetched 2026-04-23 07:41:03

View on GitHub

Comments

Participants

Timeline

Reactions

Author

Alexendros

Participants

Alexendros

Timeline (top)

labeled ×3

Error Message

don't trigger error handling (no failure),

Root Cause

A Bash tool call with run_in_background: true left a 6-process shell tree alive for 16 h 33 min on my machine (0 % CPU, 0 % RAM, but holding UNIX sockets and file descriptors). The runtime never reaped it because there is no elapsed-time ceiling for background Bash processes. I propose adding backgroundMaxElapsedSeconds as a settings key, with a sane default and a hard cap enforced by the runtime.

Fix Action

Fix / Workaround

A 5-level mitigation stack (user-side, independent layers):

All five are independent — failure of one doesn't compromise the others. If the proposal ships, L2 becomes a redundant safety net instead of the primary mitigation, which is the correct stance.

Code Example

cd /path/to/dir
npx -y marked --gfm -i 01.md -o /tmp/01.html 2>&1 | tail -3
npx -y marked --gfm -i 02.md -o /tmp/02.html 2>&1 | tail -3
wc -l /tmp/*.html

---

PID 1236251  /bin/bash -c socat … UNIX-CONNECT:/tmp/claude-http-….sock &
├─ 1236252   socat TCP-LISTEN:3128 → /tmp/claude-http-….sock
├─ 1236253   socat TCP-LISTEN:1080 → /tmp/claude-socks-….sock
├─ 1236254   /proc/self/fd/3 /bin/bash -c source … && eval '…'
   └─ 1236255  (re-invoked through pipe)
      └─ 1236256  /bin/bash -c source …  (blocked on stdin)

---

// ~/.claude/settings.json
{
  "bash": {
    "backgroundMaxElapsedSeconds": 1800,         // default 30 min
    "backgroundMaxElapsedSecondsHardCap": 14400, // absolute ceiling 4 h
    "backgroundKillSignal": "TERM",
    "backgroundKillGraceSeconds": 10             // SIGKILL after grace
  }
}

RAW_BUFFERClick to expand / collapse

TL;DR

This class of failure — the silent stuck process — is not covered by the existing safety story, which focuses on explicit destruction (rm -rf, privilege bypass, network egress). In auto mode it's actually the more dangerous class, because stuck processes:

don't show up in CPU monitoring (0 %),
don't trigger error handling (no failure),
don't time out (nothing is waiting for stdout),
get forgotten by the model after the next turn.

Incident (INCIDENT-001, 2026-04-21 / 04-22)

The agent emitted this Bash call in auto mode with run_in_background: true:

cd /path/to/dir
npx -y marked --gfm -i 01.md -o /tmp/01.html 2>&1 | tail -3
npx -y marked --gfm -i 02.md -o /tmp/02.html 2>&1 | tail -3
wc -l /tmp/*.html

The snapshot wrapper (source … && shopt … && eval '…') recomposed the command; inside eval the newlines didn't behave as independent statements and the first pipe kept waiting on stdin. Six processes sat in S state for 16 h 33 min:

PID 1236251  /bin/bash -c socat … UNIX-CONNECT:/tmp/claude-http-….sock &
├─ 1236252   socat TCP-LISTEN:3128 → /tmp/claude-http-….sock
├─ 1236253   socat TCP-LISTEN:1080 → /tmp/claude-socks-….sock
├─ 1236254   /proc/self/fd/3 /bin/bash -c source … && eval '…'
   └─ 1236255  (re-invoked through pipe)
      └─ 1236256  /bin/bash -c source …  (blocked on stdin)

All at 0.0 % CPU / 0.0 % MEM. The Claude session that spawned them had long since ended; the processes were adopted by PID 1 (user systemd). The tool had returned command running in background immediately, the model moved on, and nothing ever checked back.

Detection was manual: the user noticed a 16 h shell in their system monitor and asked about it. I only identified it through ps -eo pid,etime,args --sort=-etime with dangerouslyDisableSandbox: true. Exit code after kill -TERM: 143. Tool-use finally marked failed.

Root cause (two independent contributors)

A — model side. I emitted a multi-line command without explicit &&/; separators, assuming bash would treat each line as a separate command. Inside eval '…' it doesn't. Fixable via CLAUDE.md rule and/or a PreToolUse hook (I've added both locally).

B — runtime side. The runtime has no hard upper bound on the lifetime of run_in_background processes. Auto mode biases the model away from re-checking the output file once the tool has returned. This issue is about B. A is the agent's problem to fix; B is an infrastructure gap only the runtime can close safely.

Proposal

// ~/.claude/settings.json
{
  "bash": {
    "backgroundMaxElapsedSeconds": 1800,         // default 30 min
    "backgroundMaxElapsedSecondsHardCap": 14400, // absolute ceiling 4 h
    "backgroundKillSignal": "TERM",
    "backgroundKillGraceSeconds": 10             // SIGKILL after grace
  }
}

Semantics

When the timeout fires, the runtime sends SIGTERM to the entire process group of the launched subprocess (kill -TERM -pgid).
After backgroundKillGraceSeconds, escalates to SIGKILL.
A timeout event is written to the process's output-file so the model sees it on the next TaskOutput/BashOutput call.
backgroundMaxElapsedSecondsHardCap is enforced by the runtime regardless of user config (prevents accidental unlimited timeouts).

Secondary ask (optional). On session start, a cheap reaper pass: any orphan bash whose parent is PID 1 and whose command line contains /tmp/claude-http-*.sock or /tmp/claude-socks-*.sock with etime > hardCap gets reaped. Removes residue from crashed or killed prior sessions.

Why this matters for auto mode specifically

Current safeguards are largely policy/text-driven — regex for explicit destruction, permission prompts, network allowlists. They catch intent-based risks well. Stuck processes have no intent; they accumulate:

Failure class	Detectable by	Currently covered
Explicit destructive (`rm -rf /`)	permission + regex	✅
Privileged bypass (`sudo`, `--no-verify`)	prompt regex	✅
Unauthorized network egress	sandbox host allowlist	✅
Silent stuck process	elapsed-time watchdog	❌
Infinite `while` loop (no output)	CPU watchdog	⚠ partial (bash default timeout only)
Descriptor leak (many `bash` open)	fd count	❌
*Accumulation of `/tmp/claude-.sock`**	inode count	❌
Exfil via long `curl` in background	elapsed + connection duration	⚠ partial

Auto mode multiplies the three ❌ rows; a single shipped backgroundMaxElapsedSeconds closes the first and reduces the others to non-critical.

What I've deployed locally in the meantime

A 5-level mitigation stack (user-side, independent layers):

L2 — zombi-watchdog.sh + systemd user timer every 15 min. Reaps any bash child of socat /tmp/claude-*.sock whose parent is not in ~/.claude/sessions/*.json and whose etime > 2 h. First real run: 3.9 MB RAM peak, ~1 s CPU. Idempotent, logs to ~/.claude/logs/zombi-watchdog.log.
L3 — PreToolUse hook bash-guard.py that blocks Bash calls with: (a) multi-line commands lacking &&/;/| separators, (b) run_in_background + npx|node|ollama|curl -N|python -u|wget without a timeout N prefix, (c) run_in_background + tail -f|watch|inotifywait|less|ping|journalctl -f.
L4 — CLAUDE.md rule: in auto mode, after every run_in_background, the agent re-checks the output file next turn; every 20 turns runs a ps -eo etime,args pass.
L5 — SessionStart hook that reaps orphan Claude-sandbox bash processes before starting.

All five are independent — failure of one doesn't compromise the others. If the proposal ships, L2 becomes a redundant safety net instead of the primary mitigation, which is the correct stance.

Proposed priority

Medium. The incident's economic cost was zero; the doctrinal cost is higher because the class of failure is not covered by the current safety story. A simple backgroundMaxElapsedSeconds default ≤ 30 min would close the gap for 99 % of users without meaningful downside — the cap can be raised explicitly when someone genuinely wants a 4 h background job.

Environment

Claude Code CLI · claude-opus-4-7 (1M context tier)
Fedora Silverblue 43 · systemd 258 · bash 5.2.37
Session id: fb44fe21-54ee-4d31-ba01-971cea7b0a76 (269 turns, auto mode active throughout)

Happy to contribute a PR or refine the spec if useful. Thanks for considering.

extent analysis

TL;DR

To address the issue of silent stuck processes, implement a backgroundMaxElapsedSeconds setting with a default value and a hard cap to limit the lifetime of background Bash processes.

Guidance

Identify the root cause of the issue, which is the lack of an elapsed-time ceiling for background Bash processes, and address it by introducing a backgroundMaxElapsedSeconds setting.
Implement a hard cap for the backgroundMaxElapsedSeconds setting to prevent accidental unlimited timeouts.
Consider adding a reaper pass on session start to remove orphaned Bash processes that exceed the hard cap.
Review the proposed settings configuration, including backgroundKillSignal, backgroundKillGraceSeconds, and backgroundMaxElapsedSecondsHardCap, to ensure they meet the requirements.

Example

{
  "bash": {
    "backgroundMaxElapsedSeconds": 1800,
    "backgroundMaxElapsedSecondsHardCap": 14400,
    "backgroundKillSignal": "TERM",
    "backgroundKillGraceSeconds": 10
  }
}

Notes

The proposed solution focuses on addressing the infrastructure gap on the runtime side, while the agent-side issue is being addressed separately. The introduction of backgroundMaxElapsedSeconds and its hard cap will help prevent silent stuck processes and reduce the risk of accumulation of unused resources.

Recommendation

Apply the proposed workaround by introducing the backgroundMaxElapsedSeconds setting with a default value and a hard cap, as it provides a straightforward solution to the identified issue and helps prevent similar problems in the future.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#GPU compatibility #latency issue #model loading #dependency error #configuration error

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

claude-code - 💡(How to fix) Fix [Proposal] backgroundMaxElapsedSeconds — watchdog for Bash processes left by run_in_background (16 h silent zombie in auto mode) [1 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Fix / Workaround

Code Example

TL;DR

Incident (INCIDENT-001, 2026-04-21 / 04-22)

Root cause (two independent contributors)

Proposal

Why this matters for auto mode specifically

What I've deployed locally in the meantime

Proposed priority

Environment

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

TRENDING

claude-code - 💡(How to fix) Fix [Proposal] backgroundMaxElapsedSeconds — watchdog for Bash processes left by run_in_background (16 h silent zombie in auto mode) [1 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Fix / Workaround

Code Example

TL;DR

Incident (INCIDENT-001, 2026-04-21 / 04-22)

Root cause (two independent contributors)

Proposal

Why this matters for auto mode specifically

What I've deployed locally in the meantime

Proposed priority

Environment

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING