claude-code - 💡(How to fix) Fix claude -p silent-freeze when spawned from a long-running orchestrator (no stdout, deterministic 100% from direct spawn, probabilistic with bash wrap) [1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
anthropics/claude-code#56268Fetched 2026-05-06 06:32:40
View on GitHub
Comments
0
Participants
1
Timeline
6
Reactions
0
Participants
Timeline (top)
labeled ×4mentioned ×1subscribed ×1

Root Cause

We have not been able to inspect further with strace because Docker's default seccomp policy blocks PTRACE_SEIZE, and we don't want to lower it on a production host.

Fix Action

Fix / Workaround

After implementing the workaround spawn('bash', ['-c', 'exec claude -p ...']) to insert a transient parent between the webhook server and claude, behaviour changed but did not normalise:

Mitigations applied (so the system survives while waiting)

Code Example

/proc/pressure/cpu     some avg10=0.15 avg60=0.16 avg300=0.10 total=31466982793
/proc/pressure/memory  some avg10=0.00 avg60=0.00 avg300=0.00 total=6533572
/proc/pressure/io      some avg10=0.00 avg60=0.00 avg300=0.00 total=1149875194

---

nr_throttled = 0
throttled_usec = 0

---

ping -c 3 api.anthropic.com
64 bytes from 2607:6bc0::10 (2607:6bc0::10): icmp_seq=3 ttl=56 time=13.2 ms
3 packets transmitted, 3 received, 0% packet loss
rtt min/avg/max/mdev = 13.107/13.140/13.192/0.037 ms

---

const { spawn } = require('child_process');
   const child = spawn('claude', [
     '-p',
     '--permission-mode', 'acceptEdits',
     '--allowedTools', '<7 MCP tools mixing one custom HTTP MCP and others>',
   ], {
     cwd: '/path/to/flutter/project',
     env: { ...process.env, FLUTTER_HOME: '/opt/flutter', PATH: '/opt/flutter/bin:' + process.env.PATH },
     stdio: ['pipe', 'pipe', 'pipe'],
   });
   child.stdin.write(promptBody28KB);
   child.stdin.end();
RAW_BUFFERClick to expand / collapse

Filed: 2026-05-05 (UTC) Operator: @bearblocks, Claude Max plan Affected workflow: autonomous code-implementation routines (claude -p headless, GitHub-event triggered, Docker containerised on a self-hosted VPS) Severity: blocking — implementer tier of the autonomy stack is unusable in production. Diagnosis tier (scheduler) and audit/reaper tiers work fine.

This report consolidates eight hours of empirical investigation. All artefacts (logs, run IDs, code paths) are real; the system is live and reproducible.


TL;DR

claude -p invoked headlessly from a long-running Node.js webhook server never emits a first byte of stdout within any reasonable timeout (tested up to 600 seconds), even though the same prompt + same env + same allowedTools list completes in 50–240s when invoked from a short-lived shell with a process tree of bash → node → claude.

Wrapping the spawn in bash -c "exec claude ..." (so the immediate parent of claude is a transient bash) makes claude emit output sometimes — but the long-tail of "no first output" persists in a non-trivial fraction of runs. The freeze is bimodal: either claude completes within ~200s, or it never produces a byte before the watchdog fires.

The infrastructure is clean (host idle, 0 cgroup throttling, RTT to api.anthropic.com 13ms). The bug is somewhere in the claude CLI, the Anthropic API client used by it, or the API itself.


Environment

  • Host: self-hosted VPS, 4 vCPU, 7.8 GiB RAM, Ubuntu (kernel ≥ 4.20, PSI available), uptime 70+ days.
  • Container: node:20-bookworm-slim, claude --version reports 2.1.126 (Claude Code). Credentials present in ~/.claude/.credentials.json. Linked to a Claude Max plan via the standard claude /login flow.
  • Spawned shell of claude: claude -p --permission-mode acceptEdits --allowedTools <comma-list> with stdin piped (the routine prompt is written via child.stdin.write(prompt); child.stdin.end()).
  • Routine prompt body: ~28 KB markdown describing a multi-step autonomous task (read GitHub issue, fetch a strategy ledger, set up a git worktree, write tests, implement, push, open PR).
  • MCP servers configured:
    • A custom HTTP MCP server (internal to the operator's Docker network), bearer-auth.
    • The github stdio MCP via github-mcp-server stdio (declared in repo-level .mcp.json).

The orchestrator code is open and the spawn callsite is visible at the link in the "Pointers" section at the bottom.


Five diagnosis-grade observations

1. Process-tree dependency

When claude -p is spawned directly from the long-running Node webhook server, it never emits a first byte before the watchdog kills it (at any of the timeouts we tried: 60s, 180s, 300s, 600s).

Failure samples (full per-run logs are kept on the operator's VPS, accessible via the orchestrator's internal /api/runs/:id endpoint):

run_idstarted_at (UTC)timeoutexitlog size
dffd15e0-47c9-11f1-8069-44054a0470202026-05-04T14:59:40Z30 min budget (no probe)143 (recreate kill)593 B
eb19f180-47fd-11f1-9174-3d4d2412d6032026-05-04T21:13:32Z180s143707 B
7509a960-485a-11f1-8604-ffaeed3f8e8f2026-05-05T08:14:38Z180s143707 B
c9ec6560-485c-11f1-84b3-d4b3f9caab772026-05-05T08:31:19Z180s143707 B
b16cc510-485d-11f1-8cd7-d9bd39b3edfb2026-05-05T08:37:48Z300s143707 B
6752b590-4860-11f1-9d2a-e30d66f180452026-05-05T08:57:12Z600s143530 B

In every one of these the per-run log file contains only the orchestrator's own banner — event=… repo=… cwd=… allowedTools=… --- — and then immediately the watchdog sentinel. The claude process emitted zero bytes in any of these runs, despite being given up to 10 full minutes.

A manually-invoked probe with byte-for-byte identical spawn args, same prompt body, same --allowedTools, same cwd, same env (FLUTTER_HOME, etc.), but parent process = bash (interactive shell) → nodeclaude consistently completes in 50–240s with output. Full reproduction recipe at the bottom of this document.

The only differences between the failing path and the succeeding path:

  • Parent of claude is a long-running Node process (PID 1 of the container) vs a transient bash.
  • Parent has open file descriptors for an HTTPS server (port 8443 listening), several outbound TCP connections to GitHub and to the internal MCP server, log file descriptors, etc.

We have not been able to inspect further with strace because Docker's default seccomp policy blocks PTRACE_SEIZE, and we don't want to lower it on a production host.

What we did observe via /proc/<pid>/{net/tcp,task/*/wchan,status} while the freeze was active:

  • claude is in state S (sleeping)
  • 12 threads, all wchan = 0 (i.e. user-mode, not kernel-blocked)
  • 5–6 ESTABLISHED TCP connections, multiple to Anthropic API IP 160.79.104.10:443
  • No subprocesses spawned (no MCP stdio child, no Bash tool calls)
  • TX/RX queues empty

So claude is actively running JavaScript and waiting on remote responses, but emitting no output to stdout. A hypothesis we cannot prove without strace: the first request to the Anthropic API stalls before the first SSE token arrives, and stays stalled for at least 600 seconds.

2. Tool count threshold (deterministic freeze at exactly 7)

Independently of the process-tree issue above, we found that the --allowedTools list length has a sharp threshold. With a fixed prompt body (28 KB), fixed cwd (a Flutter project), /opt/flutter/bin in PATH, and the manual probe path:

--allowedTools count (custom MCP tools)TTFT
079s ✓
1 (any single one)53–87s ✓
4 (any 4 of 7)80s ✓
6 (any 6 of 7)71–193s ✓ (variable)
7 (all 7 we had originally)freeze >240s (probe killed) ✗

The transition from 6 → 7 is a 100% reproducible cliff. We bisected by half-set, then by exclude-one — every subset of size 6 succeeds, every subset of size 7 freezes. So it is not a specific tool, it is a cumulative cost (likely the size of the system prompt that ships the tool descriptors to the model).

We mitigated by reducing the list to only the 3 tools the prompt body actually invokes. With 3 tools, TTFT in the manual probe path averages ~80s.

This is independent of the issue in (1) but compounds with it: the orchestrator-spawned claude with 3 tools still freezes via the path-dependency above.

3. Host PSI is essentially zero, cgroup throttling is zero

To rule out infrastructure-side queueing, we measured at the kernel level:

/proc/pressure/cpu     some avg10=0.15 avg60=0.16 avg300=0.10 total=31466982793
/proc/pressure/memory  some avg10=0.00 avg60=0.00 avg300=0.00 total=6533572
/proc/pressure/io      some avg10=0.00 avg60=0.00 avg300=0.00 total=1149875194

Cumulative CPU pressure of 31.5 seconds in 70 days of uptime — i.e. ~0.45 ms/day average. The host has never been meaningfully busy in the entire window during which freezes have been observed.

The container's cgroup CPU stats:

nr_throttled = 0
throttled_usec = 0

The container has never been throttled by cgroup limits. It always had CPU and memory available.

vmstat 1 5 while a freeze was actively in progress: CPU 94–100% idle, run queue 1–2, no swap, no IO wait, no blocked processes. The host was effectively asleep while claude was hung.

4. Bimodal TTFT distribution post bash-wrap

After implementing the workaround spawn('bash', ['-c', 'exec claude -p ...']) to insert a transient parent between the webhook server and claude, behaviour changed but did not normalise:

run_idtimestamp (UTC)outcomeTTFT
c9b59690-485f-11f1-9bc2-69965450ce592026-05-05T08:52:47Zsuccess exit 0, output 1578 B193 s
6752b590-4860-11f1-9d2a-e30d66f180452026-05-05T08:57:12Zkilled by 600s probe, 0 bytesNEVER

These two ran 4 minutes apart, same code, same configuration, same prompt, same target issue. The difference: the first was able to short-circuit at an early step of the routine (it found an orphan git branch from earlier testing and exited cleanly without doing real work). The second had no shortcut to take and had to do real work — and never emitted a byte.

We don't have enough samples yet to compute a P50/P95/P99 (the orchestrator now logs TTFT into a 200-sample ring buffer, exposed via an internal /api/health endpoint), but the eyeball pattern is bimodal: either claude wakes up and works in 60–250s, or it never wakes at all and gets killed at the watchdog.

5. Network RTT to upstream is fine

ping -c 3 api.anthropic.com
64 bytes from 2607:6bc0::10 (2607:6bc0::10): icmp_seq=3 ttl=56 time=13.2 ms
3 packets transmitted, 3 received, 0% packet loss
rtt min/avg/max/mdev = 13.107/13.140/13.192/0.037 ms

13 ms with 0.04 ms jitter, IPv6 direct. Not a network bottleneck on our side.


Reproduction recipe

To reproduce the freeze on a similar setup (Linux container, Node 20, claude CLI 2.1.126):

  1. Have a long-running parent process, e.g. an Express server that handles webhooks. Keep it running while reproducing.

  2. From inside that parent, do the equivalent of:

    const { spawn } = require('child_process');
    const child = spawn('claude', [
      '-p',
      '--permission-mode', 'acceptEdits',
      '--allowedTools', '<7 MCP tools mixing one custom HTTP MCP and others>',
    ], {
      cwd: '/path/to/flutter/project',
      env: { ...process.env, FLUTTER_HOME: '/opt/flutter', PATH: '/opt/flutter/bin:' + process.env.PATH },
      stdio: ['pipe', 'pipe', 'pipe'],
    });
    child.stdin.write(promptBody28KB);
    child.stdin.end();
  3. Wait. Observe that no stdout/stderr arrives. Eventually the parent watchdog kills it.

  4. Compare against the manual-probe path: from a fresh interactive bash, run a Node script that does the same spawn(...) — that one completes.

The operator has a working probe script that demonstrates the divergence side-by-side. Available on request.


Mitigations applied (so the system survives while waiting)

  1. allowed_tools reduced from 7 to 3 in the orchestrator's routine config for the implementer routines. This eliminates the deterministic freeze cliff in (2).
  2. Watchdog raised from 60 s to 600 s with a documented intent: kill genuinely-hung claude processes within ~10 min so the queue moves.
  3. Bash-wrapped spawn: spawn('bash', ['-c', 'exec claude ...']) to mirror the working manual-probe process tree. Removes the deterministic 100% no-output failure but does not eliminate the bimodal long-tail (see point 4 above).
  4. Trial reaper in the reconciler: every 30 min, scans for stuck issues and removes the trial label so the scheduler re-picks. Closes the gap when claude does freeze.

The system is currently functional for the scheduler tier (TTFT P95 ~50–90s, never freezes) and the reaper / observability tier. The implementer tier is the broken one — production throughput is roughly 1 issue/day instead of the design target of multiple per hour.


What we'd like from Anthropic

  1. Insight into why claude -p invoked from a long-lived Node parent does not emit stdout while invoked from a transient parent does — even with byte-for-byte identical spawn args. We suspect file-descriptor inheritance, signal disposition, or scheduling interaction inside the claude internal API client, but cannot prove it without strace.
  2. Insight into the bimodal TTFT distribution post-bash-wrap. Same prompt, same env, runs minutes apart: one completes in 200s, one never emits a byte. If this is an Anthropic-side queueing or token-streaming issue, we have no way to see it from the client.
  3. Confirmation of whether claude -p headless mode is intended to support being spawned from long-running daemons at all (we may simply be using it outside its design envelope, in which case knowing that is itself the answer).

We are happy to run additional diagnostics on request — strace if you can suggest a way to make Docker seccomp permit ptrace without exposing the host, claude CLI debug flags if any exist, captured Anthropic API request/response pairs (we control the client and can intercept), etc.

A reproducible test bed is live; we can lend access on request.


Pointers for the engineer who picks this up

  • Source-of-truth for the orchestrator: https://github.com/bearblocks/qsine-workspace.claude/routines-vps/webhook-server/
  • Spawn site: src/claude-invoke.js (commit 0a2422a is the post-bash-wrap version)
  • Routine prompt the implementer reads: .claude/routines/auto-implement-on-label.md
  • Internal observability API: GET /api/health returns ttft_summary + reaper stats; GET /api/runs?limit=N lists recent runs; GET /api/runs/:id?tail=N returns a single run's tail. All bearer-auth, internal.
  • Post-mortem with full timeline of the investigation: https://github.com/bearblocks/qsine-workspace/issues/10

— @bearblocks

extent analysis

TL;DR

The most likely fix for the issue with claude -p not emitting stdout when spawned from a long-running Node process is to use a bash wrapper, as in spawn('bash', ['-c', 'exec claude ...']), and to reduce the number of allowed tools to mitigate the deterministic freeze cliff.

Guidance

  1. Use a bash wrapper: Wrap the claude -p spawn in bash -c "exec claude ..." to mirror the working manual-probe process tree.
  2. Reduce allowed tools: Reduce the number of allowed tools from 7 to 3 to eliminate the deterministic freeze cliff.
  3. Verify TTFT distribution: Monitor the TTFT distribution after applying the bash wrapper and reducing allowed tools to ensure the bimodal long-tail issue is mitigated.
  4. Investigate file descriptor inheritance: Investigate how file descriptor inheritance might be affecting the behavior of claude -p when spawned from a long-running Node process.

Example

const { spawn } = require('child_process');
const child = spawn('bash', [
  '-c',
  'exec claude -p --permission-mode acceptEdits --allowedTools <3 MCP tools>',
], {
  cwd: '/path/to/flutter/project',
  env: { ...process.env, FLUTTER_HOME: '/opt/flutter', PATH: '/opt/flutter/bin:' + process.env.PATH },
  stdio: ['pipe', 'pipe', 'pipe'],
});
child.stdin.write(promptBody28KB);
child.stdin.end();

Notes

The issue may be related to file descriptor inheritance or scheduling interactions inside the claude internal API client. Further investigation is needed to determine the root cause.

Recommendation

Apply the workaround by using a bash wrapper and reducing the number of allowed tools. This should mitigate the issue and allow the system to function, although it may not completely eliminate the bimodal

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING