claude-code - 💡(How to fix) Fix Subagent liveness/heartbeat: Task-spawned agents can die silently with no signal [1 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
anthropics/claude-code#54820Fetched 2026-04-30 06:35:02
View on GitHub
Comments
1
Participants
2
Timeline
3
Reactions
0
Timeline (top)
labeled ×2commented ×1

Error Message

  • A Codex review subagent (pid 13224) died mid-review before the result harvest stage. The orchestrating session showed it as in-flight for ~1.5 hours before the user noticed. No error, no exit signal surfaced.

Fix Action

Fix / Workaround

  • A Codex review subagent (pid 13224) died mid-review before the result harvest stage. The orchestrating session showed it as in-flight for ~1.5 hours before the user noticed. No error, no exit signal surfaced.
  • This forced the creation of a user-land subagent-heartbeat convention: every dispatched subagent appends a timestamp every ~5 min to .claude/agent-status/<id>.log; the orchestrator polls every 10 min and treats >30 min idle as stalled.

Long-running parallel agent dispatches (waves, backlog burndowns, codex reviews) where one silent death currently blocks the whole pipeline.

RAW_BUFFERClick to expand / collapse

Problem

Task-spawned subagents (including Codex-style detached workers) can die mid-execution and the orchestrator has no native way to know. The harness reports them as "running" indefinitely; the user only finds out by manually inspecting the process tree or noticing missing output hours later.

Concrete incident

  • A Codex review subagent (pid 13224) died mid-review before the result harvest stage. The orchestrating session showed it as in-flight for ~1.5 hours before the user noticed. No error, no exit signal surfaced.
  • This forced the creation of a user-land subagent-heartbeat convention: every dispatched subagent appends a timestamp every ~5 min to .claude/agent-status/<id>.log; the orchestrator polls every 10 min and treats >30 min idle as stalled.

That convention works but lives entirely in user-land — it requires every subagent author to opt in and every orchestrator to remember to poll.

Proposed

Surface subagent liveness in the harness:

  1. Track last-output timestamp per Task agent.
  2. Emit a SubagentStalled event (or set a status flag) when a subagent has produced no output for a configurable window (default ~30 min).
  3. Optionally expose this in TaskList / TaskGet output so orchestrators can detect it without polling logs.

Use case

Long-running parallel agent dispatches (waves, backlog burndowns, codex reviews) where one silent death currently blocks the whole pipeline.

extent analysis

TL;DR

Implementing a subagent liveness tracking mechanism in the harness, such as tracking last-output timestamps and emitting a SubagentStalled event, can help detect and handle silent subagent deaths.

Guidance

  • Consider adding a last_output_timestamp field to the Task agent data structure to track the last time each subagent produced output.
  • Introduce a configurable timeout window (e.g., 30 minutes) to determine when a subagent is considered stalled.
  • Design an event or status flag (e.g., SubagentStalled) to notify the orchestrator when a subagent has not produced output within the specified window.
  • Evaluate exposing the subagent liveness information in TaskList and TaskGet output to simplify detection for orchestrators.

Example

No code snippet is provided due to the high-level nature of the issue, but an example implementation might involve modifying the Task agent data structure and adding event emission logic.

Notes

The proposed solution requires changes to the harness and may involve additional complexity, such as handling edge cases (e.g., subagents with variable output frequencies) and ensuring backwards compatibility.

Recommendation

Apply a workaround, such as the existing subagent-heartbeat convention, until a native harness solution is implemented, as it provides an immediate, albeit user-land, solution to the problem.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

claude-code - 💡(How to fix) Fix Subagent liveness/heartbeat: Task-spawned agents can die silently with no signal [1 comments, 2 participants]