claude-code - 💡(How to fix) Fix [BUG] Workflow subagent: one stall is retried 6× at full token cost (no across-retry progress check); the no-progress signal can't tell slow-but-alive from hung

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

Error Message

Error Messages/Logs

Per-agent at failure (from the run record): state=error, attempt=6.

Root Cause

2. A "no progress" signal too coarse to distinguish slow/blocked-but-alive from hung. The behavior is consistent with "no progress" being measured from committed transcript events (a finished tool call, a tool result, or a completed assistant content block) rather than from the live token stream or a running subprocess — I can't see the implementation, but the kill timing relative to emitted events points that way. The consequence I observed: a long single inference turn (the model generating over a large context) registers as "no progress" while it is in fact generating, because it commits no transcript event until the block closes — even though the Messages API does stream content_block_delta/thinking_delta during generation.

Fix Action

Fix / Workaround

A Workflow-orchestrated run dispatched three subagents (one unit of work each). All three did substantial productive work, then each went silent mid-turn for ~15–17 minutes, was killed by the no-progress watchdog, and was retried five more times — each retry stalling the same way. Net result: 0 of 3 units completed, ~5.2h wall-clock, ~1.79M tokens, zero output.

[stall] agent "impl:A" stalled (no progress) after 1609s — retrying (1/5)
[stall] agent "impl:B" stalled (no progress) after 1609s — retrying (1/5)
[stall] agent "impl:A" stalled (no progress) after 945s — retrying (2/5)
[stall] agent "impl:A" stalled (no progress) after 409s — retrying (3/5)
[stall] agent "impl:B" stalled (no progress) after 1634s — retrying (2/5)
[stall] agent "impl:A" stalled (no progress) after 1244s — retrying (4/5)
[stall] agent "impl:A" stalled (no progress) after 300s — retrying (5/5)
[stall] agent "impl:B" stalled (no progress) after 2509s — retrying (3/5)
[stall] agent "impl:B" stalled (no progress) after 1003s — retrying (4/5)
[stall] agent "impl:B" stalled (no progress) after 3785s — retrying (5/5)
parallel[0] failed: agent stalled on all 6 attempts (no progress for 180000ms each)
parallel[1] failed: agent stalled on all 6 attempts (no progress for 180000ms each)
-- batch 2 --
[stall] agent "impl:C" stalled (no progress) after 2749s — retrying (1/5)
[stall] agent "impl:C" stalled (no progress) after 949s — retrying (2/5)
[stall] agent "impl:C" stalled (no progress) after 1016s — retrying (3/5)
[stall] agent "impl:C" stalled (no progress) after 1042s — retrying (4/5)
[stall] agent "impl:C" stalled (no progress) after 278s — retrying (5/5)
parallel[0] failed: agent stalled on all 6 attempts (no progress for 180000ms each)
dispatch complete: 0/3 units completed.

Repro B — inference-layer (the field manifestation; non-deterministic). Dispatch a subagent on a task large enough that a single inference turn over a ~400k–700k-token context exceeds the silence window during generation (or instruct one long uninterrupted output, e.g. "emit a single ~1500-line file in one Write," or "output the numbers 1..50000 as one message with no tool calls"). The turn streams but commits no transcript event until it finishes, so it registers as no-progress and is killed mid-generation, then retried. Here the re-stall is structural — the same oversized context forces an equally long turn on every retry — which is why the field incident stalled identically across all six attempts. Non-deterministic (depends on generation time); the run-record timestamps above are the deterministic evidence that it occurred.

Claude Model

Code Example

[stall] agent "impl:A" stalled (no progress) after 1609s — retrying (1/5)
[stall] agent "impl:B" stalled (no progress) after 1609s — retrying (1/5)
[stall] agent "impl:A" stalled (no progress) after 945s — retrying (2/5)
[stall] agent "impl:A" stalled (no progress) after 409s — retrying (3/5)
[stall] agent "impl:B" stalled (no progress) after 1634s — retrying (2/5)
[stall] agent "impl:A" stalled (no progress) after 1244s — retrying (4/5)
[stall] agent "impl:A" stalled (no progress) after 300s — retrying (5/5)
[stall] agent "impl:B" stalled (no progress) after 2509s — retrying (3/5)
[stall] agent "impl:B" stalled (no progress) after 1003s — retrying (4/5)
[stall] agent "impl:B" stalled (no progress) after 3785s — retrying (5/5)
parallel[0] failed: agent stalled on all 6 attempts (no progress for 180000ms each)
parallel[1] failed: agent stalled on all 6 attempts (no progress for 180000ms each)
-- batch 2 --
[stall] agent "impl:C" stalled (no progress) after 2749s — retrying (1/5)
[stall] agent "impl:C" stalled (no progress) after 949s — retrying (2/5)
[stall] agent "impl:C" stalled (no progress) after 1016s — retrying (3/5)
[stall] agent "impl:C" stalled (no progress) after 1042s — retrying (4/5)
[stall] agent "impl:C" stalled (no progress) after 278s — retrying (5/5)
parallel[0] failed: agent stalled on all 6 attempts (no progress for 180000ms each)
dispatch complete: 0/3 units completed.

---

export const meta = {
  name: 'watchdog-repro',
  description: 'A single alive-but-silent step trips the no-progress watchdog and is retried',
  phases: [{ title: 'Repro' }],
}
const r = await agent(
  'Run exactly this one shell command and report its output, nothing else: ' +
  'node -e "const t=Date.now();while(Date.now()-t<240000){};console.log(\'alive\')"',
  { label: 'silent-but-alive', phase: 'Repro' }
)
return r
RAW_BUFFERClick to expand / collapse

Preflight Checklist

  • I have searched existing issues and this hasn't been reported yet
  • This is a single bug report (please file separate reports for different bugs)
  • I am using the latest version of Claude Code

What's Wrong?

A Workflow-orchestrated run dispatched three subagents (one unit of work each). All three did substantial productive work, then each went silent mid-turn for ~15–17 minutes, was killed by the no-progress watchdog, and was retried five more times — each retry stalling the same way. Net result: 0 of 3 units completed, ~5.2h wall-clock, ~1.79M tokens, zero output.

There are two distinct defects. The first does not depend on any claim about Claude Code's internals; the second is an inference from observed behavior.

1. Retry amplification (the core defect). One agent() call became six attempts (1 + 5 retries). Every retry re-ran the same task and terminated on the same no-progress timer, producing no net-new progress over the prior attempt — so the runtime paid ~6× the tokens to reproduce an identical failure. Retry can be valid recovery: in a comparison run (below) two agents stalled once and recovered on retry. The defect is retrying without an across-attempt progress check — when attempt N+1 reaches the same stall with nothing new versus attempt N, repeating it is pure waste. This is visible directly in the run log; it requires no assumption about how the watchdog works internally.

2. A "no progress" signal too coarse to distinguish slow/blocked-but-alive from hung. The behavior is consistent with "no progress" being measured from committed transcript events (a finished tool call, a tool result, or a completed assistant content block) rather than from the live token stream or a running subprocess — I can't see the implementation, but the kill timing relative to emitted events points that way. The consequence I observed: a long single inference turn (the model generating over a large context) registers as "no progress" while it is in fact generating, because it commits no transcript event until the block closes — even though the Messages API does stream content_block_delta/thinking_delta during generation.

Honest caveat on defect #2: a ~15-minute silence is genuinely ambiguous — it could be a slow-but-alive turn or a real hang (the runaway in #61877). That ambiguity is the problem: the current signal can neither keep a slow-but-alive agent alive nor catch a true hang quickly. Either way, defect #1 turns one stall into six full-cost failures.

Plausibly the same underlying root cause as #61877 (runaway / fails-to-stop) and #58604 (no kill primitive) — see those threads — but this report stands on its own run record below and does not depend on that linkage.

This is not workflow configuration. The script issues a single await agent(prompt, …); the runtime expanded it into six attempts on its own. The documented agent() surface I have access to ({label, phase, schema, model, isolation, agentType}) exposes no retry-count / max-attempts / timeout option, so the caller appears to have no way to bound this.

What Should Happen?

  1. Bound retries by progress, not a fixed count (cheapest, highest-leverage; no liveness change needed). Keep retry as a recovery mechanism, but if attempt N+1 reaches the same stall with no net-new progress versus attempt N (no new tool calls / file writes / output), stop and surface the failure instead of repeating it ~6× at full token cost. This bounds the blast radius of every stall — this one and #61877's runaway — at ~1×.

  2. Judge liveness where the agent is blocked, not by committed-event silence alone. While an inference request is outstanding, a streaming content_block_delta/ping within a short window should count as alive; while a tool is running, a subprocess still emitting output or consuming CPU should count as alive; only genuine idle (none of the above) should be killed by the fast timer. Keep a generous absolute per-turn wall-clock ceiling — suggest ~600–900s, tunable — so a true hang/livelock (#61877) is still caught: idle keeps the existing fast timer, in-flight work is judged by the request-appropriate ceiling. (Note: the Bash tool already inspects foreground commands — it prompts on a bare sleep — so subprocess-level signal exists at the tool boundary; the no-progress watchdog just doesn't appear to consume it.)

Not requested: removing the watchdog, globally raising the threshold, or changing the intended aggressive-kill default. The guard should stay; its retry policy and its progress signal are the issues.

Error Messages/Logs

Runtime-emitted stall/retry log (these lines come from the Workflow runtime, not the workflow script; agent labels genericized; two batches — A and B ran in parallel, then C):

[stall] agent "impl:A" stalled (no progress) after 1609s — retrying (1/5)
[stall] agent "impl:B" stalled (no progress) after 1609s — retrying (1/5)
[stall] agent "impl:A" stalled (no progress) after 945s — retrying (2/5)
[stall] agent "impl:A" stalled (no progress) after 409s — retrying (3/5)
[stall] agent "impl:B" stalled (no progress) after 1634s — retrying (2/5)
[stall] agent "impl:A" stalled (no progress) after 1244s — retrying (4/5)
[stall] agent "impl:A" stalled (no progress) after 300s — retrying (5/5)
[stall] agent "impl:B" stalled (no progress) after 2509s — retrying (3/5)
[stall] agent "impl:B" stalled (no progress) after 1003s — retrying (4/5)
[stall] agent "impl:B" stalled (no progress) after 3785s — retrying (5/5)
parallel[0] failed: agent stalled on all 6 attempts (no progress for 180000ms each)
parallel[1] failed: agent stalled on all 6 attempts (no progress for 180000ms each)
-- batch 2 --
[stall] agent "impl:C" stalled (no progress) after 2749s — retrying (1/5)
[stall] agent "impl:C" stalled (no progress) after 949s — retrying (2/5)
[stall] agent "impl:C" stalled (no progress) after 1016s — retrying (3/5)
[stall] agent "impl:C" stalled (no progress) after 1042s — retrying (4/5)
[stall] agent "impl:C" stalled (no progress) after 278s — retrying (5/5)
parallel[0] failed: agent stalled on all 6 attempts (no progress for 180000ms each)
dispatch complete: 0/3 units completed.

Per-agent at failure (from the run record): state=error, attempt=6.

agenttokenslongest run
impl:A621,581~1.6h across attempts
impl:B737,627~3.2h (durationMs 11,651,330)
impl:C430,544~2.0h

Reconciling the totals so they can't be read as inconsistent: tokens ~1.79M = the sum across the three agents (621k + 738k + 431k = 1,789,752), i.e. real billed tokens. Wall clock ~5.2h = batch-1 span (A + B in parallel, ~3.2h) + batch-2 span (C, ~2.0h) — total elapsed, not a sum of overlapping per-agent times.

On the threshold (stated precisely, because the numbers don't obviously agree): the terminal failure lines report no progress for 180000ms (180s), but the per-line "after Ns" values (1609s, 2509s, 3785s, …) are far larger, and the actual last-event→kill silence gaps measured in the subagent transcripts were 1031s (impl:A), 925s (impl:B), 997s (impl:C). My reading is that "after Ns" is the attempt's cumulative wall-clock at the moment the kill was logged (the attempt did productive work, then went silent), but the exact relationship between "after Ns", the 180000ms window, and the ~1000s transcript gaps is not determinable from the logs alone. What is unambiguous: each attempt did substantial work, then went silent for many minutes, then was killed and retried. Within each attempt, every non-terminal inter-event gap was under 11 seconds — including around Bash/test subprocesses — strongly indicating the terminal silence fell within a single long inference turn rather than a subprocess stall.

Steps to Reproduce

The no-progress timer appears to be shared across both blocking modes (an outstanding inference request and a running tool subprocess) — the logs show a single 180000ms threshold for both. Repro A isolates the shared timer + retry amplification deterministically (no model variance). Repro B is the inference-layer manifestation that caused the field incident (non-deterministic).

Repro A — deterministic at the watchdog/tool layer. A one-agent workflow whose single step is alive but emits no output past the window:

export const meta = {
  name: 'watchdog-repro',
  description: 'A single alive-but-silent step trips the no-progress watchdog and is retried',
  phases: [{ title: 'Repro' }],
}
const r = await agent(
  'Run exactly this one shell command and report its output, nothing else: ' +
  'node -e "const t=Date.now();while(Date.now()-t<240000){};console.log(\'alive\')"',
  { label: 'silent-but-alive', phase: 'Repro' }
)
return r
  1. Save as watchdog-repro.workflow.mjs and run it via the Workflow tool. node is used because it ships with Claude Code (zero extra dependency). The busy-loop pins one core for 240s, so while the agent is stalled you can confirm out-of-band with top/ps that the child process is alive and consuming CPU — the watchdog kills it anyway. (A long foreground sleep is avoided because, in my environment on 2.1.156, the Bash tool prompts on it rather than running it unattended; any alive-but-silent foreground step >180s reproduces the kill identically.)
  2. Observe: the runtime logs … stalled (no progress) … — retrying (1/5), kills the agent, and the strict single-command prompt induces it to re-issue the same silent step, which stalls again — through (5/5), then stalled on all 6 attempts.

Expected: the step is alive for 240s and the agent returns alive once. Actual: killed after a multi-minute silence and retried to a hard failure.

Determinism note: Repro A is deterministic at the watchdog/tool layer (the step is provably alive and silent regardless of model output). The retry re-stall depends on the agent re-issuing the same command each attempt — the strict prompt makes this likely but, since the subagent is an LLM, not guaranteed.

Repro B — inference-layer (the field manifestation; non-deterministic). Dispatch a subagent on a task large enough that a single inference turn over a ~400k–700k-token context exceeds the silence window during generation (or instruct one long uninterrupted output, e.g. "emit a single ~1500-line file in one Write," or "output the numbers 1..50000 as one message with no tool calls"). The turn streams but commits no transcript event until it finishes, so it registers as no-progress and is killed mid-generation, then retried. Here the re-stall is structural — the same oversized context forces an equally long turn on every retry — which is why the field incident stalled identically across all six attempts. Non-deterministic (depends on generation time); the run-record timestamps above are the deterministic evidence that it occurred.

Claude Model

Opus

Is this a regression?

I don't know

Last Working Version

No response

Claude Code Version

2.1.156

Platform

Anthropic API

Operating System

macOS

Terminal/Shell

Terminal.app (macOS)

Additional Information

Observed on 2.1.156 (the installed release). If a later version changed the watchdog/retry behavior I haven't found it noted — happy to re-test against a version you point to.

Additional Information

  • Model: claude-opus-4-8 (Opus 4.8). #61877 (runaway direction) is reported on Opus 4.7, so the underlying behavior does not appear model-version-specific. Platform: macOS (Darwin 24.6).
  • Configurability: no documented agent() option controls the retry count or the timeout, and I'm not aware of an env var or settings key for the no-progress window or retry count. Whether these are literally hardcoded or just non-configurable from the caller's surface, I can't tell — but a blind 6× retry of an identical failure is the defect either way.
  • Contrast run (illustrative single A/B, not a controlled experiment): a comparable run on the same harness and model, with smaller units of work, completed 4/4 in 1.3h — and individual agents there ran as long as 57 minutes without being killed (long total duration is fine; the watchdog reacts only to silence). Two agents stalled once and recovered on retry, because a fresh attempt fit more work into the window before the next silence. The variable that most plausibly differed between the two runs was per-agent context size (hence per-turn inference latency), not total duration or tool behavior. Important: context size is the trigger, not the defect — a progress signal that forces users to artificially cap context/turn size to stay under a silence window is converting a latency characteristic into a correctness failure.
  • Why "working as intended" doesn't fully cover it: even granting an intentional aggressive watchdog plus a ceiling, (a) blind retry amplification is waste no policy defends — retry is recovery only when the next attempt can make progress, and a same-stall/no-delta retry can't; and (b) per reports in #61877's thread the watchdog also appears to fire on a clean stand-down (an agent that finished its work and emitted a final summary) — if accurate, the false positive is not confined to heavy load.
  • The workaround tax (field evidence): to keep agents alive I had to hand-write rules into the workflow prompts — "don't run the ~20-minute test (it's silent; you'll be killed)," and "don't Write more than ~300 lines in one call; chunk into small Edits; emit a tool call at least every ~2 minutes." Having to chunk work so the agent emits committed events faster than the silence window is the symptom.

Related files or code

  • watchdog-repro.workflow.mjs (Repro A, above).

Links / related issues

  • #61877 — subagents enter generation loop on finalization / don't exit (open, labeled bug; runaway — plausibly the same root cause, opposite symptom).
  • #58604 — runaway Agent task burns tokens after completion; agentIdtask_id makes it unkillable (missing kill primitive).

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

claude-code - 💡(How to fix) Fix [BUG] Workflow subagent: one stall is retried 6× at full token cost (no across-retry progress check); the no-progress signal can't tell slow-but-alive from hung