claude-code - 💡(How to fix) Fix Workflow counts API-errored subagents as "completed"; 429/500/529 storm under multi-agent load

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

Two compounding problems when running Workflow (multi-agent) jobs:

  1. Client-side accounting bug (main issue): Subagents that terminate on an API error are counted as successful. A 299-agent workflow returned {"grandTotal":299,"agentsCompleted":299} while only 13 agents actually produced output. The failed agents returned the API error text as their result, and the orchestrator's success filter (a truthy-string check) treated that as success. There is no error propagation, so the reported completion count is unreliable.

  2. API reliability under load: With a few concurrent workflows fanning out subagents (tens of concurrent requests from one account), the Messages API returned a burst of HTTP 429 ("Server is temporarily limiting requests (not your usage limit) · Rate limited"), plus HTTP 500 and HTTP 529 — while the status page showed all-green. Agents did not appear to honor Retry-After / backoff before failing.

Error Message

API Error: Server is temporarily limiting requests (not your usage limit) · Rate limited "error":"rate_limit","isApiErrorMessage":true,"apiErrorStatus":429

Root Cause

Two compounding problems when running Workflow (multi-agent) jobs:

  1. Client-side accounting bug (main issue): Subagents that terminate on an API error are counted as successful. A 299-agent workflow returned {"grandTotal":299,"agentsCompleted":299} while only 13 agents actually produced output. The failed agents returned the API error text as their result, and the orchestrator's success filter (a truthy-string check) treated that as success. There is no error propagation, so the reported completion count is unreliable.

  2. API reliability under load: With a few concurrent workflows fanning out subagents (tens of concurrent requests from one account), the Messages API returned a burst of HTTP 429 ("Server is temporarily limiting requests (not your usage limit) · Rate limited"), plus HTTP 500 and HTTP 529 — while the status page showed all-green. Agents did not appear to honor Retry-After / backoff before failing.

Code Example

API Error: Server is temporarily limiting requests (not your usage limit) · Rate limited
"error":"rate_limit","isApiErrorMessage":true,"apiErrorStatus":429
RAW_BUFFERClick to expand / collapse

Environment

  • Claude Code CLI (interactive), Workflow multi-agent orchestration
  • Model: claude-opus-4-8
  • Platform: Linux aarch64 (Jetson/Tegra), bash
  • During the incident, https://status.claude.com showed "All Systems Operational" (no active incident)

Summary

Two compounding problems when running Workflow (multi-agent) jobs:

  1. Client-side accounting bug (main issue): Subagents that terminate on an API error are counted as successful. A 299-agent workflow returned {"grandTotal":299,"agentsCompleted":299} while only 13 agents actually produced output. The failed agents returned the API error text as their result, and the orchestrator's success filter (a truthy-string check) treated that as success. There is no error propagation, so the reported completion count is unreliable.

  2. API reliability under load: With a few concurrent workflows fanning out subagents (tens of concurrent requests from one account), the Messages API returned a burst of HTTP 429 ("Server is temporarily limiting requests (not your usage limit) · Rate limited"), plus HTTP 500 and HTTP 529 — while the status page showed all-green. Agents did not appear to honor Retry-After / backoff before failing.

Evidence

Error-signature frequency across one workflow's 299 agent transcripts:

SignatureCount
HTTP 429366
HTTP 500323
HTTP 52950
"API Error" / rate_limit296

Verbatim from a subagent transcript:

API Error: Server is temporarily limiting requests (not your usage limit) · Rate limited
"error":"rate_limit","isApiErrorMessage":true,"apiErrorStatus":429

Sample requestIds: req_011CbaRzW4z2iAHc3eP2GRhU, req_011CbaRq84u4CU8PyoXPR6Sb, req_011CbaRzhtP9QyKv74LRCHCK

Outcome: the 299-agent run produced 13 usable files; a concurrent 192-agent run produced ~11 before it was stopped. All failures fell inside a ~4-minute window.

Reproduction

  1. Launch ~3 concurrent Claude Code Workflow runs from one account, each fanning out subagents.
  2. Within minutes, a large fraction of Messages API calls return 429/500/529.
  3. The workflow still reports agentsCompleted == grandTotal, masking the failures.

Impact

Billed for hundreds of subagent calls (input tokens + retries) with zero usable output, and a "success" metric that cannot be trusted — every run must be re-verified against the filesystem. The failed calls also consumed subscription usage quota.

Requested fixes

  1. Don't count an agent that terminated on an API error as completed — propagate the failure so agentsCompleted is truthful.
  2. Auto-retry 429/529 with exponential backoff (honor Retry-After) before declaring an agent failed.
  3. Optional: an account-wide concurrency governor so stacked workflows don't self-inflict a rate-limit storm.
  4. Reflect elevated 500/529 / capacity-shedding on status.claude.com.

Account remediation requested

The failed requests above (server-side 429 "not your usage limit" + 500 + 529) consumed subscription usage with no usable result. Please credit / reset the usage counter for these failed calls — it is not acceptable for server-side capacity errors to burn customer quota. Happy to provide account details privately through support.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

claude-code - 💡(How to fix) Fix Workflow counts API-errored subagents as "completed"; 429/500/529 storm under multi-agent load