openclaw - 💡(How to fix) Fix Claude Code SDK: subagent SSE stream errors should trigger retry, not silent text-write (root cause of #84053) [1 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
openclaw/openclaw#84064Fetched 2026-05-20 03:44:33
View on GitHub
Comments
1
Participants
2
Timeline
9
Reactions
1
Timeline (top)
labeled ×8commented ×1

When the SSE stream between claude-cli and Anthropic's API drops mid-response, the Claude Code SDK catches the error and writes it as text content into the transcript — but does NOT retry, does NOT raise to the parent, and does NOT mark the run as failed. The parent Agent tool call stays open waiting for a tool_result that never comes. From the parent agent's view, the subagent is "still running" indefinitely.

Error Message

When the SSE stream between claude-cli and Anthropic's API drops mid-response, the Claude Code SDK catches the error and writes it as text content into the transcript — but does NOT retry, does NOT raise to the parent, and does NOT mark the run as failed. The parent Agent tool call stays open waiting for a tool_result that never comes. From the parent agent's view, the subagent is "still running" indefinitely. | a9996aacae777a507 (Stage 2 first try) | stop_sequence | 0 | SDK caught the SSE error and wrote "API Error: The socket connection was closed unexpectedly." as text — but did not retry. This is the diagnostic smoking gun. | 2. SDK should raise on unrecoverable SSE error, not silently write to transcript. The transcript-write pattern is dangerous — it makes the parent layer believe the subagent is operating normally when it isn't.

Root Cause

Hypothesized root cause

Fix Action

Fix / Workaround

Why parent-layer watchdog patches don't help

Mitigation in our project

RAW_BUFFERClick to expand / collapse

Subagent SSE stream errors should trigger retry, not silent text-write

OpenClaw version: 2026.5.18 (50a2481) Component: Claude Code SDK subagent layer (the Agent tool's underlying claude-cli subprocess pipeline) Severity: High — silent failures cause indefinite stalls when subagent SSE streams drop

Summary

When the SSE stream between claude-cli and Anthropic's API drops mid-response, the Claude Code SDK catches the error and writes it as text content into the transcript — but does NOT retry, does NOT raise to the parent, and does NOT mark the run as failed. The parent Agent tool call stays open waiting for a tool_result that never comes. From the parent agent's view, the subagent is "still running" indefinitely.

Evidence — 5 reproductions in single session (2026-05-18)

Three death signatures, same root pipeline issue:

SubagentLast stop_reasonOutput tokensShape
a9996aacae777a507 (Stage 2 first try)stop_sequence0SDK caught the SSE error and wrote "API Error: The socket connection was closed unexpectedly." as text — but did not retry. This is the diagnostic smoking gun.
aeccf7ad1e79eb129 (A2a real_estate)tool_use158Clean tool_use turn — process never came back to read the tool_result. SSE stream died waiting for tool result handoff.
a7002afd82c455a72 (A2b verticals)None5Text cuts mid-word ("...micro-scenarios (gat"). SSE stream died mid-response.
af1e8c8200960d1a9 (Stage 3)NoneSame shape as A2b.
a0b81240b1e438763 (nexus rename)NoneSame shape.

Hypothesized root cause

The Claude Code SDK's SSE stream consumer catches EOF/socket close errors, writes them to the transcript buffer, and returns normally — but the subprocess doesn't re-establish the stream or retry the request. The parent layer (Agent tool call) sees a transcript-with-text-but-no-completion-event and waits forever.

Why parent-layer watchdog patches don't help

OpenClaw's own watchdog operates at the session/JARVIS layer. From OpenClaw's view, JARVIS has an open tool_call and is patiently waiting for it — OpenClaw can't tell the subprocess is dead because it can't see inside the subagent's own SSE stream. Forcing recoveryEligible=true on JARVIS-layer model_call stalls would help if JARVIS himself hung, but JARVIS doesn't hang; the SUBAGENT does, and JARVIS waits.

Suggested fixes (in priority order)

  1. SDK should retry on SSE socket close with exponential backoff (e.g., 3 attempts: immediate, 1s, 5s). The streaming connection is the most failure-prone layer; retry should be table stakes.
  2. SDK should raise on unrecoverable SSE error, not silently write to transcript. The transcript-write pattern is dangerous — it makes the parent layer believe the subagent is operating normally when it isn't.
  3. Subagent process should fire a completion event on abnormal exit with whatever transcript exists, so parent agents (and OpenClaw's watchdog) can detect death and recover.
  4. Until 1-3 ship: document the failure mode so users can build their own subagent reapers (mtime polling, periodic health checks) on the subagent JSONL layer.

Mitigation in our project

Until upstream ships retry/recovery:

  • Poll subagent JSONL file mtime every 90s
  • If stale + last event has stop_reason: None or unanswered tool_use → kill subagent + respawn with checkpoint context
  • Shrink per-spawn workload to reduce at-risk window (smaller batches = shorter SSE streams)
  • Pin OpenClaw version (no doctor runs)

Reproducibility

Trivial — 5 of 5 long-running Opus subagent spawns in our 2026-05-18 session exhibited the failure. Not narrowed to a specific task type; happens across template-generation, refactoring, file-rename, and verification workloads. ~3-15 min into the spawn.

Related issues

  • #84053 — Background subagent completion notifications dropped silently (this issue is a more precise diagnosis of the same root cause)
  • #84054 — doctor --fix strips agentRuntime mapping (worsens the situation by routing subagents to the less-stable runtime path)

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

openclaw - 💡(How to fix) Fix Claude Code SDK: subagent SSE stream errors should trigger retry, not silent text-write (root cause of #84053) [1 comments, 2 participants]