openclaw - 💡(How to fix) Fix Claude Code SDK: subagent SSE stream errors should trigger retry, not silent text-write (root cause of #84053) [1 comments, 2 participants]

openclaw2026-05-19 09:04:32

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

openclaw/openclaw#84064•Fetched 2026-05-20 03:44:33

View on GitHub

Comments

Participants

Timeline

Reactions

Author

Virexa-Labs

Participants

clawsweeper[bot]

Virexa-Labs

Timeline (top)

labeled ×8commented ×1

Error Message

When the SSE stream between claude-cli and Anthropic's API drops mid-response, the Claude Code SDK catches the error and writes it as text content into the transcript — but does NOT retry, does NOT raise to the parent, and does NOT mark the run as failed. The parent Agent tool call stays open waiting for a tool_result that never comes. From the parent agent's view, the subagent is "still running" indefinitely. | a9996aacae777a507 (Stage 2 first try) | stop_sequence | 0 | SDK caught the SSE error and wrote "API Error: The socket connection was closed unexpectedly." as text — but did not retry. This is the diagnostic smoking gun. | 2. SDK should raise on unrecoverable SSE error, not silently write to transcript. The transcript-write pattern is dangerous — it makes the parent layer believe the subagent is operating normally when it isn't.

Root Cause

Hypothesized root cause

Fix Action

Fix / Workaround

Why parent-layer watchdog patches don't help

Mitigation in our project

RAW_BUFFERClick to expand / collapse

Subagent SSE stream errors should trigger retry, not silent text-write

OpenClaw version: 2026.5.18 (50a2481) Component: Claude Code SDK subagent layer (the Agent tool's underlying claude-cli subprocess pipeline) Severity: High — silent failures cause indefinite stalls when subagent SSE streams drop

Summary

Evidence — 5 reproductions in single session (2026-05-18)

Three death signatures, same root pipeline issue:

Subagent	Last `stop_reason`	Output tokens	Shape
a9996aacae777a507 (Stage 2 first try)	`stop_sequence`	0	SDK caught the SSE error and wrote "API Error: The socket connection was closed unexpectedly." as text — but did not retry. This is the diagnostic smoking gun.
aeccf7ad1e79eb129 (A2a real_estate)	`tool_use`	158	Clean tool_use turn — process never came back to read the tool_result. SSE stream died waiting for tool result handoff.
a7002afd82c455a72 (A2b verticals)	`None`	5	Text cuts mid-word ("...micro-scenarios (gat"). SSE stream died mid-response.
af1e8c8200960d1a9 (Stage 3)	`None`	—	Same shape as A2b.
a0b81240b1e438763 (nexus rename)	`None`	—	Same shape.

Hypothesized root cause

The Claude Code SDK's SSE stream consumer catches EOF/socket close errors, writes them to the transcript buffer, and returns normally — but the subprocess doesn't re-establish the stream or retry the request. The parent layer (Agent tool call) sees a transcript-with-text-but-no-completion-event and waits forever.

Why parent-layer watchdog patches don't help

OpenClaw's own watchdog operates at the session/JARVIS layer. From OpenClaw's view, JARVIS has an open tool_call and is patiently waiting for it — OpenClaw can't tell the subprocess is dead because it can't see inside the subagent's own SSE stream. Forcing recoveryEligible=true on JARVIS-layer model_call stalls would help if JARVIS himself hung, but JARVIS doesn't hang; the SUBAGENT does, and JARVIS waits.

Suggested fixes (in priority order)

SDK should retry on SSE socket close with exponential backoff (e.g., 3 attempts: immediate, 1s, 5s). The streaming connection is the most failure-prone layer; retry should be table stakes.
SDK should raise on unrecoverable SSE error, not silently write to transcript. The transcript-write pattern is dangerous — it makes the parent layer believe the subagent is operating normally when it isn't.
Subagent process should fire a completion event on abnormal exit with whatever transcript exists, so parent agents (and OpenClaw's watchdog) can detect death and recover.
Until 1-3 ship: document the failure mode so users can build their own subagent reapers (mtime polling, periodic health checks) on the subagent JSONL layer.

Mitigation in our project

Until upstream ships retry/recovery:

Poll subagent JSONL file mtime every 90s
If stale + last event has stop_reason: None or unanswered tool_use → kill subagent + respawn with checkpoint context
Shrink per-spawn workload to reduce at-risk window (smaller batches = shorter SSE streams)
Pin OpenClaw version (no doctor runs)

Reproducibility

Trivial — 5 of 5 long-running Opus subagent spawns in our 2026-05-18 session exhibited the failure. Not narrowed to a specific task type; happens across template-generation, refactoring, file-rename, and verification workloads. ~3-15 min into the spawn.

Related issues

#84053 — Background subagent completion notifications dropped silently (this issue is a more precise diagnosis of the same root cause)
#84054 — doctor --fix strips agentRuntime mapping (worsens the situation by routing subagents to the less-stable runtime path)

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #ssr #installation #API versioning #request timeout

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

openclaw - 💡(How to fix) Fix Claude Code SDK: subagent SSE stream errors should trigger retry, not silent text-write (root cause of #84053) [1 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Hypothesized root cause

Fix Action

Fix / Workaround

Why parent-layer watchdog patches don't help

Mitigation in our project

Subagent SSE stream errors should trigger retry, not silent text-write

Summary

Evidence — 5 reproductions in single session (2026-05-18)

Hypothesized root cause

Why parent-layer watchdog patches don't help

Suggested fixes (in priority order)

Mitigation in our project

Reproducibility

Related issues

Still need to ship something?

TRENDING

openclaw - 💡(How to fix) Fix Claude Code SDK: subagent SSE stream errors should trigger retry, not silent text-write (root cause of #84053) [1 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Hypothesized root cause

Fix Action

Fix / Workaround

Why parent-layer watchdog patches don't help

Mitigation in our project

Subagent SSE stream errors should trigger retry, not silent text-write

Summary

Evidence — 5 reproductions in single session (2026-05-18)

Hypothesized root cause

Why parent-layer watchdog patches don't help

Suggested fixes (in priority order)

Mitigation in our project

Reproducibility

Related issues

Still need to ship something?

RELATED_DISCOVERY

TRENDING