claude-code - 💡(How to fix) Fix Recurring fetch() socket disconnects in Node SDK — n=81 corpus, cross-client cascade correlated with status.claude.com "Elevated errors" entry

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

Error Message

Shape 1 — socket_closed (in-jsonl synthetic error). The model is mid-stream, the harness catches a fetch()-level disconnect, and injects a synthetic assistant record with isApiErrorMessage: true, model: "<synthetic>", stop_reason: "stop_sequence". Literal error text:

API Error: The socket connection was closed unexpectedly. For more information, pass verbose: true in the second argument to fetch() Shape 4 — Sub-agent overload returning as tool_result error. When Iji spawns a sub-agent via the harness's Agent tool and the sub-agent's API call hits 529 Overloaded / 503 / 500 / transport-class, the error returns ONLY to the parent's tool_result and never writes an isApiErrorMessage record. Sub-agents share the parent's API context but the error visibility doesn't propagate. I scanned every JSONL session transcript in ~/.claude/projects/<project>/ for isApiErrorMessage: true records plus inferred process-deaths from jsonl tail state, then manually appended Shapes 3 + 4 as they were observed. n = 81 total error events across 23 days (2026-05-02 → 2026-05-25). Breakdown by class:

  • Not context-position: median context-at-error = 35-46%; max observed = 75%. Never crossed the documented 80% wrap line. The model wasn't running out of room. The error message names the source: Node's fetch() underneath the SDK lost the socket mid-stream. The model wasn't deciding to stop — it was streaming normally, and the connection died between two emitted tokens. The SDK does not retry mid-stream socket failures (it retries pre-stream errors like 5xx and connection-refused, but once bytes are flowing, a transport disconnect terminates the in-flight request). The error message names the source: Node's fetch() underneath the SDK lost the socket mid-stream. The model wasn't deciding to stop — it was streaming normally, and the connection died between two emitted tokens. The SDK does not retry mid-stream socket failures (it retries pre-stream errors like 5xx and connection-refused, but once bytes are flowing, a transport disconnect terminates the in-flight request). Shape 1 produces a synthetic error record because the CC process survives the disconnect; Shape 2 produces nothing in the jsonl because the disconnect propagates to a CC-process exit.
  1. 08:26:42 PT — session 1076b661, mid-thinking-block emission, output_tokens=49 emitted before disconnect. Context 23%, rate 15%, 2 concurrent sessions. Cache state at error: 217k cache_read + 12k cache_create. Same literal error string for both, and for every other socket_closed event in the corpus.
  2. Sub-agent overload returning as tool_result: when an Agent-tool sub-agent hits 529/503/500/transport-class, the error returns ONLY to the parent's tool_result and never writes an isApiErrorMessage record. Is that intentional? Could the sub-agent error surface as a synthetic record too, so the standing observability layer catches it?

Root Cause

We set up an outside-driver poller (anthropic-status-pinger.js, runs as a Windows scheduled task) precisely because the 2026-05-24 cascade had no public Anthropic incident at the time it happened. The pinger captured:

Fix Action

Fix / Workaround

Mitigation that empirically worked: Joe-orchestrator paste of a 5-point tactical-awareness brief (smaller atomic tool calls, continuous disk checkpointing, drop planning prose before tool_use, don't retry into the same hole, note errors inline). Applied first to 3d422756; the session continued successfully and completed its work. Codified as a durable team memory; cited by 1b8cfaab and c1f7fbc5 as they hit subsequent failures and recovered.

  1. Is this a known pattern? Particularly: are mid-stream socket disconnects retried by the SDK, or do they terminate the in-flight request without recovery? Looking at the symptom, it appears to be the latter.
  2. The 2026-05-24 spike day + the status.claude.com correlation: does a Cloudflare / edge load-balancer log of 18:42-18:52Z (the cross-client cascade window) and ~19:30Z (when the public status entry was being raised) correlate with what your infra saw?
  3. HTTP/2 RST_STREAM or connection-level close? — happy to enable ANTHROPIC_LOG=debug for the next event if you'd like a fetch-layer trace. (Patch is staged; can apply within minutes if useful.)
  4. Mid-stream retry policy — is there an SDK config knob to retry on stream-disconnect, or is that fundamentally incompatible with streaming response semantics? If incompatible, the harness-layer mitigation we've built (tactical-brief playbook + outside-driver watcher + status-pinger) is probably the right shape on our side.
  5. Sub-agent overload returning as tool_result: when an Agent-tool sub-agent hits 529/503/500/transport-class, the error returns ONLY to the parent's tool_result and never writes an isApiErrorMessage record. Is that intentional? Could the sub-agent error surface as a synthetic record too, so the standing observability layer catches it?
  6. What environment details would help triage? — happy to send Claude Code version, Node version, OS (Windows 11 26200), network (Tailscale + ISP), @anthropic-ai/sdk version, and the full enriched 81-event corpus (NDJSON) on request.

Workaround in place

Code Example

2026-05-24T19:39:14Z — "Elevated errors on Claude Opus 4.7"RESOLVED
2026-05-24T19:39:14Z — "Elevated errors on Sonnet 4.6"RESOLVED
RAW_BUFFERClick to expand / collapse

Hi Anthropic team,

Reporting a recurring transport-layer failure pattern in Claude Code on Windows. The pattern has four observable shapes that share one transport-layer cause; recent evidence (2026-05-24 spike day) includes a cross-client cascade (Claude Code + Claude Desktop simultaneously) that correlates with a public status.claude.com entry ("Elevated errors on Claude Opus 4.7" resolved 19:39Z), confirming this is upstream, not per-client noise.

Shape 1 — socket_closed (in-jsonl synthetic error). The model is mid-stream, the harness catches a fetch()-level disconnect, and injects a synthetic assistant record with isApiErrorMessage: true, model: "<synthetic>", stop_reason: "stop_sequence". Literal error text:

API Error: The socket connection was closed unexpectedly. For more information, pass verbose: true in the second argument to fetch()

That verbose: true hint points at the Node @anthropic-ai/sdk fetch transport layer.

Shape 2 — process_eject_inferred (CC process dies mid-stream, no record). The same transport failure at a moment that kills the entire CC process. The session jsonl simply ends mid-turn; the WezTerm tab is returned to the shell prompt; no isApiErrorMessage record is written. Detected via post-hoc reconstruction (abrupt jsonl end + process not running).

Shape 3 — user_reported_unspecified (no jsonl trace whatsoever). Users report API errors visually in the TUI that produce no isApiErrorMessage record AND no observable jsonl anomaly the watcher can detect. Possibly a separate code path in the harness; possibly errors during initial connection before the session writes anything. Captured only via Joe-side notification.

Shape 4 — Sub-agent overload returning as tool_result error. When Iji spawns a sub-agent via the harness's Agent tool and the sub-agent's API call hits 529 Overloaded / 503 / 500 / transport-class, the error returns ONLY to the parent's tool_result and never writes an isApiErrorMessage record. Sub-agents share the parent's API context but the error visibility doesn't propagate.

Corpus

I scanned every JSONL session transcript in ~/.claude/projects/<project>/ for isApiErrorMessage: true records plus inferred process-deaths from jsonl tail state, then manually appended Shapes 3 + 4 as they were observed. n = 81 total error events across 23 days (2026-05-02 → 2026-05-25). Breakdown by class:

ClassCountShape
socket_closed411
process_eject_inferred192
500_server (API 500)15
other (unclassified)14
429_throttle4
user_reported_unspecified23
Sub-agent overload routing (manual capture)1+4

socket_closed per day (showing the late-May spike):

DateCount
2026-05-021
2026-05-091
2026-05-132
2026-05-171
2026-05-183
2026-05-191
2026-05-201
2026-05-211
2026-05-226
2026-05-232
2026-05-24 (incl. early 5-25 UTC)15

2026-05-24 PT (Akron, UTC-7) was a substantial spike day: 15 socket_closed + 12 process_eject_inferred + 2 user_reported_unspecified = 29 events in ~12 hours. Five sessions accumulated multiple hits today: 1b8cfaab (5 events), c1f7fbc5 (4), 1076b661 (3), 3d422756 (3), be33846a (2).

What it isn't

I enriched each event with the session's context% (used_pct from the telemetry CSV), rate-window%, and concurrent-Claude-Code-sessions count at the moment of failure. The distributions rule out three common causes:

  • Not context-position: median context-at-error = 35-46%; max observed = 75%. Never crossed the documented 80% wrap line. The model wasn't running out of room.
  • Not rate-limit throttle: median rate% = 4-7; max = 43. Far below any throttling threshold.
  • Not concurrent-load: 8 of 26 baseline socket_closed events occurred with the session alone (concurrent_sessions=1). The full distribution at baseline: {1:8, 2:5, 3:8, 4:1, 5:3, 6:1}. Solo sessions are the most common failure shape.

What it is — upstream-correlated, now publicly confirmed

The error message names the source: Node's fetch() underneath the SDK lost the socket mid-stream. The model wasn't deciding to stop — it was streaming normally, and the connection died between two emitted tokens. The SDK does not retry mid-stream socket failures (it retries pre-stream errors like 5xx and connection-refused, but once bytes are flowing, a transport disconnect terminates the in-flight request).

Public-status correlation (the strongest single evidence point):

We set up an outside-driver poller (anthropic-status-pinger.js, runs as a Windows scheduled task) precisely because the 2026-05-24 cascade had no public Anthropic incident at the time it happened. The pinger captured:

2026-05-24T19:39:14Z — "Elevated errors on Claude Opus 4.7" — RESOLVED
2026-05-24T19:39:14Z — "Elevated errors on Sonnet 4.6" — RESOLVED

The status entry landed ~1 hour after our 18:42-18:52Z cross-client cascade (described as Worked Example 2 below). This is the same upstream event seen by Anthropic infrastructure and by our client — strong confirmation that "transport-layer failure on the client" maps to "elevated errors on the model serving" on Anthropic's side. The cascade is not per-client noise.

I have not yet enabled ANTHROPIC_LOG=debug or --debug api for capturing lower-level fetch traces; doing so on the next event would distinguish HTTP/2 RST_STREAM from idle-timeout from connection-reset. Happy to send a trace if it would help triage.

What it is

The error message names the source: Node's fetch() underneath the SDK lost the socket mid-stream. The model wasn't deciding to stop — it was streaming normally, and the connection died between two emitted tokens. The SDK does not retry mid-stream socket failures (it retries pre-stream errors like 5xx and connection-refused, but once bytes are flowing, a transport disconnect terminates the in-flight request). Shape 1 produces a synthetic error record because the CC process survives the disconnect; Shape 2 produces nothing in the jsonl because the disconnect propagates to a CC-process exit.

I have not yet enabled ANTHROPIC_LOG=debug or --debug api for capturing lower-level fetch traces; doing so on the next event would distinguish HTTP/2 RST_STREAM from idle-timeout from connection-reset. Happy to send a trace if it would help triage.

Worked example 1 — Two socket_closed events on 2026-05-24

  1. 08:26:42 PT — session 1076b661, mid-thinking-block emission, output_tokens=49 emitted before disconnect. Context 23%, rate 15%, 2 concurrent sessions. Cache state at error: 217k cache_read + 12k cache_create.
  2. 10:45:39 PT — session 3d422756, mid-Edit-tool-call planning text. Context 35%, rate 18%, 2 concurrent sessions.

Same literal error string for both, and for every other socket_closed event in the corpus.

Worked example 2 — Cross-client cascade 2026-05-24 11:42-11:52 PT (strongest evidence)

This is the single strongest case in the corpus because it controls for everything except the connection itself, and observably extends past one client.

11:42-11:52 PT (18:42-18:52 UTC) — 4 Claude Code sessions died over 10 minutes:

SessionTimeClassState at death
1076b661 (chain-7, mid-spec-author)18:42:52Zprocess_eject_inferredctx 40%, rate 24%
b04ff754 (sibling investigation)18:44:07Zprocess_eject_inferredctx 24%, rate 24%
575c62e4 (brand-new session, 43s old)18:44:07Zprocess_eject_inferredn/a (too young for telemetry)
3d422756 (interactive diagram walk)18:52:10Zprocess_eject_inferredctx 44%, rate 25%

~11:55 PT — Claude Desktop also restarted (observed by Joe; client-side notification). This is the cross-client signal — the same upstream surface visible to both Claude Code (Node SDK) AND Claude Desktop. Not a per-client SDK bug.

Forensic ground-truth at the time of the cascade:

  • Windows Event Log: silent (no crash dumps, no OS-level signal)
  • WezTerm: alive throughout (multi-day uptime)
  • System memory: 48% (not exhausted)
  • Tailscale: healthy pre/post-event
  • status.claude.com: clean at the time of the cascade, but the "Elevated errors on Claude Opus 4.7" entry resolved 19:39Z — i.e., ~47 minutes after the last cascade death. This is the late-landing status entry the pinger was built to catch.
  • All 4 sessions died at "request-boundary" moments (between API calls or mid-stream), not during local work

A second mini-cascade at 12:56-12:58 PT (19:56-19:58 UTC) hit 2 more sessions (85a8b4e4 + 138c9162) within 2 minutes — same shape, ~1 hour after the first cascade. A solo eject at 13:27 PT (20:27 UTC) of session 7bc47e3f followed.

This pattern — multiple sessions ejecting within seconds of each other + a Claude Desktop restart at the same moment + a public status entry resolving ~1h later — is incompatible with per-client SDK noise. It points at edge infrastructure (Cloudflare, load balancer, backend rebalancing, or similar) producing a transient reset visible to multiple connected clients simultaneously.

Worked example 3 — Single-session multi-relaunch persistence

Session 3d422756 hit transport failures on 2026-05-24 at 17:45, 18:52, 19:52 PT — three events in 2 hours on the same logical session_id across relaunches. Same conversation, same ~/.claude/projects/.../3d422756...jsonl, three consecutive transport failures. If transient network were the dominant cause, retries on the same session shouldn't immediately hit the same failure mode. Something about this conversation's edge routing or session identity correlates with the failure — a useful signal for the SDK / edge team if it's reproducible their side.

Worked example 4 — 2026-05-24 sustained spike day (the report's trigger)

The full day produced 29 transport-class events on a baseline of ~3.5/day:

  • 15 socket_closed (vs baseline 1-3/day)
  • 12 process_eject_inferred (the entire dataset for this shape; concentrated on this single day)
  • 2 user_reported_unspecified (errors with no jsonl trace; Joe-reported)

Multi-session multi-relaunch was the dominant shape — 5 sessions hit ≥2 events each:

SessionHit countClass mix
1b8cfaab (SEED_FPS Sprint 1 orchestrator)51 process_eject + 4 socket_closed across 22:18 → 03:22Z
c1f7fbc5 (walkable-forge-spatial research)42 process_eject + 1 socket_closed + 1 user_reported across 22:02 → 04:52Z
1076b661 (chain-7 spec author)31 socket_closed + 2 process_eject
3d422756 (interactive diagram walk)3All socket_closed (see Worked Example 3)
be33846a (sprint-1-r2b chain-2 author)21 process_eject + 1 socket_closed at 01:42 + 01:52Z

The 1b8cfaab + c1f7fbc5 patterns mirror 3d422756's shape — repeated transport failures on the same session across relaunches, hours apart. We saw this same shape reproduce on multiple independent conversations on the same day.

Mitigation that empirically worked: Joe-orchestrator paste of a 5-point tactical-awareness brief (smaller atomic tool calls, continuous disk checkpointing, drop planning prose before tool_use, don't retry into the same hole, note errors inline). Applied first to 3d422756; the session continued successfully and completed its work. Codified as a durable team memory; cited by 1b8cfaab and c1f7fbc5 as they hit subsequent failures and recovered.

Ask

  1. Is this a known pattern? Particularly: are mid-stream socket disconnects retried by the SDK, or do they terminate the in-flight request without recovery? Looking at the symptom, it appears to be the latter.
  2. The 2026-05-24 spike day + the status.claude.com correlation: does a Cloudflare / edge load-balancer log of 18:42-18:52Z (the cross-client cascade window) and ~19:30Z (when the public status entry was being raised) correlate with what your infra saw?
  3. HTTP/2 RST_STREAM or connection-level close? — happy to enable ANTHROPIC_LOG=debug for the next event if you'd like a fetch-layer trace. (Patch is staged; can apply within minutes if useful.)
  4. Mid-stream retry policy — is there an SDK config knob to retry on stream-disconnect, or is that fundamentally incompatible with streaming response semantics? If incompatible, the harness-layer mitigation we've built (tactical-brief playbook + outside-driver watcher + status-pinger) is probably the right shape on our side.
  5. Sub-agent overload returning as tool_result: when an Agent-tool sub-agent hits 529/503/500/transport-class, the error returns ONLY to the parent's tool_result and never writes an isApiErrorMessage record. Is that intentional? Could the sub-agent error surface as a synthetic record too, so the standing observability layer catches it?
  6. What environment details would help triage? — happy to send Claude Code version, Node version, OS (Windows 11 26200), network (Tailscale + ISP), @anthropic-ai/sdk version, and the full enriched 81-event corpus (NDJSON) on request.

Workaround in place

I've shipped a standing observability + mitigation layer on my end:

  • All historical events backfilled into a single NDJSON corpus (81 events, 23 days)
  • Continuous outside-driver watcher (Windows scheduled task) capturing new events live; zero impact on Claude Code's API budget
  • Outside-driver Anthropic-status pinger (5-min poll) capturing late-landing public-status entries for correlation
  • Tactical-awareness brief encoded as a durable team memory; observably keeps sessions making progress through the failure pattern
  • Investigation session log + reference doc + watcher + pinger scripts available if there's a public-share venue that helps

Thanks for any pointers. Happy to be the canary for any debug toggles you'd like activated.

Sincerely, Joe

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING