openclaw - 💡(How to fix) Fix [Bug]: Anthropic provider: UND_ERR_SOCKET keep-alive failures trigger silent mid-turn fallback to OpenAI/Codex

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

Long-running OpenClaw gateway intermittently fails Anthropic provider requests with UND_ERR_SOCKET after 250-400ms, triggering the configured fallback chain and silently swapping the chat from claude-opus-4-7 to gpt-5.1-codex mid-conversation; users perceive the agent suddenly changing personality and tool-call style mid-turn.

Error Message

  1. After ~1-3 hours of uptime, observe [model-fetch] error provider=anthropic api=anthropic-messages model=claude-opus-4-7 elapsedMs=254 ... causeName=SocketError causeCode=UND_ERR_SOCKET message=fetch failed in the gateway log. [model-fetch] error provider=anthropic api=anthropic-messages

Root Cause

Likely root cause

Fix Action

Fix / Workaround

  1. Add automatic retry-once on UND_ERR_SOCKET in provider-transport-fetch (or wherever the Anthropic provider's undici.fetch lives). Simplest and most localized; matches what most undici-using libraries do.
  2. Disable keep-alive on the anthropic-messages fetch agent (force fresh connection per request). Slight latency hit, no more failures.
  3. Lower keepAliveTimeout below ~30s on the undici dispatcher used by the anthropic provider so the pool always evicts before Anthropic / Cloudflare does.

Code Example

{
  agents: {
    defaults: {
      model: {
        primary: "anthropic/claude-opus-4-7",
        fallbacks: ["anthropic/claude-opus-4-7", "anthropic/claude-opus-4-6", "openai/gpt-4o", "openai/gpt-5.1-codex", "anthropic/claude-sonnet-4-6", "anthropic/claude-haiku-4-5", "openai/gpt-5.4-pro"]
      },
      models: {
        "anthropic/claude-opus-4-7": { alias: "opus", params: { cacheRetention: "short", context1m: true } },
        "anthropic/claude-opus-4-6": { params: { cacheRetention: "short", context1m: true } },
        "anthropic/claude-sonnet-4-6": { params: { cacheRetention: "short", context1m: true } },
        "anthropic/claude-haiku-4-5": { params: { cacheRetention: "short" } },
        "openai/gpt-5.1-codex": { alias: "GPT" }
      }
    },
    list: [
      {
        id: "chief",
        model: { primary: "anthropic/claude-opus-4-7", fallbacks: [/* same chain */] }
      }
    ]
  },
  auth: {
    profiles: {
      "anthropic:default": { provider: "anthropic", mode: "api_key" }
    }
  }
}

---

All evidence from local `/tmp/openclaw/openclaw-2026-05-27.log` ndjson on the affected machine. Excerpts are quoted verbatim; full lines have been redacted for log shipping but available on request.

### Representative socket failure


2026-05-27T12:58:42.845-07:00
[model-fetch] error provider=anthropic api=anthropic-messages
  model=claude-opus-4-7 elapsedMs=254
  name=TypeError code=undefined
  causeName=SocketError causeCode=UND_ERR_SOCKET
  message=fetch failed


### Representative fallback decision chain


2026-05-27T13:01:16 model_fallback_decision
  requestedProvider=anthropic requestedModel=claude-opus-4-7
  candidateProvider=anthropic candidateModel=claude-opus-4-7
  attempt=1 total=6
  decision=candidate_failed
  fallbackStepFromFailureReason=timeout
  fallbackStepFromFailureDetail="fetch failed"
  previousAttempts=[{provider:"anthropic", model:"claude-opus-4-7", reason:"timeout", status:408, errorHash:"sha256:e2c73a8fd237"}]

2026-05-27T13:01:16 model_fallback_decision
  attempt=2 candidateProvider=anthropic candidateModel=claude-opus-4-6
  decision=candidate_succeeded


### Timeline of UND_ERR_SOCKET failures today (Pacific)


08:55:20  09:06:34  09:35:42  09:36:09  09:39:28  09:48:00
10:31:43  10:31:46  10:47:37
11:18:00  11:20:57
12:53:39  12:56:01  12:58:42


14 total in ~4 hours, clustered into 4 bursts. Clean periods between bursts last 30-90 min.

### Confirmation Anthropic is healthy (same machine, same key, same minute)


curl -s -o /dev/null -w "HTTP %{http_code} elapsed %{time_total}s\n" \
  --max-time 10 https://api.anthropic.com/v1/messages \
  -X POST -H "x-api-key: $ANTHROPIC_KEY" \
  -H "anthropic-version: 2023-06-01" -H "content-type: application/json" \
  -d '{"model":"claude-opus-4-7","max_tokens":5,"messages":[{"role":"user","content":"hi"}]}'

HTTP 200 elapsed 1.414930s
RAW_BUFFERClick to expand / collapse

Bug type

Regression (worked before, now fails)

Beta release blocker

No

Summary

Long-running OpenClaw gateway intermittently fails Anthropic provider requests with UND_ERR_SOCKET after 250-400ms, triggering the configured fallback chain and silently swapping the chat from claude-opus-4-7 to gpt-5.1-codex mid-conversation; users perceive the agent suddenly changing personality and tool-call style mid-turn.

Steps to reproduce

  1. Start OpenClaw 2026.5.20 with agents.list[].model.primary = anthropic/claude-opus-4-7 and a fallback chain that includes openai/gpt-5.1-codex.
  2. Use the agent normally for several hours.
  3. After ~1-3 hours of uptime, observe [model-fetch] error provider=anthropic api=anthropic-messages model=claude-opus-4-7 elapsedMs=254 ... causeName=SocketError causeCode=UND_ERR_SOCKET message=fetch failed in the gateway log.
  4. Observe the next model_fallback_decision log line: requestedProvider=anthropic requestedModel=claude-opus-4-7 decision=candidate_failed attempt=1 fallbackStepFromFailureReason=timeout fallbackStepFromFailureDetail="fetch failed".
  5. Eventually fallback succeeds on openai/gpt-5.1-codex; the chat thread continues as the new model for that turn.
  6. Confirm Anthropic itself is healthy by running curl https://api.anthropic.com/v1/messages with the same key from the same machine — returns HTTP 200 in ~1.4s.

Expected behavior

Anthropic provider requests should complete (HTTP 200 from api.anthropic.com) using the configured anthropic:default API key, as they do when the same key is used via curl from the same host at the same time. The agent should stay on its configured primary model (anthropic/claude-opus-4-7) without silently swapping to a fallback OpenAI model mid-turn.

Actual behavior

On a single ~6 hour window (2026-05-27 08:55 to 12:58 PDT) the local gateway log contains:

  • 14 UND_ERR_SOCKET failures against api.anthropic.com (anthropic-messages API), all with elapsedMs between 254-371ms. Way too short to be a server-side timeout.
  • 29 model_fallback_decision events in the same window. The fallback ladder traverses anthropic/claude-opus-4-7 → anthropic/claude-opus-4-6 → openai/gpt-5.1-codex until a candidate succeeds.
  • Failures cluster into ~4 short bursts separated by 30-90 min of clean operation (typical keep-alive pool aging pattern).
  • Direct probe to the same endpoint with the same ANTHROPIC_API_KEY via curl returns HTTP 200 in ~1.4s, so the network path, key, and Anthropic service are all healthy.

The user observes the assistant suddenly switching style/voice/tool-call patterns mid-turn (Claude → Codex behavior). The configured fallback list is doing exactly what it's configured to do; the bug is that the primary provider's transport is dropping sockets and failing the first attempt.

OpenClaw version

2026.5.20 (e510042)

Operating system

macOS 26.5 (Apple Silicon Mac mini)

Install method

npm global (Node v25.6.1)

Model

anthropic/claude-opus-4-7 (primary). Fallback chain configured: anthropic/claude-opus-4-7 → anthropic/claude-opus-4-6 → openai/gpt-5.1-codex → anthropic/claude-sonnet-4-6 → anthropic/claude-haiku-4-5 → openai/gpt-5.4-pro

Provider / routing chain

openclaw (embedded agent runtime / pi harness) -> anthropic provider (https://api.anthropic.com/v1/messages) directly via undici fetch. No intermediate gateway/proxy. Auth profile anthropic:default (api_key mode). Failure path falls over to openai/gpt-5.1-codex via Codex app-server runtime.

Additional provider/model setup details

Effective config (chief agent, redacted):

{
  agents: {
    defaults: {
      model: {
        primary: "anthropic/claude-opus-4-7",
        fallbacks: ["anthropic/claude-opus-4-7", "anthropic/claude-opus-4-6", "openai/gpt-4o", "openai/gpt-5.1-codex", "anthropic/claude-sonnet-4-6", "anthropic/claude-haiku-4-5", "openai/gpt-5.4-pro"]
      },
      models: {
        "anthropic/claude-opus-4-7": { alias: "opus", params: { cacheRetention: "short", context1m: true } },
        "anthropic/claude-opus-4-6": { params: { cacheRetention: "short", context1m: true } },
        "anthropic/claude-sonnet-4-6": { params: { cacheRetention: "short", context1m: true } },
        "anthropic/claude-haiku-4-5": { params: { cacheRetention: "short" } },
        "openai/gpt-5.1-codex": { alias: "GPT" }
      }
    },
    list: [
      {
        id: "chief",
        model: { primary: "anthropic/claude-opus-4-7", fallbacks: [/* same chain */] }
      }
    ]
  },
  auth: {
    profiles: {
      "anthropic:default": { provider: "anthropic", mode: "api_key" }
    }
  }
}

No agentRuntime.id override on the anthropic models, so they should be routed through the embedded Anthropic provider (which uses undici.fetch under the hood).

Logs, screenshots, and evidence

All evidence from local `/tmp/openclaw/openclaw-2026-05-27.log` ndjson on the affected machine. Excerpts are quoted verbatim; full lines have been redacted for log shipping but available on request.

### Representative socket failure


2026-05-27T12:58:42.845-07:00
[model-fetch] error provider=anthropic api=anthropic-messages
  model=claude-opus-4-7 elapsedMs=254
  name=TypeError code=undefined
  causeName=SocketError causeCode=UND_ERR_SOCKET
  message=fetch failed


### Representative fallback decision chain


2026-05-27T13:01:16 model_fallback_decision
  requestedProvider=anthropic requestedModel=claude-opus-4-7
  candidateProvider=anthropic candidateModel=claude-opus-4-7
  attempt=1 total=6
  decision=candidate_failed
  fallbackStepFromFailureReason=timeout
  fallbackStepFromFailureDetail="fetch failed"
  previousAttempts=[{provider:"anthropic", model:"claude-opus-4-7", reason:"timeout", status:408, errorHash:"sha256:e2c73a8fd237"}]

2026-05-27T13:01:16 model_fallback_decision
  attempt=2 candidateProvider=anthropic candidateModel=claude-opus-4-6
  decision=candidate_succeeded


### Timeline of UND_ERR_SOCKET failures today (Pacific)


08:55:20  09:06:34  09:35:42  09:36:09  09:39:28  09:48:00
10:31:43  10:31:46  10:47:37
11:18:00  11:20:57
12:53:39  12:56:01  12:58:42


14 total in ~4 hours, clustered into 4 bursts. Clean periods between bursts last 30-90 min.

### Confirmation Anthropic is healthy (same machine, same key, same minute)


curl -s -o /dev/null -w "HTTP %{http_code} elapsed %{time_total}s\n" \
  --max-time 10 https://api.anthropic.com/v1/messages \
  -X POST -H "x-api-key: $ANTHROPIC_KEY" \
  -H "anthropic-version: 2023-06-01" -H "content-type: application/json" \
  -d '{"model":"claude-opus-4-7","max_tokens":5,"messages":[{"role":"user","content":"hi"}]}'

HTTP 200 elapsed 1.414930s

Impact and severity

Affected: any OpenClaw user with a fallback chain that includes a different-provider model (e.g., Anthropic primary + OpenAI fallback). With pure-Anthropic fallback chains the user would just see a slightly slower turn; with cross-provider chains the model swaps mid-turn.

Severity: High for trust/UX. The agent visibly changes voice, persona, and tool-call style mid-conversation; in our case the operator believed the model was being intentionally switched and asked us to investigate. Also burns extra spend on the secondary provider when the primary should have succeeded.

Frequency: Intermittent but persistent. 14 socket failures in 4 hours on a single workstation. Clusters of 1-3 failures spaced 30-90 min apart. Recurs every time the gateway runs for >1 hour.

Consequence: Mid-conversation model swap that is invisible to the user (the fallback is silent by design). For premium-tier operators paying for Anthropic specifically, this is also a billing/quality regression.

Additional information

Likely root cause

Looks like Node.js undici reusing a stale keep-alive socket from its connection pool. Anthropic (or Cloudflare in front of it) closes idle connections after some interval (typically 30-60s). On the next request, undici tries to reuse the closed socket and surfaces UND_ERR_SOCKET instead of transparently establishing a new one.

The "bursts of 1-3 failures separated by long quiet periods" pattern is consistent with: pool warm → first reused socket fails → next call opens a fresh socket → that one survives the burst → quiet period → idle close → repeat. The sub-second elapsedMs (254-371) is too short to be a server-side or DNS timeout.

Suggested fixes (in order of preference)

  1. Add automatic retry-once on UND_ERR_SOCKET in provider-transport-fetch (or wherever the Anthropic provider's undici.fetch lives). Simplest and most localized; matches what most undici-using libraries do.
  2. Disable keep-alive on the anthropic-messages fetch agent (force fresh connection per request). Slight latency hit, no more failures.
  3. Lower keepAliveTimeout below ~30s on the undici dispatcher used by the anthropic provider so the pool always evicts before Anthropic / Cloudflare does.

Good references for the same class of bug in other Node.js fetch-based libs:

  • nodejs/undici #2348 "keep-alive socket reuse hits SocketError on idle close"
  • vercel/fetch keep-alive disable for long-running serverless workers

Last known good version

Unknown — the affected operator can’t pin a specific upgrade as the trigger, but the issue surfaced after the gateway started running for multi-hour stretches on the new mac-mini box. Probably present in any recent version since the Anthropic provider switched to native undici.fetch with default keep-alive.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

FAQ

Expected behavior

Anthropic provider requests should complete (HTTP 200 from api.anthropic.com) using the configured anthropic:default API key, as they do when the same key is used via curl from the same host at the same time. The agent should stay on its configured primary model (anthropic/claude-opus-4-7) without silently swapping to a fallback OpenAI model mid-turn.

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING