openclaw - ✅(Solved) Fix [Bug]: rawError="terminated" does not trigger model fallback (classifyFailoverReason returns null) [2 pull requests, 6 comments, 4 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
openclaw/openclaw#56875Fetched 2026-04-08 01:46:38
View on GitHub
Comments
6
Participants
4
Timeline
7
Reactions
0
Author
Timeline (top)
commented ×4cross-referenced ×2referenced ×1

When an upstream provider terminates the connection (TCP reset / stream abort), OpenClaw receives rawError=terminated. This error string is not matched by any entry in ERROR_PATTERNS, so classifyFailoverReason("terminated") returns null and the model fallback chain is never invoked.

Error Message

When an upstream provider terminates the connection (TCP reset / stream abort), OpenClaw receives rawError=terminated. This error string is not matched by any entry in ERROR_PATTERNS, so classifyFailoverReason("terminated") returns null and the model fallback chain is never invoked. When the primary model returns error=terminated, the gateway should classify this as a transient timeout-class error and attempt the next model in the fallback chain — the same way "connection error", "socket hang up", and "fetch failed" are handled today. error=terminated rawError=terminated The same runId retries 4 times on the same model (Opus). The fallback chain (Gemini → GPT → Sonnet) is never attempted. After exhausting retries, the user sees the error. "timeout", "timed out", "connection error", "network error", This is consistent with the existing pattern — "terminated" is functionally equivalent to "connection error" and "socket hang up" (all represent unexpected connection closure by the upstream). The terminated error is the most common transient failure in proxy environments, yet it's the one error that doesn't trigger fallback. Users configured a fallback chain specifically for resilience, but it never activates for the most frequent failure mode.

  • #48213 — error=terminated with openai-codex (open, documents symptom but not fallback gap)

Root Cause

In src/agents/pi-embedded-helpers/failover-matches.ts (compiled to reply-payloads-dedupe-*.js in v2026.3.28):

// ERROR_PATTERNS.timeout includes:
"timeout", "timed out", "connection error", "network error",
"fetch failed", "socket hang up", /\beconn(?:refused|reset|aborted)\b/i, ...

// But does NOT include:
"terminated"   // ← missing

When classifyFailoverReason("terminated") is called:

  1. isRateLimitErrorMessage("terminated") → false
  2. isOverloadedErrorMessage("terminated") → false
  3. isServerErrorMessage("terminated") → false (v2026.3.28 added "connection reset" here, but not "terminated")
  4. isTimeoutErrorMessage("terminated") → false ← the gap
  5. Returns null → no fallback triggered

Fix Action

Workaround

Manually patch reply-payloads-dedupe-*.js to add "terminated" to ERROR_PATTERNS.timeout. Must reapply after every upgrade.

PR fix notes

PR #56886: fix(agents): add "terminated" to timeout ERROR_PATTERNS for model fallback

Description (problem / solution / changelog)

Summary

  • rawError=\"terminated\" was not matched by any entry in ERROR_PATTERNS.timeout
  • classifyFailoverReason(\"terminated\") returned null, so the fallback chain was never triggered
  • Added \"terminated\" to the timeout pattern list — consistent with existing \"socket hang up\" and ECONNRESET handling

Root Cause

In src/agents/pi-embedded-helpers/failover-matches.ts, ERROR_PATTERNS.timeout covers connection-closure errors like \"socket hang up\", ECONNREFUSED, and ECONNRESET, but omits \"terminated\" — the raw error string emitted when an upstream proxy (NVIDIA NIM, Cloudflare AI Gateway, Azure API Management) aborts a TCP stream mid-response.

When the primary model returns error=terminated, the gateway called classifyFailoverReason(\"terminated\"), which returned null. The fallback chain was skipped entirely; the primary model retried 4× on itself before surfacing the error to the user.

Change

src/agents/pi-embedded-helpers/failover-matches.ts:

  • Before: \"terminated\" not in ERROR_PATTERNS.timeout
  • After: \"terminated\" added as the first entry in ERROR_PATTERNS.timeout

This is a one-line addition consistent with PR #19077 (which added ENETUNREACH/ECONNREFUSED) and the existing treatment of \"socket hang up\".

Test plan

  • Configure a provider behind a proxy that can terminate connections (e.g. NVIDIA inference API)
  • Trigger a terminated error from the primary model
  • Verify the fallback chain (Gemini → GPT → Sonnet) is now invoked
  • Verify normal requests (no error) are unaffected

Fixes #56875

Changed files

  • src/agents/pi-embedded-helpers/failover-matches.ts (modified, +1/-0)

PR #56895: fix: trigger failover on terminated transport errors

Description (problem / solution / changelog)

Summary

  • classify plain terminated transport failures as retryable timeout errors
  • add regression coverage for terminated and rawError=terminated samples

Why

Proxy-backed providers can surface abrupt upstream disconnects as terminated, but this currently bypasses the fallback chain entirely. Mapping it to the existing timeout bucket restores expected model failover behavior for a common transient failure mode.

Closes #56875

Changed files

  • src/agents/pi-embedded-helpers.isbillingerrormessage.test.ts (modified, +4/-0)
  • src/agents/pi-embedded-helpers/failover-matches.ts (modified, +1/-0)

Code Example

[agent/embedded] embedded run agent end: runId=5cac9994-... isError=true
  model=azure/anthropic/claude-opus-4-6 provider=nvidia-proxy
  error=terminated rawError=terminated

---

// ERROR_PATTERNS.timeout includes:
"timeout", "timed out", "connection error", "network error",
"fetch failed", "socket hang up", /\beconn(?:refused|reset|aborted)\b/i, ...

// But does NOT include:
"terminated"   // ← missing

---

timeout: [
+   "terminated",
    "timeout",
    "timed out",
    "service unavailable",
RAW_BUFFERClick to expand / collapse

Bug type

Behaviour bug

Summary

When an upstream provider terminates the connection (TCP reset / stream abort), OpenClaw receives rawError=terminated. This error string is not matched by any entry in ERROR_PATTERNS, so classifyFailoverReason("terminated") returns null and the model fallback chain is never invoked.

Steps to reproduce

  1. Configure a provider behind a proxy that can terminate connections (tested with NVIDIA inference API at inference-api.nvidia.com/v1)
  2. Configure model fallback chain:
    • primary: claude-opus-4-6
    • fallbacks: [gemini-3.1-pro-preview, gpt-5.4, claude-sonnet-4-5]
  3. Send a message when the upstream provider is under load
  4. Observe gateway logs

Expected behaviour

When the primary model returns error=terminated, the gateway should classify this as a transient timeout-class error and attempt the next model in the fallback chain — the same way "connection error", "socket hang up", and "fetch failed" are handled today.

Actual behaviour

[agent/embedded] embedded run agent end: runId=5cac9994-... isError=true
  model=azure/anthropic/claude-opus-4-6 provider=nvidia-proxy
  error=terminated rawError=terminated

The same runId retries 4 times on the same model (Opus). The fallback chain (Gemini → GPT → Sonnet) is never attempted. After exhausting retries, the user sees the error.

Root cause

In src/agents/pi-embedded-helpers/failover-matches.ts (compiled to reply-payloads-dedupe-*.js in v2026.3.28):

// ERROR_PATTERNS.timeout includes:
"timeout", "timed out", "connection error", "network error",
"fetch failed", "socket hang up", /\beconn(?:refused|reset|aborted)\b/i, ...

// But does NOT include:
"terminated"   // ← missing

When classifyFailoverReason("terminated") is called:

  1. isRateLimitErrorMessage("terminated") → false
  2. isOverloadedErrorMessage("terminated") → false
  3. isServerErrorMessage("terminated") → false (v2026.3.28 added "connection reset" here, but not "terminated")
  4. isTimeoutErrorMessage("terminated") → false ← the gap
  5. Returns null → no fallback triggered

Suggested fix

Add "terminated" to ERROR_PATTERNS.timeout:

  timeout: [
+   "terminated",
    "timeout",
    "timed out",
    "service unavailable",

This is consistent with the existing pattern — "terminated" is functionally equivalent to "connection error" and "socket hang up" (all represent unexpected connection closure by the upstream).

Version

2026.3.28 (also reproducible on 2026.3.23-2)

OS

Ubuntu 24.04.3 LTS / Linux 6.17.0

Model

azure/anthropic/claude-opus-4-6 (via NVIDIA inference API)

Provider / routing chain

openclaw gatewaynvidia inference-api.nvidia.com/v1 (openai-completions) → anthropic claude

Impact

This affects any deployment where the LLM provider sits behind a proxy that can terminate connections — including NVIDIA NIM, Cloudflare AI Gateway, Azure API Management, and corporate inference endpoints.

The terminated error is the most common transient failure in proxy environments, yet it's the one error that doesn't trigger fallback. Users configured a fallback chain specifically for resilience, but it never activates for the most frequent failure mode.

Workaround

Manually patch reply-payloads-dedupe-*.js to add "terminated" to ERROR_PATTERNS.timeout. Must reapply after every upgrade.

Related issues

  • #29429 — network_error stop reason not triggering failover (closed, similar root cause)
  • #48213 — error=terminated with openai-codex (open, documents symptom but not fallback gap)
  • #45834 — generic provider errors not triggering fallback (open)
  • #18868 → PR #19077 — added ENETUNREACH/ECONNREFUSED to failover (merged, precedent for this fix)
  • #28051 — terminated leaking into user-visible replies (open)

extent analysis

Fix Plan

To resolve the issue, we need to update the ERROR_PATTERNS.timeout array to include the "terminated" error string. This can be achieved by modifying the failover-matches.ts file.

  • Update the ERROR_PATTERNS.timeout array:
timeout: [
  "terminated",
  "timeout",
  "timed out",
  "service unavailable",
  "connection error",
  "network error",
  "fetch failed",
  "socket hang up",
  /\beconn(?:refused|reset|aborted)\b/i,
  // ... other error patterns
]
  • Alternatively, you can use a patch to update the reply-payloads-dedupe-*.js file, but this will need to be reapplied after every upgrade.

Verification

To verify that the fix worked, follow these steps:

  1. Update the ERROR_PATTERNS.timeout array with the "terminated" error string.
  2. Restart the OpenClaw service.
  3. Simulate a connection termination by the upstream provider (e.g., using the NVIDIA inference API).
  4. Send a message to the OpenClaw gateway.
  5. Observe the gateway logs to ensure that the fallback chain is triggered when the primary model returns a "terminated" error.

Extra Tips

  • Make sure to test the fix in a non-production environment before deploying it to production.
  • Consider adding additional error patterns to the ERROR_PATTERNS.timeout array to handle other potential transient errors.
  • Review related issues (e.g., #29429, #48213, #45834) to ensure that similar errors are handled correctly.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING