openclaw - 💡(How to fix) Fix Model fallback chain inconsistent when session has model override — CLI exits code 1 with no useful error [1 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
openclaw/openclaw#48955Fetched 2026-04-08 00:50:33
View on GitHub
Comments
1
Participants
2
Timeline
1
Reactions
0
Author
Participants
Timeline (top)
commented ×1

When a session has a per-session model override (e.g. claude-opus-4-6 instead of the default gpt-5.4), the fallback chain behaves inconsistently and the openclaw agent CLI can exit with code 1 while only printing the LCM plugin load line — no error message, no context about what went wrong.

Error Message

Error: model anthropic/claude-opus-4-6 failed after 4 retries (overloaded). Fallback anthropic/claude-sonnet-4-6 also failed (3 retries). Attempting openrouter/auto...

Root Cause

When a session has a per-session model override (e.g. claude-opus-4-6 instead of the default gpt-5.4), the fallback chain behaves inconsistently and the openclaw agent CLI can exit with code 1 while only printing the LCM plugin load line — no error message, no context about what went wrong.

Code Example

{
  "agents.defaults.model": {
    "primary": "openai-codex/gpt-5.4",
    "fallbacks": ["anthropic/claude-sonnet-4-6", "openrouter/auto"]
  }
}

---

12:03:06 opus → failed (timeout)
12:03:06 fallback decision: opus → sonnet
12:03:37 sonnet → failed (overloaded)
12:03:37 fallback decision: sonnet → openrouter/auto
12:04:13 openrouter/auto → succeeded ✅

---

13:58:18 opus → "service temporarily overloaded"
13:58:26 opus → "Internal server error"
13:58:37 opus → "service temporarily overloaded"
13:58:50 opus → "Internal server error"
13:58:50 failover decision: opus → sonnet (reason: timeout)
13:58:57 sonnet → "service temporarily overloaded"
13:59:10 sonnet → "Internal server error"
13:59:27 sonnet → "Internal server error"
13:59:57 candidate_succeeded sonnet (next=none)

---

gpt-5.4 → failed → sonnet → failed → openrouter/auto → failed → none (all exhausted)

---

[plugins] [lcm] Plugin loaded (enabled=true, db=/.../lcm.db, threshold=0.75)

---

Error: model anthropic/claude-opus-4-6 failed after 4 retries (overloaded). 
Fallback anthropic/claude-sonnet-4-6 also failed (3 retries). 
Attempting openrouter/auto...
RAW_BUFFERClick to expand / collapse

Summary

When a session has a per-session model override (e.g. claude-opus-4-6 instead of the default gpt-5.4), the fallback chain behaves inconsistently and the openclaw agent CLI can exit with code 1 while only printing the LCM plugin load line — no error message, no context about what went wrong.

Environment

  • OpenClaw: v2026.3.13 (61d171a)
  • OS: macOS (arm64)
  • Node: v25.8.1

Config

{
  "agents.defaults.model": {
    "primary": "openai-codex/gpt-5.4",
    "fallbacks": ["anthropic/claude-sonnet-4-6", "openrouter/auto"]
  }
}

Session was overridden to anthropic/claude-opus-4-6.

What happened

Scenario 1: Session override (opus) with full fallback chain working (12:03 GMT+3)

12:03:06 opus → failed (timeout)
12:03:06 fallback decision: opus → sonnet
12:03:37 sonnet → failed (overloaded)
12:03:37 fallback decision: sonnet → openrouter/auto
12:04:13 openrouter/auto → succeeded ✅

All 3 candidates tried. Worked as expected.

Scenario 2: Session override (opus) with incomplete fallback chain (13:58 GMT+3)

13:58:18 opus → "service temporarily overloaded"
13:58:26 opus → "Internal server error"
13:58:37 opus → "service temporarily overloaded"
13:58:50 opus → "Internal server error"
13:58:50 failover decision: opus → sonnet (reason: timeout)
13:58:57 sonnet → "service temporarily overloaded"
13:59:10 sonnet → "Internal server error"
13:59:27 sonnet → "Internal server error"
13:59:57 candidate_succeeded sonnet (next=none)

Key issue: openrouter/auto was never attempted as a fallback, even though both Anthropic models failed 7 times. Instead, sonnet was retried and eventually succeeded after ~30s of errors. During that window, the openclaw agent CLI had already exited code 1.

Scenario 3: Default model (gpt-5.4) fallback (03:02 and 05:02 GMT+3)

gpt-5.4 → failed → sonnet → failed → openrouter/auto → failed → none (all exhausted)

All 3 candidates tried correctly when using the default model.

Two bugs

Bug 1: CLI exits code 1 with no useful error

When openclaw agent --agent spark --message "..." is run and the underlying model call fails during retries, the CLI outputs only:

[plugins] [lcm] Plugin loaded (enabled=true, db=/.../lcm.db, threshold=0.75)

Then exits code 1. No error message, no indication that the model provider is down, no suggestion to retry. The --json flag does surface status: ok in some cases but the plain CLI gives the operator nothing to work with.

Expected: The CLI should print a human-readable error like:

Error: model anthropic/claude-opus-4-6 failed after 4 retries (overloaded). 
Fallback anthropic/claude-sonnet-4-6 also failed (3 retries). 
Attempting openrouter/auto...

Or at minimum, the exit should include the provider error text.

Bug 2: Fallback chain may skip candidates under session override

When the session has a model override, the fallback chain appears to sometimes skip the final openrouter/auto candidate and instead retry the second candidate (sonnet) until it succeeds or times out. This is inconsistent with the behavior when the default model is primary, where all 3 candidates are always attempted in order.

In Scenario 2, if sonnet had not eventually recovered, the request would have failed entirely — even though openrouter/auto (a completely different provider) was available and would likely have worked.

Impact

  • Operators running multi-agent setups see silent failures with no diagnostic info
  • Agent-to-agent communication appears broken when it is actually a transient provider outage
  • The fallback chain — which the operator carefully configured with 3 providers for resilience — is not being fully utilized
  • Time wasted debugging "why is openclaw agent failing" when the answer is "Anthropic was down for 2 minutes and the fallback chain did not complete"

Suggested fixes

  1. CLI error output: Always surface the provider error chain to stderr before exiting non-zero
  2. Consistent fallback traversal: When model N fails after retries, always advance to model N+1 in the chain, regardless of whether the primary was a session override or the default
  3. Gateway log correlation: Include the runId in CLI output so operators can cross-reference gateway.err.log

extent analysis

Fix Plan

To address the issues, we will implement the following fixes:

  1. CLI Error Output:

    • Modify the openclaw agent CLI to print a human-readable error message when a model fails after retries.
    • Include the provider error text in the exit message.
  2. Consistent Fallback Traversal:

    • Update the fallback chain logic to always advance to the next model in the chain when the current model fails after retries, regardless of whether the primary model is a session override or the default.
  3. Gateway Log Correlation:

    • Include the runId in the CLI output to enable operators to cross-reference the gateway.err.log.

Code Changes

// Modify the error handling in the openclaw agent CLI
function handleError(model, error, retries) {
  const errorMessage = `Error: model ${model} failed after ${retries} retries (${error}).`;
  console.error(errorMessage);
  // Include the provider error text in the exit message
  process.exitCode = 1;
}

// Update the fallback chain logic
function getNextModel(currentModel, fallbacks) {
  const currentIndex = fallbacks.indexOf(currentModel);
  if (currentIndex === -1 || currentIndex === fallbacks.length - 1) {
    return null; // No more models in the chain
  }
  return fallbacks[currentIndex + 1];
}

// Include the runId in the CLI output
function printOutput(runId, output) {
  console.log(`Run ID: ${runId}`);
  console.log(output);
}

Configuration Changes

No configuration changes are required for these fixes.

Temporary Workarounds

None needed, as the fixes address the root causes of the issues.

Verification

To verify that the fixes work:

  1. Test the openclaw agent CLI with a model that fails after retries and verify that a human-readable error message is printed.
  2. Test the fallback chain with a session override and verify that all models in the chain are attempted in order.
  3. Verify that the runId is included in the CLI output and can be used to cross-reference the gateway.err.log.

Extra Tips

  • Regularly review and update the fallback chain configuration to ensure it remains effective.
  • Monitor the gateway.err.log for errors and adjust the fallback chain as needed.
  • Consider implementing additional logging and monitoring to detect and respond to model failures and fallback chain issues.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

openclaw - 💡(How to fix) Fix Model fallback chain inconsistent when session has model override — CLI exits code 1 with no useful error [1 comments, 2 participants]