openclaw - 💡(How to fix) Fix Model fallback chain inconsistent when session has model override — CLI exits code 1 with no useful error [1 comments, 2 participants]

openclaw2026-03-17 11:25:11

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

openclaw/openclaw#48955•Fetched 2026-04-08 00:50:33

View on GitHub

Comments

Participants

Timeline

Reactions

Author

erkcet

Participants

erkcet

Ryce

Timeline (top)

commented ×1

When a session has a per-session model override (e.g. claude-opus-4-6 instead of the default gpt-5.4), the fallback chain behaves inconsistently and the openclaw agent CLI can exit with code 1 while only printing the LCM plugin load line — no error message, no context about what went wrong.

Error Message

Error: model anthropic/claude-opus-4-6 failed after 4 retries (overloaded). Fallback anthropic/claude-sonnet-4-6 also failed (3 retries). Attempting openrouter/auto...

Root Cause

Code Example

{
  "agents.defaults.model": {
    "primary": "openai-codex/gpt-5.4",
    "fallbacks": ["anthropic/claude-sonnet-4-6", "openrouter/auto"]
  }
}

---

12:03:06 opus → failed (timeout)
12:03:06 fallback decision: opus → sonnet
12:03:37 sonnet → failed (overloaded)
12:03:37 fallback decision: sonnet → openrouter/auto
12:04:13 openrouter/auto → succeeded ✅

---

13:58:18 opus → "service temporarily overloaded"
13:58:26 opus → "Internal server error"
13:58:37 opus → "service temporarily overloaded"
13:58:50 opus → "Internal server error"
13:58:50 failover decision: opus → sonnet (reason: timeout)
13:58:57 sonnet → "service temporarily overloaded"
13:59:10 sonnet → "Internal server error"
13:59:27 sonnet → "Internal server error"
13:59:57 candidate_succeeded sonnet (next=none)

---

gpt-5.4 → failed → sonnet → failed → openrouter/auto → failed → none (all exhausted)

---

[plugins] [lcm] Plugin loaded (enabled=true, db=/.../lcm.db, threshold=0.75)

---

Error: model anthropic/claude-opus-4-6 failed after 4 retries (overloaded). 
Fallback anthropic/claude-sonnet-4-6 also failed (3 retries). 
Attempting openrouter/auto...

RAW_BUFFERClick to expand / collapse

Summary

Environment

OpenClaw: v2026.3.13 (61d171a)
OS: macOS (arm64)
Node: v25.8.1

Config

{
  "agents.defaults.model": {
    "primary": "openai-codex/gpt-5.4",
    "fallbacks": ["anthropic/claude-sonnet-4-6", "openrouter/auto"]
  }
}

Session was overridden to anthropic/claude-opus-4-6.

What happened

Scenario 1: Session override (opus) with full fallback chain working (12:03 GMT+3)

12:03:06 opus → failed (timeout)
12:03:06 fallback decision: opus → sonnet
12:03:37 sonnet → failed (overloaded)
12:03:37 fallback decision: sonnet → openrouter/auto
12:04:13 openrouter/auto → succeeded ✅

All 3 candidates tried. Worked as expected.

Scenario 2: Session override (opus) with incomplete fallback chain (13:58 GMT+3)

13:58:18 opus → "service temporarily overloaded"
13:58:26 opus → "Internal server error"
13:58:37 opus → "service temporarily overloaded"
13:58:50 opus → "Internal server error"
13:58:50 failover decision: opus → sonnet (reason: timeout)
13:58:57 sonnet → "service temporarily overloaded"
13:59:10 sonnet → "Internal server error"
13:59:27 sonnet → "Internal server error"
13:59:57 candidate_succeeded sonnet (next=none)

Key issue: openrouter/auto was never attempted as a fallback, even though both Anthropic models failed 7 times. Instead, sonnet was retried and eventually succeeded after ~30s of errors. During that window, the openclaw agent CLI had already exited code 1.

Scenario 3: Default model (gpt-5.4) fallback (03:02 and 05:02 GMT+3)

gpt-5.4 → failed → sonnet → failed → openrouter/auto → failed → none (all exhausted)

All 3 candidates tried correctly when using the default model.

Two bugs

Bug 1: CLI exits code 1 with no useful error

When openclaw agent --agent spark --message "..." is run and the underlying model call fails during retries, the CLI outputs only:

[plugins] [lcm] Plugin loaded (enabled=true, db=/.../lcm.db, threshold=0.75)

Then exits code 1. No error message, no indication that the model provider is down, no suggestion to retry. The --json flag does surface status: ok in some cases but the plain CLI gives the operator nothing to work with.

Expected: The CLI should print a human-readable error like:

Error: model anthropic/claude-opus-4-6 failed after 4 retries (overloaded). 
Fallback anthropic/claude-sonnet-4-6 also failed (3 retries). 
Attempting openrouter/auto...

Or at minimum, the exit should include the provider error text.

Bug 2: Fallback chain may skip candidates under session override

When the session has a model override, the fallback chain appears to sometimes skip the final openrouter/auto candidate and instead retry the second candidate (sonnet) until it succeeds or times out. This is inconsistent with the behavior when the default model is primary, where all 3 candidates are always attempted in order.

In Scenario 2, if sonnet had not eventually recovered, the request would have failed entirely — even though openrouter/auto (a completely different provider) was available and would likely have worked.

Impact

Operators running multi-agent setups see silent failures with no diagnostic info
Agent-to-agent communication appears broken when it is actually a transient provider outage
The fallback chain — which the operator carefully configured with 3 providers for resilience — is not being fully utilized
Time wasted debugging "why is openclaw agent failing" when the answer is "Anthropic was down for 2 minutes and the fallback chain did not complete"

Suggested fixes

CLI error output: Always surface the provider error chain to stderr before exiting non-zero
Consistent fallback traversal: When model N fails after retries, always advance to model N+1 in the chain, regardless of whether the primary was a session override or the default
Gateway log correlation: Include the runId in CLI output so operators can cross-reference gateway.err.log

extent analysis

Fix Plan

To address the issues, we will implement the following fixes:

CLI Error Output:
- Modify the openclaw agent CLI to print a human-readable error message when a model fails after retries.
- Include the provider error text in the exit message.
Consistent Fallback Traversal:
- Update the fallback chain logic to always advance to the next model in the chain when the current model fails after retries, regardless of whether the primary model is a session override or the default.
Gateway Log Correlation:
- Include the runId in the CLI output to enable operators to cross-reference the gateway.err.log.

Code Changes

// Modify the error handling in the openclaw agent CLI
function handleError(model, error, retries) {
  const errorMessage = `Error: model ${model} failed after ${retries} retries (${error}).`;
  console.error(errorMessage);
  // Include the provider error text in the exit message
  process.exitCode = 1;
}

// Update the fallback chain logic
function getNextModel(currentModel, fallbacks) {
  const currentIndex = fallbacks.indexOf(currentModel);
  if (currentIndex === -1 || currentIndex === fallbacks.length - 1) {
    return null; // No more models in the chain
  }
  return fallbacks[currentIndex + 1];
}

// Include the runId in the CLI output
function printOutput(runId, output) {
  console.log(`Run ID: ${runId}`);
  console.log(output);
}

Configuration Changes

No configuration changes are required for these fixes.

Temporary Workarounds

None needed, as the fixes address the root causes of the issues.

Verification

To verify that the fixes work:

Test the openclaw agent CLI with a model that fails after retries and verify that a human-readable error message is printed.
Test the fallback chain with a session override and verify that all models in the chain are attempted in order.
Verify that the runId is included in the CLI output and can be used to cross-reference the gateway.err.log.

Extra Tips

Regularly review and update the fallback chain configuration to ensure it remains effective.
Monitor the gateway.err.log for errors and adjust the fallback chain as needed.
Consider implementing additional logging and monitoring to detect and respond to model failures and fallback chain issues.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #ssr #installation #tensor shape #vector store #embedding generation #cache error #pipeline error #agent setup

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

openclaw - 💡(How to fix) Fix Model fallback chain inconsistent when session has model override — CLI exits code 1 with no useful error [1 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Code Example

Summary

Environment

Config

What happened

Scenario 1: Session override (opus) with full fallback chain working (12:03 GMT+3)

Scenario 2: Session override (opus) with incomplete fallback chain (13:58 GMT+3)

Scenario 3: Default model (gpt-5.4) fallback (03:02 and 05:02 GMT+3)

Two bugs

Bug 1: CLI exits code 1 with no useful error

Bug 2: Fallback chain may skip candidates under session override

Impact

Suggested fixes

extent analysis

Fix Plan

Code Changes

Configuration Changes

Temporary Workarounds

Verification

Extra Tips

Still need to ship something?

TRENDING

openclaw - 💡(How to fix) Fix Model fallback chain inconsistent when session has model override — CLI exits code 1 with no useful error [1 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Code Example

Summary

Environment

Config

What happened

Scenario 1: Session override (opus) with full fallback chain working (12:03 GMT+3)

Scenario 2: Session override (opus) with incomplete fallback chain (13:58 GMT+3)

Scenario 3: Default model (gpt-5.4) fallback (03:02 and 05:02 GMT+3)

Two bugs

Bug 1: CLI exits code 1 with no useful error

Bug 2: Fallback chain may skip candidates under session override

Impact

Suggested fixes

extent analysis

Fix Plan

Code Changes

Configuration Changes

Temporary Workarounds

Verification

Extra Tips

Still need to ship something?

RELATED_DISCOVERY

TRENDING