openclaw - 💡(How to fix) Fix Embedded runtime: model fallback chain breaks at intermediate candidates instead of walking to the last entry

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

In embedded runtime mode, when a model in the middle of the fallbacks array fails, the fallback chain does not continue to the next candidates. Instead, it reports chain_exhausted at the point of failure, even though there are more models remaining in the configured fallback list.

Error Message

// embedded layer decides to fallback { "event": "embedded_run_failover_decision", "provider": "nvidia", "model": "moonshotai/kimi-k2.6", "decision": "fallback_model", "failoverReason": "timeout", "fallbackConfigured": true, "status": 408, "errorPreview": "500 Internal server error: unhashable type: 'dict'" }

// model-fallback layer shows chain exhausted { "event": "model_fallback_decision", "decision": "candidate_failed", "candidateModel": "moonshotai/kimi-k2.6", "attempt": 1, "total": 1, "reason": "timeout", "fallbackStepFinalOutcome": "chain_exhausted", "fallbackConfigured": false }

Root Cause

The logs suggest that when a fallback candidate itself fails (as opposed to the primary), the embedded run treats that candidate as a new standalone invocation with its own modelOverrideSource, which then has no fallback chain configured (fallbackConfigured: false). This breaks the chain walk described in the docs.

This may be specific to runtime.type: "embedded" — all agents in my config use embedded runtime, so I cannot verify whether the ACP runtime behaves differently.

Code Example

nvidia/z-ai/glm-5.1
  nvidia/moonshotai/kimi-k2.6
  bailian/kimi-k2.5
  nvidia/deepseek-ai/deepseek-v4-flash
  nvidia/minimaxai/minimax-m2.7
  bailian/glm-5
  nvidia/deepseek-ai/deepseek-v4-pro
  bailian/qwen3.7-max
  bailian/MiniMax-M2.5
  github-copilot/gpt-5.5
  github-copilot/claude-sonnet-4.6

---

{
  "event": "embedded_run_failover_decision",
  "provider": "bailian",
  "model": "qwen3-coder-plus",
  "decision": "surface_error",
  "failoverReason": "timeout",
  "fallbackConfigured": false
}

---

// embedded layer decides to fallback
{
  "event": "embedded_run_failover_decision",
  "provider": "nvidia",
  "model": "moonshotai/kimi-k2.6",
  "decision": "fallback_model",
  "failoverReason": "timeout",
  "fallbackConfigured": true,
  "status": 408,
  "errorPreview": "500 Internal server error: unhashable type: 'dict'"
}

// model-fallback layer shows chain exhausted
{
  "event": "model_fallback_decision",
  "decision": "candidate_failed",
  "candidateModel": "moonshotai/kimi-k2.6",
  "attempt": 1,
  "total": 1,
  "reason": "timeout",
  "fallbackStepFinalOutcome": "chain_exhausted",
  "fallbackConfigured": false
}
RAW_BUFFERClick to expand / collapse

Summary

In embedded runtime mode, when a model in the middle of the fallbacks array fails, the fallback chain does not continue to the next candidates. Instead, it reports chain_exhausted at the point of failure, even though there are more models remaining in the configured fallback list.

Environment

  • OpenClaw Version: 2026.5.22 (a374c3a)
  • Runtime: embedded (all agents configured with runtime.type: "embedded")
  • Primary model: bailian/qwen3.6-plus
  • Configured fallbacks (11 models):
    nvidia/z-ai/glm-5.1
    nvidia/moonshotai/kimi-k2.6
    bailian/kimi-k2.5
    nvidia/deepseek-ai/deepseek-v4-flash
    nvidia/minimaxai/minimax-m2.7
    bailian/glm-5
    nvidia/deepseek-ai/deepseek-v4-pro
    bailian/qwen3.7-max
    bailian/MiniMax-M2.5
    github-copilot/gpt-5.5
    github-copilot/claude-sonnet-4.6

Observed Behavior (from gateway logs)

Event 1 — 2026-05-26 15:04:56 (bailian/qwen3-coder-plus)

{
  "event": "embedded_run_failover_decision",
  "provider": "bailian",
  "model": "qwen3-coder-plus",
  "decision": "surface_error",
  "failoverReason": "timeout",
  "fallbackConfigured": false
}

qwen3-coder-plus (used by subagents) is not covered by the default model fallback chain. Timed out and surfaced the error directly.

Event 2 — 2026-05-26 15:07:41 (nvidia/moonshotai/kimi-k2.6)

// embedded layer decides to fallback
{
  "event": "embedded_run_failover_decision",
  "provider": "nvidia",
  "model": "moonshotai/kimi-k2.6",
  "decision": "fallback_model",
  "failoverReason": "timeout",
  "fallbackConfigured": true,
  "status": 408,
  "errorPreview": "500 Internal server error: unhashable type: 'dict'"
}

// model-fallback layer shows chain exhausted
{
  "event": "model_fallback_decision",
  "decision": "candidate_failed",
  "candidateModel": "moonshotai/kimi-k2.6",
  "attempt": 1,
  "total": 1,
  "reason": "timeout",
  "fallbackStepFinalOutcome": "chain_exhausted",
  "fallbackConfigured": false
}

Key observation: kimi-k2.6 is candidate #2 in the fallbacks array, yet the log shows attempt=1/1 and chain_exhausted. The fallback chain did not continue to bailian/kimi-k2.5 (candidate #3) or any of the remaining 9 models.

Expected Behavior

Per model-failover.md, OpenClaw should walk the entire configured fallback chain:

"If that provider is exhausted with a failover-worthy error, move to the next model candidate." "If every candidate fails, throw a FallbackSummaryError"

The chain should only stop when all candidates have been tried or a non-failover-worthy error (abort, context overflow, user cancel) occurs.

Analysis

The logs suggest that when a fallback candidate itself fails (as opposed to the primary), the embedded run treats that candidate as a new standalone invocation with its own modelOverrideSource, which then has no fallback chain configured (fallbackConfigured: false). This breaks the chain walk described in the docs.

This may be specific to runtime.type: "embedded" — all agents in my config use embedded runtime, so I cannot verify whether the ACP runtime behaves differently.

Reproduction

  1. Configure agents.defaults.model.primary to a provider that is currently unreachable.
  2. Configure agents.defaults.model.fallbacks with multiple candidates.
  3. Trigger an agent turn (embedded runtime).
  4. Observe: the first or second fallback candidate fails, but the chain does not continue to the remaining candidates.

Impact

  • When the primary model is down (e.g., rate-limited provider), users get a failure instead of a working response from a later fallback.
  • The issue is especially impactful for embedded agents with long fallback chains, where intermediate providers (like NVIDIA) may return 500/408 errors.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

openclaw - 💡(How to fix) Fix Embedded runtime: model fallback chain breaks at intermediate candidates instead of walking to the last entry