openclaw - 💡(How to fix) Fix Embedded runtime: model fallback chain breaks at intermediate candidates instead of walking to the last entry

Error Message

// embedded layer decides to fallback { "event": "embedded_run_failover_decision", "provider": "nvidia", "model": "moonshotai/kimi-k2.6", "decision": "fallback_model", "failoverReason": "timeout", "fallbackConfigured": true, "status": 408, "errorPreview": "500 Internal server error: unhashable type: 'dict'" }

// model-fallback layer shows chain exhausted { "event": "model_fallback_decision", "decision": "candidate_failed", "candidateModel": "moonshotai/kimi-k2.6", "attempt": 1, "total": 1, "reason": "timeout", "fallbackStepFinalOutcome": "chain_exhausted", "fallbackConfigured": false }

Root Cause

The logs suggest that when a fallback candidate itself fails (as opposed to the primary), the embedded run treats that candidate as a new standalone invocation with its own modelOverrideSource, which then has no fallback chain configured (fallbackConfigured: false). This breaks the chain walk described in the docs.

This may be specific to runtime.type: "embedded" — all agents in my config use embedded runtime, so I cannot verify whether the ACP runtime behaves differently.

Code Example

nvidia/z-ai/glm-5.1
  nvidia/moonshotai/kimi-k2.6
  bailian/kimi-k2.5
  nvidia/deepseek-ai/deepseek-v4-flash
  nvidia/minimaxai/minimax-m2.7
  bailian/glm-5
  nvidia/deepseek-ai/deepseek-v4-pro
  bailian/qwen3.7-max
  bailian/MiniMax-M2.5
  github-copilot/gpt-5.5
  github-copilot/claude-sonnet-4.6

---

{
  "event": "embedded_run_failover_decision",
  "provider": "bailian",
  "model": "qwen3-coder-plus",
  "decision": "surface_error",
  "failoverReason": "timeout",
  "fallbackConfigured": false
}

---

// embedded layer decides to fallback
{
  "event": "embedded_run_failover_decision",
  "provider": "nvidia",
  "model": "moonshotai/kimi-k2.6",
  "decision": "fallback_model",
  "failoverReason": "timeout",
  "fallbackConfigured": true,
  "status": 408,
  "errorPreview": "500 Internal server error: unhashable type: 'dict'"
}

// model-fallback layer shows chain exhausted
{
  "event": "model_fallback_decision",
  "decision": "candidate_failed",
  "candidateModel": "moonshotai/kimi-k2.6",
  "attempt": 1,
  "total": 1,
  "reason": "timeout",
  "fallbackStepFinalOutcome": "chain_exhausted",
  "fallbackConfigured": false
}

Summary

In embedded runtime mode, when a model in the middle of the fallbacks array fails, the fallback chain does not continue to the next candidates. Instead, it reports chain_exhausted at the point of failure, even though there are more models remaining in the configured fallback list.

Environment

OpenClaw Version: 2026.5.22 (a374c3a)
Runtime: embedded (all agents configured with runtime.type: "embedded")
Primary model: bailian/qwen3.6-plus

Configured fallbacks (11 models):

nvidia/z-ai/glm-5.1
nvidia/moonshotai/kimi-k2.6
bailian/kimi-k2.5
nvidia/deepseek-ai/deepseek-v4-flash
nvidia/minimaxai/minimax-m2.7
bailian/glm-5
nvidia/deepseek-ai/deepseek-v4-pro
bailian/qwen3.7-max
bailian/MiniMax-M2.5
github-copilot/gpt-5.5
github-copilot/claude-sonnet-4.6

Observed Behavior (from gateway logs)

Event 1 — 2026-05-26 15:04:56 (bailian/qwen3-coder-plus)

{
  "event": "embedded_run_failover_decision",
  "provider": "bailian",
  "model": "qwen3-coder-plus",
  "decision": "surface_error",
  "failoverReason": "timeout",
  "fallbackConfigured": false
}

→ qwen3-coder-plus (used by subagents) is not covered by the default model fallback chain. Timed out and surfaced the error directly.

Event 2 — 2026-05-26 15:07:41 (nvidia/moonshotai/kimi-k2.6)

// embedded layer decides to fallback
{
  "event": "embedded_run_failover_decision",
  "provider": "nvidia",
  "model": "moonshotai/kimi-k2.6",
  "decision": "fallback_model",
  "failoverReason": "timeout",
  "fallbackConfigured": true,
  "status": 408,
  "errorPreview": "500 Internal server error: unhashable type: 'dict'"
}

// model-fallback layer shows chain exhausted
{
  "event": "model_fallback_decision",
  "decision": "candidate_failed",
  "candidateModel": "moonshotai/kimi-k2.6",
  "attempt": 1,
  "total": 1,
  "reason": "timeout",
  "fallbackStepFinalOutcome": "chain_exhausted",
  "fallbackConfigured": false
}

Key observation: kimi-k2.6 is candidate #2 in the fallbacks array, yet the log shows attempt=1/1 and chain_exhausted. The fallback chain did not continue to bailian/kimi-k2.5 (candidate #3) or any of the remaining 9 models.

Expected Behavior

Per model-failover.md, OpenClaw should walk the entire configured fallback chain:

"If that provider is exhausted with a failover-worthy error, move to the next model candidate." "If every candidate fails, throw a FallbackSummaryError"

The chain should only stop when all candidates have been tried or a non-failover-worthy error (abort, context overflow, user cancel) occurs.

Analysis

This may be specific to runtime.type: "embedded" — all agents in my config use embedded runtime, so I cannot verify whether the ACP runtime behaves differently.

Reproduction

Configure agents.defaults.model.primary to a provider that is currently unreachable.
Configure agents.defaults.model.fallbacks with multiple candidates.
Trigger an agent turn (embedded runtime).
Observe: the first or second fallback candidate fails, but the chain does not continue to the remaining candidates.

Impact

When the primary model is down (e.g., rate-limited provider), users get a failure instead of a working response from a later fallback.
The issue is especially impactful for embedded agents with long fallback chains, where intermediate providers (like NVIDIA) may return 500/408 errors.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering