openclaw - 💡(How to fix) Fix user-switched model has no fallback chain, causing session deadlock on provider outage [1 pull requests]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

When a user manually switches the active model (e.g., /model deepseek/deepseek-v4-pro), OpenClaw disables the fallback chain entirely. If that model becomes unavailable (provider 503, timeout, 5xx), the session deadlocks — the Gateway retries the same model in a loop with no recovery path.

Error Message

Provider error evidence

Root Cause

resolveEffectiveModelFallbacks() in src/agents/agent-scope.ts returns [] when modelOverrideSource === "user". The function treats any non-auto override as a signal to disable fallbacks, on the assumption that the user explicitly chose their model and no substitution should be made.

The downstream effect: persistFallbackCandidateSelection() in the agent runner also returns early for user-switched models, preventing fallback persistence. Between empty fallbacks and blocked persistence, the session has zero recovery options.

Fix Action

Fixed

Code Example

503 Service is too busy. We advise users to temporarily switch to alternative LLM API service providers.

---

13:59:25 [diagnostic] long-running session: age=577s queueDepth=1 reason=queued_behind_active_work
14:02:25 [diagnostic] stalled session: age=757s reason=active_work_without_progress
14:02:28 [agent/embedded] embedded run failover decision: decision=surface_error reason=timeout
  rawError="LLM idle timeout (120s): no response from model" fallbackConfigured=false
RAW_BUFFERClick to expand / collapse

Summary

When a user manually switches the active model (e.g., /model deepseek/deepseek-v4-pro), OpenClaw disables the fallback chain entirely. If that model becomes unavailable (provider 503, timeout, 5xx), the session deadlocks — the Gateway retries the same model in a loop with no recovery path.

Root Cause

resolveEffectiveModelFallbacks() in src/agents/agent-scope.ts returns [] when modelOverrideSource === "user". The function treats any non-auto override as a signal to disable fallbacks, on the assumption that the user explicitly chose their model and no substitution should be made.

The downstream effect: persistFallbackCandidateSelection() in the agent runner also returns early for user-switched models, preventing fallback persistence. Between empty fallbacks and blocked persistence, the session has zero recovery options.

Real Incident Data (2026-05-21, 14:00–14:14 CST)

Timeline

Time (CST)Event
13:43User switches to deepseek/deepseek-v4-pro
13:46–14:00Normal conversation with 42 tool calls
14:00:23DeepSeek v4-pro model call hangs (120s idle timeout)
14:02:28Gateway logs: stalled session, LLM idle timeout (120s): no response from model
14:07User sends follow-up message — system cannot process (v4-pro still hung)
14:08:06Context compaction fails: v4-pro timeout (61s), v4-flash file_lock_stale
14:10:15Another deepseek-v4-pro call times out (120s)
14:10:36User force-restarts Gateway (SIGTERM)
14:12:09After restart, compaction fails on v4-pro (timeout)
14:12:12Compaction on v4-flash fails: provider_error_5xx
14:13:12Another compaction fails on v4-flash: provider_error_5xx
14:13:58Session finally aborted after ~14 minutes of stall

Provider error evidence

Confirmed DeepSeek outage at the time:

503 Service is too busy. We advise users to temporarily switch to alternative LLM API service providers.

Gateway diagnostic logs

13:59:25 [diagnostic] long-running session: age=577s queueDepth=1 reason=queued_behind_active_work
14:02:25 [diagnostic] stalled session: age=757s reason=active_work_without_progress
14:02:28 [agent/embedded] embedded run failover decision: decision=surface_error reason=timeout
  rawError="LLM idle timeout (120s): no response from model" fallbackConfigured=false

The fallbackConfigured=false flag confirms the session had no fallback path for the user-switched model.

Proposed Fix

In resolveEffectiveModelFallbacks(): when modelOverrideSource === "user", return the agent's configured fallbacks instead of an empty array. This gives user-switched models a per-call recovery path. The existing persistFallbackCandidateSelection guard ensures the fallback selection is NOT persisted to session state, so the user's model preference is preserved for the next message.

Related

  • PR: #TBD (to be linked)

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

openclaw - 💡(How to fix) Fix user-switched model has no fallback chain, causing session deadlock on provider outage [1 pull requests]