openclaw - 💡(How to fix) Fix user-switched model has no fallback chain, causing session deadlock on provider outage [1 pull requests]

StepCodex · 2026-05-21T07:11:08Z

[openclaw] When a user manually switches the active model e.g., /model deepseek/deepseek-v4-pro , OpenClaw disables the fallback chain entirely. If that model… When a user manually switches the active model (e.g., `/model deepseek/deepseek-v4-pro`), OpenClaw disables the fallback chain entirely. If that model becomes unavailable (provider 503, timeout, 5xx), the session deadlocks — the Gateway retries the same model in a loop with no recovery path. ## Fixed - Fixed by PR: fix: allow user-switched model to use agent fallback chain (https://github.com/openclaw/openclaw/pull/84867) ## Summary When a user manually switches the active model (e.g., `/model deepseek/deepseek-v4-pro`), OpenClaw disables the fallback chain entirely. If that model becomes unavailable (provider 503, timeout, 5xx), the session deadlocks — the Gateway retries the same model in a loop with no recovery path. ## Root Cause `resolveEffectiveModelFallbacks()` in `src/agents/agent-scope.ts` returns `[]` when `modelOverrideSource === "user"`. The function treats any non-auto override as a signal to disable fallbacks, on the assumption that the user explicitly chose their model and no substitution should be made. The downstream effect: `persistFallbackCandidateSelection()` in the agent runner also returns early for user-switched models, preventing fallback persistence. Between empty fallbacks and blocked persistence, the session has zero recovery options. ## Real Incident Data (2026-05-21, 14:00–14:14 CST) ### Timeline | Time (CST) | Event | |------------|-------| | 13:43 | User switches to `deepseek/deepseek-v4-pro` | | 13:46–14:00 | Normal conversation with 42 tool calls | | 14:00:23 | DeepSeek v4-pro model call hangs (120s idle timeout) | | 14:02:28 | Gateway logs: `stalled session`, `LLM idle timeout (120s): no response from model` | | 14:07 | User sends follow-up message — system cannot process (v4-pro still hung) | | 14:08:06 | Context compaction fails: v4-pro timeout (61s), v4-flash `file_lock_stale` | | 14:10:15 | Another deepseek-v4-pro call times out (120s) | | 14:10:36 | User force-restarts Gateway (SIGTERM) | | 14:12:09 | After restart, compaction fails on v4-pro (timeout) | | 14:12:12 | Compaction on v4-flash fails: `provider_error_5xx` | | 14:13:12 | Another compaction fails on v4-flash: `provider_error_5xx` | | 14:13:58 | Session finally aborted after ~14 minutes of stall | ### Provider error evidence Confirmed DeepSeek outage at the time: ``` 503 Service is too busy. We advise users to temporarily switch to alternative LLM API service providers. ``` ### Gateway diagnostic logs ``` 13:59:25 [diagnostic] long-running session: age=577s queueDepth=1 reason=queued_behind_active_work 14:02:25 [diagnostic] stalled session: age=757s reason=active_work_without_progress 14:02:28 [agent/embedded] embedded run failover decision: decision=surface_error reason=timeout rawError="LLM idle timeout (120s): no response from model" fallbackConfigured=false ``` The `fallbackConfigured=false` flag confirms the session had no fallback path for the user-switched model. ## Proposed Fix In `resolveEffectiveModelFallbacks()`: when `modelOverrideSource === "user"`, return the agent's configured fallbacks instead of an empty array. This gives user-switched models a per-call recovery path. The existing `persistFallbackCandidateSelection` guard ensures the fallback selection is NOT persisted to session state, so the user's model preference is preserved for the next message. ## Related - PR: #TBD (to be linked)

openclaw2026-05-21 07:11:08

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

When a user manually switches the active model (e.g., /model deepseek/deepseek-v4-pro), OpenClaw disables the fallback chain entirely. If that model becomes unavailable (provider 503, timeout, 5xx), the session deadlocks — the Gateway retries the same model in a loop with no recovery path.

Error Message

Provider error evidence

Root Cause

resolveEffectiveModelFallbacks() in src/agents/agent-scope.ts returns [] when modelOverrideSource === "user". The function treats any non-auto override as a signal to disable fallbacks, on the assumption that the user explicitly chose their model and no substitution should be made.

The downstream effect: persistFallbackCandidateSelection() in the agent runner also returns early for user-switched models, preventing fallback persistence. Between empty fallbacks and blocked persistence, the session has zero recovery options.

Fix Action

Fixed

Fixed by PR: fix: allow user-switched model to use agent fallback chain (https://github.com/openclaw/openclaw/pull/84867)

Code Example

503 Service is too busy. We advise users to temporarily switch to alternative LLM API service providers.

---

13:59:25 [diagnostic] long-running session: age=577s queueDepth=1 reason=queued_behind_active_work
14:02:25 [diagnostic] stalled session: age=757s reason=active_work_without_progress
14:02:28 [agent/embedded] embedded run failover decision: decision=surface_error reason=timeout
  rawError="LLM idle timeout (120s): no response from model" fallbackConfigured=false

RAW_BUFFERClick to expand / collapse

Summary

Root Cause

Real Incident Data (2026-05-21, 14:00–14:14 CST)

Timeline

Time (CST)	Event
13:43	User switches to `deepseek/deepseek-v4-pro`
13:46–14:00	Normal conversation with 42 tool calls
14:00:23	DeepSeek v4-pro model call hangs (120s idle timeout)
14:02:28	Gateway logs: `stalled session`, `LLM idle timeout (120s): no response from model`
14:07	User sends follow-up message — system cannot process (v4-pro still hung)
14:08:06	Context compaction fails: v4-pro timeout (61s), v4-flash `file_lock_stale`
14:10:15	Another deepseek-v4-pro call times out (120s)
14:10:36	User force-restarts Gateway (SIGTERM)
14:12:09	After restart, compaction fails on v4-pro (timeout)
14:12:12	Compaction on v4-flash fails: `provider_error_5xx`
14:13:12	Another compaction fails on v4-flash: `provider_error_5xx`
14:13:58	Session finally aborted after ~14 minutes of stall

Provider error evidence

Confirmed DeepSeek outage at the time:

503 Service is too busy. We advise users to temporarily switch to alternative LLM API service providers.

Gateway diagnostic logs

13:59:25 [diagnostic] long-running session: age=577s queueDepth=1 reason=queued_behind_active_work
14:02:25 [diagnostic] stalled session: age=757s reason=active_work_without_progress
14:02:28 [agent/embedded] embedded run failover decision: decision=surface_error reason=timeout
  rawError="LLM idle timeout (120s): no response from model" fallbackConfigured=false

The fallbackConfigured=false flag confirms the session had no fallback path for the user-switched model.

Proposed Fix

In resolveEffectiveModelFallbacks(): when modelOverrideSource === "user", return the agent's configured fallbacks instead of an empty array. This gives user-switched models a per-call recovery path. The existing persistFallbackCandidateSelection guard ensures the fallback selection is NOT persisted to session state, so the user's model preference is preserved for the next message.

PR: #TBD (to be linked)

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

openclaw - 💡(How to fix) Fix user-switched model has no fallback chain, causing session deadlock on provider outage [1 pull requests]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Provider error evidence

Root Cause

Fix Action

Fixed

Code Example

Summary

Root Cause

Real Incident Data (2026-05-21, 14:00–14:14 CST)

Timeline

Provider error evidence

Gateway diagnostic logs

Proposed Fix

Related

Still need to ship something?

TRENDING