openclaw - 💡(How to fix) Fix Model fallback chain interrupted by race condition — 6 fallback models configured but task terminates before all are tried [1 comments, 2 participants]

openclaw2026-05-05 03:18:22

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

openclaw/openclaw#77655•Fetched 2026-05-06 06:23:21

View on GitHub

Comments

Participants

Timeline

Reactions

Author

shawenbin

Participants

clawsweeper[bot]

shawenbin

Timeline (top)

mentioned ×3subscribed ×3commented ×1

Error Message

When the primary model times out, the fallback mechanism correctly logs model-fallback/decision: next=<model>, but the next model is never actually called because the lane task error propagates upward at the same millisecond, killing the entire request. 18:37:24 [lane task error] lane=main durationMs=133963 22:16:18 [lane task error] lane=main durationMs=85570 error=""FailoverError: LLM request timed out."" ext=...). However, FailoverError: LLM request timed out propagates through [diagnostic] lane task error → session task error before the new model request can actually start. This appears to be a race condition: error propagation completes before the fallback sub-task is spawned.

Root Cause

The model-fallback/decision log shows the system correctly decides which model to fail over to ( ext=...). However, FailoverError: LLM request timed out propagates through [diagnostic] lane task error → session task error before the new model request can actually start. This appears to be a race condition: error propagation completes before the fallback sub-task is spawned.

RAW_BUFFERClick to expand / collapse

OpenClaw version: v2026.4.26 (npm @latest) Platform: Windows Server, NVIDIA API provider Config: gents.defaults.timeoutSeconds = 30 (also observed at 120) Fallback chain: 6 models configured: deepseek-v4-flash → glm-5.1 → minimax-m2.7 → qwen-397b → kimi-k2.6 → OR.intelligence → openrouter/free

Observed Behavior

The core issue: 6 fallback models are configured, but the task is terminated before any model returns a successful response and before all fallback models are tried in order.

Log Evidence

Occurrence 1 (runId=212953a6, timeoutSeconds=120 at time)

18:37:24 [agent/embedded] embedded run timeout: timeoutMs=120000 from=nVidia/minimaxai/minimax-m2.7 18:37:24 [embedded run failover decision] decision=fallback_model from=nVidia/minimaxai/minimax-m2.7 18:37:24 [lane task error] lane=main durationMs=133963 18:37:24 [model-fallback/decision] candidate_failed ... next=nvidia/z-ai/glm-5.1

Two models had already failed. The system decided to try the 3rd model (glm-5.1), but the task was terminated at the same moment. Remaining fallback models were never called.

Occurrence 2 (runId=551bbcd9, timeoutSeconds=30 at time)

22:16:18 [agent/embedded] embedded run timeout: timeoutMs=30000 from=nVidia/deepseek-ai/deepseek-v4-flash 22:16:18 [embedded run failover decision] decision=fallback_model from=nVidia/deepseek-ai/deepseek-v4-flash 22:16:18 [lane task error] lane=main durationMs=85570 error=""FailoverError: LLM request timed out."" 22:16:18 [model-fallback/decision] next=nvidia/z-ai/glm-5.1

Only the primary model failed. The system decided to try the 2nd model (glm-5.1), but the task was terminated. The remaining 5 fallback models were never called.

Expected Behavior

When a model times out, the system should try each fallback model in configured order until:

A model returns a successful response, OR
ALL fallback models have been exhausted (all failed)

The task should not be terminated early when no model has succeeded and not all fallback models have been tried.

Analysis

Steps to Reproduce

Configure multiple fallback models (6 in this case)
Set gents.defaults.timeoutSeconds to a low value (30-120)
Wait for the primary model to time out
Observe: model-fallback/decision: next=xxx appears but the model is never executed, and remaining fallback models are never attempted

Reported from OpenClaw WebChat

extent analysis

TL;DR

The system should be modified to prevent the lane task error from propagating upward and terminating the request before all fallback models have been tried.

Guidance

Review the error handling mechanism to identify why the lane task error is propagating upward and terminating the request prematurely.
Consider implementing a mechanism to delay or prevent the lane task error from propagating until all fallback models have been attempted.
Investigate the FailoverError: LLM request timed out exception to determine if it can be caught and handled in a way that allows the fallback mechanism to continue.
Verify that the fallback mechanism is correctly configured and that the model-fallback/decision log is accurately reflecting the intended fallback model.

Example

No code snippet is provided as the issue does not contain sufficient information about the specific code implementation.

Notes

The root cause of the issue appears to be a race condition between the error propagation and the fallback mechanism. Resolving this issue may require modifications to the error handling and fallback mechanisms.

Recommendation

Apply a workaround to delay or prevent the lane task error from propagating upward until all fallback models have been attempted, allowing the system to try each fallback model in configured order until a successful response is received or all models have been exhausted.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #memory management #API rate limit #retriever error #indexing error

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

openclaw - 💡(How to fix) Fix Model fallback chain interrupted by race condition — 6 fallback models configured but task terminates before all are tried [1 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Observed Behavior

Log Evidence

Occurrence 1 (runId=212953a6, timeoutSeconds=120 at time)

Occurrence 2 (runId=551bbcd9, timeoutSeconds=30 at time)

Expected Behavior

Analysis

Steps to Reproduce

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

TRENDING

openclaw - 💡(How to fix) Fix Model fallback chain interrupted by race condition — 6 fallback models configured but task terminates before all are tried [1 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Observed Behavior

Log Evidence

Occurrence 1 (runId=212953a6, timeoutSeconds=120 at time)

Occurrence 2 (runId=551bbcd9, timeoutSeconds=30 at time)

Expected Behavior

Analysis

Steps to Reproduce

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING