openclaw - 💡(How to fix) Fix Model fallback chain interrupted by race condition — 6 fallback models configured but task terminates before all are tried [1 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
openclaw/openclaw#77655Fetched 2026-05-06 06:23:21
View on GitHub
Comments
1
Participants
2
Timeline
7
Reactions
2
Author
Timeline (top)
mentioned ×3subscribed ×3commented ×1

Error Message

When the primary model times out, the fallback mechanism correctly logs model-fallback/decision: next=<model>, but the next model is never actually called because the lane task error propagates upward at the same millisecond, killing the entire request. 18:37:24 [lane task error] lane=main durationMs=133963 22:16:18 [lane task error] lane=main durationMs=85570 error=""FailoverError: LLM request timed out."" ext=...). However, FailoverError: LLM request timed out propagates through [diagnostic] lane task error → session task error before the new model request can actually start. This appears to be a race condition: error propagation completes before the fallback sub-task is spawned.

Root Cause

The model-fallback/decision log shows the system correctly decides which model to fail over to ( ext=...). However, FailoverError: LLM request timed out propagates through [diagnostic] lane task error → session task error before the new model request can actually start. This appears to be a race condition: error propagation completes before the fallback sub-task is spawned.


RAW_BUFFERClick to expand / collapse

OpenClaw version: v2026.4.26 (npm @latest) Platform: Windows Server, NVIDIA API provider Config: gents.defaults.timeoutSeconds = 30 (also observed at 120) Fallback chain: 6 models configured: deepseek-v4-flash → glm-5.1 → minimax-m2.7 → qwen-397b → kimi-k2.6 → OR.intelligence → openrouter/free


Observed Behavior

When the primary model times out, the fallback mechanism correctly logs model-fallback/decision: next=<model>, but the next model is never actually called because the lane task error propagates upward at the same millisecond, killing the entire request.

The core issue: 6 fallback models are configured, but the task is terminated before any model returns a successful response and before all fallback models are tried in order.


Log Evidence

Occurrence 1 (runId=212953a6, timeoutSeconds=120 at time)

18:37:24 [agent/embedded] embedded run timeout: timeoutMs=120000 from=nVidia/minimaxai/minimax-m2.7 18:37:24 [embedded run failover decision] decision=fallback_model from=nVidia/minimaxai/minimax-m2.7 18:37:24 [lane task error] lane=main durationMs=133963 18:37:24 [model-fallback/decision] candidate_failed ... next=nvidia/z-ai/glm-5.1

Two models had already failed. The system decided to try the 3rd model (glm-5.1), but the task was terminated at the same moment. Remaining fallback models were never called.

Occurrence 2 (runId=551bbcd9, timeoutSeconds=30 at time)

22:16:18 [agent/embedded] embedded run timeout: timeoutMs=30000 from=nVidia/deepseek-ai/deepseek-v4-flash 22:16:18 [embedded run failover decision] decision=fallback_model from=nVidia/deepseek-ai/deepseek-v4-flash 22:16:18 [lane task error] lane=main durationMs=85570 error=""FailoverError: LLM request timed out."" 22:16:18 [model-fallback/decision] next=nvidia/z-ai/glm-5.1

Only the primary model failed. The system decided to try the 2nd model (glm-5.1), but the task was terminated. The remaining 5 fallback models were never called.


Expected Behavior

When a model times out, the system should try each fallback model in configured order until:

  1. A model returns a successful response, OR
  2. ALL fallback models have been exhausted (all failed)

The task should not be terminated early when no model has succeeded and not all fallback models have been tried.


Analysis

The model-fallback/decision log shows the system correctly decides which model to fail over to ( ext=...). However, FailoverError: LLM request timed out propagates through [diagnostic] lane task error → session task error before the new model request can actually start. This appears to be a race condition: error propagation completes before the fallback sub-task is spawned.


Steps to Reproduce

  1. Configure multiple fallback models (6 in this case)
  2. Set gents.defaults.timeoutSeconds to a low value (30-120)
  3. Wait for the primary model to time out
  4. Observe: model-fallback/decision: next=xxx appears but the model is never executed, and remaining fallback models are never attempted

Reported from OpenClaw WebChat

extent analysis

TL;DR

The system should be modified to prevent the lane task error from propagating upward and terminating the request before all fallback models have been tried.

Guidance

  • Review the error handling mechanism to identify why the lane task error is propagating upward and terminating the request prematurely.
  • Consider implementing a mechanism to delay or prevent the lane task error from propagating until all fallback models have been attempted.
  • Investigate the FailoverError: LLM request timed out exception to determine if it can be caught and handled in a way that allows the fallback mechanism to continue.
  • Verify that the fallback mechanism is correctly configured and that the model-fallback/decision log is accurately reflecting the intended fallback model.

Example

No code snippet is provided as the issue does not contain sufficient information about the specific code implementation.

Notes

The root cause of the issue appears to be a race condition between the error propagation and the fallback mechanism. Resolving this issue may require modifications to the error handling and fallback mechanisms.

Recommendation

Apply a workaround to delay or prevent the lane task error from propagating upward until all fallback models have been attempted, allowing the system to try each fallback model in configured order until a successful response is received or all models have been exhausted.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

openclaw - 💡(How to fix) Fix Model fallback chain interrupted by race condition — 6 fallback models configured but task terminates before all are tried [1 comments, 2 participants]