openclaw - 💡(How to fix) Fix Model fallback cascade takes 30s per invalid candidate instead of failing fast (v2026.4.24–.25) [1 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
openclaw/openclaw#76165Fetched 2026-05-03 04:41:32
View on GitHub
Comments
1
Participants
2
Timeline
3
Reactions
2
Author
Timeline (top)
commented ×1cross-referenced ×1unsubscribed ×1

Split from #75687 (closed prematurely — only Bugs 1–2 were addressed on main). This is Bug 4 from that report.

Non-retryable 400 errors (e.g. invalid model ID, unsupported parameter) take ~30 seconds each before the fallback fires. With a 3-entry fallback chain, a single user message can stall for 90+ seconds before reaching a working model.

In v2026.4.23, the same fallback chain failed over in seconds. The 30s delay in .24/.25 suggests an internal retry or timeout is wrapping deterministic, non-retryable errors.

Example scenario

Chain: groq/model-a (400 invalid) → openrouter/model-b (400 invalid) → anthropic/claude-sonnet
  • v2026.4.23: ~2s total to reach claude-sonnet
  • v2026.4.24+: ~60-90s total to reach claude-sonnet (30s × 2 invalid candidates)

This is especially painful combined with event loop pressure — the 30s blocking compounds with other CPU-bound work.

Environment

openclaw: v2026.4.24, v2026.4.25 (regressed) — v2026.4.23 (fast failover)
node: 22.21.1
os: Windows 11 Pro N (10.0.26200)
platform: win32
bots: 8 Telegram accounts

Root Cause

Split from #75687 (closed prematurely — only Bugs 1–2 were addressed on main). This is Bug 4 from that report.

Non-retryable 400 errors (e.g. invalid model ID, unsupported parameter) take ~30 seconds each before the fallback fires. With a 3-entry fallback chain, a single user message can stall for 90+ seconds before reaching a working model.

In v2026.4.23, the same fallback chain failed over in seconds. The 30s delay in .24/.25 suggests an internal retry or timeout is wrapping deterministic, non-retryable errors.

Example scenario

Chain: groq/model-a (400 invalid) → openrouter/model-b (400 invalid) → anthropic/claude-sonnet
  • v2026.4.23: ~2s total to reach claude-sonnet
  • v2026.4.24+: ~60-90s total to reach claude-sonnet (30s × 2 invalid candidates)

This is especially painful combined with event loop pressure — the 30s blocking compounds with other CPU-bound work.

Environment

openclaw: v2026.4.24, v2026.4.25 (regressed) — v2026.4.23 (fast failover)
node: 22.21.1
os: Windows 11 Pro N (10.0.26200)
platform: win32
bots: 8 Telegram accounts

Code Example

Chain: groq/model-a (400 invalid) → openrouter/model-b (400 invalid) → anthropic/claude-sonnet

---

openclaw: v2026.4.24, v2026.4.25 (regressed) — v2026.4.23 (fast failover)
node: 22.21.1
os: Windows 11 Pro N (10.0.26200)
platform: win32
bots: 8 Telegram accounts
RAW_BUFFERClick to expand / collapse

Bug type

Behavior bug (incorrect output/state without crash)

Beta release blocker

No

Summary

Split from #75687 (closed prematurely — only Bugs 1–2 were addressed on main). This is Bug 4 from that report.

Non-retryable 400 errors (e.g. invalid model ID, unsupported parameter) take ~30 seconds each before the fallback fires. With a 3-entry fallback chain, a single user message can stall for 90+ seconds before reaching a working model.

In v2026.4.23, the same fallback chain failed over in seconds. The 30s delay in .24/.25 suggests an internal retry or timeout is wrapping deterministic, non-retryable errors.

Example scenario

Chain: groq/model-a (400 invalid) → openrouter/model-b (400 invalid) → anthropic/claude-sonnet
  • v2026.4.23: ~2s total to reach claude-sonnet
  • v2026.4.24+: ~60-90s total to reach claude-sonnet (30s × 2 invalid candidates)

This is especially painful combined with event loop pressure — the 30s blocking compounds with other CPU-bound work.

Environment

openclaw: v2026.4.24, v2026.4.25 (regressed) — v2026.4.23 (fast failover)
node: 22.21.1
os: Windows 11 Pro N (10.0.26200)
platform: win32
bots: 8 Telegram accounts

Steps to reproduce

  1. Configure a model fallback chain where the first 1-2 entries return 400 (e.g. invalid model ID)
  2. Send a message to any agent
  3. Observe ~30s delay per invalid model before fallback fires
  4. Compare with v2026.4.23 where the same chain fails over in seconds

Expected behavior

Non-retryable errors (HTTP 400, 404, invalid model) should fail over to the next candidate immediately — no retry, no timeout. These are deterministic failures.

Actual behavior

Each invalid candidate blocks for ~30 seconds before the fallback fires. A 3-entry chain with 2 invalid models stalls for 60-90 seconds.

OpenClaw version

v2026.4.24, v2026.4.25

Operating system

Windows

Related issues

  • Parent: #75687 (closed — only startup fanout and Bonjour crash were fixed)
  • Related: #75707 (broader event loop saturation — still open)

extent analysis

TL;DR

The most likely fix is to adjust the timeout or retry settings for non-retryable errors in the model fallback chain to immediately fail over to the next candidate.

Guidance

  • Investigate the changes made between v2026.4.23 and v2026.4.24 to identify the introduction of the 30-second delay for non-retryable errors.
  • Review the configuration of the model fallback chain to ensure that it is set up to fail over immediately for deterministic failures like HTTP 400 or invalid model IDs.
  • Consider implementing a custom error handling mechanism to bypass the default retry or timeout behavior for non-retryable errors.
  • Test the fallback chain with different error scenarios to verify that the fix works as expected.

Example

No code snippet is provided as the issue does not contain sufficient information about the specific code changes or configurations.

Notes

The issue seems to be specific to the OpenClaw version v2026.4.24 and later, and may be related to changes in the error handling or retry mechanism. Further investigation is needed to determine the root cause and implement a fix.

Recommendation

Apply a workaround by adjusting the timeout or retry settings for non-retryable errors in the model fallback chain, as the root cause of the issue is not immediately clear and may require further investigation.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

FAQ

Expected behavior

Non-retryable errors (HTTP 400, 404, invalid model) should fail over to the next candidate immediately — no retry, no timeout. These are deterministic failures.

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

openclaw - 💡(How to fix) Fix Model fallback cascade takes 30s per invalid candidate instead of failing fast (v2026.4.24–.25) [1 comments, 2 participants]