openclaw - 💡(How to fix) Fix Failover loop: timeout-heavy candidates starve later fallbacks in run time budget [1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
openclaw/openclaw#58049Fetched 2026-04-08 01:54:32
View on GitHub
Comments
0
Participants
1
Timeline
0
Reactions
0
Participants

When a session's modelOverride points to an unavailable model (e.g. a Vertex AI preview model that returns 404), the failover mechanism enters an infinite retry loop instead of converging on a working fallback candidate.

Error Message

  1. The run itself times out and is marked as error [agent/embedded] embedded run agent end: isError=true model=gemini-3.1-pro-preview error=LLM request failed: network connection error.

Root Cause

The fallback chain is evaluated per embedded run, and each run has a fixed time budget (~20s). The problem:

  1. Primary candidate (e.g. google-vertex/gemini-3.1-pro-preview) times out after ~15s
  2. Second candidate (xs/gpt-5.4) fails quickly (~2s)
  3. Third candidate (anthropic/claude-sonnet-4-6) gets the remaining ~3s — not enough to complete a full LLM request
  4. The run itself times out and is marked as error
  5. Lane scheduler starts a new run, which resets the fallback chain → back to step 1

This creates an infinite loop where the working candidate (Sonnet) is perpetually starved of time.

Fix Action

Workaround

Manually clear modelOverride and providerOverride in sessions.json, then restart gateway.

Code Example

[model-fallback/decision] candidate_failed requested=google-vertex/gemini-3.1-pro-preview candidate=google-vertex/gemini-3.1-pro-preview reason=timeout next=xs/gpt-5.4
[model-fallback/decision] candidate_failed requested=google-vertex/gemini-3.1-pro-preview candidate=xs/gpt-5.4 reason=unknown next=anthropic/claude-sonnet-4-6
[agent/embedded] embedded run agent end: isError=true model=gemini-3.1-pro-preview error=LLM request failed: network connection error.
RAW_BUFFERClick to expand / collapse

Summary

When a session's modelOverride points to an unavailable model (e.g. a Vertex AI preview model that returns 404), the failover mechanism enters an infinite retry loop instead of converging on a working fallback candidate.

Root Cause

The fallback chain is evaluated per embedded run, and each run has a fixed time budget (~20s). The problem:

  1. Primary candidate (e.g. google-vertex/gemini-3.1-pro-preview) times out after ~15s
  2. Second candidate (xs/gpt-5.4) fails quickly (~2s)
  3. Third candidate (anthropic/claude-sonnet-4-6) gets the remaining ~3s — not enough to complete a full LLM request
  4. The run itself times out and is marked as error
  5. Lane scheduler starts a new run, which resets the fallback chain → back to step 1

This creates an infinite loop where the working candidate (Sonnet) is perpetually starved of time.

Observed Behavior

From gateway.err.log, every ~20 seconds:

[model-fallback/decision] candidate_failed requested=google-vertex/gemini-3.1-pro-preview candidate=google-vertex/gemini-3.1-pro-preview reason=timeout next=xs/gpt-5.4
[model-fallback/decision] candidate_failed requested=google-vertex/gemini-3.1-pro-preview candidate=xs/gpt-5.4 reason=unknown next=anthropic/claude-sonnet-4-6
[agent/embedded] embedded run agent end: isError=true model=gemini-3.1-pro-preview error=LLM request failed: network connection error.

The candidate_succeeded for Sonnet only appears after many cycles (~3+ minutes), likely when the scheduler happens to give it enough headroom.

Expected Behavior

  • After a candidate fails with timeout, it should be circuit-broken (marked unavailable for a cooldown period) so subsequent runs skip it immediately
  • Or: the run time budget should be per-candidate, not shared across the entire fallback chain
  • Or: timeout-failed candidates should get a much shorter retry timeout (e.g. 2s probe) on subsequent attempts within the same session

Environment

  • OpenClaw gateway (npm global install)
  • macOS, modelOverride set to google-vertex/gemini-3.1-pro-preview (model 404 on Vertex AI)
  • Fallback config: gemini-3.1-pro-preview → xs/gpt-5.4 → anthropic/claude-sonnet-4-6

Workaround

Manually clear modelOverride and providerOverride in sessions.json, then restart gateway.

extent analysis

Fix Plan

To address the infinite retry loop issue, we will implement a circuit-breaker mechanism for candidates that fail with a timeout. This will prevent the fallback chain from repeatedly attempting to use a candidate that is not responding within the allotted time.

Step-by-Step Solution:

  1. Implement Circuit Breaker:

    • Introduce a circuitBreaker object to track the status of each candidate.
    • When a candidate times out, mark it as unavailable in the circuitBreaker for a cooldown period.
  2. Modify Fallback Logic:

    • Before attempting a candidate, check its status in the circuitBreaker.
    • If a candidate is marked as unavailable, skip it and proceed to the next candidate in the fallback chain.
  3. Example Code Snippet:

    const circuitBreaker = {};
    const cooldownPeriod = 30000; // 30 seconds
    
    function isCandidateAvailable(candidate) {
      return !(candidate in circuitBreaker) || circuitBreaker[candidate] < Date.now();
    }
    
    function markCandidateUnavailable(candidate) {
      circuitBreaker[candidate] = Date.now() + cooldownPeriod;
    }
    
    // Example usage in the fallback chain evaluation
    function evaluateFallbackChain(candidates) {
      for (const candidate of candidates) {
        if (isCandidateAvailable(candidate)) {
          try {
            // Attempt to use the candidate
            // ...
          } catch (error) {
            if (error.reason === 'timeout') {
              markCandidateUnavailable(candidate);
            }
            // Proceed to the next candidate
          }
        }
      }
    }

Verification

To verify that the fix worked:

  • Set up a test scenario with a model override pointing to an unavailable model.
  • Monitor the logs for the candidate_failed and candidate_succeeded messages.
  • The working candidate (e.g., Sonnet) should succeed within a reasonable time frame without the infinite retry loop.

Extra Tips

  • Consider implementing a more sophisticated circuit-breaker strategy, such as one that adapts the cooldown period based on the number of consecutive failures.
  • Review the fallback chain configuration to ensure that the order of candidates makes sense and that there are no unnecessary or redundant candidates.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

openclaw - 💡(How to fix) Fix Failover loop: timeout-heavy candidates starve later fallbacks in run time budget [1 participants]