openclaw - 💡(How to fix) Fix Failover loop: timeout-heavy candidates starve later fallbacks in run time budget [1 participants]

openclaw2026-03-31 01:47:46

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

openclaw/openclaw#58049•Fetched 2026-04-08 01:54:32

View on GitHub

Comments

Participants

Timeline

Reactions

Author

justin7974

Participants

justin7974

When a session's modelOverride points to an unavailable model (e.g. a Vertex AI preview model that returns 404), the failover mechanism enters an infinite retry loop instead of converging on a working fallback candidate.

Error Message

The run itself times out and is marked as error [agent/embedded] embedded run agent end: isError=true model=gemini-3.1-pro-preview error=LLM request failed: network connection error.

Root Cause

The fallback chain is evaluated per embedded run, and each run has a fixed time budget (~20s). The problem:

Primary candidate (e.g. google-vertex/gemini-3.1-pro-preview) times out after ~15s
Second candidate (xs/gpt-5.4) fails quickly (~2s)
Third candidate (anthropic/claude-sonnet-4-6) gets the remaining ~3s — not enough to complete a full LLM request
The run itself times out and is marked as error
Lane scheduler starts a new run, which resets the fallback chain → back to step 1

This creates an infinite loop where the working candidate (Sonnet) is perpetually starved of time.

Fix Action

Workaround

Manually clear modelOverride and providerOverride in sessions.json, then restart gateway.

Code Example

[model-fallback/decision] candidate_failed requested=google-vertex/gemini-3.1-pro-preview candidate=google-vertex/gemini-3.1-pro-preview reason=timeout next=xs/gpt-5.4
[model-fallback/decision] candidate_failed requested=google-vertex/gemini-3.1-pro-preview candidate=xs/gpt-5.4 reason=unknown next=anthropic/claude-sonnet-4-6
[agent/embedded] embedded run agent end: isError=true model=gemini-3.1-pro-preview error=LLM request failed: network connection error.

RAW_BUFFERClick to expand / collapse

Summary

Root Cause

The fallback chain is evaluated per embedded run, and each run has a fixed time budget (~20s). The problem:

Primary candidate (e.g. google-vertex/gemini-3.1-pro-preview) times out after ~15s
Second candidate (xs/gpt-5.4) fails quickly (~2s)
Third candidate (anthropic/claude-sonnet-4-6) gets the remaining ~3s — not enough to complete a full LLM request
The run itself times out and is marked as error
Lane scheduler starts a new run, which resets the fallback chain → back to step 1

This creates an infinite loop where the working candidate (Sonnet) is perpetually starved of time.

Observed Behavior

From gateway.err.log, every ~20 seconds:

[model-fallback/decision] candidate_failed requested=google-vertex/gemini-3.1-pro-preview candidate=google-vertex/gemini-3.1-pro-preview reason=timeout next=xs/gpt-5.4
[model-fallback/decision] candidate_failed requested=google-vertex/gemini-3.1-pro-preview candidate=xs/gpt-5.4 reason=unknown next=anthropic/claude-sonnet-4-6
[agent/embedded] embedded run agent end: isError=true model=gemini-3.1-pro-preview error=LLM request failed: network connection error.

The candidate_succeeded for Sonnet only appears after many cycles (~3+ minutes), likely when the scheduler happens to give it enough headroom.

Expected Behavior

After a candidate fails with timeout, it should be circuit-broken (marked unavailable for a cooldown period) so subsequent runs skip it immediately
Or: the run time budget should be per-candidate, not shared across the entire fallback chain
Or: timeout-failed candidates should get a much shorter retry timeout (e.g. 2s probe) on subsequent attempts within the same session

Environment

OpenClaw gateway (npm global install)
macOS, modelOverride set to google-vertex/gemini-3.1-pro-preview (model 404 on Vertex AI)
Fallback config: gemini-3.1-pro-preview → xs/gpt-5.4 → anthropic/claude-sonnet-4-6

Workaround

Manually clear modelOverride and providerOverride in sessions.json, then restart gateway.

extent analysis

Fix Plan

To address the infinite retry loop issue, we will implement a circuit-breaker mechanism for candidates that fail with a timeout. This will prevent the fallback chain from repeatedly attempting to use a candidate that is not responding within the allotted time.

Step-by-Step Solution:

Implement Circuit Breaker:
- Introduce a circuitBreaker object to track the status of each candidate.
- When a candidate times out, mark it as unavailable in the circuitBreaker for a cooldown period.
Modify Fallback Logic:
- Before attempting a candidate, check its status in the circuitBreaker.
- If a candidate is marked as unavailable, skip it and proceed to the next candidate in the fallback chain.

Example Code Snippet:

const circuitBreaker = {};
const cooldownPeriod = 30000; // 30 seconds

function isCandidateAvailable(candidate) {
  return !(candidate in circuitBreaker) || circuitBreaker[candidate] < Date.now();
}

function markCandidateUnavailable(candidate) {
  circuitBreaker[candidate] = Date.now() + cooldownPeriod;
}

// Example usage in the fallback chain evaluation
function evaluateFallbackChain(candidates) {
  for (const candidate of candidates) {
    if (isCandidateAvailable(candidate)) {
      try {
        // Attempt to use the candidate
        // ...
      } catch (error) {
        if (error.reason === 'timeout') {
          markCandidateUnavailable(candidate);
        }
        // Proceed to the next candidate
      }
    }
  }
}

Verification

To verify that the fix worked:

Set up a test scenario with a model override pointing to an unavailable model.
Monitor the logs for the candidate_failed and candidate_succeeded messages.
The working candidate (e.g., Sonnet) should succeed within a reasonable time frame without the infinite retry loop.

Extra Tips

Consider implementing a more sophisticated circuit-breaker strategy, such as one that adapts the cooldown period based on the number of consecutive failures.
Review the fallback chain configuration to ensure that the order of candidates makes sense and that there are no unnecessary or redundant candidates.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#tensor shape #autograd error #model save/load #optimization #mixed precision

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

openclaw - 💡(How to fix) Fix Failover loop: timeout-heavy candidates starve later fallbacks in run time budget [1 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Workaround

Code Example

Summary

Root Cause

Observed Behavior

Expected Behavior

Environment

Workaround

extent analysis

Fix Plan

Step-by-Step Solution:

Verification

Extra Tips

Still need to ship something?

TRENDING

openclaw - 💡(How to fix) Fix Failover loop: timeout-heavy candidates starve later fallbacks in run time budget [1 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Workaround

Code Example

Summary

Root Cause

Observed Behavior

Expected Behavior

Environment

Workaround

extent analysis

Fix Plan

Step-by-Step Solution:

Verification

Extra Tips

Still need to ship something?

RELATED_DISCOVERY

TRENDING