openclaw - 💡(How to fix) Fix Feature: Configurable overload retry count and circuit breaker for model failover [1 participants]

openclaw2026-04-01 21:10:32

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

openclaw/openclaw#59253•Fetched 2026-04-08 02:26:52

View on GitHub

Comments

Participants

Timeline

Reactions

Author

prasith-jobleap

Participants

prasith-jobleap

Root Cause

Root cause (from source)

Code Example

const OVERLOAD_FAILOVER_BACKOFF_POLICY = {
  initialMs: 250, maxMs: 1500, factor: 2, jitter: 0.2
};

---

{
  "agents": {
    "defaults": {
      "model": {
        "overloadMaxRetries": 2
      }
    }
  }
}

---

{
  "agents": {
    "defaults": {
      "model": {
        "overloadCircuitBreakerFailures": 3,
        "overloadCircuitBreakerWindowMinutes": 5,
        "overloadCircuitBreakerCooldownMinutes": 10
      }
    }
  }
}

RAW_BUFFERClick to expand / collapse

Problem

When a primary model (e.g., Anthropic Claude Opus) returns overloaded_error, OpenClaw retries 4+ times with exponential backoff (250ms → 500ms → 1000ms → 1500ms) before triggering failover to the next model in the fallback chain. This adds ~30-40 seconds of wasted latency per request even when the fallback model (e.g., GPT-5.4) is healthy and ready.

Worse, there is no cross-request circuit breaker — each new request independently rediscovers the primary is down, paying the same retry tax every time. During a sustained outage (observed: 2+ hours on 3/31/2026), this makes the experience feel broken even though the fallback model works fine when it finally gets requests.

Observed behavior (from gateway logs)

35 Opus overloaded errors over ~2 hours
Each request retried Opus 4 times (~37s) before falling back to GPT-5.4
GPT-5.4 succeeded 8/8 times when it actually received requests
The retry delay was the entire source of user-perceived failure

Root cause (from source)

The retry policy is hardcoded:

const OVERLOAD_FAILOVER_BACKOFF_POLICY = {
  initialMs: 250, maxMs: 1500, factor: 2, jitter: 0.2
};

The run loop retries up to 24 + (profiles × 8) iterations (min 32) before giving up. There is no configurable knob for overload-specific retry count.

Proposed solution

1. Configurable overload retry count (quick win)

Add a config option to control how many overload retries occur before triggering model failover:

{
  "agents": {
    "defaults": {
      "model": {
        "overloadMaxRetries": 2
      }
    }
  }
}

Default could stay at 4 for backward compat, but letting users set it to 1-2 would dramatically reduce latency during outages.

2. Cross-request circuit breaker (bigger win)

After N overload failures within a time window, skip the primary model entirely and go straight to the fallback for a cooldown period:

{
  "agents": {
    "defaults": {
      "model": {
        "overloadCircuitBreakerFailures": 3,
        "overloadCircuitBreakerWindowMinutes": 5,
        "overloadCircuitBreakerCooldownMinutes": 10
      }
    }
  }
}

Meaning: "If 3 overload failures happen within 5 minutes, skip the primary for the next 10 minutes and go straight to fallback."

The auth.cooldowns config already exists for billing failures — this would be the overload equivalent.

Impact

This would turn a 30-40 second latency penalty into a <1 second transparent failover during provider outages. For users with capable fallback models configured, outages would become nearly invisible.

extent analysis

TL;DR

Implement a configurable overload retry count and a cross-request circuit breaker to reduce latency during model outages.

Guidance

Introduce a configurable overloadMaxRetries option to control the number of retries before triggering model failover, allowing users to set it to 1-2 for reduced latency.
Implement a cross-request circuit breaker with configurable overloadCircuitBreakerFailures, overloadCircuitBreakerWindowMinutes, and overloadCircuitBreakerCooldownMinutes options to skip the primary model after a specified number of failures within a time window.
Consider defaulting overloadMaxRetries to 4 for backward compatibility while allowing users to adjust it.
Review the existing auth.cooldowns config for billing failures as a reference for implementing the overload circuit breaker.

Example

{
  "agents": {
    "defaults": {
      "model": {
        "overloadMaxRetries": 2,
        "overloadCircuitBreakerFailures": 3,
        "overloadCircuitBreakerWindowMinutes": 5,
        "overloadCircuitBreakerCooldownMinutes":

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#mixed precision #training loop #device allocation #model download #tokenizer error

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

openclaw - 💡(How to fix) Fix Feature: Configurable overload retry count and circuit breaker for model failover [1 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Root cause (from source)

Code Example

Problem

Observed behavior (from gateway logs)

Root cause (from source)

Proposed solution

1. Configurable overload retry count (quick win)

2. Cross-request circuit breaker (bigger win)

Impact

extent analysis

TL;DR

Guidance

Example

Still need to ship something?

TRENDING

openclaw - 💡(How to fix) Fix Feature: Configurable overload retry count and circuit breaker for model failover [1 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Root cause (from source)

Code Example

Problem

Observed behavior (from gateway logs)

Root cause (from source)

Proposed solution

1. Configurable overload retry count (quick win)

2. Cross-request circuit breaker (bigger win)

Impact

extent analysis

TL;DR

Guidance

Example

Still need to ship something?

RELATED_DISCOVERY

TRENDING