openclaw - ✅(Solved) Fix Model fallback: add route-level debounce/circuit breaker for repeated failed primaries [1 pull requests, 1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
openclaw/openclaw#56851Fetched 2026-04-08 01:46:57
View on GitHub
Comments
0
Participants
1
Timeline
0
Reactions
0
Author
Participants

Root Cause

  • Reduces repeated latency spikes
  • Avoids needless repeated provider failures
  • Avoids spamming logs with the same known-bad route
  • Preserves quota on near-limit providers
  • Makes fallback behavior feel intentional rather than reactive

PR fix notes

PR #5083: feat(reliability): circuit breaker for model provider fallback

Description (problem / solution / changelog)

Summary

  • Adds a CircuitBreaker state machine that quarantines failing model providers instead of retrying them on every request
  • Three states: Closed (healthy) → Open (skip provider) → HalfOpen (probe once to test recovery)
  • Integrated into all 7 provider iteration loops in ReliableProvider
  • Configurable via [reliability]: circuit_breaker_enabled, circuit_breaker_failure_threshold, circuit_breaker_recovery_secs

Inspired by openclaw#56851.

Changes

FileWhat
src/providers/circuit_breaker.rsNew — state machine with Arc<Mutex> thread safety, 9 unit tests
src/providers/reliable.rsIntegrate CB checks into provider iteration loops
src/providers/mod.rsWire CB at construction site
src/config/schema.rsAdd 3 config fields to ReliabilityConfig

Test plan

  • 9 unit tests covering all state transitions (closed→open, open→halfopen, halfopen→closed, halfopen→open)
  • cargo build succeeds
  • cargo fmt --all clean
  • CI gate

Changed files

  • src/config/schema.rs (modified, +25/-0)
  • src/providers/circuit_breaker.rs (added, +270/-0)
  • src/providers/mod.rs (modified, +32/-1)
  • src/providers/reliable.rs (modified, +84/-1)

Code Example

request
-> try primary
-> fail
-> fallback succeeds

next request
-> try same primary again

---

if model/provider fails repeatedly with auth/rate_limit/outage/timeout
-> mark that route unhealthy for N minutes
-> skip it entirely during the unhealthy window
-> probe again later
RAW_BUFFERClick to expand / collapse

Problem

OpenClaw currently does request-time failover, but it does not strongly quarantine a repeatedly failing primary model route.

In practice this means each new request still starts by trying the same broken primary again, even when the failure mode is already known (for example rate_limit, auth failure, provider outage, or repeated timeouts).

Real example

Observed in production:

  • main agent configured as claude-opus-4-6 -> gpt-5.4
  • Anthropic/Opus was returning repeated 429 rate_limit_error
  • OpenClaw correctly fell back to gpt-5.4
  • but every new request still retried Opus first, producing another 429 before falling back again

Similar issue also observed earlier with:

  • openai-codex/gpt-5.4 returning OAuth refresh failure (refresh_token_reused)
  • each request retried GPT first, then fell back to Opus

Current behavior

There is some cooldown behavior at the auth-profile level (auth_profile_failure_state_updated, cooldownUntil), but it appears too narrow/short-lived to act like a real route-level circuit breaker.

Today the behavior is effectively:

request
-> try primary
-> fail
-> fallback succeeds

next request
-> try same primary again

Requested behavior

Add a stronger route-level debounce / circuit breaker:

if model/provider fails repeatedly with auth/rate_limit/outage/timeout
-> mark that route unhealthy for N minutes
-> skip it entirely during the unhealthy window
-> probe again later

Suggested semantics

  • Track health at the model route level (provider/model, optionally plus auth profile)
  • Trigger on repeated failures like:
    • rate_limit / 429
    • auth refresh failure / invalid token
    • connection timeout / transport outage
  • Apply cooldown / quarantine window (e.g. 5m, 15m, exponential backoff)
  • During cooldown, skip the failed primary and go directly to the next fallback
  • Reset health after a successful probe or after cooldown expiry

Why this matters

  • Reduces repeated latency spikes
  • Avoids needless repeated provider failures
  • Avoids spamming logs with the same known-bad route
  • Preserves quota on near-limit providers
  • Makes fallback behavior feel intentional rather than reactive

Nice-to-have

A small amount of config would be useful, e.g.:

  • failure threshold
  • cooldown duration
  • which failure classes trip the breaker
  • whether to probe automatically after cooldown

extent analysis

Fix Plan

To implement a stronger route-level circuit breaker, follow these steps:

  • Introduce a RouteHealthTracker class to monitor the health of each model route.
  • Update the request flow to check the health of the primary route before attempting to use it.
  • If the primary route is marked as unhealthy, skip it and go directly to the next fallback.

Example code snippet in Python:

import time
from enum import Enum

class FailureType(Enum):
    RATE_LIMIT = 1
    AUTH_FAILURE = 2
    TIMEOUT = 3

class RouteHealthTracker:
    def __init__(self, cooldown_duration=300):  # 5 minutes
        self.route_health = {}
        self.cooldown_duration = cooldown_duration

    def mark_unhealthy(self, route, failure_type):
        self.route_health[route] = (failure_type, time.time())

    def is_unhealthy(self, route):
        if route not in self.route_health:
            return False
        failure_type, timestamp = self.route_health[route]
        if time.time() - timestamp < self.cooldown_duration:
            return True
        del self.route_health[route]
        return False

    def reset_health(self, route):
        if route in self.route_health:
            del self.route_health[route]

# Usage example
tracker = RouteHealthTracker()

def handle_request(route):
    if tracker.is_unhealthy(route):
        # Skip primary route and go to fallback
        print(f"Route {route} is unhealthy, using fallback")
        # Fallback logic here
    else:
        try:
            # Attempt to use primary route
            print(f"Using primary route {route}")
            # Primary route logic here
        except Exception as e:
            # Mark route as unhealthy on failure
            tracker.mark_unhealthy(route, FailureType.RATE_LIMIT)
            # Fallback logic here

Verification

To verify the fix, test the following scenarios:

  • Repeatedly failing primary route with rate_limit error
  • Repeatedly failing primary route with auth_failure error
  • Repeatedly failing primary route with timeout error
  • Successful probe after cooldown expiry
  • Successful reset of health after a successful probe

Extra Tips

  • Consider adding configuration options for failure threshold, cooldown duration, and failure classes that trip the breaker.
  • Implement automatic probing after cooldown expiry to reset the health of the route.
  • Monitor the effectiveness of the circuit breaker and adjust the configuration as needed to optimize performance and reduce latency spikes.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

openclaw - ✅(Solved) Fix Model fallback: add route-level debounce/circuit breaker for repeated failed primaries [1 pull requests, 1 participants]