openclaw - ✅(Solved) Fix Model Circuit Breaker: Auto-disable failing models [1 pull requests, 1 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
openclaw/openclaw#55536Fetched 2026-04-08 01:38:18
View on GitHub
Comments
1
Participants
2
Timeline
4
Reactions
0
Author
Participants
Timeline (top)
commented ×1cross-referenced ×1labeled ×1referenced ×1

Add a circuit breaker that automatically disables models after repeated API failures, preventing wasted calls and improving failover speed.

Error Message

Repeated error messages in logs [error] Model call failed: astroncodingplan/astron-code-latest - 503 Service Unavailable [error] Model call failed: astroncodingplan/astron-code-latest - 503 Service Unavailable [error] Model call failed: astroncodingplan/astron-code-latest - 503 Service Unavailable Error classification should map errors to reasons (rate_limit, overloaded, auth, network, timeout, etc.)

Root Cause

Add a circuit breaker that automatically disables models after repeated API failures, preventing wasted calls and improving failover speed.

PR fix notes

PR #55736: feat: security audit dmScope check + configurable circuit breaker

Description (problem / solution / changelog)

Summary

  • #55578security audit now warns when session.dmScope="main" with multi-user DM channels enabled. Detects the cross-user context leak risk that was previously only flagged in per-channel checks, and provides actionable remediation (openclaw config set session.dmScope "per-channel-peer").

  • #55536 — Exposes two new config options under auth.cooldowns to make the transient failure circuit breaker configurable:

    • transientFailureThreshold (default: 3) — consecutive failures before max cooldown kicks in
    • transientCooldownMinutes (default: 5) — max cooldown duration once threshold is reached

    The existing stepped backoff (30s / 1m / 5m) is preserved as the default. Users can now tune these for their deployment (e.g., { transientFailureThreshold: 5, transientCooldownMinutes: 30 } for longer cooldowns after more tolerance).

Test plan

  • pnpm vitest run src/agents/auth-profiles.markauthprofilefailure.test.ts — 12 tests pass (2 new for custom thresholds)
  • pnpm tsc --noEmit — clean
  • Manual: run openclaw security audit with session.dmScope="main" and a channel enabled — should emit session.dm_scope_main warning
  • Manual: set auth.cooldowns.transientCooldownMinutes: 30 and trigger 3+ failures — cooldown should be 30 minutes

Changed files

  • docs/.generated/config-baseline.json (modified, +20/-0)
  • docs/.generated/config-baseline.jsonl (modified, +3/-1)
  • src/agents/auth-profiles.markauthprofilefailure.test.ts (modified, +17/-0)
  • src/agents/auth-profiles/usage.ts (modified, +28/-6)
  • src/config/schema.base.generated.ts (modified, +9/-0)
  • src/config/types.auth.ts (modified, +12/-0)
  • src/config/zod-schema.ts (modified, +2/-0)
  • src/security/audit-extra.sync.ts (modified, +33/-0)
  • src/security/audit-extra.ts (modified, +1/-0)
  • src/security/audit.nondeep.runtime.ts (modified, +1/-0)
  • src/security/audit.ts (modified, +1/-0)
RAW_BUFFERClick to expand / collapse

Summary

Add a circuit breaker that automatically disables models after repeated API failures, preventing wasted calls and improving failover speed.

Problem to solve

When a model's API becomes unavailable or starts failing repeatedly, OpenClaw continues trying it on every request. This causes:

  1. Wasted time - Each failed call adds latency before falling back
  2. Wasted API calls - Repeatedly hitting a failing endpoint
  3. Poor user experience - Users see errors or delays when the primary model is down

This is especially problematic during provider outages, rate limit exhaustion, or persistent network issues. Current behavior requires manual intervention or waiting for timeouts on each failed attempt.

Proposed solution

Add a Model Circuit Breaker feature:

  1. Track consecutive API failures per model (provider/model combination)
  2. After N consecutive failures (configurable, default 3), "open" the circuit and stop trying that model
  3. Keep the circuit open for a configurable cooldown period (default 24 hours)
  4. Auto-recover after cooldown expires
  5. A successful call resets the failure counter

Configuration in openclaw.json:

{
  "models": {
    "circuitBreaker": {
      "enabled": true,
      "failureThreshold": 3,
      "cooldownMinutes": 1440
    }
  }
}

### Alternatives considered

Rely on provider-level cooldown only - Existing auth profile cooldown doesn't address model-level failures
External monitoring script - Less integrated, requires separate process management
Longer timeouts - Wastes time waiting for known-bad endpoints

### Impact

Affected users: All OpenClaw users who use multiple models or have fallback configurations

Severity: Medium - Blocks workflow during outages, causes frustration

Frequency: Intermittent - Occurs during provider issues, rate limiting, or network problems

Consequences:

10-30 seconds wasted per failed API call before fallback
Repeated error messages in logs
Users may think OpenClaw is broken when it's just a provider issue
Extra manual work to switch models during outages

### Evidence/examples

[error] Model call failed: astroncodingplan/astron-code-latest - 503 Service Unavailable
[error] Model call failed: astroncodingplan/astron-code-latest - 503 Service Unavailable  
[error] Model call failed: astroncodingplan/astron-code-latest - 503 Service Unavailable
# ... continues for every request

### Additional information

Should integrate with existing runWithModelFallback function
Error classification should map errors to reasons (rate_limit, overloaded, auth, network, timeout, etc.)
Must remain backward-compatible - feature is opt-in via config
Consider adding metrics/logging for circuit state changes

extent analysis

Fix Plan

To implement the Model Circuit Breaker feature, follow these steps:

  • Step 1: Configure Circuit Breaker
    • Add the following configuration to openclaw.json:

{ "models": { "circuitBreaker": { "enabled": true, "failureThreshold": 3, "cooldownMinutes": 1440 } } }

*   **Step 2: Implement Circuit Breaker Logic**
    *   Create a `CircuitBreaker` class to track consecutive failures and manage the circuit state:
    ```python
class CircuitBreaker:
    def __init__(self, failure_threshold, cooldown_minutes):
        self.failure_threshold = failure_threshold
        self.cooldown_minutes = cooldown_minutes
        self.failure_count = 0
        self.circuit_open = False
        self.cooldown_expires = None

    def is_circuit_open(self):
        return self.circuit_open

    def increment_failure_count(self):
        self.failure_count += 1
        if self.failure_count >= self.failure_threshold:
            self.open_circuit()

    def open_circuit(self):
        self.circuit_open = True
        self.cooldown_expires = datetime.now() + timedelta(minutes=self.cooldown_minutes)

    def reset(self):
        self.failure_count = 0
        self.circuit_open = False
        self.cooldown_expires = None

    def update(self):
        if self.circuit_open and datetime.now() > self.cooldown_expires:
            self.reset()
    ```
*   **Step 3: Integrate with Existing Code**
    *   Modify the `runWithModelFallback` function to check the circuit breaker state before making API calls:
    ```python
def runWithModelFallback(model, ...):
    circuit_breaker = CircuitBreaker(failure_threshold=3, cooldown_minutes=1440)
    if circuit_breaker.is_circuit_open():
        # Fallback to next model or return error
        return fallback_model(...)
    try:
        # Make API call
        response = make_api_call(model)
        circuit_breaker.reset()
        return response
    except Exception as e:
        circuit_breaker.increment_failure_count()
        circuit_breaker.update()
        # Fallback to next model or return error
        return fallback_model(...)
    ```
*   **Step 4: Add Error Classification and Metrics**
    *   Implement error classification to map errors to reasons (e.g., rate limit, overloaded, auth, network, timeout, etc.)
    *   Add metrics/logging for circuit state changes to monitor the effectiveness of the circuit breaker

### Verification
To verify that the fix worked:

*   Test the circuit breaker with a failing API endpoint and verify that it opens the circuit after the specified number of failures
*   Verify that the circuit breaker resets after a successful API call
*   Test

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

openclaw - ✅(Solved) Fix Model Circuit Breaker: Auto-disable failing models [1 pull requests, 1 comments, 2 participants]