openclaw - ✅(Solved) Fix Model Circuit Breaker: Auto-disable failing models [1 pull requests, 1 comments, 2 participants]

openclaw2026-03-27 03:46:26

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

openclaw/openclaw#55536•Fetched 2026-04-08 01:38:18

View on GitHub

Comments

Participants

Timeline

Reactions

Author

kkkgkg

Participants

kkkgkg

sophiaashi

Timeline (top)

commented ×1cross-referenced ×1labeled ×1referenced ×1

Add a circuit breaker that automatically disables models after repeated API failures, preventing wasted calls and improving failover speed.

Error Message

Repeated error messages in logs [error] Model call failed: astroncodingplan/astron-code-latest - 503 Service Unavailable [error] Model call failed: astroncodingplan/astron-code-latest - 503 Service Unavailable [error] Model call failed: astroncodingplan/astron-code-latest - 503 Service Unavailable Error classification should map errors to reasons (rate_limit, overloaded, auth, network, timeout, etc.)

Root Cause

Add a circuit breaker that automatically disables models after repeated API failures, preventing wasted calls and improving failover speed.

PR fix notes

PR #55736: feat: security audit dmScope check + configurable circuit breaker

Repository: openclaw/openclaw
Author: ayushozha
State: open | merged: False
Link: https://github.com/openclaw/openclaw/pull/55736

Description (problem / solution / changelog)

Summary

#55578 — security audit now warns when session.dmScope="main" with multi-user DM channels enabled. Detects the cross-user context leak risk that was previously only flagged in per-channel checks, and provides actionable remediation (openclaw config set session.dmScope "per-channel-peer").
#55536 — Exposes two new config options under auth.cooldowns to make the transient failure circuit breaker configurable:
- transientFailureThreshold (default: 3) — consecutive failures before max cooldown kicks in
- transientCooldownMinutes (default: 5) — max cooldown duration once threshold is reached
The existing stepped backoff (30s / 1m / 5m) is preserved as the default. Users can now tune these for their deployment (e.g., { transientFailureThreshold: 5, transientCooldownMinutes: 30 } for longer cooldowns after more tolerance).

Test plan

pnpm vitest run src/agents/auth-profiles.markauthprofilefailure.test.ts — 12 tests pass (2 new for custom thresholds)
pnpm tsc --noEmit — clean
Manual: run openclaw security audit with session.dmScope="main" and a channel enabled — should emit session.dm_scope_main warning
Manual: set auth.cooldowns.transientCooldownMinutes: 30 and trigger 3+ failures — cooldown should be 30 minutes

Changed files

docs/.generated/config-baseline.json (modified, +20/-0)
docs/.generated/config-baseline.jsonl (modified, +3/-1)
src/agents/auth-profiles.markauthprofilefailure.test.ts (modified, +17/-0)
src/agents/auth-profiles/usage.ts (modified, +28/-6)
src/config/schema.base.generated.ts (modified, +9/-0)
src/config/types.auth.ts (modified, +12/-0)
src/config/zod-schema.ts (modified, +2/-0)
src/security/audit-extra.sync.ts (modified, +33/-0)
src/security/audit-extra.ts (modified, +1/-0)
src/security/audit.nondeep.runtime.ts (modified, +1/-0)
src/security/audit.ts (modified, +1/-0)

RAW_BUFFERClick to expand / collapse

Summary

Add a circuit breaker that automatically disables models after repeated API failures, preventing wasted calls and improving failover speed.

Problem to solve

When a model's API becomes unavailable or starts failing repeatedly, OpenClaw continues trying it on every request. This causes:

Wasted time - Each failed call adds latency before falling back
Wasted API calls - Repeatedly hitting a failing endpoint
Poor user experience - Users see errors or delays when the primary model is down

This is especially problematic during provider outages, rate limit exhaustion, or persistent network issues. Current behavior requires manual intervention or waiting for timeouts on each failed attempt.

Proposed solution

Add a Model Circuit Breaker feature:

Track consecutive API failures per model (provider/model combination)
After N consecutive failures (configurable, default 3), "open" the circuit and stop trying that model
Keep the circuit open for a configurable cooldown period (default 24 hours)
Auto-recover after cooldown expires
A successful call resets the failure counter

Configuration in openclaw.json:

{
  "models": {
    "circuitBreaker": {
      "enabled": true,
      "failureThreshold": 3,
      "cooldownMinutes": 1440
    }
  }
}

### Alternatives considered

Rely on provider-level cooldown only - Existing auth profile cooldown doesn't address model-level failures
External monitoring script - Less integrated, requires separate process management
Longer timeouts - Wastes time waiting for known-bad endpoints

### Impact

Affected users: All OpenClaw users who use multiple models or have fallback configurations

Severity: Medium - Blocks workflow during outages, causes frustration

Frequency: Intermittent - Occurs during provider issues, rate limiting, or network problems

Consequences:

10-30 seconds wasted per failed API call before fallback
Repeated error messages in logs
Users may think OpenClaw is broken when it's just a provider issue
Extra manual work to switch models during outages

### Evidence/examples

[error] Model call failed: astroncodingplan/astron-code-latest - 503 Service Unavailable
[error] Model call failed: astroncodingplan/astron-code-latest - 503 Service Unavailable  
[error] Model call failed: astroncodingplan/astron-code-latest - 503 Service Unavailable
# ... continues for every request

### Additional information

Should integrate with existing runWithModelFallback function
Error classification should map errors to reasons (rate_limit, overloaded, auth, network, timeout, etc.)
Must remain backward-compatible - feature is opt-in via config
Consider adding metrics/logging for circuit state changes

extent analysis

Fix Plan

To implement the Model Circuit Breaker feature, follow these steps:

Step 1: Configure Circuit Breaker
- Add the following configuration to openclaw.json:

{ "models": { "circuitBreaker": { "enabled": true, "failureThreshold": 3, "cooldownMinutes": 1440 } } }

*   **Step 2: Implement Circuit Breaker Logic**
    *   Create a `CircuitBreaker` class to track consecutive failures and manage the circuit state:
    ```python
class CircuitBreaker:
    def __init__(self, failure_threshold, cooldown_minutes):
        self.failure_threshold = failure_threshold
        self.cooldown_minutes = cooldown_minutes
        self.failure_count = 0
        self.circuit_open = False
        self.cooldown_expires = None

    def is_circuit_open(self):
        return self.circuit_open

    def increment_failure_count(self):
        self.failure_count += 1
        if self.failure_count >= self.failure_threshold:
            self.open_circuit()

    def open_circuit(self):
        self.circuit_open = True
        self.cooldown_expires = datetime.now() + timedelta(minutes=self.cooldown_minutes)

    def reset(self):
        self.failure_count = 0
        self.circuit_open = False
        self.cooldown_expires = None

    def update(self):
        if self.circuit_open and datetime.now() > self.cooldown_expires:
            self.reset()
    ```
*   **Step 3: Integrate with Existing Code**
    *   Modify the `runWithModelFallback` function to check the circuit breaker state before making API calls:
    ```python
def runWithModelFallback(model, ...):
    circuit_breaker = CircuitBreaker(failure_threshold=3, cooldown_minutes=1440)
    if circuit_breaker.is_circuit_open():
        # Fallback to next model or return error
        return fallback_model(...)
    try:
        # Make API call
        response = make_api_call(model)
        circuit_breaker.reset()
        return response
    except Exception as e:
        circuit_breaker.increment_failure_count()
        circuit_breaker.update()
        # Fallback to next model or return error
        return fallback_model(...)
    ```
*   **Step 4: Add Error Classification and Metrics**
    *   Implement error classification to map errors to reasons (e.g., rate limit, overloaded, auth, network, timeout, etc.)
    *   Add metrics/logging for circuit state changes to monitor the effectiveness of the circuit breaker

### Verification
To verify that the fix worked:

*   Test the circuit breaker with a failing API endpoint and verify that it opens the circuit after the specified number of failures
*   Verify that the circuit breaker resets after a successful API call
*   Test

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #dependency error #configuration error #environment variable #network issue

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

openclaw - ✅(Solved) Fix Model Circuit Breaker: Auto-disable failing models [1 pull requests, 1 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

PR fix notes

PR #55736: feat: security audit dmScope check + configurable circuit breaker

Description (problem / solution / changelog)

Summary

Test plan

Changed files

Summary

Problem to solve

Proposed solution

extent analysis

Fix Plan

Still need to ship something?

TRENDING

openclaw - ✅(Solved) Fix Model Circuit Breaker: Auto-disable failing models [1 pull requests, 1 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

PR fix notes

PR #55736: feat: security audit dmScope check + configurable circuit breaker

Description (problem / solution / changelog)

Summary

Test plan

Changed files

Summary

Problem to solve

Proposed solution

extent analysis

Fix Plan

Still need to ship something?

RELATED_DISCOVERY

TRENDING