openclaw - 💡(How to fix) Fix [Bug]: Error masking + global circuit breaker causes total outage when any single provider fails [2 comments, 3 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
openclaw/openclaw#48988Fetched 2026-04-08 00:50:06
View on GitHub
Comments
2
Participants
3
Timeline
4
Reactions
0
Author
Timeline (top)
commented ×2cross-referenced ×1referenced ×1

Error Message

  1. Error masking: ALL provider errors (auth failures, wrong endpoints, plan limitations) are mapped to generic "API rate limit reached" GLM-5 request → error 1311 (model not on plan) → mapped to "rate limit" → fallback to GLM-4.7 → wrong endpoint → error 1113 → mapped to "rate limit"

RC-1: Error masking (same as #43447)

All provider errors are mapped to "API rate limit reached" regardless of actual error type:

  1. Preserve original error codes: "Plan limitation (1311)" not "rate limit". "Auth failure (1113)" not "rate limit".
  • #43447 — Same error masking bug (Kimi/Moonshot)

Root Cause

A single misconfigured provider (wrong Z.ai endpoint) cascaded into total outage across ALL providers, including local Ollama which has no rate limits. The root cause is two compounding behaviors:

Fix Action

Fix / Workaround

Impact

  • Severity: Critical (total service outage)
  • Duration: Until manual SSH intervention to fix config and restart
  • User experience: All Telegram bots unresponsive, "rate limit reached" on every message
  • Workaround: Only include verified-working models in the fallback chain. Exclude any model that might fail.

Workaround (current)

Only include verified-working models in agents.defaults.model.fallbacks. Any model that returns errors (even temporarily) will cascade to take down all providers.

Code Example

GLM-5 request → error 1311 (model not on plan) → mapped to "rate limit"
  → fallback to GLM-4.7 → wrong endpoint → error 1113 → mapped to "rate limit"
    → fallback to Groq → retry storm from above failures → actual rate limit
      → fallback to Ollama → circuit breaker rejects instantly (214ms, no actual API call)
TOTAL FAILURE: all 4 providers "rate limited"
RAW_BUFFERClick to expand / collapse

Environment

  • OpenClaw Version: 2026.3.8
  • OS: Ubuntu (Docker container on VPS)
  • Channels: Telegram (2 bots)
  • Providers: Z.ai (GLM-4.7), Groq (Llama 3.3 70B), Ollama (Qwen 3)

Problem Summary

A single misconfigured provider (wrong Z.ai endpoint) cascaded into total outage across ALL providers, including local Ollama which has no rate limits. The root cause is two compounding behaviors:

  1. Error masking: ALL provider errors (auth failures, wrong endpoints, plan limitations) are mapped to generic "API rate limit reached"
  2. Global circuit breaker: After repeated "rate limit" errors from one provider, the circuit breaker marks ALL providers as rate-limited, including unrelated ones

This is the same class of bug as #43447 (Kimi/Moonshot rate limit masking), but with a more severe cascade path.

Cascade Sequence (observed)

GLM-5 request → error 1311 (model not on plan) → mapped to "rate limit"
  → fallback to GLM-4.7 → wrong endpoint → error 1113 → mapped to "rate limit"
    → fallback to Groq → retry storm from above failures → actual rate limit
      → fallback to Ollama → circuit breaker rejects instantly (214ms, no actual API call)
        → TOTAL FAILURE: all 4 providers "rate limited"

Evidence: Ollama is local (localhost:11434, air-gapped Docker network). It has zero rate limiting capability. The 214ms rejection time proves the circuit breaker rejected the request without making any network call.

Root Causes

RC-1: Error masking (same as #43447)

All provider errors are mapped to "API rate limit reached" regardless of actual error type:

  • HTTP 1311 (plan limitation) → "rate limit"
  • HTTP 1113 (wrong endpoint/auth) → "rate limit"
  • HTTP 429 (actual rate limit) → "rate limit"
  • Connection refused → "rate limit"

This prevents users from diagnosing the actual problem and prevents the circuit breaker from distinguishing recoverable vs permanent failures.

RC-2: Global circuit breaker scope

The circuit breaker appears to track failures globally rather than per-provider. After N failures from Provider A, Provider B and Provider C also get rejected without attempting a real API call. This turns any single-provider outage into a total outage.

Impact

  • Severity: Critical (total service outage)
  • Duration: Until manual SSH intervention to fix config and restart
  • User experience: All Telegram bots unresponsive, "rate limit reached" on every message
  • Workaround: Only include verified-working models in the fallback chain. Exclude any model that might fail.

Expected Behavior

  1. Preserve original error codes: "Plan limitation (1311)" not "rate limit". "Auth failure (1113)" not "rate limit".
  2. Per-provider circuit breaker: Failures from Z.ai should not affect Groq or Ollama circuit breaker state.
  3. Graceful degradation: If primary fails, fallback should be attempted with a clean circuit breaker state.

Proposed Fix

  1. Map provider errors to their actual categories (auth, billing, rate limit, network, plan limitation)
  2. Scope circuit breaker state per provider ID, not globally
  3. Only trigger circuit breaker on actual 429 responses, not on auth/config errors

Workaround (current)

Only include verified-working models in agents.defaults.model.fallbacks. Any model that returns errors (even temporarily) will cascade to take down all providers.

Related

  • #43447 — Same error masking bug (Kimi/Moonshot)
  • #34624 — Global circuit breaker for model failover chains
  • #47988 — Phase 6: LLM circuit breaker

extent analysis

Fix Plan

To address the issue, we need to implement the following changes:

  • Map provider errors to their actual categories:
    • Update the error mapping to preserve original error codes.
    • Create a dictionary to map error codes to their respective categories (e.g., auth, billing, rate limit, network, plan limitation).
  • Scope circuit breaker state per provider ID:
    • Update the circuit breaker to track failures per provider ID.
    • Use a dictionary to store the circuit breaker state for each provider.
  • Only trigger circuit breaker on actual 429 responses:
    • Update the circuit breaker to only trigger on 429 responses.
    • Ignore other error types (e.g., auth, config errors) when triggering the circuit breaker.

Example Code

# Error mapping dictionary
error_mapping = {
    1311: "Plan limitation",
    1113: "Auth failure",
    429: "Rate limit exceeded"
}

# Circuit breaker dictionary
circuit_breaker_state = {}

def update_circuit_breaker(provider_id, error_code):
    if error_code == 429:
        # Trigger circuit breaker only on 429 responses
        circuit_breaker_state[provider_id] = True
    else:
        # Ignore other error types
        pass

def get_error_message(error_code):
    return error_mapping.get(error_code, "Unknown error")

# Example usage:
provider_id = "Z.ai"
error_code = 429
update_circuit_breaker(provider_id, error_code)
print(get_error_message(error_code))  # Output: Rate limit exceeded

Verification

To verify the fix, test the following scenarios:

  • Test a successful request to a provider.
  • Test a request that returns a 429 response (rate limit exceeded).
  • Test a request that returns a non-429 error response (e.g., auth failure, plan limitation).
  • Verify that the circuit breaker is triggered only on 429 responses.
  • Verify that the error messages are correctly mapped to their respective categories.

Extra Tips

  • Regularly review and update the error mapping dictionary to ensure it covers all possible error codes.
  • Consider implementing a retry mechanism for non-429 error responses to handle temporary failures.
  • Monitor the circuit breaker state and error messages to detect and diagnose issues with providers.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

openclaw - 💡(How to fix) Fix [Bug]: Error masking + global circuit breaker causes total outage when any single provider fails [2 comments, 3 participants]