openclaw - 💡(How to fix) Fix Feature Request: Provider Fallback Chains [1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
openclaw/openclaw#56908Fetched 2026-04-08 01:46:14
View on GitHub
Comments
0
Participants
1
Timeline
1
Reactions
0
Author
Participants
Timeline (top)
cross-referenced ×1

Error Message

When the active LLM provider fails — rate limits (429), server errors (500/503), authentication expiry, or regional outages — OpenClaw surfaces the error to the user. The user must then manually run /model to switch providers and retry. This breaks flow, loses momentum, and in time-sensitive scenarios (monitoring, alerts, active debugging) can mean missed windows.

  • Track provider response times and error rates per session.

Code Example

providers:
  fallback:
    - anthropic/claude-opus-4
    - openai/gpt-4o
    - anthropic/claude-sonnet-4
  retry:
    max_attempts: 2          # per provider before moving to next
    backoff_ms: 1000         # exponential backoff base
    trigger_codes: [429, 500, 502, 503, 504]
    trigger_on_auth_failure: true
    trigger_on_timeout: true  # provider response timeout
RAW_BUFFERClick to expand / collapse

Feature Request: Provider Fallback Chains

Problem Statement

When the active LLM provider fails — rate limits (429), server errors (500/503), authentication expiry, or regional outages — OpenClaw surfaces the error to the user. The user must then manually run /model to switch providers and retry. This breaks flow, loses momentum, and in time-sensitive scenarios (monitoring, alerts, active debugging) can mean missed windows.

Provider outages and rate limits are not edge cases; they happen weekly across all major providers.

Proposed Solution

Fallback Chain Configuration

providers:
  fallback:
    - anthropic/claude-opus-4
    - openai/gpt-4o
    - anthropic/claude-sonnet-4
  retry:
    max_attempts: 2          # per provider before moving to next
    backoff_ms: 1000         # exponential backoff base
    trigger_codes: [429, 500, 502, 503, 504]
    trigger_on_auth_failure: true
    trigger_on_timeout: true  # provider response timeout

Behavior

  1. Primary provider fails → retry with backoff (up to max_attempts).
  2. Retries exhausted → switch to next provider in chain.
  3. User sees a brief, non-intrusive note: ⚡ Switched to gpt-4o (claude-opus-4 rate limited).
  4. Conversation continues seamlessly — system prompt and tools are re-sent to new provider.
  5. When primary recovers, optionally switch back (configurable).

Health Monitoring

  • Track provider response times and error rates per session.
  • Proactive degradation detection: if latency spikes >3x baseline, preemptively prepare fallback.
  • Surface provider health in /status.

User Impact

  • Reliability: Agent stays available even during provider outages.
  • Time-sensitive workflows: Heartbeats, monitoring, and alerts don't silently fail.
  • Cost optimization: Users can set expensive models as primary with cheap fallbacks, getting quality when available and reliability always.

Technical Considerations

  • Prompt portability: System prompts and tool definitions are already provider-agnostic in OpenClaw's gateway. Minor format differences (tool call schemas) are handled at the gateway layer.
  • Token counting: Different models have different tokenizers; compression thresholds need per-model awareness.
  • State continuity: Conversation history format may differ slightly between providers; gateway should normalize.
  • Config UX: Should work with zero config (sensible defaults based on configured providers) and full customization.

Priority

HIGH. Provider reliability is outside OpenClaw's control, but resilience to it is firmly within scope. This is a straightforward improvement with outsized impact on trust and uptime.

extent analysis

Fix Plan

To implement the provider fallback chain, follow these steps:

  • Update the configuration to include the fallback chain and retry settings.
  • Modify the provider selection logic to use the fallback chain when the primary provider fails.
  • Implement retry with exponential backoff for failed requests.
  • Add health monitoring to track provider response times and error rates.

Example configuration update:

providers:
  fallback:
    - anthropic/claude-opus-4
    - openai/gpt-4o
    - anthropic/claude-sonnet-4
  retry:
    max_attempts: 2
    backoff_ms: 1000
    trigger_codes: [429, 500, 502, 503, 504]
    trigger_on_auth_failure: true
    trigger_on_timeout: true

Example code for retry with exponential backoff:

import time
import random

def retry_with_backoff(max_attempts, backoff_ms, trigger_codes):
    attempts = 0
    delay = backoff_ms
    while attempts < max_attempts:
        try:
            # Make the request
            response = make_request()
            if response.status_code in trigger_codes:
                raise Exception("Request failed")
            return response
        except Exception as e:
            attempts += 1
            # Exponential backoff with jitter
            delay_with_jitter = delay * (1 + random.random() * 0.1)
            time.sleep(delay_with_jitter / 1000)
            delay *= 2
    # If all attempts fail, switch to the next provider
    switch_to_next_provider()

def switch_to_next_provider():
    # Get the next provider from the fallback chain
    next_provider = get_next_provider()
    # Update the provider selection logic to use the next provider
    update_provider_selection(next_provider)

Verification

To verify that the fix worked, test the following scenarios:

  • Primary provider fails with a 429 or 500 error: the system should retry with backoff and switch to the next provider in the chain.
  • Primary provider fails with an authentication error: the system should retry with backoff and switch to the next provider in the chain.
  • Primary provider fails with a timeout error: the system should retry with backoff and switch to the next provider in the chain.

Extra Tips

  • Make sure to handle token counting and state continuity differences between providers.
  • Implement proactive degradation detection to prepare for fallbacks before they are needed.
  • Surface provider health in the /status endpoint to provide visibility into provider performance.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

openclaw - 💡(How to fix) Fix Feature Request: Provider Fallback Chains [1 participants]