openclaw - 💡(How to fix) Fix [Bug]: Live session model switch detector blocks programmatic fallback during rate limits [1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
openclaw/openclaw#57961Fetched 2026-04-08 01:55:38
View on GitHub
Comments
0
Participants
1
Timeline
0
Reactions
0
Participants

When a running session hits a rate limit (HTTP 429) and the fallback chain attempts to rotate to a different provider, the "live session model switch" detector fires and cancels the fallback attempt, forcing the session back to the rate-limited provider. This creates a loop where no fallback model is ever tried, and the session burns dozens of retries against the same dead endpoint.

Error Message

16:25:16 [agent/embedded] embedded run agent end: isError=true model=claude-opus-4-6 error=429 rate_limit_error

Root Cause

When a running session hits a rate limit (HTTP 429) and the fallback chain attempts to rotate to a different provider, the "live session model switch" detector fires and cancels the fallback attempt, forcing the session back to the rate-limited provider. This creates a loop where no fallback model is ever tried, and the session burns dozens of retries against the same dead endpoint.

Code Example

{
  "agents": {
    "defaults": {
      "model": {
        "primary": "anthropic/claude-opus-4-6",
        "fallbacks": ["openai-codex/gpt-5.4", "google/gemini-3.1-pro-preview"]
      }
    }
  }
}

---

16:25:12 [model-fallback/decision] candidate_failed requested=claude-opus-4-6 candidate=claude-opus-4-6 reason=rate_limit next=openai-codex/gpt-5.4
16:25:13 [agent/embedded] live session model switch detected before attempt for 408589ce: openai-codex/gpt-5.4 -> anthropic/claude-opus-4-6
16:25:13 [model-fallback/decision] candidate_failed requested=claude-opus-4-6 candidate=openai-codex/gpt-5.4 reason=unknown next=google/gemini-3.1-pro-preview
16:25:13 [agent/embedded] live session model switch detected before attempt for 408589ce: google/gemini-3.1-pro-preview -> anthropic/claude-opus-4-6
16:25:16 [agent/embedded] embedded run agent end: isError=true model=claude-opus-4-6 error=429 rate_limit_error
16:25:21 — rate_limit again
16:25:28 — rate_limit again
16:25:38 — rate_limit again, failover decision: try gpt-5.4
16:25:40SAME: gpt-5.4 blocked by model switch, gemini blocked, back to opus

---

09:54:19 candidate_failed opus reason=rate_limit next=sonnet
09:54:44 candidate_failed sonnet reason=rate_limit next=gpt-5.2
09:55:30 candidate_succeeded gpt-5.2
RAW_BUFFERClick to expand / collapse

Bug Report

Summary

When a running session hits a rate limit (HTTP 429) and the fallback chain attempts to rotate to a different provider, the "live session model switch" detector fires and cancels the fallback attempt, forcing the session back to the rate-limited provider. This creates a loop where no fallback model is ever tried, and the session burns dozens of retries against the same dead endpoint.

Version

OpenClaw 2026.3.28 (f9b1079) on Ubuntu 24.04 ARM64

Configuration

{
  "agents": {
    "defaults": {
      "model": {
        "primary": "anthropic/claude-opus-4-6",
        "fallbacks": ["openai-codex/gpt-5.4", "google/gemini-3.1-pro-preview"]
      }
    }
  }
}

Steps to Reproduce

  1. Start a session with anthropic/claude-opus-4-6 as default
  2. Hit Anthropic rate limit (429)
  3. Gateway attempts fallback to openai-codex/gpt-5.4
  4. Bug: "live session model switch detected" fires, marks gpt-5.4 as candidate_failed reason=unknown
  5. Gateway attempts fallback to google/gemini-3.1-pro-preview
  6. Bug: Same — "live session model switch detected" fires, blocks gemini
  7. Gateway loops back to opus, hits 429 again
  8. Repeats for the entire retry budget (48+ iterations)

Expected Behavior

When the fallback chain rotates to a different provider due to rate_limit/timeout/overload, the model switch detector should not fire. Programmatic fallback rotation is not a manual /model switch.

Actual Behavior

The model switch detector treats fallback rotation as a manual model change, cancels it, and resets to the original session model — creating an infinite loop against the rate-limited provider.

Log Evidence

16:25:12 [model-fallback/decision] candidate_failed requested=claude-opus-4-6 candidate=claude-opus-4-6 reason=rate_limit next=openai-codex/gpt-5.4
16:25:13 [agent/embedded] live session model switch detected before attempt for 408589ce: openai-codex/gpt-5.4 -> anthropic/claude-opus-4-6
16:25:13 [model-fallback/decision] candidate_failed requested=claude-opus-4-6 candidate=openai-codex/gpt-5.4 reason=unknown next=google/gemini-3.1-pro-preview
16:25:13 [agent/embedded] live session model switch detected before attempt for 408589ce: google/gemini-3.1-pro-preview -> anthropic/claude-opus-4-6
16:25:16 [agent/embedded] embedded run agent end: isError=true model=claude-opus-4-6 error=429 rate_limit_error
16:25:21 — rate_limit again
16:25:28 — rate_limit again
16:25:38 — rate_limit again, failover decision: try gpt-5.4
16:25:40 — SAME: gpt-5.4 blocked by model switch, gemini blocked, back to opus

53 rate_limit errors in 6 minutes for a single runId, with zero successful fallback attempts.

Contrast — fallback works correctly for NEW sessions:

09:54:19 candidate_failed opus reason=rate_limit next=sonnet
09:54:44 candidate_failed sonnet reason=rate_limit next=gpt-5.2
09:55:30 candidate_succeeded gpt-5.2

The difference: the new session was started with the current default. The broken session was a long-running session where the model switch detector sees any non-original model as a "switch."

Impact

  • Long-running sessions (main agent, persistent Telegram sessions) cannot fail over during provider outages
  • All 48 retry iterations burn against the rate-limited endpoint
  • The configured fallback chain is effectively useless for the most important sessions
  • Related: #19249 (runtime failover not always activating on rate limits), #45834 (fallback not triggered on generic errors), #6966 (dynamic model switching)

Suggested Fix

The live session model switch detected check should distinguish between:

  1. Manual /model switch — user intentionally changing models mid-session (should log, may need special handling)
  2. Programmatic fallback rotation — the failover engine rotating through the fallback chain due to errors (should be transparent, never blocked)

A simple check: if the model change originates from model_fallback_decision with reason=rate_limit|timeout|overloaded, skip the model switch detection entirely.

extent analysis

Fix Plan

To resolve the issue, we need to modify the live session model switch detected check to distinguish between manual model switches and programmatic fallback rotations.

Here are the steps:

  • Modify the live session model switch detected check to include a condition that checks the origin of the model change.
  • If the model change originates from model_fallback_decision with reason=rate_limit|timeout|overloaded, skip the model switch detection entirely.

Example code snippet:

def live_session_model_switch_detected(current_model, new_model, reason):
    # Check if the model change originates from model_fallback_decision
    if reason in ['rate_limit', 'timeout', 'overloaded']:
        # Skip model switch detection for programmatic fallback rotations
        return False
    # Existing logic for manual model switches
    # ...
  • Update the model_fallback_decision function to pass the reason for the model change to the live_session_model_switch_detected function.
def model_fallback_decision(current_model, reason):
    # ...
    new_model = get_next_fallback_model(current_model)
    if live_session_model_switch_detected(current_model, new_model, reason):
        # Handle manual model switch
        # ...
    else:
        # Proceed with programmatic fallback rotation
        # ...

Verification

To verify the fix, test the following scenarios:

  • Start a session with a model that has a rate limit.
  • Trigger the rate limit and verify that the fallback chain is activated correctly.
  • Verify that the live session model switch detected check does not fire for programmatic fallback rotations.
  • Test manual model switches and verify that the live session model switch detected check fires correctly.

Extra Tips

  • Make sure to update the documentation to reflect the changes to the live session model switch detected check.
  • Consider adding additional logging to track programmatic fallback rotations and manual model switches.
  • Review related issues (#19249, #45834, #6966) to ensure that the fix does not introduce any regressions.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING