openclaw - 💡(How to fix) Fix [Bug]: Live session model switch detector blocks programmatic fallback during rate limits [1 participants]

Root Cause

When a running session hits a rate limit (HTTP 429) and the fallback chain attempts to rotate to a different provider, the "live session model switch" detector fires and cancels the fallback attempt, forcing the session back to the rate-limited provider. This creates a loop where no fallback model is ever tried, and the session burns dozens of retries against the same dead endpoint.

Code Example

{
  "agents": {
    "defaults": {
      "model": {
        "primary": "anthropic/claude-opus-4-6",
        "fallbacks": ["openai-codex/gpt-5.4", "google/gemini-3.1-pro-preview"]
      }
    }
  }
}

---

16:25:12 [model-fallback/decision] candidate_failed requested=claude-opus-4-6 candidate=claude-opus-4-6 reason=rate_limit next=openai-codex/gpt-5.4
16:25:13 [agent/embedded] live session model switch detected before attempt for 408589ce: openai-codex/gpt-5.4 -> anthropic/claude-opus-4-6
16:25:13 [model-fallback/decision] candidate_failed requested=claude-opus-4-6 candidate=openai-codex/gpt-5.4 reason=unknown next=google/gemini-3.1-pro-preview
16:25:13 [agent/embedded] live session model switch detected before attempt for 408589ce: google/gemini-3.1-pro-preview -> anthropic/claude-opus-4-6
16:25:16 [agent/embedded] embedded run agent end: isError=true model=claude-opus-4-6 error=429 rate_limit_error
16:25:21 — rate_limit again
16:25:28 — rate_limit again
16:25:38 — rate_limit again, failover decision: try gpt-5.4
16:25:40 — SAME: gpt-5.4 blocked by model switch, gemini blocked, back to opus

---

09:54:19 candidate_failed opus reason=rate_limit next=sonnet
09:54:44 candidate_failed sonnet reason=rate_limit next=gpt-5.2
09:55:30 candidate_succeeded gpt-5.2

Bug Report

Summary

Version

OpenClaw 2026.3.28 (f9b1079) on Ubuntu 24.04 ARM64

Configuration

{
  "agents": {
    "defaults": {
      "model": {
        "primary": "anthropic/claude-opus-4-6",
        "fallbacks": ["openai-codex/gpt-5.4", "google/gemini-3.1-pro-preview"]
      }
    }
  }
}

Steps to Reproduce

Start a session with anthropic/claude-opus-4-6 as default
Hit Anthropic rate limit (429)
Gateway attempts fallback to openai-codex/gpt-5.4
Bug: "live session model switch detected" fires, marks gpt-5.4 as candidate_failed reason=unknown
Gateway attempts fallback to google/gemini-3.1-pro-preview
Bug: Same — "live session model switch detected" fires, blocks gemini
Gateway loops back to opus, hits 429 again
Repeats for the entire retry budget (48+ iterations)

Expected Behavior

When the fallback chain rotates to a different provider due to rate_limit/timeout/overload, the model switch detector should not fire. Programmatic fallback rotation is not a manual /model switch.

Actual Behavior

The model switch detector treats fallback rotation as a manual model change, cancels it, and resets to the original session model — creating an infinite loop against the rate-limited provider.

Log Evidence

16:25:12 [model-fallback/decision] candidate_failed requested=claude-opus-4-6 candidate=claude-opus-4-6 reason=rate_limit next=openai-codex/gpt-5.4
16:25:13 [agent/embedded] live session model switch detected before attempt for 408589ce: openai-codex/gpt-5.4 -> anthropic/claude-opus-4-6
16:25:13 [model-fallback/decision] candidate_failed requested=claude-opus-4-6 candidate=openai-codex/gpt-5.4 reason=unknown next=google/gemini-3.1-pro-preview
16:25:13 [agent/embedded] live session model switch detected before attempt for 408589ce: google/gemini-3.1-pro-preview -> anthropic/claude-opus-4-6
16:25:16 [agent/embedded] embedded run agent end: isError=true model=claude-opus-4-6 error=429 rate_limit_error
16:25:21 — rate_limit again
16:25:28 — rate_limit again
16:25:38 — rate_limit again, failover decision: try gpt-5.4
16:25:40 — SAME: gpt-5.4 blocked by model switch, gemini blocked, back to opus

53 rate_limit errors in 6 minutes for a single runId, with zero successful fallback attempts.

Contrast — fallback works correctly for NEW sessions:

09:54:19 candidate_failed opus reason=rate_limit next=sonnet
09:54:44 candidate_failed sonnet reason=rate_limit next=gpt-5.2
09:55:30 candidate_succeeded gpt-5.2

The difference: the new session was started with the current default. The broken session was a long-running session where the model switch detector sees any non-original model as a "switch."

Impact

Long-running sessions (main agent, persistent Telegram sessions) cannot fail over during provider outages
All 48 retry iterations burn against the rate-limited endpoint
The configured fallback chain is effectively useless for the most important sessions
Related: #19249 (runtime failover not always activating on rate limits), #45834 (fallback not triggered on generic errors), #6966 (dynamic model switching)

Suggested Fix

The live session model switch detected check should distinguish between:

Manual /model switch — user intentionally changing models mid-session (should log, may need special handling)
Programmatic fallback rotation — the failover engine rotating through the fallback chain due to errors (should be transparent, never blocked)

A simple check: if the model change originates from model_fallback_decision with reason=rate_limit|timeout|overloaded, skip the model switch detection entirely.

extent analysis

Fix Plan

To resolve the issue, we need to modify the live session model switch detected check to distinguish between manual model switches and programmatic fallback rotations.

Here are the steps:

Modify the live session model switch detected check to include a condition that checks the origin of the model change.
If the model change originates from model_fallback_decision with reason=rate_limit|timeout|overloaded, skip the model switch detection entirely.

Example code snippet:

def live_session_model_switch_detected(current_model, new_model, reason):
    # Check if the model change originates from model_fallback_decision
    if reason in ['rate_limit', 'timeout', 'overloaded']:
        # Skip model switch detection for programmatic fallback rotations
        return False
    # Existing logic for manual model switches
    # ...

Update the model_fallback_decision function to pass the reason for the model change to the live_session_model_switch_detected function.

def model_fallback_decision(current_model, reason):
    # ...
    new_model = get_next_fallback_model(current_model)
    if live_session_model_switch_detected(current_model, new_model, reason):
        # Handle manual model switch
        # ...
    else:
        # Proceed with programmatic fallback rotation
        # ...

Verification

To verify the fix, test the following scenarios:

Start a session with a model that has a rate limit.
Trigger the rate limit and verify that the fallback chain is activated correctly.
Verify that the live session model switch detected check does not fire for programmatic fallback rotations.
Test manual model switches and verify that the live session model switch detected check fires correctly.

Extra Tips

Make sure to update the documentation to reflect the changes to the live session model switch detected check.
Consider adding additional logging to track programmatic fallback rotations and manual model switches.
Review related issues (#19249, #45834, #6966) to ensure that the fix does not introduce any regressions.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

openclaw - 💡(How to fix) Fix [Bug]: Live session model switch detector blocks programmatic fallback during rate limits [1 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Code Example

Bug Report

Summary

Version

Configuration

Steps to Reproduce

Expected Behavior

Actual Behavior

Log Evidence

Impact

Suggested Fix

extent analysis

Fix Plan

Verification

Extra Tips

Still need to ship something?

TRENDING

openclaw - 💡(How to fix) Fix [Bug]: Live session model switch detector blocks programmatic fallback during rate limits [1 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Code Example

Bug Report

Summary

Version

Configuration

Steps to Reproduce

Expected Behavior

Actual Behavior

Log Evidence

Impact

Suggested Fix

extent analysis

Fix Plan

Verification

Extra Tips

Still need to ship something?

RELATED_DISCOVERY

TRENDING