openclaw - ✅(Solved) Fix [bug] Fallback chain aborted by premature primary restore when cooldown expires mid-flight [1 pull requests, 1 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
openclaw/openclaw#58578Fetched 2026-04-08 02:00:44
View on GitHub
Comments
1
Participants
2
Timeline
2
Reactions
0
Timeline (top)
commented ×1cross-referenced ×1

When the primary model (e.g. anthropic/claude-sonnet-4-6) returns overloaded_error (503), the fallback chain initiates correctly. However, because the overload cooldown is very short (30s for 1st error, 60s for 2nd), it expires during the fallback chain execution. When it expires, requestLiveSessionModelSwitch fires and aborts the in-flight fallback attempt, forcing a switch back to the primary — which is still overloaded. This creates an infinite loop where no model ever successfully responds.

Error Message

21:29:09 — candidate_failed: openai-codex/gpt-5.4, error: "Live session model switch requested: anthropic/claude-sonnet-4-6"
21:30:01 — candidate_failed: anthropic/claude-opus-4-6, error: "Live session model switch requested: anthropic/claude-sonnet-4-6"
21:30:01 — live session model switch detected: openai-codex/gpt-5.4 -> anthropic/claude-sonnet-4-6
21:30:50 — candidate_failed: anthropic/claude-opus-4-6 (same error, loop continues)
21:36:10 — candidate_succeeded: anthropic/claude-sonnet-4-6 (finally works after ~7 minutes of loop)

Pattern repeats every ~50s: Sonnet fails → Opus fails → GPT-5.4 aborted by switch → Gemini aborted by switch → back to Sonnet.

Root Cause

When the primary model (e.g. anthropic/claude-sonnet-4-6) returns overloaded_error (503), the fallback chain initiates correctly. However, because the overload cooldown is very short (30s for 1st error, 60s for 2nd), it expires during the fallback chain execution. When it expires, requestLiveSessionModelSwitch fires and aborts the in-flight fallback attempt, forcing a switch back to the primary — which is still overloaded. This creates an infinite loop where no model ever successfully responds.

PR fix notes

PR #62682: fix(agents): distinguish terminal aborts from retryable failures (#60388)

Description (problem / solution / changelog)

Addresses #60388 (complementary to #52365, see "Relationship to PR #52365" below)

Today the fallback layer cannot tell the difference between two very different aborts:

  1. "This model failed, try another" -> fallback should retry
  2. "The whole run is over" -> fallback should stop immediately

Two situations where the run is over and retrying with another model wastes resources:

  • Run-budget timeout (#60388): The embedded runner's scheduleAbortTimer fires runAbortController.abort(makeTimeoutAbortReason()) after the configured agents.defaults.timeoutSeconds. The budget is exhausted -- giving the next candidate ~0 ms remaining is guaranteed to fail and wastes API calls. The user issue description (#60388) says "On a fleet of ~100 cron jobs: 30+ model-fallback events per day, almost all triggered by run timeouts."

  • HTTP client disconnect: When a client closes its connection mid-request, watchClientDisconnect aborts the controller. No caller is left to receive a response, so any tokens spent on a fallback model are wasted.

Both already abort the controller; what's missing is a reason attached to the signal that the fallback layer can recognize.

Summary

  • Problem: Run-budget timeout aborts (and HTTP client disconnects) currently flow through the model fallback chain as if they were retryable provider failures, wasting API calls and lane occupancy.
  • Why it matters: Each unwanted fallback attempt holds a session lane and burns tokens for no benefit. On busy systems (#60388 reports 30+ events/day from cron timeouts alone) this is meaningful overhead.
  • What changed: The fallback layer now checks AbortSignal.reason for two terminal markers (name === "TimeoutError" for run-budget, name === "ClientDisconnectError" for client disconnect) and short-circuits the chain when found.
  • What did NOT change (scope boundary): No changes to how aborts are triggered, no changes to per-provider timeout/retry semantics, no changes to AbortController lifecycles. Issues #37505 (cron AbortController sharing) and #58578 (mid-flight primary restore) are different abort sources upstream of the fallback layer and remain unchanged.

Change Type

  • Bug fix

Scope

  • Gateway / orchestration

Linked Issue

  • Closes #60388
  • This PR fixes a bug or regression

Root Cause

  • Root cause: shouldRethrowAbort() in model-fallback.ts checks isFallbackAbortError(err) && !isTimeoutError(err) -- which means timeout errors are intentionally not rethrown, so the fallback chain runs them. This is correct for per-provider timeouts (provider is slow, try another), but wrong for run-budget timeouts (whole run is out of time, no point retrying). The two cannot be told apart from the error alone.
  • Missing detection / guardrail: No check for why the abort happened. The signal.reason carrying TimeoutError (set by pi-embedded-runner/run/attempt.ts:1382-1386 via makeTimeoutAbortReason) was never inspected.
  • Contributing context: The HTTP client disconnect path added in #54388 has the same shape -- it tags the abort with no reason today, so a downstream client disconnect also flows through fallback retries.

Regression Test Plan

  • Coverage level: [x] Unit test
  • Target test or file: src/agents/model-fallback.test.ts
  • Scenario the tests lock in: Six new tests under describe("terminal abort propagation (closes #60388)"):
    • signal.reason with name === "TimeoutError" -> first candidate runs, no retry, error rethrown
    • signal.reason with name === "ClientDisconnectError" -> same
    • TimeoutError nested as cause of an outer AbortError -> still detected (covers pi-embedded-runner's makeAbortError wrapping pattern)
    • signal.reason with a generic error -> fallback runs normally (non-terminal)
    • No abortSignal passed -> fallback runs normally (back-compat for existing callers)
    • abortSignal provided but not aborted -> fallback runs normally (live-signal back-compat)
  • Why this is the smallest reliable guardrail: The tests construct the abort signal directly and assert on run.mock.calls.length === 1 to verify the chain stopped. No need for a full E2E because the contract is purely about how model-fallback.ts interprets AbortSignal.reason.
  • Existing test that already covers this: None.
  • If no new test is added, why not: 6 new tests added.

User-visible / Behavior Changes

When a run-budget timeout fires (the agent run exceeds agents.defaults.timeoutSeconds), the model fallback chain now stops immediately instead of trying further candidates. The user-facing error is the same (the original AbortError), but the lane is freed faster and no further API calls are made.

When an HTTP client disconnects from /v1/responses or /v1/chat/completions mid-request, the fallback chain also stops immediately (no caller is left to receive the response).

Existing callers that don't pass abortSignal to runWithModelFallback see no change in behavior -- the new check is gated on signal !== undefined && signal.aborted.

Diagram

Before:
[run-budget timer fires]
  -> runAbortController.abort(TimeoutError)
  -> agent attempt throws AbortError
  -> shouldRethrowAbort(err) returns false (because isTimeoutError(err) is true)
  -> runFallbackCandidate returns { ok: false } -> tries next candidate
  -> next candidate also times out (~0ms budget left) -> tries next ...
  -> wasted API calls

After:
[run-budget timer fires]
  -> runAbortController.abort(TimeoutError)  // unchanged
  -> agent attempt throws AbortError
  -> isTerminalAbort(signal) returns true (signal.reason.name === "TimeoutError")
  -> shouldRethrowAbort(err, signal) returns true
  -> error rethrown immediately, no further candidates tried

Security Impact

  • New permissions/capabilities? No
  • Secrets/tokens handling changed? No
  • New/changed network calls? No (this REDUCES network calls in the timeout path)
  • Command/tool execution surface changed? No
  • Data access scope changed? No

Repro + Verification

Environment

  • OS: macOS (Docker linux/amd64)
  • Runtime/container: Node 22 / OpenClaw built from this branch
  • Test runner: Vitest

Steps

  1. Configure an agent with a small agents.defaults.timeoutSeconds (e.g. 5s) and one or more model fallbacks
  2. Send a request that takes longer than the timeout
  3. Observe the fallback chain behavior

Expected (after fix)

  • Primary candidate runs, hits the timeout, fallback chain stops, single error returned
  • Logs show no [model-fallback] candidate_failed entries for fallback candidates beyond the first

Actual (before fix)

  • Primary candidate runs, hits the timeout
  • Fallback layer tries the next candidate with ~0ms budget remaining
  • Next candidate also times out, fallback layer tries the next, and so on
  • 2-3x the API calls before the chain exhausts and returns an error

Evidence

$ npx vitest run src/agents/model-fallback.test.ts

Test Files  1 passed (1)
     Tests  70 passed (70)

The new test cases:

  • rethrows immediately when signal.reason has name=TimeoutError (run-budget timeout)
  • rethrows immediately when signal.reason has name=ClientDisconnectError
  • detects TimeoutError nested as cause of an outer Error
  • falls back normally when signal is aborted with a non-terminal reason
  • falls back normally when no abortSignal is passed (back-compat)
  • falls back normally when signal is provided but not aborted

Human Verification

  • Verified scenarios:
    • All 70 tests in model-fallback.test.ts pass after the change
    • All 20 tests in model-fallback.probe.test.ts pass
    • All 8 tests in agent-command.live-model-switch.test.ts pass
    • End-to-end smoke: container rebuilt from this branch, HTTP client disconnect on /v1/responses produces the expected [openresponses] client disconnected, aborting streaming run runId=... log line and the agent run terminates within ~1s (the upstream watchClientDisconnect plumbing already in aad3bbedd works correctly with the new ClientDisconnectError reason tag)
  • Edge cases checked: cause-chain walking (one level deep) for TimeoutError/ClientDisconnectError wrapped inside an outer AbortError -- this matches pi-embedded-runner/run/attempt.ts:1387's makeAbortError pattern
  • What I did not verify:
    • Direct E2E of the fallback-skip path with a real cron timeout firing (would require setting timeoutSeconds to a very small value; the unit tests cover the contract)
    • The image-model fallback variant (runWithImageModelFallback) -- it gets the new abortSignal? parameter for consistency but no callers currently pass a signal

Compatibility / Migration

  • Backward compatible? Yes -- the new abortSignal? parameter is optional. Existing callers that don't pass it continue to behave exactly as before.
  • Config/env changes? No
  • Migration needed? No

Risks and Mitigations

  • Risk: A future caller passes an abort signal whose signal.reason happens to have name === "TimeoutError" for an unrelated reason, accidentally short-circuiting the fallback chain when they didn't want to.
    • Mitigation: The check is intentionally narrow -- only the exact name strings TimeoutError and ClientDisconnectError match. Both names are already reserved for run-budget timeout (set by makeTimeoutAbortReason in pi-embedded-runner/run/attempt.ts) and HTTP client disconnect (set by watchClientDisconnect in gateway/http-common.ts after this PR).
  • Risk: TimeoutError from a per-provider request timeout (not run-budget) gets mistaken for a run-budget timeout and skips fallback when it shouldn't.
    • Mitigation: Per-provider timeouts come from the LLM SDK's internal fetch timeout, which throws an error directly -- they don't propagate through runAbortController.abort(). Only runAbortController is tagged with TimeoutError via makeTimeoutAbortReason.

Out of scope (for separate PRs)

  • Issue #37505 (cron timeout aborts entire fallback chain via shared AbortController) -- different abort source (cron service executeJobCoreWithTimeout), needs per-attempt AbortController isolation. PR #42482 already addresses this.
  • Issue #58578 (fallback chain aborted by premature primary restore mid-flight) -- different abort source (requestLiveSessionModelSwitch), needs coordination state between live-model-switch and fallback layers

Relationship to PR #52365

PR #52365 (fix(cron): stop fallback attempts when cron budget is exhausted) addresses the same underlying problem from #60388 but with a different, complementary mechanism:

  • #52365 is proactive: a new beforeAttempt hook in runWithModelFallback that lets the cron layer check its remaining budget before each attempt and stop the chain if budget is too low.
  • This PR is reactive: an isTerminalAbort(signal) check in shouldRethrowAbort that inspects signal.reason after an attempt aborts to decide whether to rethrow or retry.

These are complementary, not exclusive. #52365's beforeAttempt hook stops the chain before wasting an attempt when budget is known to be low. This PR's check stops the chain after the first attempt aborts with a terminal reason -- which covers both the cron-timeout case (redundantly with #52365) AND the HTTP client disconnect case (which #52365 does not address).

Notable differences:

  • Scope: this PR is 8 files / ~276 lines touching only model-fallback.ts, http-common.ts, agent-command.ts, 3 auto-reply callers, and one cron caller. #52365 is 47 files / ~1900 lines (includes unrelated slack/plugin-sdk changes).
  • Client disconnect coverage: this PR handles the HTTP /v1/responses and /v1/chat/completions client-disconnect case by tagging the abort with ClientDisconnectError in watchClientDisconnect. #52365 does not touch the gateway HTTP path.
  • Detection layer: this PR checks signal.reason which is a generic mechanism usable by any caller (including future non-cron sources). #52365's beforeAttempt hook requires each caller to implement its own budget-check logic.

Either PR alone fixes the #60388 cron-timeout case. Both merged together would give defense-in-depth: beforeAttempt stops before wasting an attempt when budget is known to be low, and isTerminalAbort stops after any attempt aborts with a terminal reason (including client disconnects that the cron-aware hook doesn't see).

If the maintainers prefer #52365's approach and would rather not have two overlapping mechanisms, I'd suggest retargeting this PR to cover only the ClientDisconnectError branch of isTerminalAbort (the unique coverage) and letting #52365 handle the cron-timeout case via its proactive hook.

Changed files

  • src/agents/agent-command.ts (modified, +5/-0)
  • src/agents/model-fallback.test.ts (modified, +276/-0)
  • src/agents/model-fallback.ts (modified, +122/-3)
  • src/auto-reply/reply/agent-runner-execution.ts (modified, +4/-0)
  • src/auto-reply/reply/agent-runner-memory.ts (modified, +3/-0)
  • src/auto-reply/reply/followup-runner.ts (modified, +4/-0)
  • src/cron/isolated-agent/run-executor.ts (modified, +3/-0)
  • src/gateway/http-common.ts (modified, +15/-1)

Code Example

errorPreview: "Live session model switch requested: anthropic/claude-sonnet-4-6"

---

21:29:09 — candidate_failed: openai-codex/gpt-5.4, error: "Live session model switch requested: anthropic/claude-sonnet-4-6"
21:30:01 — candidate_failed: anthropic/claude-opus-4-6, error: "Live session model switch requested: anthropic/claude-sonnet-4-6"
21:30:01 — live session model switch detected: openai-codex/gpt-5.4 -> anthropic/claude-sonnet-4-6
21:30:50 — candidate_failed: anthropic/claude-opus-4-6 (same error, loop continues)
21:36:10 — candidate_succeeded: anthropic/claude-sonnet-4-6 (finally works after ~7 minutes of loop)

---

// Overload cooldown is hardcoded and very short:
function calculateAuthProfileCooldownMs(errorCount) {
    if (normalized <= 1) return 30000;   // 30s
    if (normalized <= 2) return 60000;   // 60s
    return 300000;                        // 5 min
}

---

// This fires when cooldown expires, even during active fallback:
log.info(`live session model switch detected before attempt for ${params.sessionId}`);
throw new LiveSessionModelSwitchError(nextSelection);
RAW_BUFFERClick to expand / collapse

[bug] Fallback chain aborted by premature primary restore when cooldown expires mid-flight

Description

When the primary model (e.g. anthropic/claude-sonnet-4-6) returns overloaded_error (503), the fallback chain initiates correctly. However, because the overload cooldown is very short (30s for 1st error, 60s for 2nd), it expires during the fallback chain execution. When it expires, requestLiveSessionModelSwitch fires and aborts the in-flight fallback attempt, forcing a switch back to the primary — which is still overloaded. This creates an infinite loop where no model ever successfully responds.

Steps to Reproduce

  1. Configure primary anthropic/claude-sonnet-4-6 with fallbacks [anthropic/claude-opus-4-6, openai-codex/gpt-5.4, google/gemini-2.5-flash]
  2. Wait for Anthropic overload (503)
  3. Observe fallback chain starting
  4. Observe cooldown expiring (~30-60s) mid-flight
  5. requestLiveSessionModelSwitch cancels fallback, switches back to Sonnet
  6. Sonnet fails again → loop

Expected Behavior

When a fallback is in-flight, requestLiveSessionModelSwitch should NOT cancel it. The primary should only be restored after the current fallback attempt completes (success or failure).

Actual Behavior

Every fallback candidate (Opus, GPT-5.4, Gemini Flash) is aborted mid-attempt with:

errorPreview: "Live session model switch requested: anthropic/claude-sonnet-4-6"

The agent becomes completely unresponsive despite having 4 models configured.

Logs

21:29:09 — candidate_failed: openai-codex/gpt-5.4, error: "Live session model switch requested: anthropic/claude-sonnet-4-6"
21:30:01 — candidate_failed: anthropic/claude-opus-4-6, error: "Live session model switch requested: anthropic/claude-sonnet-4-6"
21:30:01 — live session model switch detected: openai-codex/gpt-5.4 -> anthropic/claude-sonnet-4-6
21:30:50 — candidate_failed: anthropic/claude-opus-4-6 (same error, loop continues)
21:36:10 — candidate_succeeded: anthropic/claude-sonnet-4-6 (finally works after ~7 minutes of loop)

Pattern repeats every ~50s: Sonnet fails → Opus fails → GPT-5.4 aborted by switch → Gemini aborted by switch → back to Sonnet.

Root Cause (from source analysis)

In auth-profiles-B5ypC5S-.js:

// Overload cooldown is hardcoded and very short:
function calculateAuthProfileCooldownMs(errorCount) {
    if (normalized <= 1) return 30000;   // 30s
    if (normalized <= 2) return 60000;   // 60s
    return 300000;                        // 5 min
}

In login-B5O9Mtcp.js, around line 169326:

// This fires when cooldown expires, even during active fallback:
log.info(`live session model switch detected before attempt for ${params.sessionId}`);
throw new LiveSessionModelSwitchError(nextSelection);

The cooldown expiry triggers a switch request that aborts any in-flight fallback, regardless of whether the primary is actually healthy.

Suggested Fix

Option A: Don't fire requestLiveSessionModelSwitch if a fallback attempt is currently in-flight. Only check after the current attempt completes.

Option B: Add a minimum grace period after an overload error before allowing switch-back (e.g., don't switch back within 5 minutes of the last overload from that provider).

Option C: Make the overload cooldown configurable via auth.cooldowns (currently only billingBackoffHours is configurable; the overload cooldown in calculateAuthProfileCooldownMs is hardcoded).

Environment

  • OpenClaw version: 2026.3.28
  • OS: Ubuntu Linux 6.8.0-90-generic (x64)
  • Node: v22.22.1
  • Primary: anthropic/claude-sonnet-4-6
  • Fallbacks: anthropic/claude-opus-4-6, openai-codex/gpt-5.4, google/gemini-2.5-flash

extent analysis

TL;DR

Implement a minimum grace period after an overload error to prevent immediate switch-back to the primary model, or modify the requestLiveSessionModelSwitch logic to not interrupt in-flight fallback attempts.

Guidance

  • Review the calculateAuthProfileCooldownMs function to consider increasing the cooldown period or making it configurable to prevent rapid switch-backs.
  • Investigate modifying the log.info block in login-B5O9Mtcp.js to only trigger requestLiveSessionModelSwitch after the current fallback attempt completes.
  • Consider adding a check to prevent requestLiveSessionModelSwitch from firing if a fallback is in progress, ensuring the primary model is only restored after the fallback attempt is complete.
  • Evaluate the feasibility of implementing a configurable minimum grace period (e.g., 5 minutes) after an overload error before allowing switch-back to the primary model.

Example

// Example of how the calculateAuthProfileCooldownMs function could be modified to include a minimum cooldown period
function calculateAuthProfileCooldownMs(errorCount) {
    const minCooldown = 300000; // 5 minutes
    if (normalized <= 1) return Math.max(30000, minCooldown);   // 30s or minCooldown, whichever is greater
    if (normalized <= 2) return Math.max(60000, minCooldown);   // 60s or minCooldown, whichever is greater
    return Math.max(300000, minCooldown);                        // 5 min or minCooldown, whichever is greater
}

Notes

The provided code snippets and analysis suggest that the issue is related to the hardcoded cooldown periods and the logic surrounding requestLiveSessionModelSwitch. However, without further information about the specific requirements and constraints of the system, it's challenging to provide a definitive solution. The suggested fixes aim to address the immediate issue but may require adjustments based on the broader system architecture and performance needs.

Recommendation

Apply a workaround by introducing a minimum grace period after an overload error, such as 5 minutes, to prevent immediate switch-back to the primary model. This can help stabilize the system and prevent the infinite loop of fallback attempts. The reason for this recommendation is to provide a simple, immediate solution that can be implemented without deeply modifying the existing logic, allowing for further analysis and optimization of the cooldown and switch-back mechanisms.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING