openclaw - ✅(Solved) Fix Model fallback not triggered on agent execution timeout (fallbacks configured but never attempted) [1 pull requests, 5 comments, 5 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
openclaw/openclaw#49921Fetched 2026-04-08 01:01:14
View on GitHub
Comments
5
Participants
5
Timeline
8
Reactions
1
Timeline (top)
commented ×5cross-referenced ×2subscribed ×1

Error Message

model: "openrouter/hunter-alpha" | status: "error" | error: "Provider returned error" | durationMs: 126081 model: "openrouter/hunter-alpha" | status: "error" | error: "Provider returned error" | durationMs: 125985 ... (6 occurrences, same pattern)

Fix Action

Fixed

PR fix notes

PR #51283: Fix model fallback not triggering on unrecognized provider errors

Description (problem / solution / changelog)

When an LLM provider returns a generic error like "Provider returned error" (common with OpenRouter), the inner runner didn't recognize it as a failover-eligible error and surfaced it directly to the user — even when fallback models were configured.

The error classification in classifyFailoverReason handles known patterns (timeouts, rate limits, auth errors, billing, etc.) but anything that doesn't match falls through. The shouldRotate flag only triggers for recognized failover errors or explicit timeouts, so unrecognized errors just got returned as-is.

This adds a catch-all in the inner auth-profile loop: when the assistant returned a stopReason: "error" that wasn't already handled by shouldRotate and fallback models are configured, throw a FailoverError so the outer runWithModelFallback loop gets a chance to try the next candidate.

Also adds some tests for the failover classification to document which errors are and aren't recognized.

Fixes #49921

AI-assisted: Built with Claude, reviewed by human.

Changed files

  • src/agents/pi-embedded-helpers/failover-matches.test.ts (added, +43/-0)
  • src/agents/pi-embedded-runner/run.ts (modified, +32/-0)

Code Example

model: "openrouter/hunter-alpha" | status: "error" | error: "Provider returned error" | durationMs: 126081
model: "openrouter/hunter-alpha" | status: "error" | error: "Provider returned error" | durationMs: 125985
... (6 occurrences, same pattern)
RAW_BUFFERClick to expand / collapse

Describe the bug

Model fallbacks are configured (agents.defaults.model.fallbacks: ["openrouter/healer-alpha", "openrouter/free"]) but when the primary model fails, the fallback is never attempted. The system records a terminal error on the primary model without ever trying the next candidate.

To reproduce

  1. Configure agents.defaults.model.primary: "openrouter/hunter-alpha" with fallbacks: ["openrouter/healer-alpha", "openrouter/free"]
  2. Run cron jobs or agent sessions that intermittently fail on the primary model
  3. Check cron run history (~/.openclaw/cron/runs/<jobId>.jsonl)

Expected behavior

When openrouter/hunter-alpha fails, the system should immediately retry with openrouter/healer-alpha, then openrouter/free, logging which model ultimately succeeded or if all candidates were exhausted.

Actual behavior

  • 6 errors recorded over 6 hours, all on openrouter/hunter-alpha
  • Each error takes ~126 seconds (suggests execution timeout, not API rejection)
  • openrouter/healer-alpha was never attempted (0 uses across all runs)
  • openrouter/free was never attempted

Evidence

Cron run logs show:

model: "openrouter/hunter-alpha" | status: "error" | error: "Provider returned error" | durationMs: 126081
model: "openrouter/hunter-alpha" | status: "error" | error: "Provider returned error" | durationMs: 125985
... (6 occurrences, same pattern)

No entries show fallback model names.

Code analysis

The fallback logic exists in query-expansion-DnS6CGY2.js:

  • resolveEffectiveModelFallbacks() correctly returns configured fallbacks
  • resolveFallbackCandidates() correctly builds candidate list (primary + fallbacks)
  • runWithModelFallback() iterates candidates with try/catch

However, it appears the catch block may only handle API-level errors (429, 5xx, rate limits) and not agent execution timeouts. When the primary model times out at the execution level, the error is recorded as terminal without attempting fallback candidates.

Hypothesis

The agent execution timeout (likely from cron job timeout or internal agent timeout) fires before the fallback retry logic gets a chance to run, or the timeout error type is not recognized as a retriable failure by the fallback loop.

Environment

  • OpenClaw version: 2026.3.13
  • Node: v22.22.1
  • OS: Linux 6.12.63 (Debian)
  • Provider: OpenRouter
  • Primary model: openrouter/hunter-alpha
  • Configured fallbacks: openrouter/healer-alpha, openrouter/free

extent analysis

Fix Plan

To address the issue, we need to modify the runWithModelFallback() function in query-expansion-DnS6CGY2.js to catch and handle execution timeouts as retriable failures.

Here are the steps:

  • Update the catch block to recognize and handle execution timeouts.
  • Implement a retry mechanism for timeouts.

Example code changes:

// In runWithModelFallback() function
try {
    // existing code to run the model
} catch (error) {
    // existing code to handle API-level errors
    if (error.code === 'ETIMEDOUT' || error.message.includes('timeout')) {
        // Handle execution timeout as a retriable failure
        // Attempt the next fallback candidate
        return tryNextFallbackCandidate();
    }
    // existing code to handle other error types
}

// New function to attempt the next fallback candidate
function tryNextFallbackCandidate() {
    const nextCandidate = resolveFallbackCandidates().shift();
    if (nextCandidate) {
        return runWithModelFallback(nextCandidate);
    } else {
        // All candidates exhausted, log the error
        logError('All model fallback candidates exhausted');
        return null;
    }
}

Verification

To verify the fix, follow these steps:

  • Configure the primary model and fallbacks as before.
  • Run cron jobs or agent sessions that intermittently fail on the primary model.
  • Check the cron run history (~/.openclaw/cron/runs/<jobId>.jsonl) for attempts to use the fallback models.
  • Verify that the system logs which model ultimately succeeded or if all candidates were exhausted.

Extra Tips

  • Make sure to test the updated code with different error scenarios to ensure it handles all possible retriable failures.
  • Consider adding a limit to the number of retries to prevent infinite loops.
  • Review the OpenClaw documentation for any updates on handling execution timeouts and fallbacks.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

FAQ

Expected behavior

When openrouter/hunter-alpha fails, the system should immediately retry with openrouter/healer-alpha, then openrouter/free, logging which model ultimately succeeded or if all candidates were exhausted.

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

openclaw - ✅(Solved) Fix Model fallback not triggered on agent execution timeout (fallbacks configured but never attempted) [1 pull requests, 5 comments, 5 participants]