openclaw - ✅(Solved) Fix Model fallback not triggered on agent execution timeout (fallbacks configured but never attempted) [1 pull requests, 5 comments, 5 participants]

Q: Expected behavior

When `openrouter/hunter-alpha` fails, the system should immediately retry with `openrouter/healer-alpha`, then `openrouter/free`, logging which model ultimately succeeded or if all candidates were exhausted.

openclaw2026-03-18 16:42:37

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

openclaw/openclaw#49921•Fetched 2026-04-08 01:01:14

View on GitHub

Comments

Participants

Timeline

Reactions

Author

Participants

Timeline (top)

commented ×5cross-referenced ×2subscribed ×1

Error Message

Fix Action

Fixed

Fixed by PR: Fix model fallback not triggering on unrecognized provider errors (https://github.com/openclaw/openclaw/pull/51283)

PR fix notes

PR #51283: Fix model fallback not triggering on unrecognized provider errors

Repository: openclaw/openclaw
Author: BenediktSchackenberg
State: open | merged: False
Link: https://github.com/openclaw/openclaw/pull/51283

Description (problem / solution / changelog)

When an LLM provider returns a generic error like "Provider returned error" (common with OpenRouter), the inner runner didn't recognize it as a failover-eligible error and surfaced it directly to the user — even when fallback models were configured.

The error classification in classifyFailoverReason handles known patterns (timeouts, rate limits, auth errors, billing, etc.) but anything that doesn't match falls through. The shouldRotate flag only triggers for recognized failover errors or explicit timeouts, so unrecognized errors just got returned as-is.

This adds a catch-all in the inner auth-profile loop: when the assistant returned a stopReason: "error" that wasn't already handled by shouldRotate and fallback models are configured, throw a FailoverError so the outer runWithModelFallback loop gets a chance to try the next candidate.

Also adds some tests for the failover classification to document which errors are and aren't recognized.

Fixes #49921

AI-assisted: Built with Claude, reviewed by human.

Changed files

src/agents/pi-embedded-helpers/failover-matches.test.ts (added, +43/-0)
src/agents/pi-embedded-runner/run.ts (modified, +32/-0)

Code Example

model: "openrouter/hunter-alpha" | status: "error" | error: "Provider returned error" | durationMs: 126081
model: "openrouter/hunter-alpha" | status: "error" | error: "Provider returned error" | durationMs: 125985
... (6 occurrences, same pattern)

RAW_BUFFERClick to expand / collapse

Describe the bug

Model fallbacks are configured (agents.defaults.model.fallbacks: ["openrouter/healer-alpha", "openrouter/free"]) but when the primary model fails, the fallback is never attempted. The system records a terminal error on the primary model without ever trying the next candidate.

To reproduce

Configure agents.defaults.model.primary: "openrouter/hunter-alpha" with fallbacks: ["openrouter/healer-alpha", "openrouter/free"]
Run cron jobs or agent sessions that intermittently fail on the primary model
Check cron run history (~/.openclaw/cron/runs/<jobId>.jsonl)

Expected behavior

When openrouter/hunter-alpha fails, the system should immediately retry with openrouter/healer-alpha, then openrouter/free, logging which model ultimately succeeded or if all candidates were exhausted.

Actual behavior

6 errors recorded over 6 hours, all on openrouter/hunter-alpha
Each error takes ~126 seconds (suggests execution timeout, not API rejection)
openrouter/healer-alpha was never attempted (0 uses across all runs)
openrouter/free was never attempted

Evidence

Cron run logs show:

model: "openrouter/hunter-alpha" | status: "error" | error: "Provider returned error" | durationMs: 126081
model: "openrouter/hunter-alpha" | status: "error" | error: "Provider returned error" | durationMs: 125985
... (6 occurrences, same pattern)

No entries show fallback model names.

Code analysis

The fallback logic exists in query-expansion-DnS6CGY2.js:

resolveEffectiveModelFallbacks() correctly returns configured fallbacks
resolveFallbackCandidates() correctly builds candidate list (primary + fallbacks)
runWithModelFallback() iterates candidates with try/catch

However, it appears the catch block may only handle API-level errors (429, 5xx, rate limits) and not agent execution timeouts. When the primary model times out at the execution level, the error is recorded as terminal without attempting fallback candidates.

Hypothesis

The agent execution timeout (likely from cron job timeout or internal agent timeout) fires before the fallback retry logic gets a chance to run, or the timeout error type is not recognized as a retriable failure by the fallback loop.

Environment

OpenClaw version: 2026.3.13
Node: v22.22.1
OS: Linux 6.12.63 (Debian)
Provider: OpenRouter
Primary model: openrouter/hunter-alpha
Configured fallbacks: openrouter/healer-alpha, openrouter/free

extent analysis

Fix Plan

To address the issue, we need to modify the runWithModelFallback() function in query-expansion-DnS6CGY2.js to catch and handle execution timeouts as retriable failures.

Here are the steps:

Update the catch block to recognize and handle execution timeouts.
Implement a retry mechanism for timeouts.

Example code changes:

// In runWithModelFallback() function
try {
    // existing code to run the model
} catch (error) {
    // existing code to handle API-level errors
    if (error.code === 'ETIMEDOUT' || error.message.includes('timeout')) {
        // Handle execution timeout as a retriable failure
        // Attempt the next fallback candidate
        return tryNextFallbackCandidate();
    }
    // existing code to handle other error types
}

// New function to attempt the next fallback candidate
function tryNextFallbackCandidate() {
    const nextCandidate = resolveFallbackCandidates().shift();
    if (nextCandidate) {
        return runWithModelFallback(nextCandidate);
    } else {
        // All candidates exhausted, log the error
        logError('All model fallback candidates exhausted');
        return null;
    }
}

Verification

To verify the fix, follow these steps:

Configure the primary model and fallbacks as before.
Run cron jobs or agent sessions that intermittently fail on the primary model.
Check the cron run history (~/.openclaw/cron/runs/<jobId>.jsonl) for attempts to use the fallback models.
Verify that the system logs which model ultimately succeeded or if all candidates were exhausted.

Extra Tips

Make sure to test the updated code with different error scenarios to ensure it handles all possible retriable failures.
Consider adding a limit to the number of retries to prevent infinite loops.
Review the OpenClaw documentation for any updates on handling execution timeouts and fallbacks.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

FAQ

Expected behavior

#api #ssr #installation #tensor shape #agent execution #prompt issue #agent setup #task chaining

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

openclaw - ✅(Solved) Fix Model fallback not triggered on agent execution timeout (fallbacks configured but never attempted) [1 pull requests, 5 comments, 5 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Fix Action

Fixed

PR fix notes

PR #51283: Fix model fallback not triggering on unrecognized provider errors

Description (problem / solution / changelog)

Changed files

Code Example

Describe the bug

To reproduce

Expected behavior

Actual behavior

Evidence

Code analysis

Hypothesis

Environment

extent analysis

Fix Plan

Verification

Extra Tips

FAQ

Expected behavior

Still need to ship something?

TRENDING

openclaw - ✅(Solved) Fix Model fallback not triggered on agent execution timeout (fallbacks configured but never attempted) [1 pull requests, 5 comments, 5 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Fix Action

Fixed

PR fix notes

PR #51283: Fix model fallback not triggering on unrecognized provider errors

Description (problem / solution / changelog)

Changed files

Code Example

Describe the bug

To reproduce

Expected behavior

Actual behavior

Evidence

Code analysis

Hypothesis

Environment

extent analysis

Fix Plan

Verification

Extra Tips

FAQ

Expected behavior

Still need to ship something?

RELATED_DISCOVERY

TRENDING