openclaw - 💡(How to fix) Fix Timeouts should mark auth profiles as failed to trigger faster model fallback [1 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
openclaw/openclaw#51057Fetched 2026-04-08 01:04:51
View on GitHub
Comments
1
Participants
2
Timeline
3
Reactions
0
Author
Participants
Timeline (top)
commented ×1cross-referenced ×1subscribed ×1

Error Message

OpenClaw v2026.3.13, Anna agent with openai-codex/gpt-5.4 primary and 3 OAuth profiles:

Root Cause

Root cause: In the model selection loop, timeouts are explicitly excluded from marking profile failures:

Code Example

// model-selection-CU2b7bN6.js
const maybeMarkAuthProfileFailure = async (failure) => {
    const { profileId, reason } = failure;
    if (!profileId || !reason || reason === "timeout") return; // ← timeouts skipped
    ...
};

---

const BASE_RUN_RETRY_ITERATIONS = 24;
const RUN_RETRY_ITERATIONS_PER_PROFILE = 8;  // 24 + N*8 iterations
const MIN_RUN_RETRY_ITERATIONS = 32;
const MAX_RUN_RETRY_ITERATIONS = 160;

---

12:44:27 embedded run agent end: isError=true error=LLM request timed out.
12:45:20 embedded run agent end: isError=true error=LLM request timed out.
12:46:17 failover decision: decision=rotate_profile reason=timeout
12:47:06 embedded run agent end: isError=true error=LLM request timed out.
12:47:57 embedded run agent end: isError=true error=LLM request timed out.
12:48:50 embedded run agent end: isError=true error=LLM request timed out.
12:49:47 failover decision: decision=rotate_profile reason=timeout
12:50:35 embedded run agent end: isError=true error=LLM request timed out.
12:51:26 embedded run agent end: isError=true error=LLM request timed out.
12:52:19 embedded run agent end: isError=true error=LLM request timed out.
12:53:16 failover decision: decision=fallback_model  ← finally, after ~10 min
12:53:36 model fallback: candidate=anthropic/claude-sonnet-4-6 → succeeded in 20s
RAW_BUFFERClick to expand / collapse

Problem

When the primary model provider times out (e.g. OpenAI Codex outage), the failover loop retries for the entire agents.defaults.timeoutSeconds budget (default 600s) before falling back — even though the provider is clearly down.

Root cause: In the model selection loop, timeouts are explicitly excluded from marking profile failures:

// model-selection-CU2b7bN6.js
const maybeMarkAuthProfileFailure = async (failure) => {
    const { profileId, reason } = failure;
    if (!profileId || !reason || reason === "timeout") return; // ← timeouts skipped
    ...
};

Because timed-out profiles never enter cooldown, advanceAuthProfile() keeps cycling through them as "available". The loop only terminates when runLoopIterations hits the hardcoded cap:

const BASE_RUN_RETRY_ITERATIONS = 24;
const RUN_RETRY_ITERATIONS_PER_PROFILE = 8;  // 24 + N*8 iterations
const MIN_RUN_RETRY_ITERATIONS = 32;
const MAX_RUN_RETRY_ITERATIONS = 160;

With 3 profiles: 24 + 3×8 = 48 max iterations, each burning ~60s on a timeout = potentially 48 minutes of useless retries (capped only by the 600s turn timeout).

Observed behavior

OpenClaw v2026.3.13, Anna agent with openai-codex/gpt-5.4 primary and 3 OAuth profiles:

12:44:27 embedded run agent end: isError=true error=LLM request timed out.
12:45:20 embedded run agent end: isError=true error=LLM request timed out.
12:46:17 failover decision: decision=rotate_profile reason=timeout
12:47:06 embedded run agent end: isError=true error=LLM request timed out.
12:47:57 embedded run agent end: isError=true error=LLM request timed out.
12:48:50 embedded run agent end: isError=true error=LLM request timed out.
12:49:47 failover decision: decision=rotate_profile reason=timeout
12:50:35 embedded run agent end: isError=true error=LLM request timed out.
12:51:26 embedded run agent end: isError=true error=LLM request timed out.
12:52:19 embedded run agent end: isError=true error=LLM request timed out.
12:53:16 failover decision: decision=fallback_model  ← finally, after ~10 min
12:53:36 model fallback: candidate=anthropic/claude-sonnet-4-6 → succeeded in 20s

Proposal

One or more of these would fix the issue:

  1. Count consecutive timeouts toward profile cooldown — e.g. after 2 consecutive timeouts on a profile, put it in short cooldown (30-60s). This lets the system exhaust all profiles faster and reach fallback_model sooner.

  2. Add a configurable model.maxFailoverSeconds — a separate timeout that caps the total time spent retrying the primary provider before falling back, independent of agents.defaults.timeoutSeconds (which also covers legitimate long-running turns like deep research via subagents).

  3. Add model.maxRetriesBeforeFallback — explicit cap on how many times the primary model is retried before escalating to the fallback chain.

  4. Per-agent timeoutSeconds — allow individual agents to have different timeout budgets, so a chat agent can have 120s while a research agent keeps 900s.

Why agents.defaults.timeoutSeconds alone doesn't solve this

Lowering the global timeout would also kill legitimate long-running turns (e.g. subagent deep research that takes 15+ minutes). There's no per-agent override, so you can't set a short timeout for the chat agent without breaking the research agent.

Related issues

  • #45589 — same 10-min unresponsiveness symptom (Discord + Gemini)
  • #32533 — failover doesn't escalate on overload errors
  • #36576 — retry with backoff for transient provider errors

extent analysis

Fix Plan

To address the issue, we will implement the first proposal: Count consecutive timeouts toward profile cooldown. This involves modifying the maybeMarkAuthProfileFailure function to include timeouts in marking profile failures.

Step-by-Step Solution

  1. Modify the maybeMarkAuthProfileFailure function:

const maybeMarkAuthProfileFailure = async (failure) => { const { profileId, reason } = failure; if (!profileId || !reason) return; // Include timeouts in marking profile failures // ... };

2. **Implement consecutive timeout tracking**:
   ```javascript
const consecutiveTimeouts = {};
const maxConsecutiveTimeouts = 2; // Adjust as needed
const cooldownDuration = 30 * 1000; // 30 seconds

const maybeMarkAuthProfileFailure = async (failure) => {
    const { profileId, reason } = failure;
    if (!profileId || !reason) return;
    if (reason === "timeout") {
        if (!consecutiveTimeouts[profileId]) {
            consecutiveTimeouts[profileId] = 1;
        } else {
            consecutiveTimeouts[profileId]++;
        }
        if (consecutiveTimeouts[profileId] >= maxConsecutiveTimeouts) {
            // Put the profile in cooldown
            await putProfileInCooldown(profileId, cooldownDuration);
            delete consecutiveTimeouts[profileId];
        }
    } else {
        // Handle other failure reasons
        // ...
    }
};
  1. Implement the putProfileInCooldown function:

const putProfileInCooldown = async (profileId, duration) => { // Update the profile's status to "cooldown" for the specified duration // ... };


### Verification
To verify the fix, you can simulate a timeout scenario and check if the profile is put in cooldown after the specified number of consecutive timeouts. Monitor the system's behavior and adjust the `maxConsecutiveTimeouts` and `cooldownDuration` values as needed to achieve the desired failover behavior.

### Extra Tips
* Consider adding logging to track consecutive timeouts and cooldown events for easier debugging and monitoring.
* Adjust the `maxConsecutiveTimeouts` and `cooldownDuration` values based on your system's specific requirements and performance characteristics.
* You may also want to explore implementing the other proposed solutions, such as adding a configurable `model.maxFailoverSeconds` or `model.maxRetriesBeforeFallback`, to further improve the system's failover behavior.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

openclaw - 💡(How to fix) Fix Timeouts should mark auth profiles as failed to trigger faster model fallback [1 comments, 2 participants]