openclaw - 💡(How to fix) Fix Timeouts should mark auth profiles as failed to trigger faster model fallback [1 comments, 2 participants]

openclaw2026-03-20 13:19:51

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

openclaw/openclaw#51057•Fetched 2026-04-08 01:04:51

View on GitHub

Comments

Participants

Timeline

Reactions

Author

YoanWai

Participants

Ryce

YoanWai

Timeline (top)

commented ×1cross-referenced ×1subscribed ×1

Error Message

OpenClaw v2026.3.13, Anna agent with openai-codex/gpt-5.4 primary and 3 OAuth profiles:

Root Cause

Root cause: In the model selection loop, timeouts are explicitly excluded from marking profile failures:

Code Example

// model-selection-CU2b7bN6.js
const maybeMarkAuthProfileFailure = async (failure) => {
    const { profileId, reason } = failure;
    if (!profileId || !reason || reason === "timeout") return; // ← timeouts skipped
    ...
};

---

const BASE_RUN_RETRY_ITERATIONS = 24;
const RUN_RETRY_ITERATIONS_PER_PROFILE = 8;  // 24 + N*8 iterations
const MIN_RUN_RETRY_ITERATIONS = 32;
const MAX_RUN_RETRY_ITERATIONS = 160;

---

12:44:27 embedded run agent end: isError=true error=LLM request timed out.
12:45:20 embedded run agent end: isError=true error=LLM request timed out.
12:46:17 failover decision: decision=rotate_profile reason=timeout
12:47:06 embedded run agent end: isError=true error=LLM request timed out.
12:47:57 embedded run agent end: isError=true error=LLM request timed out.
12:48:50 embedded run agent end: isError=true error=LLM request timed out.
12:49:47 failover decision: decision=rotate_profile reason=timeout
12:50:35 embedded run agent end: isError=true error=LLM request timed out.
12:51:26 embedded run agent end: isError=true error=LLM request timed out.
12:52:19 embedded run agent end: isError=true error=LLM request timed out.
12:53:16 failover decision: decision=fallback_model  ← finally, after ~10 min
12:53:36 model fallback: candidate=anthropic/claude-sonnet-4-6 → succeeded in 20s

RAW_BUFFERClick to expand / collapse

Problem

When the primary model provider times out (e.g. OpenAI Codex outage), the failover loop retries for the entire agents.defaults.timeoutSeconds budget (default 600s) before falling back — even though the provider is clearly down.

Root cause: In the model selection loop, timeouts are explicitly excluded from marking profile failures:

// model-selection-CU2b7bN6.js
const maybeMarkAuthProfileFailure = async (failure) => {
    const { profileId, reason } = failure;
    if (!profileId || !reason || reason === "timeout") return; // ← timeouts skipped
    ...
};

Because timed-out profiles never enter cooldown, advanceAuthProfile() keeps cycling through them as "available". The loop only terminates when runLoopIterations hits the hardcoded cap:

const BASE_RUN_RETRY_ITERATIONS = 24;
const RUN_RETRY_ITERATIONS_PER_PROFILE = 8;  // 24 + N*8 iterations
const MIN_RUN_RETRY_ITERATIONS = 32;
const MAX_RUN_RETRY_ITERATIONS = 160;

With 3 profiles: 24 + 3×8 = 48 max iterations, each burning ~60s on a timeout = potentially 48 minutes of useless retries (capped only by the 600s turn timeout).

Observed behavior

OpenClaw v2026.3.13, Anna agent with openai-codex/gpt-5.4 primary and 3 OAuth profiles:

12:44:27 embedded run agent end: isError=true error=LLM request timed out.
12:45:20 embedded run agent end: isError=true error=LLM request timed out.
12:46:17 failover decision: decision=rotate_profile reason=timeout
12:47:06 embedded run agent end: isError=true error=LLM request timed out.
12:47:57 embedded run agent end: isError=true error=LLM request timed out.
12:48:50 embedded run agent end: isError=true error=LLM request timed out.
12:49:47 failover decision: decision=rotate_profile reason=timeout
12:50:35 embedded run agent end: isError=true error=LLM request timed out.
12:51:26 embedded run agent end: isError=true error=LLM request timed out.
12:52:19 embedded run agent end: isError=true error=LLM request timed out.
12:53:16 failover decision: decision=fallback_model  ← finally, after ~10 min
12:53:36 model fallback: candidate=anthropic/claude-sonnet-4-6 → succeeded in 20s

Proposal

One or more of these would fix the issue:

Count consecutive timeouts toward profile cooldown — e.g. after 2 consecutive timeouts on a profile, put it in short cooldown (30-60s). This lets the system exhaust all profiles faster and reach fallback_model sooner.
Add a configurable model.maxFailoverSeconds — a separate timeout that caps the total time spent retrying the primary provider before falling back, independent of agents.defaults.timeoutSeconds (which also covers legitimate long-running turns like deep research via subagents).
Add model.maxRetriesBeforeFallback — explicit cap on how many times the primary model is retried before escalating to the fallback chain.
Per-agent timeoutSeconds — allow individual agents to have different timeout budgets, so a chat agent can have 120s while a research agent keeps 900s.

Why `agents.defaults.timeoutSeconds` alone doesn't solve this

Lowering the global timeout would also kill legitimate long-running turns (e.g. subagent deep research that takes 15+ minutes). There's no per-agent override, so you can't set a short timeout for the chat agent without breaking the research agent.

Related issues

#45589 — same 10-min unresponsiveness symptom (Discord + Gemini)
#32533 — failover doesn't escalate on overload errors
#36576 — retry with backoff for transient provider errors

extent analysis

Fix Plan

To address the issue, we will implement the first proposal: Count consecutive timeouts toward profile cooldown. This involves modifying the maybeMarkAuthProfileFailure function to include timeouts in marking profile failures.

Step-by-Step Solution

Modify the maybeMarkAuthProfileFailure function:

const maybeMarkAuthProfileFailure = async (failure) => { const { profileId, reason } = failure; if (!profileId || !reason) return; // Include timeouts in marking profile failures // ... };

2. **Implement consecutive timeout tracking**:
   ```javascript
const consecutiveTimeouts = {};
const maxConsecutiveTimeouts = 2; // Adjust as needed
const cooldownDuration = 30 * 1000; // 30 seconds

const maybeMarkAuthProfileFailure = async (failure) => {
    const { profileId, reason } = failure;
    if (!profileId || !reason) return;
    if (reason === "timeout") {
        if (!consecutiveTimeouts[profileId]) {
            consecutiveTimeouts[profileId] = 1;
        } else {
            consecutiveTimeouts[profileId]++;
        }
        if (consecutiveTimeouts[profileId] >= maxConsecutiveTimeouts) {
            // Put the profile in cooldown
            await putProfileInCooldown(profileId, cooldownDuration);
            delete consecutiveTimeouts[profileId];
        }
    } else {
        // Handle other failure reasons
        // ...
    }
};

Implement the putProfileInCooldown function:

const putProfileInCooldown = async (profileId, duration) => { // Update the profile's status to "cooldown" for the specified duration // ... };


### Verification
To verify the fix, you can simulate a timeout scenario and check if the profile is put in cooldown after the specified number of consecutive timeouts. Monitor the system's behavior and adjust the `maxConsecutiveTimeouts` and `cooldownDuration` values as needed to achieve the desired failover behavior.

### Extra Tips
* Consider adding logging to track consecutive timeouts and cooldown events for easier debugging and monitoring.
* Adjust the `maxConsecutiveTimeouts` and `cooldownDuration` values based on your system's specific requirements and performance characteristics.
* You may also want to explore implementing the other proposed solutions, such as adding a configurable `model.maxFailoverSeconds` or `model.maxRetriesBeforeFallback`, to further improve the system's failover behavior.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #ssr #installation #tensor shape #autograd error #environment setup #docker error #permission error #memory optimization #batch processing

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

openclaw - 💡(How to fix) Fix Timeouts should mark auth profiles as failed to trigger faster model fallback [1 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Code Example

Problem

Observed behavior

Proposal

Why `agents.defaults.timeoutSeconds` alone doesn't solve this

Related issues

extent analysis

Fix Plan

Step-by-Step Solution

Still need to ship something?

TRENDING

openclaw - 💡(How to fix) Fix Timeouts should mark auth profiles as failed to trigger faster model fallback [1 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Code Example

Problem

Observed behavior

Proposal

Why agents.defaults.timeoutSeconds alone doesn't solve this

Related issues

extent analysis

Fix Plan

Step-by-Step Solution

Still need to ship something?

RELATED_DISCOVERY

TRENDING

Why `agents.defaults.timeoutSeconds` alone doesn't solve this