openclaw - ✅(Solved) Fix [Feature]: Single-Model Retry Logic [2 pull requests, 1 comments, 2 participants]

joeldevelops · 2026-03-17T10:06:15Z

[openclaw] Add retry logic for API overload errors 429/529 before failover PR 49800: fix: retry same auth profile on transient overloaded errors before failove… Add retry logic for API overload errors (429/529) before failover # PR #49800: fix: retry same auth profile on transient overloaded errors before failover - Repository: openclaw/openclaw - Author: clovericbot - State: open | merged: False - Link: https://github.com/openclaw/openclaw/pull/49800 ## Description (problem / solution / changelog) ## Summary - **Problem:** When the LLM API returns a transient `overloaded` error (HTTP 529 / `overloaded_error`), OpenClaw tries to rotate to the next auth profile. If only one profile is configured for the provider, the error is surfaced immediately with "The AI service is temporarily overloaded. Please try again in a moment." — no retry. - **Why it matters:** During Anthropic capacity spikes (frequent in March 2026), agents with a single API key become unresponsive. Users must manually resend messages. The cron subsystem already has configurable retry (`cron.retry`), but the main LLM request path does not. - **What changed:** Added a same-profile retry loop (up to 3 attempts, exponential backoff 2s→4s→8s, jitter 25%, capped at 30s) for `overloaded` errors when no other profile is available. Applies to both prompt-side and assistant-side error paths. - **What did NOT change (scope boundary):** Existing profile rotation, fallback model logic, and backoff-before-failover behavior are untouched. Rate-limit, auth, billing, and format errors are NOT retried. The retry counter resets on success. No new config surface (hardcoded constants — a follow-up could expose these via `agents.defaults.llmRetry`). ## Change Type (select all) - [x] Bug fix - [ ] Feature - [ ] Refactor - [ ] Docs - [ ] Security hardening - [ ] Chore/infra ## Scope (select all touched areas) - [x] Gateway / orchestration - [ ] Skills / tool execution - [ ] Auth / tokens - [ ] Memory / storage - [ ] Integrations - [ ] API / contracts - [ ] UI / DX - [ ] CI/CD / infra ## Linked Issue/PR - Closes #49376 - Closes #48913 - Related #49696 - Related #24321 ## User-visible / Behavior Changes - When a single-profile provider returns `overloaded` (529), the agent now silently retries up to 3 times with exponential backoff (2s, 4s, 8s) before surfacing the error. - Log messages at `warn` level: `overloaded — retrying same profile for / : attempt=N/3 delayMs=X`. - Failover observation logs a new decision value: `retry_same_profile`. - If all 3 retries are exhausted, behavior is identical to before (fallback model or surface error). ## Security Impact (required) - New permissions/capabilities? `No` - Secrets/tokens handling changed? `No` - New/changed network calls? `No` (same LLM API call, just retried) - Command/tool execution surface changed? `No` - Data access scope changed? `No` ## Repro + Verification ### Environment - OS: macOS (Darwin arm64) - Runtime: Node 24 - Model/provider: Anthropic Claude Opus 4.6 (single API key profile) - Channel: Telegram ### Steps 1. Configure an agent with a single Anthropic auth profile (no fallbacks). 2. Send a message when Anthropic API is overloaded (returns 529). 3. **Before fix:** Error surfaced immediately. 4. **After fix:** Agent retries up to 3 times with backoff, then surfaces error only if still overloaded. ### Expected Agent retries transparently and recovers when the API comes back within ~15s. ### Actual Agent retries (visible in warn logs) and succeeds on retry, or surfaces error after 3 failed attempts. ## Evidence - `tsgo` type check: 0 errors ✅ - `oxlint` on changed files: 0 warnings, 0 errors ✅ - `vitest run` on `failover-observation.test.ts`: 2/2 passed ✅ - `vitest run` on `runs.test.ts`: 6/6 passed ✅ ## Human Verification (required) - Verified: TypeScript compiles, lint passes, existing tests pass. - Verified: Code review of both prompt-side and assistant-side paths — retry only triggers for `overloaded` reason, counter resets on success and on profile rotation. - Edge cases checked: abort signal interrupts retry sleep; counter does not interfere with existing `overloadFailoverAttempts`; `markAuthProfileGood` is called before retry to clear cooldown state. - What I did **not** verify: Live 529 error from Anthropic (cannot reproduce on demand). ## Review Conversations - [x] N/A (new PR) ## Compatibility / Migration - Backward compatible? `Yes` - Config/env changes? `No` - Migration needed? `No` ## Failure Recovery (if this breaks) - Revert this single commit to restore previous behavior (immediate surface on overloaded). - The retry is bounded (3 attempts, ~15s max wall-clock) and respects abort signals, so it cannot cause infinite loops. - Watch for: warn-level log spam if overload is persistent (3 log lines per failed request vs 0 before). ## Risks and Mitigations - Risk: Retry adds up to ~15s latency before surfacing a persistent overload error. - Mitigation: 3 retries with 2s/4s/8s back

openclaw2026-03-17 10:06:15

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

openclaw/openclaw#48913•Fetched 2026-04-08 00:51:09

View on GitHub

Comments

Participants

Timeline

Reactions

Author

joeldevelops

Participants

joeldevelops

miguelmanlyx

Timeline (top)

referenced ×3cross-referenced ×2commented ×1labeled ×1

Add retry logic for API overload errors (429/529) before failover

Error Message

Error detection (model-selection-DPAUAnEm.js): } catch (error) { const reason = classifyFailoverReason(error); log.warn(API overloaded, retry ${overloadRetryAttempts}/${MAX_OVERLOAD_RETRIES} after ${delayMs}ms);

Reset overloadRetryAttempts on success or non-overload error
Respects existing error classification

Root Cause

Add retry logic for API overload errors (429/529) before failover

Code Example

let overloadRetryAttempts = 0;
const MAX_OVERLOAD_RETRIES = 3; // configurable

while (true) {
  try {
    // Make API call
    const result = await callLLM(...);
    overloadRetryAttempts = 0; // Reset on success
    return result;
    
  } catch (error) {
    const reason = classifyFailoverReason(error);
    
    // NEW: Retry same profile before failover
    if (reason === "overloaded" && overloadRetryAttempts < MAX_OVERLOAD_RETRIES) {
      overloadRetryAttempts++;
      const delayMs = computeBackoff(OVERLOAD_BACKOFF_POLICY, overloadRetryAttempts);
      log.warn(`API overloaded, retry ${overloadRetryAttempts}/${MAX_OVERLOAD_RETRIES} after ${delayMs}ms`);
      await sleepWithAbort(delayMs, params.abortSignal);
      continue; // Retry same request/profile
    }
    
    // Reset counter for non-overload errors or after exhausting retries
    overloadRetryAttempts = 0;
    
    // Existing failover logic...
  }
}

---

agents:
  main:
    api:
      overloadRetries: 3  # default 3, set to 0 to disable
      overloadBackoff:    # optional custom backoff policy
        initial: 2000
        max: 30000
        multiplier: 2

RAW_BUFFERClick to expand / collapse

Summary

Add retry logic for API overload errors (429/529) before failover

Problem to solve

Problem

When Anthropic (or other providers) return overload errors (HTTP 529, overloaded_error response), OpenClaw immediately fails over to the next configured profile/provider. For users with a single API key, this means the request fails immediately with:

"The AI service is temporarily overloaded. Please try again in a moment."

Expected behavior: Retry the same provider with exponential backoff (like OpenCode does) before giving up or failing over.

Current Implementation

OpenClaw already has excellent infrastructure for this:

Error detection (model-selection-DPAUAnEm.js):
- ✅ Detects 529 status codes
- ✅ Pattern-matches overloaded_error in responses
- ✅ Classifies as "overloaded" reason
Backoff logic (pi-embedded-D6PpOsxP.js):
- ✅ maybeBackoffBeforeOverloadFailover() function exists
- ✅ Uses computeBackoff(OVERLOAD_FAILOVER_BACKOFF_POLICY, attempts)
- ✅ Respects abort signals

The gap: This backoff only happens between profile switches, not for retrying the same profile.

Proposed solution

Proposed Solution

Add a retry loop before failover in the main run loop:

let overloadRetryAttempts = 0;
const MAX_OVERLOAD_RETRIES = 3; // configurable

while (true) {
  try {
    // Make API call
    const result = await callLLM(...);
    overloadRetryAttempts = 0; // Reset on success
    return result;
    
  } catch (error) {
    const reason = classifyFailoverReason(error);
    
    // NEW: Retry same profile before failover
    if (reason === "overloaded" && overloadRetryAttempts < MAX_OVERLOAD_RETRIES) {
      overloadRetryAttempts++;
      const delayMs = computeBackoff(OVERLOAD_BACKOFF_POLICY, overloadRetryAttempts);
      log.warn(`API overloaded, retry ${overloadRetryAttempts}/${MAX_OVERLOAD_RETRIES} after ${delayMs}ms`);
      await sleepWithAbort(delayMs, params.abortSignal);
      continue; // Retry same request/profile
    }
    
    // Reset counter for non-overload errors or after exhausting retries
    overloadRetryAttempts = 0;
    
    // Existing failover logic...
  }
}

Configuration

Add to agent config:

agents:
  main:
    api:
      overloadRetries: 3  # default 3, set to 0 to disable
      overloadBackoff:    # optional custom backoff policy
        initial: 2000
        max: 30000
        multiplier: 2

Benefits

Better UX: Users get Claude responses instead of "try again later"
Fewer failed requests: Transient overloads (30-60s) resolve automatically
Backwards compatible: Only retries when overloaded reason detected
Respects existing patterns: Uses existing backoff + abort infrastructure
Works with failover: If retries exhausted, existing failover logic kicks in

Implementation Notes

Location: pi-embedded-D6PpOsxP.js in the main run loop (near overloadFailoverAttempts)

Separate counters needed:

overloadRetryAttempts - retries on same profile
overloadFailoverAttempts - backoff before switching profiles (existing)

Reset conditions:

Reset overloadRetryAttempts on success or non-overload error
Keep separate from failover counter so both mechanisms work together

Edge cases:

Abort signal should interrupt retry wait (already handled by sleepWithAbort)
Log retries clearly so users know what's happening
Consider different retry limits per provider (Anthropic vs others)

Alternatives considered

Could implement at HTTP client level, but doing it in the run loop:

Gives better logging context (session/model info)
Respects existing error classification
Easier to make configurable per agent
Already has abort signal wiring

Impact

User Impact: High — Anthropic has been overloaded frequently in March 2026, causing many failed requests that could have succeeded with a retry. Due to the recent rise in overloaded errors, many people on non-enterprise plans must deal with broken interactions if they rely on non-locally-hosted models. It would be nice to retry on this models before automatic failover.

Evidence/examples

https://status.claude.com/

Additional information

This pattern matches how OpenCode handles overload errors, which users report works well during Anthropic capacity issues.

extent analysis

Fix Plan

To implement retry logic for API overload errors, follow these steps:

Add retry loop: Introduce a retry loop in the main run loop (pi-embedded-D6PpOsxP.js) to handle overload errors before failing over to the next provider.
Configure retry attempts: Add configuration options for overloadRetries and overloadBackoff in the agent config (agents.main.api).
Implement retry logic: Use the following code snippet to implement the retry logic:

let overloadRetryAttempts = 0; const MAX_OVERLOAD_RETRIES = config.overloadRetries; // configurable

while (true) { try { // Make API call const result = await callLLM(...); overloadRetryAttempts = 0; // Reset on success return result;

} catch (error) { const reason = classifyFailoverReason(error);

// Retry same profile before failover
if (reason === "overloaded" && overloadRetryAttempts < MAX_OVERLOAD_RETRIES) {
  overloadRetryAttempts++;
  const delayMs = computeBackoff(config.overloadBackoff, overloadRetryAttempts);
  log.warn(`API overloaded, retry ${overloadRetryAttempts}/${MAX_OVERLOAD_RETRIES} after ${delayMs}ms`);
  await sleepWithAbort(delayMs, params.abortSignal);
  continue; // Retry same request/profile
}

// Reset counter for non-overload errors or after exhausting retries
overloadRetryAttempts = 0;

// Existing failover logic...

} }

4. **Update configuration**: Add the following configuration options to the agent config:
   ```yaml
agents:
  main:
    api:
      overloadRetries: 3  # default 3, set to 0 to disable
      overloadBackoff:    # optional custom backoff policy
        initial: 2000
        max: 30000
        multiplier: 2

Verification

To verify that the fix worked:

Test with overload errors: Simulate overload errors (e.g., using a test API that returns 529 status codes) and verify that the retry logic kicks in.
Check logs: Verify that the retry attempts are logged correctly, including the number of attempts and the delay between retries.
Verify success: After the retry attempts, verify that the API call is successful and the response is returned as expected.

Extra Tips

Consider implementing different retry limits per provider (e.g., Anthropic vs others).
Make sure to log retries clearly so users know what's happening.
The sleepWithAbort function should interrupt the retry wait if an abort signal is received.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #ssr #installation #tensor shape #autograd error #prompt issue #agent setup #task chaining #parallel task

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

openclaw - ✅(Solved) Fix [Feature]: Single-Model Retry Logic [2 pull requests, 1 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Fixed

PR fix notes

PR #49800: fix: retry same auth profile on transient overloaded errors before failover

Description (problem / solution / changelog)

Summary

Change Type (select all)

Scope (select all touched areas)

Linked Issue/PR

User-visible / Behavior Changes

Security Impact (required)

Repro + Verification

Environment

Steps

Expected

Actual

Evidence

Human Verification (required)

Review Conversations

Compatibility / Migration

Failure Recovery (if this breaks)

Risks and Mitigations

Changed files

PR #49807: fix: exclude overloaded errors from auth profile cooldown escalation

Description (problem / solution / changelog)

Summary

Change Type (select all)

Scope (select all touched areas)

Linked Issue/PR

User-visible / Behavior Changes

Security Impact (required)

Repro + Verification

Environment

Steps

Expected

Actual (before fix)

Evidence

Human Verification (required)

Review Conversations

Compatibility / Migration

Failure Recovery (if this breaks)

Risks and Mitigations

Changed files

Code Example

Summary

Problem to solve

Problem

Current Implementation

Proposed solution

Proposed Solution

Configuration

Benefits

Implementation Notes

Alternatives considered

Impact

Evidence/examples

Additional information

Related

extent analysis

Fix Plan

Verification

Extra Tips

Still need to ship something?

RELATED_DISCOVERY

TRENDING