openclaw - ✅(Solved) Fix [Feature]: Single-Model Retry Logic [2 pull requests, 1 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
openclaw/openclaw#48913Fetched 2026-04-08 00:51:09
View on GitHub
Comments
1
Participants
2
Timeline
7
Reactions
1
Timeline (top)
referenced ×3cross-referenced ×2commented ×1labeled ×1

Add retry logic for API overload errors (429/529) before failover

Error Message

  1. Error detection (model-selection-DPAUAnEm.js): } catch (error) { const reason = classifyFailoverReason(error); log.warn(API overloaded, retry ${overloadRetryAttempts}/${MAX_OVERLOAD_RETRIES} after ${delayMs}ms);
  • Reset overloadRetryAttempts on success or non-overload error
  • Respects existing error classification

Root Cause

Add retry logic for API overload errors (429/529) before failover

Fix Action

Fixed

PR fix notes

PR #49800: fix: retry same auth profile on transient overloaded errors before failover

Description (problem / solution / changelog)

Summary

  • Problem: When the LLM API returns a transient overloaded error (HTTP 529 / overloaded_error), OpenClaw tries to rotate to the next auth profile. If only one profile is configured for the provider, the error is surfaced immediately with "The AI service is temporarily overloaded. Please try again in a moment." — no retry.
  • Why it matters: During Anthropic capacity spikes (frequent in March 2026), agents with a single API key become unresponsive. Users must manually resend messages. The cron subsystem already has configurable retry (cron.retry), but the main LLM request path does not.
  • What changed: Added a same-profile retry loop (up to 3 attempts, exponential backoff 2s→4s→8s, jitter 25%, capped at 30s) for overloaded errors when no other profile is available. Applies to both prompt-side and assistant-side error paths.
  • What did NOT change (scope boundary): Existing profile rotation, fallback model logic, and backoff-before-failover behavior are untouched. Rate-limit, auth, billing, and format errors are NOT retried. The retry counter resets on success. No new config surface (hardcoded constants — a follow-up could expose these via agents.defaults.llmRetry).

Change Type (select all)

  • Bug fix
  • Feature
  • Refactor
  • Docs
  • Security hardening
  • Chore/infra

Scope (select all touched areas)

  • Gateway / orchestration
  • Skills / tool execution
  • Auth / tokens
  • Memory / storage
  • Integrations
  • API / contracts
  • UI / DX
  • CI/CD / infra

Linked Issue/PR

  • Closes #49376
  • Closes #48913
  • Related #49696
  • Related #24321

User-visible / Behavior Changes

  • When a single-profile provider returns overloaded (529), the agent now silently retries up to 3 times with exponential backoff (2s, 4s, 8s) before surfacing the error.
  • Log messages at warn level: overloaded — retrying same profile for <provider>/<model>: attempt=N/3 delayMs=X.
  • Failover observation logs a new decision value: retry_same_profile.
  • If all 3 retries are exhausted, behavior is identical to before (fallback model or surface error).

Security Impact (required)

  • New permissions/capabilities? No
  • Secrets/tokens handling changed? No
  • New/changed network calls? No (same LLM API call, just retried)
  • Command/tool execution surface changed? No
  • Data access scope changed? No

Repro + Verification

Environment

  • OS: macOS (Darwin arm64)
  • Runtime: Node 24
  • Model/provider: Anthropic Claude Opus 4.6 (single API key profile)
  • Channel: Telegram

Steps

  1. Configure an agent with a single Anthropic auth profile (no fallbacks).
  2. Send a message when Anthropic API is overloaded (returns 529).
  3. Before fix: Error surfaced immediately.
  4. After fix: Agent retries up to 3 times with backoff, then surfaces error only if still overloaded.

Expected

Agent retries transparently and recovers when the API comes back within ~15s.

Actual

Agent retries (visible in warn logs) and succeeds on retry, or surfaces error after 3 failed attempts.

Evidence

  • tsgo type check: 0 errors ✅
  • oxlint on changed files: 0 warnings, 0 errors ✅
  • vitest run on failover-observation.test.ts: 2/2 passed ✅
  • vitest run on runs.test.ts: 6/6 passed ✅

Human Verification (required)

  • Verified: TypeScript compiles, lint passes, existing tests pass.
  • Verified: Code review of both prompt-side and assistant-side paths — retry only triggers for overloaded reason, counter resets on success and on profile rotation.
  • Edge cases checked: abort signal interrupts retry sleep; counter does not interfere with existing overloadFailoverAttempts; markAuthProfileGood is called before retry to clear cooldown state.
  • What I did not verify: Live 529 error from Anthropic (cannot reproduce on demand).

Review Conversations

  • N/A (new PR)

Compatibility / Migration

  • Backward compatible? Yes
  • Config/env changes? No
  • Migration needed? No

Failure Recovery (if this breaks)

  • Revert this single commit to restore previous behavior (immediate surface on overloaded).
  • The retry is bounded (3 attempts, ~15s max wall-clock) and respects abort signals, so it cannot cause infinite loops.
  • Watch for: warn-level log spam if overload is persistent (3 log lines per failed request vs 0 before).

Risks and Mitigations

  • Risk: Retry adds up to ~15s latency before surfacing a persistent overload error.
    • Mitigation: 3 retries with 2s/4s/8s backoff is a reasonable tradeoff. Users previously had to manually retry anyway. A future config option could let users tune this.
  • Risk: markAuthProfileGood call before retry could mask a genuinely bad profile.
    • Mitigation: Only called for overloaded reason (transient), not for auth/billing/format errors. The profile was working before the overload.

Changed files

  • src/agents/pi-embedded-runner/run.overflow-compaction.harness.ts (modified, +21/-5)
  • src/agents/pi-embedded-runner/run.overloaded-retry.test.ts (added, +96/-0)
  • src/agents/pi-embedded-runner/run.ts (modified, +91/-0)
  • src/agents/pi-embedded-runner/run/failover-observation.ts (modified, +1/-1)

PR #49807: fix: exclude overloaded errors from auth profile cooldown escalation

Description (problem / solution / changelog)

Summary

  • Problem: Transient overloaded errors (HTTP 529) are persisted to auth profile failure stats, triggering exponential cooldown escalation (60s × 5^(n-1), up to 1 hour). This causes a cascading failure where the profile becomes unusable long after the provider recovers.
  • Why it matters: Two agents sharing the same Anthropic API key and model — one works fine, the other is stuck in cooldown for 25+ minutes after a brief 529 spike. The agent that first hit the 529 enters a death spiral: retry → 529 or cooldown → error count ↑ → longer cooldown → repeat.
  • What changed: resolveAuthProfileFailureReason() now returns null for "overloaded" (same treatment as "timeout"), so overloaded errors do not persist to usage stats or trigger cooldown escalation. The profile stays healthy for immediate retry.
  • What did NOT change: Rate-limit, billing, auth, and format errors still trigger cooldown as before. The maybeBackoffBeforeOverloadFailover() backoff between profile rotations is unchanged.

Change Type (select all)

  • Bug fix
  • Feature
  • Refactor
  • Docs
  • Security hardening
  • Chore/infra

Scope (select all touched areas)

  • Gateway / orchestration
  • Skills / tool execution
  • Auth / tokens
  • Memory / storage
  • Integrations
  • API / contracts
  • UI / DX
  • CI/CD / infra

Linked Issue/PR

  • Closes #49696
  • Related #49376
  • Related #48913
  • Related #24321
  • Related #49800

User-visible / Behavior Changes

  • overloaded (529) errors no longer cause auth profile cooldown escalation.
  • After a 529 spike, the agent can immediately retry instead of waiting out an exponentially growing cooldown window.
  • Two agents sharing the same API key will no longer have divergent availability after one hits a transient 529.
  • Log change: auth_profile_failure_state_updated events will no longer appear for reason: "overloaded".

Security Impact (required)

  • New permissions/capabilities? No
  • Secrets/tokens handling changed? No
  • New/changed network calls? No
  • Command/tool execution surface changed? No
  • Data access scope changed? No

Repro + Verification

Environment

  • OS: macOS Darwin arm64
  • Runtime: Node 24
  • Model/provider: Anthropic Claude Opus 4.6 (single API key, no fallbacks)
  • Channel: Telegram

Steps

  1. Configure two agents (A and B) with the same Anthropic API key, same model, no fallbacks.
  2. During an Anthropic 529 spike, agent A gets overloaded errors.
  3. Anthropic recovers after ~1 minute.
  4. Before fix: Agent A is stuck in 5-25min cooldown. Agent B (which never hit 529) works fine. Same key, same model, different availability.
  5. After fix: Agent A can immediately retry after the 529 clears. No cooldown escalation.

Expected

Both agents recover as soon as Anthropic's 529 clears.

Actual (before fix)

Agent A stuck in exponential cooldown (observed: 16 consecutive overloaded errors over 28 minutes, cooldown escalating from 60s to 25min+).

Evidence

  • Gateway logs showing cascading cooldown: cooldownUntil values increasing from 177383792171117738380864351773838284549 (3+ minute jumps between each)
  • tsgo type check: 0 errors ✅
  • oxlint: 0 warnings, 0 errors ✅
  • vitest run failover-observation + runs tests: 8/8 passed ✅

Human Verification (required)

  • Verified: Reproduced the cascading cooldown in gateway logs with real Anthropic 529 errors.
  • Verified: After the fix, resolveAuthProfileFailureReason("overloaded") returns null, preventing markAuthProfileFailure() from being called.
  • Edge cases: overloaded still triggers maybeBackoffBeforeOverloadFailover() for inter-profile rotation backoff (unchanged). Rate-limit errors still escalate cooldown correctly.
  • What I did not verify: Live test with intentional 529 triggering (cannot reproduce on demand).

Review Conversations

  • N/A (new PR)

Compatibility / Migration

  • Backward compatible? Yes
  • Config/env changes? No
  • Migration needed? No

Failure Recovery (if this breaks)

  • Revert this single commit.
  • Watch for: If overloaded errors are persistent (not transient), the profile will no longer be cooled down. But the existing maybeBackoffBeforeOverloadFailover() backoff still provides pacing between failover attempts.

Risks and Mitigations

  • Risk: Without cooldown, a truly overloaded provider gets hammered with retries.
    • Mitigation: The run loop already has MAX_RUN_LOOP_ITERATIONS to cap total retries. The related PR #49800 adds same-profile retry with exponential backoff (2s→4s→8s) which provides proper pacing.

Changed files

  • src/agents/pi-embedded-runner/run.ts (modified, +9/-11)
  • src/agents/pi-embedded-runner/run/resolve-profile-failure-reason.test.ts (added, +36/-0)
  • src/agents/pi-embedded-runner/run/resolve-profile-failure-reason.ts (added, +21/-0)

Code Example

let overloadRetryAttempts = 0;
const MAX_OVERLOAD_RETRIES = 3; // configurable

while (true) {
  try {
    // Make API call
    const result = await callLLM(...);
    overloadRetryAttempts = 0; // Reset on success
    return result;
    
  } catch (error) {
    const reason = classifyFailoverReason(error);
    
    // NEW: Retry same profile before failover
    if (reason === "overloaded" && overloadRetryAttempts < MAX_OVERLOAD_RETRIES) {
      overloadRetryAttempts++;
      const delayMs = computeBackoff(OVERLOAD_BACKOFF_POLICY, overloadRetryAttempts);
      log.warn(`API overloaded, retry ${overloadRetryAttempts}/${MAX_OVERLOAD_RETRIES} after ${delayMs}ms`);
      await sleepWithAbort(delayMs, params.abortSignal);
      continue; // Retry same request/profile
    }
    
    // Reset counter for non-overload errors or after exhausting retries
    overloadRetryAttempts = 0;
    
    // Existing failover logic...
  }
}

---

agents:
  main:
    api:
      overloadRetries: 3  # default 3, set to 0 to disable
      overloadBackoff:    # optional custom backoff policy
        initial: 2000
        max: 30000
        multiplier: 2
RAW_BUFFERClick to expand / collapse

Summary

Add retry logic for API overload errors (429/529) before failover

Problem to solve

Problem

When Anthropic (or other providers) return overload errors (HTTP 529, overloaded_error response), OpenClaw immediately fails over to the next configured profile/provider. For users with a single API key, this means the request fails immediately with:

"The AI service is temporarily overloaded. Please try again in a moment."

Expected behavior: Retry the same provider with exponential backoff (like OpenCode does) before giving up or failing over.

Current Implementation

OpenClaw already has excellent infrastructure for this:

  1. Error detection (model-selection-DPAUAnEm.js):

    • ✅ Detects 529 status codes
    • ✅ Pattern-matches overloaded_error in responses
    • ✅ Classifies as "overloaded" reason
  2. Backoff logic (pi-embedded-D6PpOsxP.js):

    • maybeBackoffBeforeOverloadFailover() function exists
    • ✅ Uses computeBackoff(OVERLOAD_FAILOVER_BACKOFF_POLICY, attempts)
    • ✅ Respects abort signals

The gap: This backoff only happens between profile switches, not for retrying the same profile.

Proposed solution

Proposed Solution

Add a retry loop before failover in the main run loop:

let overloadRetryAttempts = 0;
const MAX_OVERLOAD_RETRIES = 3; // configurable

while (true) {
  try {
    // Make API call
    const result = await callLLM(...);
    overloadRetryAttempts = 0; // Reset on success
    return result;
    
  } catch (error) {
    const reason = classifyFailoverReason(error);
    
    // NEW: Retry same profile before failover
    if (reason === "overloaded" && overloadRetryAttempts < MAX_OVERLOAD_RETRIES) {
      overloadRetryAttempts++;
      const delayMs = computeBackoff(OVERLOAD_BACKOFF_POLICY, overloadRetryAttempts);
      log.warn(`API overloaded, retry ${overloadRetryAttempts}/${MAX_OVERLOAD_RETRIES} after ${delayMs}ms`);
      await sleepWithAbort(delayMs, params.abortSignal);
      continue; // Retry same request/profile
    }
    
    // Reset counter for non-overload errors or after exhausting retries
    overloadRetryAttempts = 0;
    
    // Existing failover logic...
  }
}

Configuration

Add to agent config:

agents:
  main:
    api:
      overloadRetries: 3  # default 3, set to 0 to disable
      overloadBackoff:    # optional custom backoff policy
        initial: 2000
        max: 30000
        multiplier: 2

Benefits

  1. Better UX: Users get Claude responses instead of "try again later"
  2. Fewer failed requests: Transient overloads (30-60s) resolve automatically
  3. Backwards compatible: Only retries when overloaded reason detected
  4. Respects existing patterns: Uses existing backoff + abort infrastructure
  5. Works with failover: If retries exhausted, existing failover logic kicks in

Implementation Notes

Location: pi-embedded-D6PpOsxP.js in the main run loop (near overloadFailoverAttempts)

Separate counters needed:

  • overloadRetryAttempts - retries on same profile
  • overloadFailoverAttempts - backoff before switching profiles (existing)

Reset conditions:

  • Reset overloadRetryAttempts on success or non-overload error
  • Keep separate from failover counter so both mechanisms work together

Edge cases:

  • Abort signal should interrupt retry wait (already handled by sleepWithAbort)
  • Log retries clearly so users know what's happening
  • Consider different retry limits per provider (Anthropic vs others)

Alternatives considered

Could implement at HTTP client level, but doing it in the run loop:

  • Gives better logging context (session/model info)
  • Respects existing error classification
  • Easier to make configurable per agent
  • Already has abort signal wiring

Impact

User Impact: High — Anthropic has been overloaded frequently in March 2026, causing many failed requests that could have succeeded with a retry. Due to the recent rise in overloaded errors, many people on non-enterprise plans must deal with broken interactions if they rely on non-locally-hosted models. It would be nice to retry on this models before automatic failover.

Evidence/examples

https://status.claude.com/

Additional information

Related

This pattern matches how OpenCode handles overload errors, which users report works well during Anthropic capacity issues.

extent analysis

Fix Plan

To implement retry logic for API overload errors, follow these steps:

  1. Add retry loop: Introduce a retry loop in the main run loop (pi-embedded-D6PpOsxP.js) to handle overload errors before failing over to the next provider.
  2. Configure retry attempts: Add configuration options for overloadRetries and overloadBackoff in the agent config (agents.main.api).
  3. Implement retry logic: Use the following code snippet to implement the retry logic:

let overloadRetryAttempts = 0; const MAX_OVERLOAD_RETRIES = config.overloadRetries; // configurable

while (true) { try { // Make API call const result = await callLLM(...); overloadRetryAttempts = 0; // Reset on success return result;

} catch (error) { const reason = classifyFailoverReason(error);

// Retry same profile before failover
if (reason === "overloaded" && overloadRetryAttempts < MAX_OVERLOAD_RETRIES) {
  overloadRetryAttempts++;
  const delayMs = computeBackoff(config.overloadBackoff, overloadRetryAttempts);
  log.warn(`API overloaded, retry ${overloadRetryAttempts}/${MAX_OVERLOAD_RETRIES} after ${delayMs}ms`);
  await sleepWithAbort(delayMs, params.abortSignal);
  continue; // Retry same request/profile
}

// Reset counter for non-overload errors or after exhausting retries
overloadRetryAttempts = 0;

// Existing failover logic...

} }

4. **Update configuration**: Add the following configuration options to the agent config:
   ```yaml
agents:
  main:
    api:
      overloadRetries: 3  # default 3, set to 0 to disable
      overloadBackoff:    # optional custom backoff policy
        initial: 2000
        max: 30000
        multiplier: 2

Verification

To verify that the fix worked:

  1. Test with overload errors: Simulate overload errors (e.g., using a test API that returns 529 status codes) and verify that the retry logic kicks in.
  2. Check logs: Verify that the retry attempts are logged correctly, including the number of attempts and the delay between retries.
  3. Verify success: After the retry attempts, verify that the API call is successful and the response is returned as expected.

Extra Tips

  • Consider implementing different retry limits per provider (e.g., Anthropic vs others).
  • Make sure to log retries clearly so users know what's happening.
  • The sleepWithAbort function should interrupt the retry wait if an abort signal is received.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

openclaw - ✅(Solved) Fix [Feature]: Single-Model Retry Logic [2 pull requests, 1 comments, 2 participants]