openclaw - ✅(Solved) Fix Feature: configurable LLM retry with backoff on transient errors (overloaded/529) [2 pull requests, 1 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
openclaw/openclaw#49376Fetched 2026-04-08 00:55:51
View on GitHub
Comments
1
Participants
2
Timeline
6
Reactions
0
Author
Timeline (top)
referenced ×3cross-referenced ×2commented ×1

When Anthropic returns a transient error (HTTP 529 overloaded, 503, or 500), the gateway currently classifies it and surfaces the error to the user. There's no configurable retry-with-backoff for the main LLM request path.

Error Message

When Anthropic returns a transient error (HTTP 529 overloaded, 503, or 500), the gateway currently classifies it and surfaces the error to the user. There's no configurable retry-with-backoff for the main LLM request path.

  • retryOn: Which error classes to retry (reuse existing classification: overloaded, server_error, network, timeout)

Root Cause

During Anthropic outages (529 overloaded), the agent becomes unresponsive. Users have to manually retry or wait. A built-in retry with backoff would let the gateway transparently recover when the API comes back, without the user needing to re-send their message.

Currently we work around this with an external watchdog script that polls the API and sends notifications, but native gateway retry would be cleaner and faster to recover.

Fix Action

Fixed

PR fix notes

PR #49800: fix: retry same auth profile on transient overloaded errors before failover

Description (problem / solution / changelog)

Summary

  • Problem: When the LLM API returns a transient overloaded error (HTTP 529 / overloaded_error), OpenClaw tries to rotate to the next auth profile. If only one profile is configured for the provider, the error is surfaced immediately with "The AI service is temporarily overloaded. Please try again in a moment." — no retry.
  • Why it matters: During Anthropic capacity spikes (frequent in March 2026), agents with a single API key become unresponsive. Users must manually resend messages. The cron subsystem already has configurable retry (cron.retry), but the main LLM request path does not.
  • What changed: Added a same-profile retry loop (up to 3 attempts, exponential backoff 2s→4s→8s, jitter 25%, capped at 30s) for overloaded errors when no other profile is available. Applies to both prompt-side and assistant-side error paths.
  • What did NOT change (scope boundary): Existing profile rotation, fallback model logic, and backoff-before-failover behavior are untouched. Rate-limit, auth, billing, and format errors are NOT retried. The retry counter resets on success. No new config surface (hardcoded constants — a follow-up could expose these via agents.defaults.llmRetry).

Change Type (select all)

  • Bug fix
  • Feature
  • Refactor
  • Docs
  • Security hardening
  • Chore/infra

Scope (select all touched areas)

  • Gateway / orchestration
  • Skills / tool execution
  • Auth / tokens
  • Memory / storage
  • Integrations
  • API / contracts
  • UI / DX
  • CI/CD / infra

Linked Issue/PR

  • Closes #49376
  • Closes #48913
  • Related #49696
  • Related #24321

User-visible / Behavior Changes

  • When a single-profile provider returns overloaded (529), the agent now silently retries up to 3 times with exponential backoff (2s, 4s, 8s) before surfacing the error.
  • Log messages at warn level: overloaded — retrying same profile for <provider>/<model>: attempt=N/3 delayMs=X.
  • Failover observation logs a new decision value: retry_same_profile.
  • If all 3 retries are exhausted, behavior is identical to before (fallback model or surface error).

Security Impact (required)

  • New permissions/capabilities? No
  • Secrets/tokens handling changed? No
  • New/changed network calls? No (same LLM API call, just retried)
  • Command/tool execution surface changed? No
  • Data access scope changed? No

Repro + Verification

Environment

  • OS: macOS (Darwin arm64)
  • Runtime: Node 24
  • Model/provider: Anthropic Claude Opus 4.6 (single API key profile)
  • Channel: Telegram

Steps

  1. Configure an agent with a single Anthropic auth profile (no fallbacks).
  2. Send a message when Anthropic API is overloaded (returns 529).
  3. Before fix: Error surfaced immediately.
  4. After fix: Agent retries up to 3 times with backoff, then surfaces error only if still overloaded.

Expected

Agent retries transparently and recovers when the API comes back within ~15s.

Actual

Agent retries (visible in warn logs) and succeeds on retry, or surfaces error after 3 failed attempts.

Evidence

  • tsgo type check: 0 errors ✅
  • oxlint on changed files: 0 warnings, 0 errors ✅
  • vitest run on failover-observation.test.ts: 2/2 passed ✅
  • vitest run on runs.test.ts: 6/6 passed ✅

Human Verification (required)

  • Verified: TypeScript compiles, lint passes, existing tests pass.
  • Verified: Code review of both prompt-side and assistant-side paths — retry only triggers for overloaded reason, counter resets on success and on profile rotation.
  • Edge cases checked: abort signal interrupts retry sleep; counter does not interfere with existing overloadFailoverAttempts; markAuthProfileGood is called before retry to clear cooldown state.
  • What I did not verify: Live 529 error from Anthropic (cannot reproduce on demand).

Review Conversations

  • N/A (new PR)

Compatibility / Migration

  • Backward compatible? Yes
  • Config/env changes? No
  • Migration needed? No

Failure Recovery (if this breaks)

  • Revert this single commit to restore previous behavior (immediate surface on overloaded).
  • The retry is bounded (3 attempts, ~15s max wall-clock) and respects abort signals, so it cannot cause infinite loops.
  • Watch for: warn-level log spam if overload is persistent (3 log lines per failed request vs 0 before).

Risks and Mitigations

  • Risk: Retry adds up to ~15s latency before surfacing a persistent overload error.
    • Mitigation: 3 retries with 2s/4s/8s backoff is a reasonable tradeoff. Users previously had to manually retry anyway. A future config option could let users tune this.
  • Risk: markAuthProfileGood call before retry could mask a genuinely bad profile.
    • Mitigation: Only called for overloaded reason (transient), not for auth/billing/format errors. The profile was working before the overload.

Changed files

  • src/agents/pi-embedded-runner/run.overflow-compaction.harness.ts (modified, +21/-5)
  • src/agents/pi-embedded-runner/run.overloaded-retry.test.ts (added, +96/-0)
  • src/agents/pi-embedded-runner/run.ts (modified, +91/-0)
  • src/agents/pi-embedded-runner/run/failover-observation.ts (modified, +1/-1)

PR #49807: fix: exclude overloaded errors from auth profile cooldown escalation

Description (problem / solution / changelog)

Summary

  • Problem: Transient overloaded errors (HTTP 529) are persisted to auth profile failure stats, triggering exponential cooldown escalation (60s × 5^(n-1), up to 1 hour). This causes a cascading failure where the profile becomes unusable long after the provider recovers.
  • Why it matters: Two agents sharing the same Anthropic API key and model — one works fine, the other is stuck in cooldown for 25+ minutes after a brief 529 spike. The agent that first hit the 529 enters a death spiral: retry → 529 or cooldown → error count ↑ → longer cooldown → repeat.
  • What changed: resolveAuthProfileFailureReason() now returns null for "overloaded" (same treatment as "timeout"), so overloaded errors do not persist to usage stats or trigger cooldown escalation. The profile stays healthy for immediate retry.
  • What did NOT change: Rate-limit, billing, auth, and format errors still trigger cooldown as before. The maybeBackoffBeforeOverloadFailover() backoff between profile rotations is unchanged.

Change Type (select all)

  • Bug fix
  • Feature
  • Refactor
  • Docs
  • Security hardening
  • Chore/infra

Scope (select all touched areas)

  • Gateway / orchestration
  • Skills / tool execution
  • Auth / tokens
  • Memory / storage
  • Integrations
  • API / contracts
  • UI / DX
  • CI/CD / infra

Linked Issue/PR

  • Closes #49696
  • Related #49376
  • Related #48913
  • Related #24321
  • Related #49800

User-visible / Behavior Changes

  • overloaded (529) errors no longer cause auth profile cooldown escalation.
  • After a 529 spike, the agent can immediately retry instead of waiting out an exponentially growing cooldown window.
  • Two agents sharing the same API key will no longer have divergent availability after one hits a transient 529.
  • Log change: auth_profile_failure_state_updated events will no longer appear for reason: "overloaded".

Security Impact (required)

  • New permissions/capabilities? No
  • Secrets/tokens handling changed? No
  • New/changed network calls? No
  • Command/tool execution surface changed? No
  • Data access scope changed? No

Repro + Verification

Environment

  • OS: macOS Darwin arm64
  • Runtime: Node 24
  • Model/provider: Anthropic Claude Opus 4.6 (single API key, no fallbacks)
  • Channel: Telegram

Steps

  1. Configure two agents (A and B) with the same Anthropic API key, same model, no fallbacks.
  2. During an Anthropic 529 spike, agent A gets overloaded errors.
  3. Anthropic recovers after ~1 minute.
  4. Before fix: Agent A is stuck in 5-25min cooldown. Agent B (which never hit 529) works fine. Same key, same model, different availability.
  5. After fix: Agent A can immediately retry after the 529 clears. No cooldown escalation.

Expected

Both agents recover as soon as Anthropic's 529 clears.

Actual (before fix)

Agent A stuck in exponential cooldown (observed: 16 consecutive overloaded errors over 28 minutes, cooldown escalating from 60s to 25min+).

Evidence

  • Gateway logs showing cascading cooldown: cooldownUntil values increasing from 177383792171117738380864351773838284549 (3+ minute jumps between each)
  • tsgo type check: 0 errors ✅
  • oxlint: 0 warnings, 0 errors ✅
  • vitest run failover-observation + runs tests: 8/8 passed ✅

Human Verification (required)

  • Verified: Reproduced the cascading cooldown in gateway logs with real Anthropic 529 errors.
  • Verified: After the fix, resolveAuthProfileFailureReason("overloaded") returns null, preventing markAuthProfileFailure() from being called.
  • Edge cases: overloaded still triggers maybeBackoffBeforeOverloadFailover() for inter-profile rotation backoff (unchanged). Rate-limit errors still escalate cooldown correctly.
  • What I did not verify: Live test with intentional 529 triggering (cannot reproduce on demand).

Review Conversations

  • N/A (new PR)

Compatibility / Migration

  • Backward compatible? Yes
  • Config/env changes? No
  • Migration needed? No

Failure Recovery (if this breaks)

  • Revert this single commit.
  • Watch for: If overloaded errors are persistent (not transient), the profile will no longer be cooled down. But the existing maybeBackoffBeforeOverloadFailover() backoff still provides pacing between failover attempts.

Risks and Mitigations

  • Risk: Without cooldown, a truly overloaded provider gets hammered with retries.
    • Mitigation: The run loop already has MAX_RUN_LOOP_ITERATIONS to cap total retries. The related PR #49800 adds same-profile retry with exponential backoff (2s→4s→8s) which provides proper pacing.

Changed files

  • src/agents/pi-embedded-runner/run.ts (modified, +9/-11)
  • src/agents/pi-embedded-runner/run/resolve-profile-failure-reason.test.ts (added, +36/-0)
  • src/agents/pi-embedded-runner/run/resolve-profile-failure-reason.ts (added, +21/-0)

Code Example

{
  "agents": {
    "defaults": {
      "llmRetry": {
        "enabled": true,
        "maxAttempts": 5,
        "backoffMs": [60000, 60000, 120000, 180000, 300000],
        "retryOn": ["overloaded", "server_error", "network"],
        "maxTotalMs": 900000
      }
    }
  }
}
RAW_BUFFERClick to expand / collapse

Summary

When Anthropic returns a transient error (HTTP 529 overloaded, 503, or 500), the gateway currently classifies it and surfaces the error to the user. There's no configurable retry-with-backoff for the main LLM request path.

Current behavior

  • Cron jobs have cron.retry with configurable maxAttempts, backoffMs, and retryOn — great design.
  • Channel message delivery has retry config (attempts, minDelayMs, maxDelayMs, jitter).
  • The generic retryAsync infra exists in src/infra/retry.ts.
  • Main LLM conversation requests have no exposed retry config for transient provider errors.

Proposed behavior

Add an agent-level (or global) retry config for LLM requests on transient errors:

{
  "agents": {
    "defaults": {
      "llmRetry": {
        "enabled": true,
        "maxAttempts": 5,
        "backoffMs": [60000, 60000, 120000, 180000, 300000],
        "retryOn": ["overloaded", "server_error", "network"],
        "maxTotalMs": 900000
      }
    }
  }
}

Key properties:

  • backoffMs: Array of delays between retries (supports Fibonacci-style or custom progression)
  • retryOn: Which error classes to retry (reuse existing classification: overloaded, server_error, network, timeout)
  • maxTotalMs: Total wall-clock cap to prevent infinite waits
  • Should NOT retry on: rate_limit (quota, not transient), auth, billing, format errors

Why this matters

During Anthropic outages (529 overloaded), the agent becomes unresponsive. Users have to manually retry or wait. A built-in retry with backoff would let the gateway transparently recover when the API comes back, without the user needing to re-send their message.

Currently we work around this with an external watchdog script that polls the API and sends notifications, but native gateway retry would be cleaner and faster to recover.

Prior art

The cron.retry schema is a perfect model — same shape could be reused for LLM requests.

extent analysis

Fix Plan

To implement a retry mechanism for LLM requests, follow these steps:

  • Update the src/infra/retry.ts file to include a new function for LLM retries, utilizing the existing retryAsync infrastructure.
  • Add a new configuration option for LLM retries in the agent-level or global configuration.
  • Modify the LLM request handler to use the new retry function.

Example Code

// src/infra/retry.ts
export async function retryLlmRequest(
  request: () => Promise<any>,
  retryConfig: {
    maxAttempts: number;
    backoffMs: number[];
    retryOn: string[];
    maxTotalMs: number;
  }
) {
  const { maxAttempts, backoffMs, retryOn, maxTotalMs } = retryConfig;
  let attempt = 0;
  let delay = 0;
  const startTime = Date.now();

  while (attempt < maxAttempts) {
    try {
      return await request();
    } catch (error) {
      if (!retryOn.includes(error.class)) {
        throw error;
      }

      attempt++;
      delay = backoffMs[Math.min(attempt - 1, backoffMs.length - 1)];
      const elapsed = Date.now() - startTime;
      if (elapsed + delay > maxTotalMs) {
        throw error;
      }

      await new Promise((resolve) => setTimeout(resolve, delay));
    }
  }

  throw new Error('Max attempts exceeded');
}
// src/llm-handler.ts
import { retryLlmRequest } from './infra/retry';

const llmRetryConfig = {
  maxAttempts: 5,
  backoffMs: [60000, 60000, 120000, 180000, 300000],
  retryOn: ['overloaded', 'server_error', 'network'],
  maxTotalMs: 900000,
};

export async function handleLlmRequest(request: any) {
  return retryLlmRequest(
    async () => {
      // Original LLM request logic
    },
    llmRetryConfig
  );
}

Verification

To verify the fix, test the LLM request handler with different error scenarios, including:

  • Transient errors (e.g., 529 overloaded, 503, 500)
  • Non-transient errors (e.g., rate limit, auth, billing, format errors)
  • Successful requests

Verify that the retry mechanism works as expected, with the correct number of attempts and backoff delays.

Extra Tips

  • Make sure to handle errors properly and provide informative error messages to users.
  • Consider adding logging and monitoring to track retry attempts and errors.
  • Review and adjust the retry configuration options to suit your specific use case.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING