openclaw - ✅(Solved) Fix Feature: configurable LLM retry with backoff on transient errors (overloaded/529) [2 pull requests, 1 comments, 2 participants]

zenfish · 2026-03-18T02:25:14Z

[openclaw] When Anthropic returns a transient error HTTP 529 overloaded, 503, or 500 , the gateway currently classifies it and surfaces the error to the user.… When Anthropic returns a transient error (HTTP 529 overloaded, 503, or 500), the gateway currently classifies it and surfaces the error to the user. There's no configurable retry-with-backoff for the main LLM request path. # PR #49800: fix: retry same auth profile on transient overloaded errors before failover - Repository: openclaw/openclaw - Author: clovericbot - State: open | merged: False - Link: https://github.com/openclaw/openclaw/pull/49800 ## Description (problem / solution / changelog) ## Summary - **Problem:** When the LLM API returns a transient `overloaded` error (HTTP 529 / `overloaded_error`), OpenClaw tries to rotate to the next auth profile. If only one profile is configured for the provider, the error is surfaced immediately with "The AI service is temporarily overloaded. Please try again in a moment." — no retry. - **Why it matters:** During Anthropic capacity spikes (frequent in March 2026), agents with a single API key become unresponsive. Users must manually resend messages. The cron subsystem already has configurable retry (`cron.retry`), but the main LLM request path does not. - **What changed:** Added a same-profile retry loop (up to 3 attempts, exponential backoff 2s→4s→8s, jitter 25%, capped at 30s) for `overloaded` errors when no other profile is available. Applies to both prompt-side and assistant-side error paths. - **What did NOT change (scope boundary):** Existing profile rotation, fallback model logic, and backoff-before-failover behavior are untouched. Rate-limit, auth, billing, and format errors are NOT retried. The retry counter resets on success. No new config surface (hardcoded constants — a follow-up could expose these via `agents.defaults.llmRetry`). ## Change Type (select all) - [x] Bug fix - [ ] Feature - [ ] Refactor - [ ] Docs - [ ] Security hardening - [ ] Chore/infra ## Scope (select all touched areas) - [x] Gateway / orchestration - [ ] Skills / tool execution - [ ] Auth / tokens - [ ] Memory / storage - [ ] Integrations - [ ] API / contracts - [ ] UI / DX - [ ] CI/CD / infra ## Linked Issue/PR - Closes #49376 - Closes #48913 - Related #49696 - Related #24321 ## User-visible / Behavior Changes - When a single-profile provider returns `overloaded` (529), the agent now silently retries up to 3 times with exponential backoff (2s, 4s, 8s) before surfacing the error. - Log messages at `warn` level: `overloaded — retrying same profile for / : attempt=N/3 delayMs=X`. - Failover observation logs a new decision value: `retry_same_profile`. - If all 3 retries are exhausted, behavior is identical to before (fallback model or surface error). ## Security Impact (required) - New permissions/capabilities? `No` - Secrets/tokens handling changed? `No` - New/changed network calls? `No` (same LLM API call, just retried) - Command/tool execution surface changed? `No` - Data access scope changed? `No` ## Repro + Verification ### Environment - OS: macOS (Darwin arm64) - Runtime: Node 24 - Model/provider: Anthropic Claude Opus 4.6 (single API key profile) - Channel: Telegram ### Steps 1. Configure an agent with a single Anthropic auth profile (no fallbacks). 2. Send a message when Anthropic API is overloaded (returns 529). 3. **Before fix:** Error surfaced immediately. 4. **After fix:** Agent retries up to 3 times with backoff, then surfaces error only if still overloaded. ### Expected Agent retries transparently and recovers when the API comes back within ~15s. ### Actual Agent retries (visible in warn logs) and succeeds on retry, or surfaces error after 3 failed attempts. ## Evidence - `tsgo` type check: 0 errors ✅ - `oxlint` on changed files: 0 warnings, 0 errors ✅ - `vitest run` on `failover-observation.test.ts`: 2/2 passed ✅ - `vitest run` on `runs.test.ts`: 6/6 passed ✅ ## Human Verification (required) - Verified: TypeScript compiles, lint passes, existing tests pass. - Verified: Code review of both prompt-side and assistant-side paths — retry only triggers for `overloaded` reason, counter resets on success and on profile rotation. - Edge cases checked: abort signal interrupts retry sleep; counter does not interfere with existing `overloadFailoverAttempts`; `markAuthProfileGood` is called before retry to clear cooldown state. - What I did **not** verify: Live 529 error from Anthropic (cannot reproduce on demand). ## Review Conversations - [x] N/A (new PR) ## Compatibility / Migration - Backward compatible? `Yes` - Config/env changes? `No` - Migration needed? `No` ## Failure Recovery (if this breaks) - Revert this single commit to restore previous behavior (immediate surface on overloaded). - The retry is bounded (3 attempts, ~15s max wall-clock) and respects abort signals, so it cannot cause infinite loops. - Watch for: warn-level log spam if overload is persistent (3 log lines per failed request vs 0 befo

openclaw2026-03-18 02:25:14

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

openclaw/openclaw#49376•Fetched 2026-04-08 00:55:51

View on GitHub

Comments

Participants

Timeline

Reactions

Author

zenfish

Participants

tudorrusskii

zenfish

Timeline (top)

referenced ×3cross-referenced ×2commented ×1

When Anthropic returns a transient error (HTTP 529 overloaded, 503, or 500), the gateway currently classifies it and surfaces the error to the user. There's no configurable retry-with-backoff for the main LLM request path.

Error Message

retryOn: Which error classes to retry (reuse existing classification: overloaded, server_error, network, timeout)

Root Cause

During Anthropic outages (529 overloaded), the agent becomes unresponsive. Users have to manually retry or wait. A built-in retry with backoff would let the gateway transparently recover when the API comes back, without the user needing to re-send their message.

Currently we work around this with an external watchdog script that polls the API and sends notifications, but native gateway retry would be cleaner and faster to recover.

Code Example

{
  "agents": {
    "defaults": {
      "llmRetry": {
        "enabled": true,
        "maxAttempts": 5,
        "backoffMs": [60000, 60000, 120000, 180000, 300000],
        "retryOn": ["overloaded", "server_error", "network"],
        "maxTotalMs": 900000
      }
    }
  }
}

RAW_BUFFERClick to expand / collapse

Summary

Current behavior

Cron jobs have cron.retry with configurable maxAttempts, backoffMs, and retryOn — great design.
Channel message delivery has retry config (attempts, minDelayMs, maxDelayMs, jitter).
The generic retryAsync infra exists in src/infra/retry.ts.
Main LLM conversation requests have no exposed retry config for transient provider errors.

Proposed behavior

Add an agent-level (or global) retry config for LLM requests on transient errors:

{
  "agents": {
    "defaults": {
      "llmRetry": {
        "enabled": true,
        "maxAttempts": 5,
        "backoffMs": [60000, 60000, 120000, 180000, 300000],
        "retryOn": ["overloaded", "server_error", "network"],
        "maxTotalMs": 900000
      }
    }
  }
}

Key properties:

backoffMs: Array of delays between retries (supports Fibonacci-style or custom progression)
retryOn: Which error classes to retry (reuse existing classification: overloaded, server_error, network, timeout)
maxTotalMs: Total wall-clock cap to prevent infinite waits
Should NOT retry on: rate_limit (quota, not transient), auth, billing, format errors

Why this matters

Currently we work around this with an external watchdog script that polls the API and sends notifications, but native gateway retry would be cleaner and faster to recover.

Prior art

The cron.retry schema is a perfect model — same shape could be reused for LLM requests.

extent analysis

Fix Plan

To implement a retry mechanism for LLM requests, follow these steps:

Update the src/infra/retry.ts file to include a new function for LLM retries, utilizing the existing retryAsync infrastructure.
Add a new configuration option for LLM retries in the agent-level or global configuration.
Modify the LLM request handler to use the new retry function.

Example Code

// src/infra/retry.ts
export async function retryLlmRequest(
  request: () => Promise<any>,
  retryConfig: {
    maxAttempts: number;
    backoffMs: number[];
    retryOn: string[];
    maxTotalMs: number;
  }
) {
  const { maxAttempts, backoffMs, retryOn, maxTotalMs } = retryConfig;
  let attempt = 0;
  let delay = 0;
  const startTime = Date.now();

  while (attempt < maxAttempts) {
    try {
      return await request();
    } catch (error) {
      if (!retryOn.includes(error.class)) {
        throw error;
      }

      attempt++;
      delay = backoffMs[Math.min(attempt - 1, backoffMs.length - 1)];
      const elapsed = Date.now() - startTime;
      if (elapsed + delay > maxTotalMs) {
        throw error;
      }

      await new Promise((resolve) => setTimeout(resolve, delay));
    }
  }

  throw new Error('Max attempts exceeded');
}

// src/llm-handler.ts
import { retryLlmRequest } from './infra/retry';

const llmRetryConfig = {
  maxAttempts: 5,
  backoffMs: [60000, 60000, 120000, 180000, 300000],
  retryOn: ['overloaded', 'server_error', 'network'],
  maxTotalMs: 900000,
};

export async function handleLlmRequest(request: any) {
  return retryLlmRequest(
    async () => {
      // Original LLM request logic
    },
    llmRetryConfig
  );
}

Verification

To verify the fix, test the LLM request handler with different error scenarios, including:

Transient errors (e.g., 529 overloaded, 503, 500)
Non-transient errors (e.g., rate limit, auth, billing, format errors)
Successful requests

Verify that the retry mechanism works as expected, with the correct number of attempts and backoff delays.

Extra Tips

Make sure to handle errors properly and provide informative error messages to users.
Consider adding logging and monitoring to track retry attempts and errors.
Review and adjust the retry configuration options to suit your specific use case.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #ssr #installation #tensor shape #autograd error #request timeout

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

openclaw - ✅(Solved) Fix Feature: configurable LLM retry with backoff on transient errors (overloaded/529) [2 pull requests, 1 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Fixed

PR fix notes

PR #49800: fix: retry same auth profile on transient overloaded errors before failover

Description (problem / solution / changelog)

Summary

Change Type (select all)

Scope (select all touched areas)

Linked Issue/PR

User-visible / Behavior Changes

Security Impact (required)

Repro + Verification

Environment

Steps

Expected

Actual

Evidence

Human Verification (required)

Review Conversations

Compatibility / Migration

Failure Recovery (if this breaks)

Risks and Mitigations

Changed files

PR #49807: fix: exclude overloaded errors from auth profile cooldown escalation

Description (problem / solution / changelog)

Summary

Change Type (select all)

Scope (select all touched areas)

Linked Issue/PR

User-visible / Behavior Changes

Security Impact (required)

Repro + Verification

Environment

Steps

Expected

Actual (before fix)

Evidence

Human Verification (required)

Review Conversations

Compatibility / Migration

Failure Recovery (if this breaks)

Risks and Mitigations

Changed files

Code Example

Summary

Current behavior

Proposed behavior

Why this matters

Prior art

extent analysis

Fix Plan

Example Code

Verification

Extra Tips

Still need to ship something?

RELATED_DISCOVERY

TRENDING