openclaw - 💡(How to fix) Fix [Feature]: Configurable retry backoff (exponential) for API rate limit errors [1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
openclaw/openclaw#59711Fetched 2026-04-08 02:41:27
View on GitHub
Comments
0
Participants
1
Timeline
1
Reactions
0
Participants
Timeline (top)
labeled ×1

When hitting TPM/rate limit errors, retries should back off exponentially (e.g. 2s → 8s → 15s → → → Give up after N attempts and surface the error) instead of retrying immediately. This compounds the TPM problem.

Error Message

When hitting TPM/rate limit errors, retries should back off exponentially (e.g. 2s → 8s → 15s → → → Give up after N attempts and surface the error) instead of retrying immediately. This compounds the TPM problem. When a request hits Anthropic's TPM limit, OpenClaw gets a 429 Too Many Requests error. Right now it retries almost immediately — which means it sends another large request into an API that just told it "you're sending too much." That retry also hits the limit, and possibly triggers another agent's fallback, which also retries… and now you have 4 agents all hammering the API in a tight loop. The TPM bucket never gets a chance to refill. Give up after N attempts and surface the error End users experience it as the agent going silent, giving a degraded response, or returning an error mid-conversation

Root Cause

When hitting TPM/rate limit errors, retries should back off exponentially (e.g. 2s → 8s → 15s → → → Give up after N attempts and surface the error) instead of retrying immediately. This compounds the TPM problem.

Fix Action

Fix / Workaround

Consequences: Immediate: Requests fail or return degraded fallback responses; users lose context mid-conversation Cascading: Without backoff, a single 429 can knock out all agents for 30–90 seconds as retries pile up Operational: No visibility into why the agent went quiet — users assume the agent is broken, not rate-limited Cost: Failed requests that consumed tokens before erroring still count against the TPM budget, making the problem self-reinforcing Workaround tax: Currently requires manual config tuning (reducing concurrency, capping context) that degrades performance to avoid hitting limits — a structural fix via backoff would eliminate this tradeoff

RAW_BUFFERClick to expand / collapse

Summary

When hitting TPM/rate limit errors, retries should back off exponentially (e.g. 2s → 8s → 15s → → → Give up after N attempts and surface the error) instead of retrying immediately. This compounds the TPM problem.

Problem to solve

When a request hits Anthropic's TPM limit, OpenClaw gets a 429 Too Many Requests error. Right now it retries almost immediately — which means it sends another large request into an API that just told it "you're sending too much." That retry also hits the limit, and possibly triggers another agent's fallback, which also retries… and now you have 4 agents all hammering the API in a tight loop. The TPM bucket never gets a chance to refill.

Proposed solution

Instead of retrying immediately, the system waits progressively longer between each attempt:

1st retry: wait 2 seconds 2nd retry: wait 8 seconds 3rd retry: wait 15–30 seconds Give up after N attempts and surface the error

The API rate limit window (usually 60 seconds) has time to reset before the next attempt hits it. This is the industry-standard way to handle rate limits — AWS, Google, Stripe all require it in their SDKs.

Alternatives considered

No response

Impact

Affected users/systems/channels: Any OpenClaw deployment running 2+ agents simultaneously — the problem scales with agent count Specifically affects multi-agent setups using Anthropic models (Sonnet/Haiku) where concurrent requests are common All channels are affected (Discord, Telegram, WhatsApp) since the failure happens at the API layer, not the channel layer End users experience it as the agent going silent, giving a degraded response, or returning an error mid-conversation

Severity: Blocks workflow When a 429 triggers a retry storm, all active agents become unresponsive for the duration — not just the one that hit the limit The fallback chain (Sonnet → Haiku → GPT-4.1) fails silently when retries exhaust, leaving the user with no response and no explanation In a business/executive assistant context (time-sensitive decisions, live conversations), a 30–60 second blackout is a trust-breaking failure

Frequency: Intermittent but predictable Occurs reliably under moderate-to-high load: multiple agents active + long conversation contexts In a 5-agent setup with 30K context windows, hitting the limit during peak usage (morning, active work sessions) happens multiple times per day Not an edge case — it's a structural ceiling that grows more frequent as agent usage increases

Consequences: Immediate: Requests fail or return degraded fallback responses; users lose context mid-conversation Cascading: Without backoff, a single 429 can knock out all agents for 30–90 seconds as retries pile up Operational: No visibility into why the agent went quiet — users assume the agent is broken, not rate-limited Cost: Failed requests that consumed tokens before erroring still count against the TPM budget, making the problem self-reinforcing Workaround tax: Currently requires manual config tuning (reducing concurrency, capping context) that degrades performance to avoid hitting limits — a structural fix via backoff would eliminate this tradeoff

Evidence/examples

No response

Additional information

No response

extent analysis

TL;DR

Implement an exponential backoff strategy for retries when hitting TPM/rate limit errors to prevent compounding the problem.

Guidance

  • Identify the current retry mechanism and modify it to introduce exponential backoff (e.g., 2s, 8s, 15s) between attempts.
  • Determine a suitable value for N (the number of attempts before giving up and surfacing the error) based on the specific use case and API rate limit window.
  • Consider implementing a jitter in the backoff timings to prevent multiple agents from retrying at the same time.
  • Monitor the system's behavior after implementing the backoff strategy to ensure it effectively prevents retry storms and reduces the frequency of 429 errors.

Example

import time
import random

def exponential_backoff(attempt):
    backoff_times = [2, 8, 15]  # example backoff times
    if attempt < len(backoff_times):
        return backoff_times[attempt]
    else:
        return 60  # maximum backoff time (e.g., 1 minute)

def retry_with_backoff(max_attempts, func):
    for attempt in range(max_attempts):
        try:
            return func()
        except Exception as e:
            if attempt < max_attempts - 1:
                backoff_time = exponential_backoff(attempt)
                # add some jitter to the backoff time
                jitter = random.uniform(0, 1)
                time.sleep(backoff_time * (1 + jitter))
            else:
                raise e

Notes

The exact implementation of the exponential backoff strategy may vary depending on the programming language and framework used. It's essential to test and fine-tune the backoff parameters to ensure they effectively mitigate the TPM/rate limit errors.

Recommendation

Apply the exponential backoff workaround to prevent retry storms and reduce the frequency of 429 errors, allowing the API rate limit window to reset before the next attempt. This approach is an industry-standard solution for handling rate limits and can help eliminate the need for manual config tuning and performance degradation.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING