openclaw - 💡(How to fix) Fix OpenClaw should auto-pause and auto-retry on provider 429/rate-limit errors [2 comments, 1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
openclaw/openclaw#62540Fetched 2026-04-08 03:02:46
View on GitHub
Comments
2
Participants
1
Timeline
2
Reactions
0
Participants
Timeline (top)
commented ×2

Error Message

  • provider error body
  • surface concise error including:

Root Cause

This would make OpenClaw feel much more resilient during real work. Right now, a brief TPM spike can make a session appear to “go stale” until the user nudges it again, even when the provider explicitly tells us when to retry.

Fix Action

Fix / Workaround

Failure behavior

After max retries:

  • mark run
  • surface concise error including:
    • provider/model
    • limit type if known
    • retry attempts
    • suggested mitigations:
      • reduce context
      • reduce concurrency
      • use smaller model
      • request higher limits

Optional follow-ons

  • pre-retry context shrinking
  • model fallback policy
  • per-provider concurrency throttling
  • projected TPM budgeting before dispatch
  • queueing large runs instead of firing immediately into known TPM exhaustion
RAW_BUFFERClick to expand / collapse

When a model call hits a provider rate limit, OpenClaw currently behaves like the run has effectively stopped until the user prompts again. This is frustrating during long-running work because the provider often includes an explicit retry delay, but OpenClaw does not automatically resume the run.

OpenClaw should treat 429/rate-limit failures as transient resumable pauses, not terminal run failures.

Problem

Current behavior:

  • model call hits 429 / TPM limit
  • provider returns retry hint like
  • run stops progressing
  • user must manually prompt again to continue

This creates a poor user experience, especially for:

  • long multi-step runs
  • coding/debugging sessions
  • background work
  • high-context GPT-5.x turns that temporarily exceed TPM

Desired behavior

On provider 429/rate-limit:

  1. detect the rate-limit condition
  2. parse retry delay from:

    • provider error body
    • structured message text
  3. mark the run/session as paused for rate limit
  4. schedule an automatic retry
  5. resume the pending step automatically without needing a new user prompt

Important note: This does not require resuming partial provider generation state. It only requires resuming at the OpenClaw orchestration step by retrying the pending model call.

Proposed run states

Required implementation

  • Catch provider 429 errors centrally
  • Parse retry delay if available
  • Persist retry metadata:
    • run/session id
    • pending step
    • provider
    • model
    • retry count
    • retry_at
  • Schedule wake/retry using existing scheduling/wake primitives
  • Retry automatically at with small jitter
  • Continue run normally if retry succeeds

Retry policy

  • default max retries:
  • use provider retry hint first
  • otherwise use fallback backoff
  • repeated 429s should use exponential backoff with cap
  • apply jitter to avoid synchronized retries

UX

  • retry delay < 30s:
    • retry silently by default
  • retry delay >= 30s:
    • surface a short status update like:
      • “Rate-limited, retrying automatically at 09:20:03”

Failure behavior

After max retries:

  • mark run
  • surface concise error including:
    • provider/model
    • limit type if known
    • retry attempts
    • suggested mitigations:
      • reduce context
      • reduce concurrency
      • use smaller model
      • request higher limits

Optional follow-ons

  • pre-retry context shrinking
  • model fallback policy
  • per-provider concurrency throttling
  • projected TPM budgeting before dispatch
  • queueing large runs instead of firing immediately into known TPM exhaustion

Acceptance criteria

  1. A provider 429 with a retry delay causes the run to pause and auto-resume without user prompting
  2. Session/status shows paused-for-rate-limit state and retry time
  3. Successful retry continues the original task
  4. Repeated 429s back off and eventually fail cleanly after max retries
  5. No duplicate final replies are emitted across retries

Why this matters

This would make OpenClaw feel much more resilient during real work. Right now, a brief TPM spike can make a session appear to “go stale” until the user nudges it again, even when the provider explicitly tells us when to retry.

extent analysis

TL;DR

Implement a retry mechanism that catches provider 429 errors, parses retry delays, and schedules automatic retries to resume paused runs without user prompting.

Guidance

  • Catch provider 429 errors centrally and parse retry delays from error bodies or structured message text to determine the retry timing.
  • Implement a retry policy with a default max retries, using provider retry hints, fallback backoff, and exponential backoff with a cap to avoid overwhelming the provider.
  • Surface status updates to the user when retry delays are 30 seconds or longer, and provide concise error messages with suggested mitigations after max retries.
  • Consider implementing additional features like pre-retry context shrinking, model fallback policy, and per-provider concurrency throttling to further improve resilience.

Example

# Pseudo-code example of retry mechanism
def handle_provider_error(error):
    if error.status_code == 429:
        retry_delay = parse_retry_delay(error.body)
        schedule_retry(retry_delay)
        # Mark run as paused for rate limit and persist retry metadata

Notes

The implementation should focus on catching 429 errors, parsing retry delays, and scheduling automatic retries. The retry policy should be designed to avoid overwhelming the provider and provide a good user experience. Additional features like pre-retry context shrinking and model fallback policy can be considered to further improve resilience.

Recommendation

Apply a workaround by implementing a retry mechanism that catches provider 429 errors and schedules automatic retries, as this will significantly improve the user experience during long-running work and high-context GPT-5.x turns.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

openclaw - 💡(How to fix) Fix OpenClaw should auto-pause and auto-retry on provider 429/rate-limit errors [2 comments, 1 participants]