openclaw - 💡(How to fix) Fix Auth profile failover should differentiate 401 (dead key) from 429/529 (transient) [1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
openclaw/openclaw#58565Fetched 2026-04-08 02:01:02
View on GitHub
Comments
0
Participants
1
Timeline
0
Reactions
0
Author
Participants

Error Message

  • openclaw doctor reports: anthropic:me.com: cooldown (1m) -- Wait for cooldown or switch provider.
  • Gateway returns {"error":{"message":"internal error","type":"api_error"}} (HTTP 500) to the caller while the cooldown resolves
  • DMs work (eventually) because the cooldown expires and fallback kicks in, but scheduled jobs consistently time out

Root Cause

  • openclaw doctor reports: anthropic:me.com: cooldown (1m) -- Wait for cooldown or switch provider.
  • Gateway returns {"error":{"message":"internal error","type":"api_error"}} (HTTP 500) to the caller while the cooldown resolves
  • DMs work (eventually) because the cooldown expires and fallback kicks in, but scheduled jobs consistently time out

Fix Action

Fix / Workaround

For scheduler-dispatched jobs, this adds 30-90 seconds of dead time per request as the gateway ping-pongs through cooldown on the dead key before reaching a working profile.

This primarily affects scheduler-dispatched agent jobs where latency matters and every request goes through the same failover path. Interactive DMs are less affected because users retry naturally.

  • OpenClaw gateway running on macOS (arm64)
  • Two Anthropic auth profiles in order, primary key expired
  • Scheduler dispatching isolated agent tasks via /v1/chat/completions
RAW_BUFFERClick to expand / collapse

Problem

When an auth profile's API key is invalid or expired (HTTP 401), OpenClaw puts the profile into cooldown with the same timing as rate limits (429) or overloaded responses (529). This means:

  1. Every request first tries the dead key
  2. Waits for the 401 failure
  3. Enters cooldown (1 minute observed)
  4. Falls back to the next profile in the order
  5. The HTTP request has been waiting the entire time

For scheduler-dispatched jobs, this adds 30-90 seconds of dead time per request as the gateway ping-pongs through cooldown on the dead key before reaching a working profile.

Observed behavior

  • openclaw doctor reports: anthropic:me.com: cooldown (1m) -- Wait for cooldown or switch provider.
  • Gateway returns {"error":{"message":"internal error","type":"api_error"}} (HTTP 500) to the caller while the cooldown resolves
  • DMs work (eventually) because the cooldown expires and fallback kicks in, but scheduled jobs consistently time out

Expected behavior

Error-specific failover:

  • 401 (authentication_error): Immediately disable the profile. The key is dead, not rate-limited. Don't cooldown -- mark the profile as disabled and skip it on all subsequent requests until the operator explicitly re-enables it or replaces the key. Log a warning so the operator knows.
  • 429 (rate_limit_error): Cooldown using the Retry-After header value (or a reasonable default). The key is valid but temporarily throttled.
  • 529 (overloaded_error): Short cooldown (5-10s) + immediate failover to the next profile. The provider is busy, not the key.
  • 402 (billing): Extended backoff (the billingBackoffHours config already exists for this).

Impact

This primarily affects scheduler-dispatched agent jobs where latency matters and every request goes through the same failover path. Interactive DMs are less affected because users retry naturally.

Reproduction

  1. Add an invalid API key as the primary auth profile
  2. Add a valid API key as the secondary
  3. Send a chat completion request via the HTTP API
  4. Observe the request takes 45-90 seconds as it tries the dead key, enters cooldown, then falls back

Environment

  • OpenClaw gateway running on macOS (arm64)
  • Two Anthropic auth profiles in order, primary key expired
  • Scheduler dispatching isolated agent tasks via /v1/chat/completions

extent analysis

TL;DR

Implement error-specific failover logic to immediately disable profiles with invalid or expired API keys (401) instead of applying a cooldown.

Guidance

  • Modify the OpenClaw gateway to handle 401 errors by marking the profile as disabled and skipping it on subsequent requests.
  • Update the cooldown logic to differentiate between 401, 429, and 529 errors, applying distinct cooldown strategies for each.
  • Consider adding a warning log when a profile is disabled due to a 401 error to notify the operator.
  • Review the current failover mechanism to ensure it correctly handles the new error-specific logic.

Example

# Pseudocode example of error-specific failover logic
def handle_error(status_code, profile):
    if status_code == 401:
        # Mark profile as disabled and skip on subsequent requests
        profile.disabled = True
        logger.warning("Profile disabled due to invalid or expired API key")
    elif status_code == 429:
        # Cooldown using Retry-After header value or a reasonable default
        cooldown_time = get_retry_after_header() or 60
        profile.cooldown_expires = time.time() + cooldown_time
    elif status_code == 529:
        # Short cooldown and immediate failover
        cooldown_time = 10
        profile.cooldown_expires = time.time() + cooldown_time
        # Failover to the next profile
        return next_profile

Notes

The provided guidance assumes that the OpenClaw gateway has the necessary logic to handle different error codes and apply distinct cooldown strategies. The example pseudocode illustrates the error-specific failover logic but may require modifications to fit the actual implementation.

Recommendation

Apply workaround: Implement error-specific failover logic to handle 401 errors by immediately disabling the profile and skipping it on subsequent requests. This will help reduce latency in scheduler-dispatched agent jobs and improve overall system performance.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

FAQ

Expected behavior

Error-specific failover:

  • 401 (authentication_error): Immediately disable the profile. The key is dead, not rate-limited. Don't cooldown -- mark the profile as disabled and skip it on all subsequent requests until the operator explicitly re-enables it or replaces the key. Log a warning so the operator knows.
  • 429 (rate_limit_error): Cooldown using the Retry-After header value (or a reasonable default). The key is valid but temporarily throttled.
  • 529 (overloaded_error): Short cooldown (5-10s) + immediate failover to the next profile. The provider is busy, not the key.
  • 402 (billing): Extended backoff (the billingBackoffHours config already exists for this).

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING