Error-specific failover: - **401 (authentication_error)**: Immediately disable the profile. The key is dead, not rate-limited. Don't cooldown -- mark the profile as `disabled` and skip it on all subsequent requests until the operator explicitly re-enables it or replaces the key. Log a warning so the operator knows. - **429 (rate_limit_error)**: Cooldown using the `Retry-After` header value (or a reasonable default). The key is valid but temporarily throttled. - **529 (overloaded_error)**: Short cooldown (5-10s) + immediate failover to the next profile. The provider is busy, not the key. - **402 (billing)**: Extended backoff (the `billingBackoffHours` config already exists for this).

openclaw - 💡(How to fix) Fix Auth profile failover should differentiate 401 (dead key) from 429/529 (transient) [1 participants]

amittell · 2026-03-31T21:17:54Z

[openclaw] Problem When an auth profile's API key is invalid or expired HTTP 401 , OpenClaw puts the profile into cooldown with the same timing as rate limits… ## Fix / Workaround For scheduler-dispatched jobs, this adds 30-90 seconds of dead time per request as the gateway ping-pongs through cooldown on the dead key before reaching a working profile. This primarily affects scheduler-dispatched agent jobs where latency matters and every request goes through the same failover path. Interactive DMs are less affected because users retry naturally. - OpenClaw gateway running on macOS (arm64) - Two Anthropic auth profiles in order, primary key expired - Scheduler dispatching isolated agent tasks via `/v1/chat/completions` ## Problem When an auth profile's API key is invalid or expired (HTTP 401), OpenClaw puts the profile into cooldown with the same timing as rate limits (429) or overloaded responses (529). This means: 1. Every request first tries the dead key 2. Waits for the 401 failure 3. Enters cooldown (1 minute observed) 4. Falls back to the next profile in the order 5. The HTTP request has been waiting the entire time For scheduler-dispatched jobs, this adds 30-90 seconds of dead time per request as the gateway ping-pongs through cooldown on the dead key before reaching a working profile. ## Observed behavior - `openclaw doctor` reports: `anthropic:me.com: cooldown (1m) -- Wait for cooldown or switch provider.` - Gateway returns `{"error":{"message":"internal error","type":"api_error"}}` (HTTP 500) to the caller while the cooldown resolves - DMs work (eventually) because the cooldown expires and fallback kicks in, but scheduled jobs consistently time out ## Expected behavior Error-specific failover: - **401 (authentication_error)**: Immediately disable the profile. The key is dead, not rate-limited. Don't cooldown -- mark the profile as `disabled` and skip it on all subsequent requests until the operator explicitly re-enables it or replaces the key. Log a warning so the operator knows. - **429 (rate_limit_error)**: Cooldown using the `Retry-After` header value (or a reasonable default). The key is valid but temporarily throttled. - **529 (overloaded_error)**: Short cooldown (5-10s) + immediate failover to the next profile. The provider is busy, not the key. - **402 (billing)**: Extended backoff (the `billingBackoffHours` config already exists for this). ## Impact This primarily affects scheduler-dispatched agent jobs where latency matters and every request goes through the same failover path. Interactive DMs are less affected because users retry naturally. ## Reproduction 1. Add an invalid API key as the primary auth profile 2. Add a valid API key as the secondary 3. Send a chat completion request via the HTTP API 4. Observe the request takes 45-90 seconds as it tries the dead key, enters cooldown, then falls back ## Environment - OpenClaw gateway running on macOS (arm64) - Two Anthropic auth profiles in order, primary key expired - Scheduler dispatching isolated agent tasks via `/v1/chat/completions`

openclaw2026-03-31 21:17:54

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

openclaw/openclaw#58565•Fetched 2026-04-08 02:01:02

View on GitHub

Comments

Participants

Timeline

Reactions

Author

amittell

Participants

amittell

Error Message

openclaw doctor reports: anthropic:me.com: cooldown (1m) -- Wait for cooldown or switch provider.
Gateway returns {"error":{"message":"internal error","type":"api_error"}} (HTTP 500) to the caller while the cooldown resolves
DMs work (eventually) because the cooldown expires and fallback kicks in, but scheduled jobs consistently time out

Root Cause

openclaw doctor reports: anthropic:me.com: cooldown (1m) -- Wait for cooldown or switch provider.
Gateway returns {"error":{"message":"internal error","type":"api_error"}} (HTTP 500) to the caller while the cooldown resolves
DMs work (eventually) because the cooldown expires and fallback kicks in, but scheduled jobs consistently time out

Fix Action

Fix / Workaround

For scheduler-dispatched jobs, this adds 30-90 seconds of dead time per request as the gateway ping-pongs through cooldown on the dead key before reaching a working profile.

This primarily affects scheduler-dispatched agent jobs where latency matters and every request goes through the same failover path. Interactive DMs are less affected because users retry naturally.

OpenClaw gateway running on macOS (arm64)
Two Anthropic auth profiles in order, primary key expired
Scheduler dispatching isolated agent tasks via /v1/chat/completions

RAW_BUFFERClick to expand / collapse

Problem

When an auth profile's API key is invalid or expired (HTTP 401), OpenClaw puts the profile into cooldown with the same timing as rate limits (429) or overloaded responses (529). This means:

Every request first tries the dead key
Waits for the 401 failure
Enters cooldown (1 minute observed)
Falls back to the next profile in the order
The HTTP request has been waiting the entire time

For scheduler-dispatched jobs, this adds 30-90 seconds of dead time per request as the gateway ping-pongs through cooldown on the dead key before reaching a working profile.

Observed behavior

openclaw doctor reports: anthropic:me.com: cooldown (1m) -- Wait for cooldown or switch provider.
Gateway returns {"error":{"message":"internal error","type":"api_error"}} (HTTP 500) to the caller while the cooldown resolves
DMs work (eventually) because the cooldown expires and fallback kicks in, but scheduled jobs consistently time out

Expected behavior

Error-specific failover:

401 (authentication_error): Immediately disable the profile. The key is dead, not rate-limited. Don't cooldown -- mark the profile as disabled and skip it on all subsequent requests until the operator explicitly re-enables it or replaces the key. Log a warning so the operator knows.
429 (rate_limit_error): Cooldown using the Retry-After header value (or a reasonable default). The key is valid but temporarily throttled.
529 (overloaded_error): Short cooldown (5-10s) + immediate failover to the next profile. The provider is busy, not the key.
402 (billing): Extended backoff (the billingBackoffHours config already exists for this).

Impact

This primarily affects scheduler-dispatched agent jobs where latency matters and every request goes through the same failover path. Interactive DMs are less affected because users retry naturally.

Reproduction

Add an invalid API key as the primary auth profile
Add a valid API key as the secondary
Send a chat completion request via the HTTP API
Observe the request takes 45-90 seconds as it tries the dead key, enters cooldown, then falls back

Environment

OpenClaw gateway running on macOS (arm64)
Two Anthropic auth profiles in order, primary key expired
Scheduler dispatching isolated agent tasks via /v1/chat/completions

extent analysis

TL;DR

Implement error-specific failover logic to immediately disable profiles with invalid or expired API keys (401) instead of applying a cooldown.

Guidance

Modify the OpenClaw gateway to handle 401 errors by marking the profile as disabled and skipping it on subsequent requests.
Update the cooldown logic to differentiate between 401, 429, and 529 errors, applying distinct cooldown strategies for each.
Consider adding a warning log when a profile is disabled due to a 401 error to notify the operator.
Review the current failover mechanism to ensure it correctly handles the new error-specific logic.

Example

# Pseudocode example of error-specific failover logic
def handle_error(status_code, profile):
    if status_code == 401:
        # Mark profile as disabled and skip on subsequent requests
        profile.disabled = True
        logger.warning("Profile disabled due to invalid or expired API key")
    elif status_code == 429:
        # Cooldown using Retry-After header value or a reasonable default
        cooldown_time = get_retry_after_header() or 60
        profile.cooldown_expires = time.time() + cooldown_time
    elif status_code == 529:
        # Short cooldown and immediate failover
        cooldown_time = 10
        profile.cooldown_expires = time.time() + cooldown_time
        # Failover to the next profile
        return next_profile

Notes

The provided guidance assumes that the OpenClaw gateway has the necessary logic to handle different error codes and apply distinct cooldown strategies. The example pseudocode illustrates the error-specific failover logic but may require modifications to fit the actual implementation.

Recommendation

Apply workaround: Implement error-specific failover logic to handle 401 errors by immediately disabling the profile and skipping it on subsequent requests. This will help reduce latency in scheduler-dispatched agent jobs and improve overall system performance.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

FAQ

Expected behavior

Error-specific failover:

401 (authentication_error): Immediately disable the profile. The key is dead, not rate-limited. Don't cooldown -- mark the profile as disabled and skip it on all subsequent requests until the operator explicitly re-enables it or replaces the key. Log a warning so the operator knows.
429 (rate_limit_error): Cooldown using the Retry-After header value (or a reasonable default). The key is valid but temporarily throttled.
529 (overloaded_error): Short cooldown (5-10s) + immediate failover to the next profile. The provider is busy, not the key.
402 (billing): Extended backoff (the billingBackoffHours config already exists for this).

#api #embedding generation #cache error #pipeline error #runtime error

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

openclaw - 💡(How to fix) Fix Auth profile failover should differentiate 401 (dead key) from 429/529 (transient) [1 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Fix / Workaround

Problem

Observed behavior

Expected behavior

Impact

Reproduction

Environment

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

FAQ

Expected behavior

Still need to ship something?

TRENDING

openclaw - 💡(How to fix) Fix Auth profile failover should differentiate 401 (dead key) from 429/529 (transient) [1 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Fix / Workaround

Problem

Observed behavior

Expected behavior

Impact

Reproduction

Environment

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

FAQ

Expected behavior

Still need to ship something?

RELATED_DISCOVERY

TRENDING