openclaw - 💡(How to fix) Fix Persistent file-based provider cooldown blocks user for hours after billing recovery [1 participants]

mattglover11 · 2026-04-24T03:10:41Z

[openclaw] When an Anthropic or any provider returns a 402 billing error, OpenClaw writes a disabledUntil timestamp to the agent auth-state file that persists… When an Anthropic (or any) provider returns a 402 billing error, OpenClaw writes a `disabledUntil` timestamp to the agent auth-state file that persists across gateway restarts. On repeated failures the timestamp extends forward, which means even after the user tops up credit, they remain locked out for hours (or up to 24h) until either a human edits the JSON or the timestamp naturally expires. This affects every user who runs a single-provider configuration (e.g. Opus-only) or whose fallback chain shares a single auth profile. ## Workaround A `~/bin/opus-reset` script that: 1. Scans all `~/.openclaw/agents/*/agent/auth-state.json` files. 2. Strips `disabledUntil`, `disabledReason`, `cooldownUntil`, `pausedUntil`, `failureCounts`, `errorCount`, `lastFailureAt`, `quarantineUntil` fields. 3. Runs `openclaw gateway restart`. This is the minimum needed to recover today. It should not be necessary. ## Summary When an Anthropic (or any) provider returns a 402 billing error, OpenClaw writes a `disabledUntil` timestamp to the agent auth-state file that persists across gateway restarts. On repeated failures the timestamp extends forward, which means even after the user tops up credit, they remain locked out for hours (or up to 24h) until either a human edits the JSON or the timestamp naturally expires. This affects every user who runs a single-provider configuration (e.g. Opus-only) or whose fallback chain shares a single auth profile. ## Reproduction 1. Run OpenClaw with a single-provider config (e.g. primary: `anthropic/claude-opus-4-7`, fallback: `anthropic/claude-opus-4-6`, both resolving to the same `anthropic:default` auth profile). 2. Let the Anthropic account hit 402 "out of extra usage". 3. Top up credit in the Anthropic console. Confirm a curl to `GET https://api.anthropic.com/v1/models` with the same key returns 200. 4. Fire a normal message. OpenClaw still returns `billing issue (skipping all models)`. 5. Run `openclaw gateway restart`. Gateway restarts cleanly but provider remains disabled. 6. Inspect `~/.openclaw/agents/main/agent/auth-state.json` and observe: ```json "anthropic:default": { "errorCount": 2, "lastFailureAt": 1776983458318, "failureCounts": { "billing": 2 }, "disabledUntil": 1777001449315, "disabledReason": "billing" } ``` 7. `disabledUntil` is several hours or up to 24h in the future. Only manually editing this file (removing `disabledUntil`, `disabledReason`, and zeroing counters) AND restarting the gateway restores service. ## Impact - **User is silently locked out** long after the underlying cause is resolved. - **Remote users cannot recover** without SSH / local shell access \u2014 slash commands are dead when the gateway thinks the provider is disabled. - **Increasing lock on repeat failures**: each new 402 during the disabled window extends `disabledUntil` further forward, compounding the problem. - **No observability**: nothing surfaces "you are in a persistent cooldown until X" to the user. They just see blank responses or cron failures labelled "billing issue". ## Observed in the wild - 2026-04-23: Anthropic credit depleted ~08:00 AEST. Credit restored ~10:00 AEST. Provider stayed disabled until a manual auth-state.json edit + gateway restart some 60 minutes later. - 2026-04-24: Anthropic credit depleted evening 2026-04-23. Credit restored ~08:00 AEST the next morning. Provider stayed disabled until ~13:00 AEST (5+ hours after top-up), only resolved by manually clearing `disabledUntil` via Claude Code in the local terminal. ## Proposed fix (any of these would resolve it) 1. **Don't persist `disabledUntil` to disk.** Keep it in-memory only with a short TTL (e.g. 5 minutes). On restart, probe the provider fresh. 2. **Clear `disabledUntil` on successful probe.** Add a periodic health check (every 2\u20135 minutes) that hits `/v1/models` (or equivalent) on disabled providers. First success clears the cooldown immediately. 3. **Cap `disabledUntil` at a sensible ceiling** (e.g. 15 minutes maximum, regardless of failure count). Billing errors are usually user-recoverable on a minutes timescale, not hours. 4. **Expose a first-class CLI reset**: `openclaw auth reset --provider anthropic` that clears the state cleanly. This at least gives remote users a documented escape hatch. Option 2 is probably the cleanest: a provider marked "disabled" should be opportunistically re-probed, and if it responds healthy, the cooldown should be cleared without any user action. ## Environment - OpenClaw version: 2026.4.21 (commit f788c88) - OS: macOS 25.3.0 arm64 - Auth mode: `api_key` for `anthropic:default` - Fallback chain: Opus-only (no cross-provider safety net, which makes this failure mode fully blocking) ## Workaround A `~/bin/opus-reset` script that: 1. Scans all `~/.openclaw/agents/*/agent/auth-state.json` files. 2. Strips `disabledUntil`

openclaw2026-04-24 03:10:41

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

openclaw/openclaw#70903•Fetched 2026-04-24 10:38:04

View on GitHub

Comments

Participants

Timeline

Reactions

Author

mattglover11

Participants

mattglover11

When an Anthropic (or any) provider returns a 402 billing error, OpenClaw writes a disabledUntil timestamp to the agent auth-state file that persists across gateway restarts. On repeated failures the timestamp extends forward, which means even after the user tops up credit, they remain locked out for hours (or up to 24h) until either a human edits the JSON or the timestamp naturally expires.

This affects every user who runs a single-provider configuration (e.g. Opus-only) or whose fallback chain shares a single auth profile.

Error Message

Root Cause

This affects every user who runs a single-provider configuration (e.g. Opus-only) or whose fallback chain shares a single auth profile.

Fix Action

Workaround

A ~/bin/opus-reset script that:

Scans all ~/.openclaw/agents/*/agent/auth-state.json files.
Strips disabledUntil, disabledReason, cooldownUntil, pausedUntil, failureCounts, errorCount, lastFailureAt, quarantineUntil fields.
Runs openclaw gateway restart.

This is the minimum needed to recover today. It should not be necessary.

Code Example

"anthropic:default": {
  "errorCount": 2,
  "lastFailureAt": 1776983458318,
  "failureCounts": { "billing": 2 },
  "disabledUntil": 1777001449315,
  "disabledReason": "billing"
}

RAW_BUFFERClick to expand / collapse

Summary

This affects every user who runs a single-provider configuration (e.g. Opus-only) or whose fallback chain shares a single auth profile.

Reproduction

Run OpenClaw with a single-provider config (e.g. primary: anthropic/claude-opus-4-7, fallback: anthropic/claude-opus-4-6, both resolving to the same anthropic:default auth profile).
Let the Anthropic account hit 402 "out of extra usage".
Top up credit in the Anthropic console. Confirm a curl to GET https://api.anthropic.com/v1/models with the same key returns 200.
Fire a normal message. OpenClaw still returns billing issue (skipping all models).
Run openclaw gateway restart. Gateway restarts cleanly but provider remains disabled.
Inspect ~/.openclaw/agents/main/agent/auth-state.json and observe:

"anthropic:default": {
  "errorCount": 2,
  "lastFailureAt": 1776983458318,
  "failureCounts": { "billing": 2 },
  "disabledUntil": 1777001449315,
  "disabledReason": "billing"
}

disabledUntil is several hours or up to 24h in the future. Only manually editing this file (removing disabledUntil, disabledReason, and zeroing counters) AND restarting the gateway restores service.

Impact

User is silently locked out long after the underlying cause is resolved.
Remote users cannot recover without SSH / local shell access \u2014 slash commands are dead when the gateway thinks the provider is disabled.
Increasing lock on repeat failures: each new 402 during the disabled window extends disabledUntil further forward, compounding the problem.
No observability: nothing surfaces "you are in a persistent cooldown until X" to the user. They just see blank responses or cron failures labelled "billing issue".

Observed in the wild

2026-04-23: Anthropic credit depleted ~08:00 AEST. Credit restored ~10:00 AEST. Provider stayed disabled until a manual auth-state.json edit + gateway restart some 60 minutes later.
2026-04-24: Anthropic credit depleted evening 2026-04-23. Credit restored ~08:00 AEST the next morning. Provider stayed disabled until ~13:00 AEST (5+ hours after top-up), only resolved by manually clearing disabledUntil via Claude Code in the local terminal.

Proposed fix (any of these would resolve it)

Don't persist disabledUntil to disk. Keep it in-memory only with a short TTL (e.g. 5 minutes). On restart, probe the provider fresh.
Clear disabledUntil on successful probe. Add a periodic health check (every 2\u20135 minutes) that hits /v1/models (or equivalent) on disabled providers. First success clears the cooldown immediately.
Cap disabledUntil at a sensible ceiling (e.g. 15 minutes maximum, regardless of failure count). Billing errors are usually user-recoverable on a minutes timescale, not hours.
Expose a first-class CLI reset: openclaw auth reset --provider anthropic that clears the state cleanly. This at least gives remote users a documented escape hatch.

Option 2 is probably the cleanest: a provider marked "disabled" should be opportunistically re-probed, and if it responds healthy, the cooldown should be cleared without any user action.

Environment

OpenClaw version: 2026.4.21 (commit f788c88)
OS: macOS 25.3.0 arm64
Auth mode: api_key for anthropic:default
Fallback chain: Opus-only (no cross-provider safety net, which makes this failure mode fully blocking)

Workaround

A ~/bin/opus-reset script that:

Scans all ~/.openclaw/agents/*/agent/auth-state.json files.
Strips disabledUntil, disabledReason, cooldownUntil, pausedUntil, failureCounts, errorCount, lastFailureAt, quarantineUntil fields.
Runs openclaw gateway restart.

This is the minimum needed to recover today. It should not be necessary.

extent analysis

TL;DR

The most likely fix is to implement a periodic health check that clears the disabledUntil timestamp when a disabled provider becomes available again.

Guidance

Consider implementing option 2 from the proposed fixes: add a periodic health check (every 2-5 minutes) that hits /v1/models on disabled providers, and clear the cooldown immediately on the first success.
To verify the fix, test the scenario described in the reproduction steps and check if the provider is re-enabled after the credit is topped up.
To mitigate the issue temporarily, use the provided ~/bin/opus-reset script to reset the auth state and restart the gateway.
Review the proposed fixes and choose the one that best fits the requirements, considering factors such as simplicity, effectiveness, and potential impact on the system.

Example

No code example is provided as the issue does not require a specific code snippet to illustrate the solution.

Notes

The chosen solution should be tested thoroughly to ensure it resolves the issue without introducing new problems. The periodic health check should be configured to run at a reasonable interval to avoid overwhelming the provider with requests.

Recommendation

Apply workaround: use the ~/bin/opus-reset script to reset the auth state and restart the gateway until a permanent fix is implemented. This provides a temporary solution to recover from the issue without requiring manual editing of the auth-state.json file.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #retriever error #indexing error #inference speed #output truncation

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

openclaw - 💡(How to fix) Fix Persistent file-based provider cooldown blocks user for hours after billing recovery [1 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Workaround

Code Example

Summary

Reproduction

Impact

Observed in the wild

Proposed fix (any of these would resolve it)

Environment

Workaround

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

TRENDING

openclaw - 💡(How to fix) Fix Persistent file-based provider cooldown blocks user for hours after billing recovery [1 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Workaround

Code Example

Summary

Reproduction

Impact

Observed in the wild

Proposed fix (any of these would resolve it)

Environment

Workaround

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING