openclaw - ✅(Solved) Fix Auth key rotation not triggered by 400 billing errors (out of extra usage) [1 pull requests, 1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
openclaw/openclaw#62375Fetched 2026-04-08 03:05:18
View on GitHub
Comments
0
Participants
1
Timeline
0
Reactions
0
Author
Participants

Error Message

2026-04-07T07:25:56.628+02:00 LLM request rejected: You're out of extra usage...
2026-04-07T07:36:13.288+02:00 LLM request rejected: You're out of extra usage...
... (repeated every 5-10 min for 2+ hours, no rotation attempted)

PR fix notes

PR #66315: fix(agents): detect billing from FallbackSummaryError structured reasons

Description (problem / solution / changelog)

Summary

  • Problem: When auth profiles enter billing cooldown, model-fallback skips all candidates with "Provider X has billing issue (skipping all models)". This message does not match any pattern in isBillingErrorMessage(), so users see "Something went wrong while processing your request" instead of BILLING_ERROR_USER_MESSAGE.
  • Why it matters: Only the very first billing failure (containing raw "out of extra usage") is detected correctly. Every subsequent attempt, which is the majority of what users see falls through to the generic error path. Especially common with OAuth users who exhaust personal Anthropic "extra usage" quotas.
  • What changed: Added isPureBillingSummary() in agent-runner-execution.ts that checks the structured attempt.reason === "billing" on FallbackSummaryError, used as a fast-path before isBillingErrorMessage() string matching. Added test covering this path.
  • What did NOT change (scope boundary): isBillingErrorMessage() patterns in failover-matches.ts are untouched. BILLING_ERROR_USER_MESSAGE text is unchanged. No changes to model-fallback logic itself.

Change Type (select all)

  • Bug fix
  • Feature
  • Refactor required for the fix
  • Docs
  • Security hardening
  • Chore/infra

Scope (select all touched areas)

  • Gateway / orchestration
  • Skills / tool execution
  • Auth / tokens
  • Memory / storage
  • Integrations
  • API / contracts
  • UI / DX
  • CI/CD / infra

Linked Issue/PR

  • Closes #66314
  • Related #48526, #64224, #64308, #62375, #61608
  • This PR fixes a bug or regression

Root Cause (if applicable)

For bug fixes or regressions, explain why this happened, not just what changed. Otherwise write N/A. If the cause is unclear, write Unknown.

  • Root cause: agent-runner-execution.ts line 1337 classifies billing errors using isBillingErrorMessage(message) which string-matches the formatted FallbackSummaryError message. When all candidates are skipped due to billing cooldown, the message contains "has billing issue (skipping all models)" none of the patterns in ERROR_PATTERNS.billing match this string.
  • Missing detection / guardrail: The rate-limit path already uses structured FallbackSummaryError data via isPureTransientRateLimitSummary() (checking attempt.reason), but the billing path had no equivalent and relied solely on string matching.
  • Contributing context (if known): PR #61608 added "out of extra usage" to the billing patterns, fixing detection for the raw Anthropic API error. But the cooldown-generated skip message uses different wording that was never added to the pattern list.

Regression Test Plan (if applicable)

For bug fixes or regressions, name the smallest reliable test coverage that should catch this. Otherwise write N/A.

  • Coverage level that should have caught this:
    • Unit test
    • Seam / integration test
    • End-to-end test
    • Existing coverage already sufficient
  • Target test or file: src/auto-reply/reply/agent-runner-execution.test.ts
  • Scenario the test should lock in: FallbackSummaryError where every attempt.reason === "billing" (cooldown skip path) must return BILLING_ERROR_USER_MESSAGE, not the generic fallback.
  • Why this is the smallest reliable guardrail: Unit test directly verifies the classification logic in the catch block without needing a live provider or gateway.
  • Existing test that already covers this (if any): "does not show a rate-limit countdown for mixed-cause fallback exhaustion" covers the mixed billing+rate_limit case (which should NOT be classified as pure billing). The new test covers the pure-billing case.
  • If no new test is added, why not: N/A

User-visible / Behavior Changes

When all model candidates are skipped due to billing cooldown, users now see:

⚠️ API provider returned a billing error — your API key has run out of credits or has an insufficient balance. Check your provider's billing dashboard and top up or switch to a different API key.

Instead of:

⚠️ Something went wrong while processing your request. Please try again, or use /new to start a fresh session.

Diagram (if applicable)

For UI changes or non-trivial logic flows, include a small ASCII diagram reviewers can scan quickly. Otherwise write N/A.

Before:
[billing cooldown] -> FallbackSummaryError("...has billing issue...")
    -> isBillingErrorMessage() = false
    -> buildExternalRunFailureText()
    -> "Something went wrong"

After:
[billing cooldown] -> FallbackSummaryError(attempts[].reason="billing")
    -> isPureBillingSummary() = true
    -> BILLING_ERROR_USER_MESSAGE

Security Impact (required)

  • New permissions/capabilities? No
  • Secrets/tokens handling changed? No
  • New/changed network calls? No
  • Command/tool execution surface changed? No
  • Data access scope changed? No
  • If any Yes, explain risk + mitigation:

Repro + Verification

Environment

  • OS: Debian 12 (Bookworm) — Docker container (ghcr.io/openclaw/openclaw)
  • Runtime/container: Docker
  • Model/provider: anthropic/claude-opus-4-6, anthropic/claude-sonnet-4-6 (OAuth)
  • Integration/channel (if any): Web UI / gateway
  • Relevant config (redacted): Single-provider setup with Anthropic OAuth authentication

Steps

  1. Configure OpenClaw with Anthropic provider using OAuth authentication.
  2. The personal "extra usage" quota on claude.ai becomes exhausted.
  3. A background/scheduled agent task hits the billing error first, putting the auth profile into billing cooldown.
  4. User sends a message, all candidates are skipped due to billing cooldown.
  5. Observe the user-facing error message.

Expected

  • User sees BILLING_ERROR_USER_MESSAGE on every billing failure, including cooldown skips.

Actual

  • User sees generic "Something went wrong while processing your request" with no billing context.

Evidence

Attach at least one:

  • Failing test/log before + passing after
  • Trace/log snippets
  • Screenshot/recording
  • Perf numbers (if relevant)
# Cooldown skip path:  isBillingErrorMessage() misses this:
[model-fallback] decision=skip_candidate reason=billing detail=Provider anthropic has billing issue (skipping all models)
Embedded agent failed before reply: All models failed (2): anthropic/claude-opus-4-6: Provider anthropic has billing issue (skipping all models) (billing) | anthropic/claude-sonnet-4-6: Provider anthropic has billing issue (skipping all models) (billing)

Test results: 31/31 pass in agent-runner-execution.test.ts, 150/150 pass in related billing/error tests.

Human Verification (required)

What you personally verified (not just CI), and how:

  • Verified scenarios: New test confirms pure-billing FallbackSummaryError returns BILLING_ERROR_USER_MESSAGE. Ran full test suites for agent-runner-execution.test.ts (31 pass), pi-embedded-helpers.isbillingerrormessage.test.ts and formatassistanterrortext.test.ts (150 pass).
  • Edge cases checked: Mixed-cause summary (billing + rate_limit) is NOT classified as pure billing, existing test "does not show a rate-limit countdown for mixed-cause fallback exhaustion" still passes.
  • What you did not verify: Live end-to-end test with an actual exhausted OAuth account. The fix is a classification-only change with no side effects on the model-fallback or auth-profile logic.

Review Conversations

  • I replied to or resolved every bot review conversation I addressed in this PR.
  • I left unresolved only the conversations that still need reviewer or maintainer judgment.

If a bot review conversation is addressed by this PR, resolve that conversation yourself. Do not leave bot review conversation cleanup for maintainers.

Compatibility / Migration

  • Backward compatible? Yes
  • Config/env changes? No
  • Migration needed? No
  • If yes, exact upgrade steps:

Risks and Mitigations

None

Changed files

  • src/auto-reply/reply/agent-runner-execution.test.ts (modified, +59/-0)
  • src/auto-reply/reply/agent-runner-execution.ts (modified, +15/-1)

Code Example

You're out of extra usage. Add more at claude.ai/settings/usage and keep going.

---

2026-04-07T07:25:56.628+02:00 LLM request rejected: You're out of extra usage...
2026-04-07T07:36:13.288+02:00 LLM request rejected: You're out of extra usage...
... (repeated every 5-10 min for 2+ hours, no rotation attempted)
RAW_BUFFERClick to expand / collapse

Bug

When an Anthropic API key returns a 400 error with message:

You're out of extra usage. Add more at claude.ai/settings/usage and keep going.

The gateway does not rotate to the next key in auth.order.anthropic. It fails the request immediately.

Expected Behavior

The gateway should treat billing/quota exhaustion errors (400 with "out of extra usage") the same as 429 rate limits — try the next key in auth.order before failing.

Current Behavior

  • auth.order.anthropic has 6 keys configured: [key1, key2, key3, key4, key5, key6]
  • When key1 returns 400 billing error, the gateway stops and returns the error to the agent
  • No attempt is made to try key2..key6
  • Meanwhile, other keys on different billing accounts still have credits
  • Result: agents fail repeatedly on every task while working keys sit unused

Impact

All agents (15 in our setup) running on anthropic/claude-opus-4-6 were blocked for hours. Parts Molty failed ~20+ tasks in a row, each dying in <500ms with $0 cost.

Logs

2026-04-07T07:25:56.628+02:00 LLM request rejected: You're out of extra usage...
2026-04-07T07:36:13.288+02:00 LLM request rejected: You're out of extra usage...
... (repeated every 5-10 min for 2+ hours, no rotation attempted)

Environment

  • OpenClaw 2026.4.5
  • macOS 26.2 (arm64)
  • 6 Anthropic auth profiles in rotation order
  • All agents using anthropic/claude-opus-4-6 with no fallbacks

Suggested Fix

In the LLM request handler, when a 400 is received with a billing/quota message, treat it as a retryable-on-next-key error (same path as 429). Specifically check for:

  • out of extra usage
  • billing
  • quota
  • credit

in the 400 error body before falling through to the next auth profile.

extent analysis

TL;DR

Modify the LLM request handler to treat 400 errors with billing/quota messages as retryable-on-next-key errors, similar to 429 rate limits.

Guidance

  • Check the error message body for specific phrases ("out of extra usage", "billing", "quota", "credit") when a 400 error is received from the Anthropic API.
  • If a match is found, attempt to rotate to the next key in auth.order.anthropic before failing the request.
  • Verify that the gateway is correctly configured to use the auth.order.anthropic list and that all keys are properly set up.
  • Test the modified handler with a simulated 400 error to ensure it correctly rotates to the next key.

Example

if response.status_code == 400 and ("out of extra usage" in response.text or
                                   "billing" in response.text or
                                   "quota" in response.text or
                                   "credit" in response.text):
    # Rotate to next key in auth.order.anthropic
    next_key = get_next_key()
    if next_key:
        # Retry request with next key
        return retry_request(next_key)
    else:
        # No more keys available, fail request
        return fail_request()

Notes

This fix assumes that the auth.order.anthropic list is correctly configured and that all keys are properly set up. Additionally, this fix may not work if the Anthropic API changes its error messaging or behavior.

Recommendation

Apply the suggested fix to the LLM request handler to treat 400 errors with billing/quota messages as retryable-on-next-key errors, allowing the gateway to rotate to the next key in auth.order.anthropic before failing the request. This should help prevent agents from being blocked due to a single key's billing/quota issues.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING