openclaw - 💡(How to fix) Fix v2026.3.28: Runaway API calls burned entire monthly budget across all providers without user action [1 participants]

openclaw2026-04-03 17:34:19

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

openclaw/openclaw#60450•Fetched 2026-04-08 02:51:01

View on GitHub

Comments

Participants

Timeline

Reactions

Author

jarvisaibowen

Participants

jarvisaibowen

OpenClaw release 2026.3.28 (March 31, 2026) caused uncontrolled runaway API calls across all configured LLM providers simultaneously on March 31 and April 1, 2026. No user interaction triggered this — it happened autonomously. The entire monthly budget across Anthropic, OpenAI, Google (Gemini), and xAI (Grok) was consumed in approximately 2 days.

Error Message

embedded_run_agent_end: HTTP 401 authentication_error: Invalid bearer token auth_profile_failure_state_updated: cooldown set embedded_run_failover_decision: rotate_profile model_fallback_decision: candidate_failed → trying next provider

Root Cause

Root cause hypothesis

Fix Action

Fix / Workaround

Workarounds applied

Code Example

embedded_run_agent_end: HTTP 401 authentication_error: Invalid bearer token
auth_profile_failure_state_updated: cooldown set
embedded_run_failover_decision: rotate_profile
model_fallback_decision: candidate_failed → trying next provider

RAW_BUFFERClick to expand / collapse

Summary

Environment

OpenClaw version: 2026.3.28 (upgraded from 2026.3.24 on March 31)
OS: macOS Darwin 24.6.0 (arm64) — Mac Studio
Gateway: launchd-managed, running locally
Providers affected: Anthropic, OpenAI, Google Gemini, xAI Grok, Ollama

What happened

Upgraded from 2026.3.24 → 2026.3.28 on March 31
Shortly after, observed massive API spend across all providers
Auth profiles began returning 401 errors (tokens invalidated or rotated unexpectedly)
OpenClaw's failover system cycled through all providers repeatedly in a loop
Each failed attempt billed the next provider in the fallback chain
This continued autonomously over ~48 hours (March 31 – April 1)
Monthly budget caps were hit on ALL providers

Evidence from logs

Repeated pattern observed in gateway logs:

embedded_run_agent_end: HTTP 401 authentication_error: Invalid bearer token
auth_profile_failure_state_updated: cooldown set
embedded_run_failover_decision: rotate_profile
model_fallback_decision: candidate_failed → trying next provider

This cycle repeated hundreds of times across all providers simultaneously with no user-facing alert or pause.

Root cause hypothesis

The 2026.3.28 release appears to have:

Invalidated or corrupted existing auth profile tokens causing immediate 401s on all Anthropic profiles
The failover/retry system had no effective rate limiting or spend guard — it cycled through all providers indefinitely
No user-facing alert was generated despite runaway spending across all providers

Impact

Financial: Entire monthly API budget consumed across 4 paid providers in ~48 hours
Service: Assistant was non-functional for 2+ days requiring extensive manual repair
Trust: Required deep diagnosis of auth-profiles.json, gateway logs, and manual config surgery to restore

Suggested fixes

Hard failover limit: Stop after N total cross-provider failover attempts per session
Spend rate guard: If N failed auth attempts occur within a short window, pause all API calls and alert the user immediately
Auth invalidation alert: If ALL profiles for a provider return 401 simultaneously, surface an immediate user-facing alert rather than silently retrying
Billing lockout default: Default billing lockout should be 24h not 5h — we had to set this manually
Release notes: Breaking auth changes should be prominently flagged in release notes

Workarounds applied

Manually identified and replaced corrupted auth profiles in ~/.openclaw/agents/*/agent/auth-profiles.json
Set auth.cooldowns.billingBackoffHours: 24
Set auth.cooldowns.overloadedProfileRotations: 1
Added daily spend monitoring cron job as a custom guard
Set hard monthly caps at each provider billing console

Stable version

2026.3.24 — confirmed running without issues before the March 31 upgrade.

extent analysis

TL;DR

Implementing a hard failover limit and spend rate guard can help prevent uncontrolled runaway API calls across all configured LLM providers.

Guidance

Review the release notes for any breaking auth changes before upgrading to a new version.
Set a hard failover limit to stop after a certain number of total cross-provider failover attempts per session.
Implement a spend rate guard to pause all API calls and alert the user immediately if a certain number of failed auth attempts occur within a short window.
Consider setting a default billing lockout of 24 hours to prevent excessive spending.
Monitor daily spend and set hard monthly caps at each provider's billing console.

Example

No code snippet is provided as the issue does not explicitly mention any specific code changes.

Notes

The suggested fixes and workarounds are based on the information provided in the issue and may not be applicable to all scenarios. It is essential to thoroughly test any changes before implementing them in a production environment.

Recommendation

Apply the suggested workarounds, such as setting a hard failover limit and spend rate guard, to prevent similar issues in the future. This approach is recommended because it provides a more immediate solution to the problem, whereas upgrading to a fixed version may not be available yet.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #cache issue #memory leak #API versioning #request timeout

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

openclaw - 💡(How to fix) Fix v2026.3.28: Runaway API calls burned entire monthly budget across all providers without user action [1 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Root cause hypothesis

Fix Action

Fix / Workaround

Workarounds applied

Code Example

Summary

Environment

What happened

Evidence from logs

Root cause hypothesis

Impact

Suggested fixes

Workarounds applied

Stable version

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

TRENDING

openclaw - 💡(How to fix) Fix v2026.3.28: Runaway API calls burned entire monthly budget across all providers without user action [1 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Root cause hypothesis

Fix Action

Fix / Workaround

Workarounds applied

Code Example

Summary

Environment

What happened

Evidence from logs

Root cause hypothesis

Impact

Suggested fixes

Workarounds applied

Stable version

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING