openclaw - ✅(Solved) Fix feat: Provider circuit breaker — detect quota exhaustion and auto-trip fallback [2 pull requests, 1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
openclaw/openclaw#64085Fetched 2026-04-11 06:16:25
View on GitHub
Comments
0
Participants
1
Timeline
6
Reactions
0
Author
Participants
Timeline (top)
cross-referenced ×4mentioned ×1subscribed ×1
  • Real incident: Gemini free-tier hit daily token limit on 2026-04-10, silently broke workflow for hours
  • Gateway log evidence: [agent] embedded run agent end: isError=true model=gemini-3-pro-preview error=API rate limit reached
  • The [model-fallback] system handled the immediate request but didn't prevent re-trying the same dead provider on subsequent requests
  • Copilot and Anthropic already have partial detection via auth profile failure tracking

Error Message

Track consecutive failures per provider keyed by error category:

  • Gateway log evidence: [agent] embedded run agent end: isError=true model=gemini-3-pro-preview error=API rate limit reached

Root Cause

  • Real incident: Gemini free-tier hit daily token limit on 2026-04-10, silently broke workflow for hours
  • Gateway log evidence: [agent] embedded run agent end: isError=true model=gemini-3-pro-preview error=API rate limit reached
  • The [model-fallback] system handled the immediate request but didn't prevent re-trying the same dead provider on subsequent requests
  • Copilot and Anthropic already have partial detection via auth profile failure tracking

Fix Action

Fixed

PR fix notes

PR #64127: feat: Provider circuit breaker for quota exhaustion

Description (problem / solution / changelog)

Resolves #64085

This PR introduces proper handling for daily/weekly/monthly quota exhaustion errors:

  1. Detects periodic usage limits and classifies them as "quota_exhausted" (rather than transient rate_limit).
  2. Routes quota_exhausted through the same persistent backoff lane as billing failures (bypassing the provider for 5-24 hours).
  3. Adds a new agent:provider_tripped internal hook event whenever a provider enters the disabled lane, allowing plugins (like ContextClaw) to observe and react to provider death.

Tested via local inspection; handles the Gemini 429 loops by correctly stepping back for the day.

Changed files

  • src/agents/auth-profiles/state-observation.ts (modified, +18/-1)
  • src/agents/auth-profiles/types.ts (modified, +1/-0)
  • src/agents/auth-profiles/usage.ts (modified, +6/-2)
  • src/agents/failover-error.ts (modified, +1/-0)
  • src/agents/failover-policy.ts (modified, +1/-0)
  • src/agents/model-fallback.ts (modified, +1/-1)
  • src/agents/pi-embedded-helpers/errors.ts (modified, +1/-1)
  • src/agents/pi-embedded-helpers/types.ts (modified, +1/-0)
  • src/gateway/server-plugins.test.ts (modified, +1/-0)
  • src/gateway/test-helpers.plugin-registry.ts (modified, +1/-0)
  • src/hooks/internal-hooks.ts (modified, +25/-0)
  • src/plugin-sdk/plugin-entry.ts (modified, +2/-0)
  • src/plugins/api-builder.ts (modified, +2/-0)
  • src/plugins/registry-empty.ts (modified, +1/-0)
  • src/plugins/registry-types.ts (modified, +7/-0)
  • src/plugins/registry.ts (modified, +9/-0)
  • src/plugins/status.test-helpers.ts (modified, +1/-0)
  • src/plugins/types.ts (modified, +11/-0)
  • src/test-utils/channel-plugins.ts (modified, +1/-0)
  • src/tui/tui-session-actions.ts (modified, +3/-0)
  • src/tui/tui-types.ts (modified, +1/-0)
  • src/tui/tui.ts (modified, +5/-0)
  • test/helpers/plugins/plugin-api.ts (modified, +1/-0)

PR #64436: feat: expose model pricing to plugins via runtime.usage API

Description (problem / solution / changelog)

Summary

Adds runtime.usage namespace to the plugin runtime, exposing resolveModelCostConfig() and estimateUsageCost() so context engine plugins can calculate real dollar savings using actual model pricing instead of hardcoded heuristics.

Motivation

Context engine plugins (like ContextClaw) truncate stale/large context to save tokens. Today they can only estimate savings using a fixed $3/M-token heuristic. The gateway already has per-model pricing via resolveModelCostConfig — this PR just threads it through the plugin runtime so plugins can use it.

Changes (24 lines, 4 files)

  • src/plugins/runtime/runtime-usage.ts — New module exposing resolveModelCostConfig and estimateUsageCost (follows existing runtime-*.ts pattern)
  • src/plugins/runtime/types-core.ts — Added usage namespace to PluginRuntimeCore
  • src/plugins/runtime/index.ts — Wired createRuntimeUsage() into assembly
  • test/helpers/plugins/plugin-runtime-mock.ts — Updated mock

Testing

  • tsgo --noEmit clean (only pre-existing errors in msteams + child.test.ts)
  • Follows exact pattern of existing runtime modules (runtime-config.ts, runtime-agent.ts, etc.)
  • No new dependencies

Related

  • Issue #64085 — Provider circuit breaker (uses this to show real cost in TUI)
  • PR #64127 — registerStatusProvider API (companion feature)

Changed files

  • src/plugins/runtime/index.ts (modified, +2/-0)
  • src/plugins/runtime/runtime-usage.ts (added, +9/-0)
  • src/plugins/runtime/types-core.ts (modified, +6/-0)
  • test/helpers/plugins/plugin-runtime-mock.ts (modified, +8/-0)

Code Example

[agent] auth profile failure state updated: provider=github-copilot reason=format window=cooldown
[agent] auth profile failure state updated: provider=anthropic reason=billing window=disabled

---

createInternalHookEvent("model", "provider-tripped", sessionKey, {
  provider: "google",
  model: "gemini-3.1-pro-preview",
  reason: "quota_exceeded",
  consecutiveFailures: 3,
  nextReset: "2026-04-11T04:00:00Z"
})

---

{
  "agents": {
    "defaults": {
      "providerHealth": {
        "enabled": true,
        "tripAfterFailures": 3,
        "tripWindowMs": 300000,
        "resetPolicy": "midnight-utc",
        "notify": true
      }
    }
  }
}
RAW_BUFFERClick to expand / collapse

Problem

When a provider hits its daily quota limit (e.g., Gemini free-tier), the gateway treats repeated 429s as transient errors and keeps retrying. The fallback chain handles individual request failures, but there is no mechanism to:

  1. Detect that a provider is persistently failing (quota exhausted, not a transient spike)
  2. Mark the provider as "tripped" so it gets skipped entirely in the fallback chain
  3. Notify the user that a provider is down for the day
  4. Auto-reset the trip at midnight or after a configurable cooldown

Current behavior

  • Gemini 429 "You exceeded your current quota" → gateway retries → eventually falls back per-request
  • Sub-agents and cron jobs that pin a specific model don't benefit from the per-request fallback
  • No user notification — workflow silently degrades or breaks
  • Hours of lost work before the user discovers the issue

Existing partial solutions

The gateway already tracks auth profile failures for some providers:

[agent] auth profile failure state updated: provider=github-copilot reason=format window=cooldown
[agent] auth profile failure state updated: provider=anthropic reason=billing window=disabled

But Google/Gemini quota errors are not classified as auth/billing failures — they come back as generic 429s in the model routing layer.

Proposed solution

1. Provider health tracker (gateway-level)

Track consecutive failures per provider keyed by error category:

  • 429 + quota-related message → increment quota failure counter
  • After N consecutive quota failures (configurable, default 3) → mark provider as "tripped"
  • Tripped providers are skipped in fallback chain resolution
  • Auto-reset at configurable interval (default: midnight UTC, or provider-specific reset windows)

2. Hook event: model:provider-tripped

Emit a new internal hook event when a provider trips:

createInternalHookEvent("model", "provider-tripped", sessionKey, {
  provider: "google",
  model: "gemini-3.1-pro-preview",
  reason: "quota_exceeded",
  consecutiveFailures: 3,
  nextReset: "2026-04-11T04:00:00Z"
})

This lets plugins/hooks react (send notifications, log, adjust behavior).

3. Status visibility

  • openclaw status should show provider health (tripped providers, cooldown remaining)
  • TUI footer could show provider health indicators

4. Config

{
  "agents": {
    "defaults": {
      "providerHealth": {
        "enabled": true,
        "tripAfterFailures": 3,
        "tripWindowMs": 300000,
        "resetPolicy": "midnight-utc",
        "notify": true
      }
    }
  }
}

Context

  • Real incident: Gemini free-tier hit daily token limit on 2026-04-10, silently broke workflow for hours
  • Gateway log evidence: [agent] embedded run agent end: isError=true model=gemini-3-pro-preview error=API rate limit reached
  • The [model-fallback] system handled the immediate request but didn't prevent re-trying the same dead provider on subsequent requests
  • Copilot and Anthropic already have partial detection via auth profile failure tracking

Additional context

I'm building ContextClaw, a context engine plugin for OpenClaw. Happy to implement the plugin-side hook consumer and help with the gateway-side circuit breaker if pointed in the right direction.

/cc @steipete

extent analysis

TL;DR

Implement a provider health tracker to detect and handle persistently failing providers due to quota limits, and integrate it with the existing fallback chain and notification system.

Guidance

  • Introduce a quota failure counter for each provider, incrementing it when a 429 error with a quota-related message is encountered.
  • Mark a provider as "tripped" after a configurable number of consecutive quota failures, and skip it in the fallback chain.
  • Emit a model:provider-tripped internal hook event when a provider trips, allowing plugins to react and send notifications.
  • Add provider health visibility to the openclaw status command and TUI footer.

Example

// Example of emitting the model:provider-tripped hook event
createInternalHookEvent("model", "provider-tripped", sessionKey, {
  provider: "google",
  model: "gemini-3.1-pro-preview",
  reason: "quota_exceeded",
  consecutiveFailures: 3,
  nextReset: "2026-04-11T04:00:00Z"
})

Notes

The proposed solution requires changes to the gateway-level provider health tracking and the integration with the existing fallback chain and notification system. The example code snippet shows how to emit the model:provider-tripped hook event, but the actual implementation will depend on the specific requirements and existing codebase.

Recommendation

Apply the proposed workaround by implementing the provider health tracker and integrating it with the existing system, as it addresses the root cause of the issue and provides a comprehensive solution to handle persistently failing providers due to quota limits.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING