When all credentials return `model_cooldown` and `reset_seconds > 3600` (i.e. hours-long, not transient): 1. **Stop retrying immediately** — do not continue the ~20s retry loop 2. **Write a cooldown state marker** to the session (e.g. `cooling_until: `) so the session knows it is in cooldown 3. **Surface an alert** to the user via the configured delivery channel (e.g. Telegram message: "All credentials cooling down until . Will resume automatically.") 4. **Auto-resume** when `reset_seconds` has elapsed — either via a scheduled wake or on next gateway restart For short cooldowns (`reset_seconds < 3600`) the current retry behaviour may be acceptable, but hours-long cooldowns should be treated as a distinct failure mode requiring a hard stop + user notification.

openclaw - ✅(Solved) Fix [Bug]: No circuit breaker for model_cooldown — session retries indefinitely against hours-long cooldown [1 pull requests, 1 participants]

openclaw2026-04-06 02:24:54

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

openclaw/openclaw#61622•Fetched 2026-04-08 02:56:41

View on GitHub

Comments

Participants

Timeline

Reactions

Author

quanyeomans

Participants

quanyeomans

Timeline (top)

referenced ×6labeled ×2cross-referenced ×1

When all configured credentials return model_cooldown with a reset_seconds value indicating hours-long unavailability, OpenClaw has no circuit breaker. The session retries every ~20 seconds indefinitely, producing thousands of failed attempts and making the session completely unusable until manually reset.

Error Message

Multi-credential setup: multiple Anthropic OAuth tokens configured across separate agent accounts (builder, growth, consultant, shape, coach). All credentials share the same underlying Anthropic rate limit pool. When one credential hits model_cooldown, all credentials for that model return the same error — the gateway has no per-credential fallback for model_cooldown (only for overload/timeout errors).

Root Cause

Fix Action

Fixed

Fixed by PR: fix(agents): add model cooldown circuit breaker (#61622) (https://github.com/openclaw/openclaw/pull/61693)

PR fix notes

PR #61693: fix(agents): add model cooldown circuit breaker (#61622)

Repository: openclaw/openclaw
Author: Linux2010
State: open | merged: False
Link: https://github.com/openclaw/openclaw/pull/61693

Description (problem / solution / changelog)

Summary

Fixes #61622

When all credentials return model_cooldown with long reset_seconds, sessions now enter a cooldown state and reject inbound messages with a clear "I'll auto-resume when ready" notice, preventing infinite retry loops that silently consume resources for hours.

Problem

As reported in #61622, a user experienced:

2658 retry attempts over 16 hours
Session transcript grew to 8MB
No circuit breaker to stop the loop
No user notification
No automatic recovery

Solution

Added a model cooldown circuit breaker that:

Detects long cooldowns: When FallbackSummaryError.soonestCooldownExpiry exceeds the configured threshold (default: 60 minutes)
Enters cooldown state: Writes modelCooldownUntil to session entry
Rejects inbound messages: Returns a clear message to users:

⚠️ All credentials are cooling down for ~2 hours. I'll auto-resume when ready. No need to resend.
Auto-recovers: Clears cooldown state when:
- Cooldown expires (next message triggers cleanup)
- A run succeeds after cooldown

Changes

SessionEntry: Added modelCooldownUntil field for cooldown state
Config: Added agents.defaults.llm.modelCooldownThresholdMinutes (default: 60)
Entry check: Reject messages when session is in active cooldown
State persistence: Write/read cooldown state to session store
Auto-clear: Clear on success or expiry

Configuration

agents:
  defaults:
    llm:
      modelCooldownThresholdMinutes: 60  # Default: 60 minutes

Set to 0 to disable the circuit breaker.

Tests

Added 4 new test cases:

Rejects inbound messages when session is in model cooldown state
Clears expired cooldown state at session entry
Writes cooldown state when FallbackSummaryError has long soonestCooldownExpiry
Clears cooldown state on successful run after cooldown

All 23 tests pass.

Checklist

Code follows project conventions
Tests added and passing
CHANGELOG.md updated
No breaking changes

Changed files

CHANGELOG.md (modified, +2/-0)
src/auto-reply/reply/agent-runner-execution.test.ts (modified, +209/-0)
src/auto-reply/reply/agent-runner-execution.ts (modified, +146/-11)
src/config/sessions/types.ts (modified, +8/-0)
src/config/types.agent-defaults.ts (modified, +9/-0)
src/config/zod-schema.agent-defaults.ts (modified, +8/-0)

RAW_BUFFERClick to expand / collapse

Bug type

Behavior bug (incorrect output/state without crash)

Beta release blocker

Summary

Steps to reproduce

Exhaust all claude-sonnet (or equivalent) credentials across all configured accounts so every credential returns 429 model_cooldown
Send any inbound message to an affected agent session
Observe the session retry every ~20 seconds with no backoff and no halt condition
Session becomes an infinite retry loop — never surfaces an alert, never pauses, never auto-recovers

Expected behavior

When all credentials return model_cooldown and reset_seconds > 3600 (i.e. hours-long, not transient):

Stop retrying immediately — do not continue the ~20s retry loop
Write a cooldown state marker to the session (e.g. cooling_until: <ISO timestamp>) so the session knows it is in cooldown
Surface an alert to the user via the configured delivery channel (e.g. Telegram message: "All credentials cooling down until <time>. Will resume automatically.")
Auto-resume when reset_seconds has elapsed — either via a scheduled wake or on next gateway restart

For short cooldowns (reset_seconds < 3600) the current retry behaviour may be acceptable, but hours-long cooldowns should be treated as a distinct failure mode requiring a hard stop + user notification.

Actual behavior

Observed Behaviour (production incident, 2026-04-03/04)

A single inbound Telegram message was delivered to an agent session during a period when all claude-sonnet-4-6 credentials were cooling down (reset time: ~102 hours). The session retried every ~20 seconds for 16 hours with zero circuit breaking:

Total retry attempts: 2,658
Session transcript size: 8MB / 13,296 message objects
Token cost: $0 (all errors pre-model, but session was completely unusable)
Resolution: manual session reset by a human operator
The session never surfaced an alert, never wrote a cooldown state marker, never paused

This was compounded by a separate inbound dedup issue (see #58611) but the circuit breaker failure is independent — even with a single delivery, 2,658 retries against a 102h cooldown is unacceptable behaviour.

OpenClaw version

OpenClaw v2026.4.2

Operating system

Host: Azure VPS, Linux systemd

Install method

No response

Model

Model provider: Anthropic (claude-sonnet-4-6, claude-proxy wrapper via CLI-Proxy)

Provider / routing chain

Telegram (long polling) → OpenClaw gateway → claude-proxy (internal HTTP proxy on :8318) → Anthropic API (claude-sonnet-4-6)

Additional provider/model setup details

claude-proxy is a transparent token-counting proxy; it does not add retry logic. The retry loop is entirely within the OpenClaw session runtime.

No per-agent model override was active at the time — all agents default to claude-sonnet-4-6. No fallback model was configured for model_cooldown specifically (only for overload).

Config: OpenClaw v2026.4.2, hosted on Azure VPS (Standard D4as v5, Ubuntu 24.04).

Logs, screenshots, and evidence

Impact and severity

No response

Additional information

No response

extent analysis

TL;DR

Implement a circuit breaker in OpenClaw to stop retrying when all credentials return model_cooldown with a reset_seconds value indicating hours-long unavailability.

Guidance

Identify the threshold for reset_seconds that indicates hours-long unavailability (e.g., reset_seconds > 3600) and implement a conditional check for this threshold in the retry logic.
When the threshold is met, stop retrying immediately and write a cooldown state marker to the session (e.g., cooling_until: <ISO timestamp>).
Surface an alert to the user via the configured delivery channel (e.g., Telegram message) to notify them of the cooldown period.
Implement a mechanism for auto-resuming the session when the reset_seconds has elapsed, either via a scheduled wake or on next gateway restart.

Example

A possible implementation of the circuit breaker could involve modifying the retry logic to include a conditional check for the reset_seconds threshold:

if reset_seconds > 3600:
    # Stop retrying and write cooldown state marker
    session.cooling_until = datetime.now() + timedelta(seconds=reset_seconds)
    # Surface alert to user
    send_alert(f"All credentials cooling down until {session.cooling_until}")
    # Stop retrying
    return

Notes

The exact implementation details may vary depending on the specific requirements and constraints of the OpenClaw system. Additionally, the threshold value for reset_seconds may need to be adjusted based on the specific use case and requirements.

Recommendation

Apply a workaround by implementing a circuit breaker in OpenClaw to stop retrying when all credentials return model_cooldown with a reset_seconds value indicating hours-long unavailability. This will prevent the infinite retry loop and allow for a more robust and reliable system.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

FAQ

Expected behavior

When all credentials return model_cooldown and reset_seconds > 3600 (i.e. hours-long, not transient):

Stop retrying immediately — do not continue the ~20s retry loop
Write a cooldown state marker to the session (e.g. cooling_until: <ISO timestamp>) so the session knows it is in cooldown
Surface an alert to the user via the configured delivery channel (e.g. Telegram message: "All credentials cooling down until <time>. Will resume automatically.")
Auto-resume when reset_seconds has elapsed — either via a scheduled wake or on next gateway restart

#api #ssr #installation #API versioning #request timeout

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

openclaw - ✅(Solved) Fix [Bug]: No circuit breaker for model_cooldown — session retries indefinitely against hours-long cooldown [1 pull requests, 1 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Fixed

PR fix notes

PR #61693: fix(agents): add model cooldown circuit breaker (#61622)

Description (problem / solution / changelog)

Summary

Problem

Solution

Changes

Configuration

Tests

Checklist

Changed files

Bug type

Beta release blocker

Summary

Steps to reproduce

Expected behavior

Actual behavior

Observed Behaviour (production incident, 2026-04-03/04)

OpenClaw version

Operating system

Install method

Model

Provider / routing chain

Additional provider/model setup details

Logs, screenshots, and evidence

Impact and severity

Additional information

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

FAQ

Expected behavior

Still need to ship something?

RELATED_DISCOVERY

TRENDING