openclaw - ✅(Solved) Fix [Bug]: No circuit breaker for model_cooldown — session retries indefinitely against hours-long cooldown [1 pull requests, 1 participants]

Official PRs (…)
ON THIS PAGE

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
openclaw/openclaw#61622Fetched 2026-04-08 02:56:41
View on GitHub
Comments
0
Participants
1
Timeline
9
Reactions
0
Participants
Timeline (top)
referenced ×6labeled ×2cross-referenced ×1

When all configured credentials return model_cooldown with a reset_seconds value indicating hours-long unavailability, OpenClaw has no circuit breaker. The session retries every ~20 seconds indefinitely, producing thousands of failed attempts and making the session completely unusable until manually reset.

Error Message

Multi-credential setup: multiple Anthropic OAuth tokens configured across separate agent accounts (builder, growth, consultant, shape, coach). All credentials share the same underlying Anthropic rate limit pool. When one credential hits model_cooldown, all credentials for that model return the same error — the gateway has no per-credential fallback for model_cooldown (only for overload/timeout errors).

Root Cause

When all configured credentials return model_cooldown with a reset_seconds value indicating hours-long unavailability, OpenClaw has no circuit breaker. The session retries every ~20 seconds indefinitely, producing thousands of failed attempts and making the session completely unusable until manually reset.

Fix Action

Fixed

PR fix notes

PR #61693: fix(agents): add model cooldown circuit breaker (#61622)

Description (problem / solution / changelog)

Summary

Fixes #61622

When all credentials return model_cooldown with long reset_seconds, sessions now enter a cooldown state and reject inbound messages with a clear "I'll auto-resume when ready" notice, preventing infinite retry loops that silently consume resources for hours.

Problem

As reported in #61622, a user experienced:

  • 2658 retry attempts over 16 hours
  • Session transcript grew to 8MB
  • No circuit breaker to stop the loop
  • No user notification
  • No automatic recovery

Solution

Added a model cooldown circuit breaker that:

  1. Detects long cooldowns: When FallbackSummaryError.soonestCooldownExpiry exceeds the configured threshold (default: 60 minutes)

  2. Enters cooldown state: Writes modelCooldownUntil to session entry

  3. Rejects inbound messages: Returns a clear message to users:

    ⚠️ All credentials are cooling down for ~2 hours. I'll auto-resume when ready. No need to resend.

  4. Auto-recovers: Clears cooldown state when:

    • Cooldown expires (next message triggers cleanup)
    • A run succeeds after cooldown

Changes

  • SessionEntry: Added modelCooldownUntil field for cooldown state
  • Config: Added agents.defaults.llm.modelCooldownThresholdMinutes (default: 60)
  • Entry check: Reject messages when session is in active cooldown
  • State persistence: Write/read cooldown state to session store
  • Auto-clear: Clear on success or expiry

Configuration

agents:
  defaults:
    llm:
      modelCooldownThresholdMinutes: 60  # Default: 60 minutes

Set to 0 to disable the circuit breaker.

Tests

Added 4 new test cases:

  • Rejects inbound messages when session is in model cooldown state
  • Clears expired cooldown state at session entry
  • Writes cooldown state when FallbackSummaryError has long soonestCooldownExpiry
  • Clears cooldown state on successful run after cooldown

All 23 tests pass.

Checklist

  • Code follows project conventions
  • Tests added and passing
  • CHANGELOG.md updated
  • No breaking changes

Changed files

  • CHANGELOG.md (modified, +2/-0)
  • src/auto-reply/reply/agent-runner-execution.test.ts (modified, +209/-0)
  • src/auto-reply/reply/agent-runner-execution.ts (modified, +146/-11)
  • src/config/sessions/types.ts (modified, +8/-0)
  • src/config/types.agent-defaults.ts (modified, +9/-0)
  • src/config/zod-schema.agent-defaults.ts (modified, +8/-0)
RAW_BUFFERClick to expand / collapse

Bug type

Behavior bug (incorrect output/state without crash)

Beta release blocker

No

Summary

When all configured credentials return model_cooldown with a reset_seconds value indicating hours-long unavailability, OpenClaw has no circuit breaker. The session retries every ~20 seconds indefinitely, producing thousands of failed attempts and making the session completely unusable until manually reset.

Steps to reproduce

  1. Exhaust all claude-sonnet (or equivalent) credentials across all configured accounts so every credential returns 429 model_cooldown
  2. Send any inbound message to an affected agent session
  3. Observe the session retry every ~20 seconds with no backoff and no halt condition
  4. Session becomes an infinite retry loop — never surfaces an alert, never pauses, never auto-recovers

Expected behavior

When all credentials return model_cooldown and reset_seconds > 3600 (i.e. hours-long, not transient):

  1. Stop retrying immediately — do not continue the ~20s retry loop
  2. Write a cooldown state marker to the session (e.g. cooling_until: <ISO timestamp>) so the session knows it is in cooldown
  3. Surface an alert to the user via the configured delivery channel (e.g. Telegram message: "All credentials cooling down until <time>. Will resume automatically.")
  4. Auto-resume when reset_seconds has elapsed — either via a scheduled wake or on next gateway restart

For short cooldowns (reset_seconds < 3600) the current retry behaviour may be acceptable, but hours-long cooldowns should be treated as a distinct failure mode requiring a hard stop + user notification.

Actual behavior

Observed Behaviour (production incident, 2026-04-03/04)

A single inbound Telegram message was delivered to an agent session during a period when all claude-sonnet-4-6 credentials were cooling down (reset time: ~102 hours). The session retried every ~20 seconds for 16 hours with zero circuit breaking:

  • Total retry attempts: 2,658
  • Session transcript size: 8MB / 13,296 message objects
  • Token cost: $0 (all errors pre-model, but session was completely unusable)
  • Resolution: manual session reset by a human operator
  • The session never surfaced an alert, never wrote a cooldown state marker, never paused

This was compounded by a separate inbound dedup issue (see #58611) but the circuit breaker failure is independent — even with a single delivery, 2,658 retries against a 102h cooldown is unacceptable behaviour.

OpenClaw version

OpenClaw v2026.4.2

Operating system

Host: Azure VPS, Linux systemd

Install method

No response

Model

Model provider: Anthropic (claude-sonnet-4-6, claude-proxy wrapper via CLI-Proxy)

Provider / routing chain

Telegram (long polling) → OpenClaw gateway → claude-proxy (internal HTTP proxy on :8318) → Anthropic API (claude-sonnet-4-6)

Additional provider/model setup details

Multi-credential setup: multiple Anthropic OAuth tokens configured across separate agent accounts (builder, growth, consultant, shape, coach). All credentials share the same underlying Anthropic rate limit pool. When one credential hits model_cooldown, all credentials for that model return the same error — the gateway has no per-credential fallback for model_cooldown (only for overload/timeout errors).

claude-proxy is a transparent token-counting proxy; it does not add retry logic. The retry loop is entirely within the OpenClaw session runtime.

No per-agent model override was active at the time — all agents default to claude-sonnet-4-6. No fallback model was configured for model_cooldown specifically (only for overload).

Config: OpenClaw v2026.4.2, hosted on Azure VPS (Standard D4as v5, Ubuntu 24.04).

Logs, screenshots, and evidence

Impact and severity

No response

Additional information

No response

extent analysis

TL;DR

Implement a circuit breaker in OpenClaw to stop retrying when all credentials return model_cooldown with a reset_seconds value indicating hours-long unavailability.

Guidance

  • Identify the threshold for reset_seconds that indicates hours-long unavailability (e.g., reset_seconds > 3600) and implement a conditional check for this threshold in the retry logic.
  • When the threshold is met, stop retrying immediately and write a cooldown state marker to the session (e.g., cooling_until: <ISO timestamp>).
  • Surface an alert to the user via the configured delivery channel (e.g., Telegram message) to notify them of the cooldown period.
  • Implement a mechanism for auto-resuming the session when the reset_seconds has elapsed, either via a scheduled wake or on next gateway restart.

Example

A possible implementation of the circuit breaker could involve modifying the retry logic to include a conditional check for the reset_seconds threshold:

if reset_seconds > 3600:
    # Stop retrying and write cooldown state marker
    session.cooling_until = datetime.now() + timedelta(seconds=reset_seconds)
    # Surface alert to user
    send_alert(f"All credentials cooling down until {session.cooling_until}")
    # Stop retrying
    return

Notes

The exact implementation details may vary depending on the specific requirements and constraints of the OpenClaw system. Additionally, the threshold value for reset_seconds may need to be adjusted based on the specific use case and requirements.

Recommendation

Apply a workaround by implementing a circuit breaker in OpenClaw to stop retrying when all credentials return model_cooldown with a reset_seconds value indicating hours-long unavailability. This will prevent the infinite retry loop and allow for a more robust and reliable system.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

FAQ

Expected behavior

When all credentials return model_cooldown and reset_seconds > 3600 (i.e. hours-long, not transient):

  1. Stop retrying immediately — do not continue the ~20s retry loop
  2. Write a cooldown state marker to the session (e.g. cooling_until: <ISO timestamp>) so the session knows it is in cooldown
  3. Surface an alert to the user via the configured delivery channel (e.g. Telegram message: "All credentials cooling down until <time>. Will resume automatically.")
  4. Auto-resume when reset_seconds has elapsed — either via a scheduled wake or on next gateway restart

For short cooldowns (reset_seconds < 3600) the current retry behaviour may be acceptable, but hours-long cooldowns should be treated as a distinct failure mode requiring a hard stop + user notification.

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

openclaw - ✅(Solved) Fix [Bug]: No circuit breaker for model_cooldown — session retries indefinitely against hours-long cooldown [1 pull requests, 1 participants]