openclaw - 💡(How to fix) Fix [Bug]: auth.cooldowns config change forces full gateway restart, drops in-flight CLI runs [1 pull requests]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

Error Message

12:04:53 [agent/cli-backend] claude live session turn failed: error=FailoverError

Root Cause

The reload classifier treats auth.cooldowns as restart-required. Two adjacent design points cause the user-visible damage:

  1. Cooldown state is short-lived, lane-scoped, and observability-shaped — it's a poor fit for durable config that gets persisted and re-read on boot.
  2. Even granting that auth.cooldowns lives in config, applying a cooldown only requires the auth subsystem to refresh its in-memory state. A full process restart with in-flight CLI runs is disproportionate.

Fix Action

Fixed

Code Example

12:02:19 [agent/cli-backend] cli exec: provider=claude-cli model=opus trigger=user
12:02:25 [diagnostic] liveness warning: phase=channels.telegram.start-account
         (coincidental — slow start-account, not causal)
12:04:25 [diagnostic] stuck session: reason=queued_work_without_active_run
12:04:50 [reload] config change detected; evaluating reload (auth.cooldowns)
12:04:50 [reload] config change requires gateway restart (auth.cooldowns)
                  — deferring until 1 reply(ies), 1 embedded run(s) complete
12:04:53 systemd[910]: Stopping openclaw-gateway.service
12:04:53 [gateway] signal SIGTERM received
12:04:53 [gateway] received SIGTERM; shutting down
12:04:53 [agent/cli-backend] claude live session turn failed: error=FailoverError
RAW_BUFFERClick to expand / collapse

Symptom

When the gateway writes a new auth.cooldowns entry into config (typically after an Anthropic billing 400 or rate-limit response), the gateway's own config-reload detector classifies the change as restart-required. Systemd then stops the unit, SIGTERMs the gateway, and aborts any in-flight CLI runs with FailoverError. User-trigger messages are silently dropped.

This compounds with #71709 (slug-generator-as-billing-misclassify): every spurious 400 in the helper lane triggers a full gateway bounce, even though the user-facing OAuth lane is healthy and the cooldown only needs to apply to one lane.

Repro / observed cascade

Pi 5 host, Ubuntu 24.04, OpenClaw 2026.4.25, gateway under systemd (user unit).

Today's incident timeline from journalctl --user -u openclaw-gateway:

12:02:19 [agent/cli-backend] cli exec: provider=claude-cli model=opus trigger=user
12:02:25 [diagnostic] liveness warning: phase=channels.telegram.start-account
         (coincidental — slow start-account, not causal)
12:04:25 [diagnostic] stuck session: reason=queued_work_without_active_run
12:04:50 [reload] config change detected; evaluating reload (auth.cooldowns)
12:04:50 [reload] config change requires gateway restart (auth.cooldowns)
                  — deferring until 1 reply(ies), 1 embedded run(s) complete
12:04:53 systemd[910]: Stopping openclaw-gateway.service
12:04:53 [gateway] signal SIGTERM received
12:04:53 [gateway] received SIGTERM; shutting down
12:04:53 [agent/cli-backend] claude live session turn failed: error=FailoverError

This pattern repeated 4× between 12:02–13:10 EDT today; each restart dropped one user-trigger Telegram message.

Liveness warnings at 13:12:59 (33.5s event-loop block) and 13:16:00 produced no SIGTERM, confirming liveness isn't the cause and the gateway tolerates slow phases when no cooldown event fires.

Root cause

The reload classifier treats auth.cooldowns as restart-required. Two adjacent design points cause the user-visible damage:

  1. Cooldown state is short-lived, lane-scoped, and observability-shaped — it's a poor fit for durable config that gets persisted and re-read on boot.
  2. Even granting that auth.cooldowns lives in config, applying a cooldown only requires the auth subsystem to refresh its in-memory state. A full process restart with in-flight CLI runs is disproportionate.

Suggested fix (in priority order)

  1. Hot-reload auth.cooldowns — mark the key as reload-safe; refresh the auth subsystem's cooldown table in-place. No restart, no in-flight abort.
  2. Alternatively, move cooldowns out of persisted config entirely — keep them in-memory only, optionally re-derive from a state DB or journal on boot.
  3. As a stopgap operators can apply: a config knob to suppress restart-on-cooldown-change (accepting that cooldowns won't survive a real restart).

Why this matters

Every billing classifier false-positive (#71709) currently causes user-visible message loss, even when the affected lane is internal. Cooldowns should degrade gracefully, not bounce the supervisor.

Environment

  • OpenClaw 2026.4.25 (eeef486)
  • Node 22.22.2, kernel 6.8.0-1056-raspi (Pi 5 ARM64)
  • Gateway loopback, user-unit systemd, KillMode=control-group, Restart=always
  • Single agent (main), Telegram channel direct-message mode

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

openclaw - 💡(How to fix) Fix [Bug]: auth.cooldowns config change forces full gateway restart, drops in-flight CLI runs [1 pull requests]