openclaw - 💡(How to fix) Fix [Bug]: auth.cooldowns config change forces full gateway restart, drops in-flight CLI runs [1 pull requests]

openclaw2026-05-30 20:51:09

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

Error Message

12:04:53 [agent/cli-backend] claude live session turn failed: error=FailoverError

Root Cause

The reload classifier treats auth.cooldowns as restart-required. Two adjacent design points cause the user-visible damage:

Cooldown state is short-lived, lane-scoped, and observability-shaped — it's a poor fit for durable config that gets persisted and re-read on boot.
Even granting that auth.cooldowns lives in config, applying a cooldown only requires the auth subsystem to refresh its in-memory state. A full process restart with in-flight CLI runs is disproportionate.

Fix Action

Fixed

Fixed by PR: [AI-assisted] fix(gateway): avoid restarts for auth cooldown reloads (https://github.com/openclaw/openclaw/pull/88474)

Code Example

12:02:19 [agent/cli-backend] cli exec: provider=claude-cli model=opus trigger=user
12:02:25 [diagnostic] liveness warning: phase=channels.telegram.start-account
         (coincidental — slow start-account, not causal)
12:04:25 [diagnostic] stuck session: reason=queued_work_without_active_run
12:04:50 [reload] config change detected; evaluating reload (auth.cooldowns)
12:04:50 [reload] config change requires gateway restart (auth.cooldowns)
                  — deferring until 1 reply(ies), 1 embedded run(s) complete
12:04:53 systemd[910]: Stopping openclaw-gateway.service
12:04:53 [gateway] signal SIGTERM received
12:04:53 [gateway] received SIGTERM; shutting down
12:04:53 [agent/cli-backend] claude live session turn failed: error=FailoverError

RAW_BUFFERClick to expand / collapse

Symptom

When the gateway writes a new auth.cooldowns entry into config (typically after an Anthropic billing 400 or rate-limit response), the gateway's own config-reload detector classifies the change as restart-required. Systemd then stops the unit, SIGTERMs the gateway, and aborts any in-flight CLI runs with FailoverError. User-trigger messages are silently dropped.

This compounds with #71709 (slug-generator-as-billing-misclassify): every spurious 400 in the helper lane triggers a full gateway bounce, even though the user-facing OAuth lane is healthy and the cooldown only needs to apply to one lane.

Repro / observed cascade

Pi 5 host, Ubuntu 24.04, OpenClaw 2026.4.25, gateway under systemd (user unit).

Today's incident timeline from journalctl --user -u openclaw-gateway:

12:02:19 [agent/cli-backend] cli exec: provider=claude-cli model=opus trigger=user
12:02:25 [diagnostic] liveness warning: phase=channels.telegram.start-account
         (coincidental — slow start-account, not causal)
12:04:25 [diagnostic] stuck session: reason=queued_work_without_active_run
12:04:50 [reload] config change detected; evaluating reload (auth.cooldowns)
12:04:50 [reload] config change requires gateway restart (auth.cooldowns)
                  — deferring until 1 reply(ies), 1 embedded run(s) complete
12:04:53 systemd[910]: Stopping openclaw-gateway.service
12:04:53 [gateway] signal SIGTERM received
12:04:53 [gateway] received SIGTERM; shutting down
12:04:53 [agent/cli-backend] claude live session turn failed: error=FailoverError

This pattern repeated 4× between 12:02–13:10 EDT today; each restart dropped one user-trigger Telegram message.

Liveness warnings at 13:12:59 (33.5s event-loop block) and 13:16:00 produced no SIGTERM, confirming liveness isn't the cause and the gateway tolerates slow phases when no cooldown event fires.

Root cause

The reload classifier treats auth.cooldowns as restart-required. Two adjacent design points cause the user-visible damage:

Cooldown state is short-lived, lane-scoped, and observability-shaped — it's a poor fit for durable config that gets persisted and re-read on boot.
Even granting that auth.cooldowns lives in config, applying a cooldown only requires the auth subsystem to refresh its in-memory state. A full process restart with in-flight CLI runs is disproportionate.

Suggested fix (in priority order)

Hot-reload auth.cooldowns — mark the key as reload-safe; refresh the auth subsystem's cooldown table in-place. No restart, no in-flight abort.
Alternatively, move cooldowns out of persisted config entirely — keep them in-memory only, optionally re-derive from a state DB or journal on boot.
As a stopgap operators can apply: a config knob to suppress restart-on-cooldown-change (accepting that cooldowns won't survive a real restart).

Why this matters

Every billing classifier false-positive (#71709) currently causes user-visible message loss, even when the affected lane is internal. Cooldowns should degrade gracefully, not bounce the supervisor.

Environment

OpenClaw 2026.4.25 (eeef486)
Node 22.22.2, kernel 6.8.0-1056-raspi (Pi 5 ARM64)
Gateway loopback, user-unit systemd, KillMode=control-group, Restart=always
Single agent (main), Telegram channel direct-message mode

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering