openclaw - 💡(How to fix) Fix doctor unconditionally prompts 'Restart gateway service now?' for any running gateway, causing post-update restart loops on macOS

StepCodex · 2026-05-25T14:24:17Z

[openclaw] openclaw doctor offers the "Restart gateway service now?" prompt any time the gateway service is running , with no check for whether it's actually u… `openclaw doctor` offers the "Restart gateway service now?" prompt **any time the gateway service is `running`**, with no check for whether it's actually unhealthy and with `initialValue: true` (Enter defaults to Yes). When run shortly after an upgrade — which is the common case — the gateway has already auto-restarted via `update.run` → SIGUSR1, so this prompt causes a redundant kill+respawn that races the supervisor (`launchctl` on macOS, presumably `systemd` on Linux) and can put the service into a restart cycle for several minutes. ## Fix / Workaround ## Workaround for users ## Environment - **OpenClaw:** 2026.5.22 (a374c3a) - **Platform:** macOS 26.3.1 (arm64), Node 25.8.0 - **Install:** npm global, LaunchAgent at `ai.openclaw.gateway` - **Channel:** stable ## Summary `openclaw doctor` offers the "Restart gateway service now?" prompt **any time the gateway service is `running`**, with no check for whether it's actually unhealthy and with `initialValue: true` (Enter defaults to Yes). When run shortly after an upgrade — which is the common case — the gateway has already auto-restarted via `update.run` → SIGUSR1, so this prompt causes a redundant kill+respawn that races the supervisor (`launchctl` on macOS, presumably `systemd` on Linux) and can put the service into a restart cycle for several minutes. ## Source location `dist/doctor-gateway-daemon-flow-DrfrAktV.js:299-329` (file basename will vary per build): ```js if (serviceRuntime?.status === "running") { if (serviceRepairExternal) { note(EXTERNAL_SERVICE_REPAIR_NOTE, "Gateway"); return; } if (await confirmDoctorServiceRepair(params.prompter, { message: "Restart gateway service now?", initialValue: true // <-- defaults to Yes }, serviceRepairPolicy)) { const restartStatus = describeGatewayServiceRestart("Gateway", await service.restart({ ... })); ... } } ``` The gate is `serviceRuntime?.status === "running"` alone. No probe of `/health`, no check for `staleGatewayPids`, no consideration of how recently the service started. ## Reproduction 1. Run an OpenClaw upgrade via the control UI (pnpm/npm, doesn't matter). 2. Update completes, `update.run` emits SIGUSR1 → in-process restart → gateway is healthy ~3-5s later. 3. Run `openclaw doctor` (or accept the doctor prompt that fires from the control UI flow). 4. Prompt appears: "Restart gateway service now?" with default = Yes. 5. Hitting Enter triggers `service.restart()`, which calls `cleanStaleGatewayProcessesSync` → SIGTERM/SIGKILL of the running gateway → launchctl KeepAlive respawns it. ## Observed log sequence (real incident, 2026-05-25 22:06-22:11 AWST) ``` 22:06:58 [gateway] update.run completed ... restartReason=update.run status=skipped 22:07:00 [gateway] signal SIGUSR1 received 22:07:00 [gateway] received SIGUSR1; restarting 22:07:00 [shutdown] completed cleanly in 238ms 22:07:02 [gateway] loading configuration… 22:07:04 [gateway] http server listening (14 plugins...; 2.9s) 22:07:04 [gateway] signal SIGTERM received ← doctor/UI restart trigger 22:07:04 [gateway] received SIGTERM; shutting down 22:09:01 [gateway] loading configuration… ← launchctl KeepAlive respawn 22:09:04 [gateway] ready 22:09:05 [restart] killing 1 stale gateway process(es) before restart: 36434 22:09:05 Restarted LaunchAgent: gui/501/ai.openclaw.gateway 22:09:21 [gateway] loading configuration… 22:09:24 [gateway] ready 22:09:41 [gateway] signal SIGTERM received ← another trigger 22:10:53 [gateway] loading configuration… 22:10:58 [gateway] ready ← finally stable ``` Roughly four restart cycles over four minutes. The user reported anxiety + a full Mac reboot because they assumed the system was broken; in reality launchctl would have (and did) recover on its own. ## Impact - **User trust:** "Restart gateway service now?" with default-Yes, fired against a healthy service, looks like the tool is asserting the service needs restarting. Users click Yes and trigger the very loop the prompt was ostensibly there to prevent. - **Cascading work:** Each forced restart triggers channel reconnects (Telegram, iMessage, etc.) and pre-warm (`warmCurrentProviderAuthState` was already flagged as a 60s event-loop blocker in #85999), so the cost is not just a kill+respawn — it's 5-15s of degraded behaviour per cycle, multiplied. - **macOS-specific amplifier:** `launchctl` KeepAlive races the manual `service.restart()` flow, producing the "stale pid" warnings users see in the dialog (`killing N stale gateway process(es) before restart`). ## Suggested fix (cheapest to most thorough) 1. **Minimum:** flip `initialValue: true` to `initialValue: false` at line ~306. Enter should not default to "restart a healthy service". 2. **Better:** only prompt when there's a documented reason to restart: - `staleGatewayPids.length > 0`, or - A health probe to `/health` failed/timed out, or - A doctor finding

Code Example

if (serviceRuntime?.status === "running") {
  if (serviceRepairExternal) {
    note(EXTERNAL_SERVICE_REPAIR_NOTE, "Gateway");
    return;
  }
  if (await confirmDoctorServiceRepair(params.prompter, {
    message: "Restart gateway service now?",
    initialValue: true                          // <-- defaults to Yes
  }, serviceRepairPolicy)) {
    const restartStatus = describeGatewayServiceRestart("Gateway", await service.restart({ ... }));
    ...
  }
}

---

22:06:58 [gateway] update.run completed ... restartReason=update.run status=skipped
22:07:00 [gateway] signal SIGUSR1 received
22:07:00 [gateway] received SIGUSR1; restarting
22:07:00 [shutdown] completed cleanly in 238ms
22:07:02 [gateway] loading configuration…
22:07:04 [gateway] http server listening (14 plugins...; 2.9s)
22:07:04 [gateway] signal SIGTERM received                ← doctor/UI restart trigger
22:07:04 [gateway] received SIGTERM; shutting down
22:09:01 [gateway] loading configuration…                 ← launchctl KeepAlive respawn
22:09:04 [gateway] ready
22:09:05 [restart] killing 1 stale gateway process(es) before restart: 36434
22:09:05 Restarted LaunchAgent: gui/501/ai.openclaw.gateway
22:09:21 [gateway] loading configuration…
22:09:24 [gateway] ready
22:09:41 [gateway] signal SIGTERM received                ← another trigger
22:10:53 [gateway] loading configuration…
22:10:58 [gateway] ready                                  ← finally stable

Environment

OpenClaw: 2026.5.22 (a374c3a)
Platform: macOS 26.3.1 (arm64), Node 25.8.0
Install: npm global, LaunchAgent at ai.openclaw.gateway
Channel: stable

Summary

openclaw doctor offers the "Restart gateway service now?" prompt any time the gateway service is running, with no check for whether it's actually unhealthy and with initialValue: true (Enter defaults to Yes). When run shortly after an upgrade — which is the common case — the gateway has already auto-restarted via update.run → SIGUSR1, so this prompt causes a redundant kill+respawn that races the supervisor (launchctl on macOS, presumably systemd on Linux) and can put the service into a restart cycle for several minutes.

Source location

dist/doctor-gateway-daemon-flow-DrfrAktV.js:299-329 (file basename will vary per build):

if (serviceRuntime?.status === "running") {
  if (serviceRepairExternal) {
    note(EXTERNAL_SERVICE_REPAIR_NOTE, "Gateway");
    return;
  }
  if (await confirmDoctorServiceRepair(params.prompter, {
    message: "Restart gateway service now?",
    initialValue: true                          // <-- defaults to Yes
  }, serviceRepairPolicy)) {
    const restartStatus = describeGatewayServiceRestart("Gateway", await service.restart({ ... }));
    ...
  }
}

The gate is serviceRuntime?.status === "running" alone. No probe of /health, no check for staleGatewayPids, no consideration of how recently the service started.

Reproduction

Run an OpenClaw upgrade via the control UI (pnpm/npm, doesn't matter).
Update completes, update.run emits SIGUSR1 → in-process restart → gateway is healthy ~3-5s later.
Run openclaw doctor (or accept the doctor prompt that fires from the control UI flow).
Prompt appears: "Restart gateway service now?" with default = Yes.
Hitting Enter triggers service.restart(), which calls cleanStaleGatewayProcessesSync → SIGTERM/SIGKILL of the running gateway → launchctl KeepAlive respawns it.

Observed log sequence (real incident, 2026-05-25 22:06-22:11 AWST)

22:06:58 [gateway] update.run completed ... restartReason=update.run status=skipped
22:07:00 [gateway] signal SIGUSR1 received
22:07:00 [gateway] received SIGUSR1; restarting
22:07:00 [shutdown] completed cleanly in 238ms
22:07:02 [gateway] loading configuration…
22:07:04 [gateway] http server listening (14 plugins...; 2.9s)
22:07:04 [gateway] signal SIGTERM received                ← doctor/UI restart trigger
22:07:04 [gateway] received SIGTERM; shutting down
22:09:01 [gateway] loading configuration…                 ← launchctl KeepAlive respawn
22:09:04 [gateway] ready
22:09:05 [restart] killing 1 stale gateway process(es) before restart: 36434
22:09:05 Restarted LaunchAgent: gui/501/ai.openclaw.gateway
22:09:21 [gateway] loading configuration…
22:09:24 [gateway] ready
22:09:41 [gateway] signal SIGTERM received                ← another trigger
22:10:53 [gateway] loading configuration…
22:10:58 [gateway] ready                                  ← finally stable

Roughly four restart cycles over four minutes. The user reported anxiety + a full Mac reboot because they assumed the system was broken; in reality launchctl would have (and did) recover on its own.

Impact

User trust: "Restart gateway service now?" with default-Yes, fired against a healthy service, looks like the tool is asserting the service needs restarting. Users click Yes and trigger the very loop the prompt was ostensibly there to prevent.
Cascading work: Each forced restart triggers channel reconnects (Telegram, iMessage, etc.) and pre-warm (warmCurrentProviderAuthState was already flagged as a 60s event-loop blocker in #85999), so the cost is not just a kill+respawn — it's 5-15s of degraded behaviour per cycle, multiplied.
macOS-specific amplifier: launchctl KeepAlive races the manual service.restart() flow, producing the "stale pid" warnings users see in the dialog (killing N stale gateway process(es) before restart).

Suggested fix (cheapest to most thorough)

Minimum: flip initialValue: true to initialValue: false at line ~306. Enter should not default to "restart a healthy service".
Better: only prompt when there's a documented reason to restart:
- staleGatewayPids.length > 0, or
- A health probe to /health failed/timed out, or
- A doctor finding above warning severity has been raised. If none of those hold, log "Gateway running (pid X, uptime Ys) — no restart required" and continue.
Best: add an uptimeMs check. If the service started within the last (say) 30s, never offer restart — it's almost certainly the post-update auto-restart that just landed. Pair with (2).

A similar gate already exists upstream of this — serviceRuntime?.status !== "running" correctly distinguishes "not running" from "running" and only offers Start in that case. The "running" branch just needs the same level of care.

Workaround for users

Until this is fixed, after upgrading on macOS:

Wait ~30s, run openclaw status. If gateway shows running with a fresh pid, do not run openclaw doctor.
If the "Restart gateway service now?" prompt appears against a known-healthy gateway, answer No.
Do not reboot the host — launchctl KeepAlive will stabilise the service within ~3-5 minutes even if a thrash starts.

#55563 — similar symptom on Linux/WSL2 but root cause was a broken systemd unit (nvm Node path), not a redundant prompt against a healthy service.
#85999 — provider auth pre-warm event-loop block, which makes each spurious restart from this issue more painful.
#66675 — openclaw gateway restart can return false failure after a healthy systemd restart, same family.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

openclaw - 💡(How to fix) Fix doctor unconditionally prompts 'Restart gateway service now?' for any running gateway, causing post-update restart loops on macOS

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Fix Action

Fix / Workaround

Workaround for users

Code Example

Environment

Summary

Source location

Reproduction

Observed log sequence (real incident, 2026-05-25 22:06-22:11 AWST)

Impact

Suggested fix (cheapest to most thorough)

Workaround for users

Related

Still need to ship something?

TRENDING