openclaw - 💡(How to fix) Fix doctor unconditionally prompts 'Restart gateway service now?' for any running gateway, causing post-update restart loops on macOS

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

openclaw doctor offers the "Restart gateway service now?" prompt any time the gateway service is running, with no check for whether it's actually unhealthy and with initialValue: true (Enter defaults to Yes). When run shortly after an upgrade — which is the common case — the gateway has already auto-restarted via update.run → SIGUSR1, so this prompt causes a redundant kill+respawn that races the supervisor (launchctl on macOS, presumably systemd on Linux) and can put the service into a restart cycle for several minutes.

Root Cause

Roughly four restart cycles over four minutes. The user reported anxiety + a full Mac reboot because they assumed the system was broken; in reality launchctl would have (and did) recover on its own.

Fix Action

Fix / Workaround

Workaround for users

Code Example

if (serviceRuntime?.status === "running") {
  if (serviceRepairExternal) {
    note(EXTERNAL_SERVICE_REPAIR_NOTE, "Gateway");
    return;
  }
  if (await confirmDoctorServiceRepair(params.prompter, {
    message: "Restart gateway service now?",
    initialValue: true                          // <-- defaults to Yes
  }, serviceRepairPolicy)) {
    const restartStatus = describeGatewayServiceRestart("Gateway", await service.restart({ ... }));
    ...
  }
}

---

22:06:58 [gateway] update.run completed ... restartReason=update.run status=skipped
22:07:00 [gateway] signal SIGUSR1 received
22:07:00 [gateway] received SIGUSR1; restarting
22:07:00 [shutdown] completed cleanly in 238ms
22:07:02 [gateway] loading configuration…
22:07:04 [gateway] http server listening (14 plugins...; 2.9s)
22:07:04 [gateway] signal SIGTERM received                ← doctor/UI restart trigger
22:07:04 [gateway] received SIGTERM; shutting down
22:09:01 [gateway] loading configuration…                 ← launchctl KeepAlive respawn
22:09:04 [gateway] ready
22:09:05 [restart] killing 1 stale gateway process(es) before restart: 36434
22:09:05 Restarted LaunchAgent: gui/501/ai.openclaw.gateway
22:09:21 [gateway] loading configuration…
22:09:24 [gateway] ready
22:09:41 [gateway] signal SIGTERM received                ← another trigger
22:10:53 [gateway] loading configuration…
22:10:58 [gateway] ready                                  ← finally stable
RAW_BUFFERClick to expand / collapse

Environment

  • OpenClaw: 2026.5.22 (a374c3a)
  • Platform: macOS 26.3.1 (arm64), Node 25.8.0
  • Install: npm global, LaunchAgent at ai.openclaw.gateway
  • Channel: stable

Summary

openclaw doctor offers the "Restart gateway service now?" prompt any time the gateway service is running, with no check for whether it's actually unhealthy and with initialValue: true (Enter defaults to Yes). When run shortly after an upgrade — which is the common case — the gateway has already auto-restarted via update.run → SIGUSR1, so this prompt causes a redundant kill+respawn that races the supervisor (launchctl on macOS, presumably systemd on Linux) and can put the service into a restart cycle for several minutes.

Source location

dist/doctor-gateway-daemon-flow-DrfrAktV.js:299-329 (file basename will vary per build):

if (serviceRuntime?.status === "running") {
  if (serviceRepairExternal) {
    note(EXTERNAL_SERVICE_REPAIR_NOTE, "Gateway");
    return;
  }
  if (await confirmDoctorServiceRepair(params.prompter, {
    message: "Restart gateway service now?",
    initialValue: true                          // <-- defaults to Yes
  }, serviceRepairPolicy)) {
    const restartStatus = describeGatewayServiceRestart("Gateway", await service.restart({ ... }));
    ...
  }
}

The gate is serviceRuntime?.status === "running" alone. No probe of /health, no check for staleGatewayPids, no consideration of how recently the service started.

Reproduction

  1. Run an OpenClaw upgrade via the control UI (pnpm/npm, doesn't matter).
  2. Update completes, update.run emits SIGUSR1 → in-process restart → gateway is healthy ~3-5s later.
  3. Run openclaw doctor (or accept the doctor prompt that fires from the control UI flow).
  4. Prompt appears: "Restart gateway service now?" with default = Yes.
  5. Hitting Enter triggers service.restart(), which calls cleanStaleGatewayProcessesSync → SIGTERM/SIGKILL of the running gateway → launchctl KeepAlive respawns it.

Observed log sequence (real incident, 2026-05-25 22:06-22:11 AWST)

22:06:58 [gateway] update.run completed ... restartReason=update.run status=skipped
22:07:00 [gateway] signal SIGUSR1 received
22:07:00 [gateway] received SIGUSR1; restarting
22:07:00 [shutdown] completed cleanly in 238ms
22:07:02 [gateway] loading configuration…
22:07:04 [gateway] http server listening (14 plugins...; 2.9s)
22:07:04 [gateway] signal SIGTERM received                ← doctor/UI restart trigger
22:07:04 [gateway] received SIGTERM; shutting down
22:09:01 [gateway] loading configuration…                 ← launchctl KeepAlive respawn
22:09:04 [gateway] ready
22:09:05 [restart] killing 1 stale gateway process(es) before restart: 36434
22:09:05 Restarted LaunchAgent: gui/501/ai.openclaw.gateway
22:09:21 [gateway] loading configuration…
22:09:24 [gateway] ready
22:09:41 [gateway] signal SIGTERM received                ← another trigger
22:10:53 [gateway] loading configuration…
22:10:58 [gateway] ready                                  ← finally stable

Roughly four restart cycles over four minutes. The user reported anxiety + a full Mac reboot because they assumed the system was broken; in reality launchctl would have (and did) recover on its own.

Impact

  • User trust: "Restart gateway service now?" with default-Yes, fired against a healthy service, looks like the tool is asserting the service needs restarting. Users click Yes and trigger the very loop the prompt was ostensibly there to prevent.
  • Cascading work: Each forced restart triggers channel reconnects (Telegram, iMessage, etc.) and pre-warm (warmCurrentProviderAuthState was already flagged as a 60s event-loop blocker in #85999), so the cost is not just a kill+respawn — it's 5-15s of degraded behaviour per cycle, multiplied.
  • macOS-specific amplifier: launchctl KeepAlive races the manual service.restart() flow, producing the "stale pid" warnings users see in the dialog (killing N stale gateway process(es) before restart).

Suggested fix (cheapest to most thorough)

  1. Minimum: flip initialValue: true to initialValue: false at line ~306. Enter should not default to "restart a healthy service".
  2. Better: only prompt when there's a documented reason to restart:
    • staleGatewayPids.length > 0, or
    • A health probe to /health failed/timed out, or
    • A doctor finding above warning severity has been raised. If none of those hold, log "Gateway running (pid X, uptime Ys) — no restart required" and continue.
  3. Best: add an uptimeMs check. If the service started within the last (say) 30s, never offer restart — it's almost certainly the post-update auto-restart that just landed. Pair with (2).

A similar gate already exists upstream of this — serviceRuntime?.status !== "running" correctly distinguishes "not running" from "running" and only offers Start in that case. The "running" branch just needs the same level of care.

Workaround for users

Until this is fixed, after upgrading on macOS:

  • Wait ~30s, run openclaw status. If gateway shows running with a fresh pid, do not run openclaw doctor.
  • If the "Restart gateway service now?" prompt appears against a known-healthy gateway, answer No.
  • Do not reboot the host — launchctl KeepAlive will stabilise the service within ~3-5 minutes even if a thrash starts.

Related

  • #55563 — similar symptom on Linux/WSL2 but root cause was a broken systemd unit (nvm Node path), not a redundant prompt against a healthy service.
  • #85999 — provider auth pre-warm event-loop block, which makes each spurious restart from this issue more painful.
  • #66675 — openclaw gateway restart can return false failure after a healthy systemd restart, same family.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING