openclaw - ✅(Solved) Fix Gateway left dead after update.run / SIGUSR1 supervisor restart — systemd sees clean exit, does not relaunch [1 pull requests, 1 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
openclaw/openclaw#70354Fetched 2026-04-23 07:25:46
View on GitHub
Comments
1
Participants
2
Timeline
2
Reactions
0
Timeline (top)
commented ×1cross-referenced ×1

When a user triggers an update from the Control UI, the gateway receives SIGUSR1 and performs an in-process supervisor restart. If the supervisor's re-exec fails, the process exits with code 0. A systemd user service with the typical Restart=on-failure policy treats exit 0 as "clean", does not relaunch, and leaves the gateway dead. The user sees 502 Bad Gateway on the tailnet-served dashboard URL and has no in-band way to recover — the Control UI itself can't fix it because the Control UI needs the gateway to talk to.

Root Cause

When a user triggers an update from the Control UI, the gateway receives SIGUSR1 and performs an in-process supervisor restart. If the supervisor's re-exec fails, the process exits with code 0. A systemd user service with the typical Restart=on-failure policy treats exit 0 as "clean", does not relaunch, and leaves the gateway dead. The user sees 502 Bad Gateway on the tailnet-served dashboard URL and has no in-band way to recover — the Control UI itself can't fix it because the Control UI needs the gateway to talk to.

Fix Action

Fix / Workaround

Workaround I deployed

PR fix notes

PR #70466: fix(gateway): exit non-zero on supervised restart so systemd Restart=on-failure recovers

Description (problem / solution / changelog)

After update.run / SIGUSR1 supervisor restart, the gateway exited with code 0. When running under systemd with Restart=on-failure (the default), a clean exit does not trigger a restart, leaving the gateway dead until manual intervention.

Exit code 1 instead so Restart=on-failure restarts the service. KeepAlive on launchd ignores exit code and is unaffected. The spawned (non-supervised) case contin to exit 0 as the detached child takes over as the new gateway.

Closes #70354

Changed files

  • src/cli/gateway-cli/run-loop.test.ts (modified, +2/-2)
  • src/cli/gateway-cli/run-loop.ts (modified, +11/-11)

Code Example

Apr 22 00:43:24 node[1035]: [gateway] update.run completed ... status=ok
Apr 22 00:43:26 node[1035]: [gateway] signal SIGUSR1 received
Apr 22 00:43:26 node[1035]: [gateway] received SIGUSR1; restarting
Apr 22 00:43:26 node[1035]: [gateway] restart mode: full process restart (supervisor restart)
Apr 22 00:43:27 systemd[883]: openclaw-gateway.service: Consumed 3min 20.664s CPU time.
# ← no further journal entries from _UID=openclaw until manual intervention 19h later

---

[Service]
Restart=always
RestartSec=10s

[Unit]
StartLimitIntervalSec=300
StartLimitBurst=20
RAW_BUFFERClick to expand / collapse

Summary

When a user triggers an update from the Control UI, the gateway receives SIGUSR1 and performs an in-process supervisor restart. If the supervisor's re-exec fails, the process exits with code 0. A systemd user service with the typical Restart=on-failure policy treats exit 0 as "clean", does not relaunch, and leaves the gateway dead. The user sees 502 Bad Gateway on the tailnet-served dashboard URL and has no in-band way to recover — the Control UI itself can't fix it because the Control UI needs the gateway to talk to.

Impact

Hit deterministically by two separate nodes on 2026-04-22 when I pressed Update in the dashboard. Both stayed down for ~19 hours until noticed. For less-technical users who run OpenClaw behind Tailscale + HTTPS serve, this is a hard lock-out: the only recovery path requires SSH and knowledge of systemctl --user restart. Friends I was planning to hand this setup to would absolutely have been stuck here.

Reproduction

  1. Run OpenClaw gateway as a systemd user service with Restart=on-failure (the default if you follow the standard install path).
  2. Open the Control UI, click Update.
  3. Observe: dashboard returns 502. systemctl --user status openclaw-gateway shows inactive (dead) with status=0/SUCCESS (clean exit). Process does not come back.

Journal evidence (redacted, from my node)

Apr 22 00:43:24 node[1035]: [gateway] update.run completed ... status=ok
Apr 22 00:43:26 node[1035]: [gateway] signal SIGUSR1 received
Apr 22 00:43:26 node[1035]: [gateway] received SIGUSR1; restarting
Apr 22 00:43:26 node[1035]: [gateway] restart mode: full process restart (supervisor restart)
Apr 22 00:43:27 systemd[883]: openclaw-gateway.service: Consumed 3min 20.664s CPU time.
# ← no further journal entries from _UID=openclaw until manual intervention 19h later

The Consumed ... CPU time line is systemd's last word on the unit. No Started openclaw-gateway.service. follow-up.

Expected

The gateway should come back after SIGUSR1, either by the supervisor succeeding at re-exec, or by systemd relaunching the unit. Neither happened here.

Possible fixes (pick one or both)

  1. Exit non-zero when re-exec fails. If the supervisor cannot hand off cleanly, abort with exit 1 so systemd's default Restart=on-failure catches it. Current behavior exits 0 even when the post-update process is gone, which hides the failure from systemd.

  2. Recommend Restart=always in the packaged openclaw-gateway.service template (not Restart=on-failure). Makes the exit code irrelevant; systemd always relaunches. This is the pragmatic fix and it's what I've now baked into my deploy kit.

  3. Optional but nice: have the Control UI monitor the gateway's /health post-restart and display a "gateway recovering…" spinner + "click here if stuck for > 2 minutes" escape link. Right now the dashboard just hangs or shows 502 with no context.

Workaround I deployed

In ~/.config/systemd/user/openclaw-gateway.service.d/20-restart.conf:

[Service]
Restart=always
RestartSec=10s

[Unit]
StartLimitIntervalSec=300
StartLimitBurst=20

Plus a user-level 60-second /health watchdog timer as belt-and-suspenders for the StartLimitBurst edge case. Both reproducibly recover the gateway from a forced systemctl --user kill --signal=SIGUSR1 openclaw-gateway within 25–50 seconds on my two nodes.

Version

OpenClaw 2026.4.15 → updating to 2026.4.20 triggered this. Reproduced on two independent nodes (Oracle Cloud Ubuntu 24.04 aarch64, Node 24.15.0 via nvm).

extent analysis

TL;DR

To fix the issue where the OpenClaw gateway does not restart after receiving a SIGUSR1 signal, update the openclaw-gateway.service template to use Restart=always instead of the default Restart=on-failure.

Guidance

  • The root cause of the issue is that the supervisor's re-exec fails and exits with code 0, which is treated as a clean exit by systemd, preventing it from restarting the service.
  • To verify the fix, trigger an update from the Control UI and check if the gateway restarts successfully by running systemctl --user status openclaw-gateway.
  • Consider implementing a watchdog timer to monitor the gateway's /health endpoint and display a recovery message to the user if the gateway is stuck.
  • Update the openclaw-gateway.service template to include Restart=always and RestartSec=10s to ensure the service restarts after a failure.

Example

[Service]
Restart=always
RestartSec=10s

[Unit]
StartLimitIntervalSec=300
StartLimitBurst=20

This configuration can be added to a drop-in file (e.g., ~/.config/systemd/user/openclaw-gateway.service.d/20-restart.conf) to override the default service settings.

Notes

The provided workaround using Restart=always and a watchdog timer has been successfully tested on two independent nodes and can be used as a reliable fix for this issue.

Recommendation

Apply the workaround by updating the openclaw-gateway.service template to use Restart=always, as it provides a reliable and pragmatic fix for the issue.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

openclaw - ✅(Solved) Fix Gateway left dead after update.run / SIGUSR1 supervisor restart — systemd sees clean exit, does not relaunch [1 pull requests, 1 comments, 2 participants]