openclaw - 💡(How to fix) Fix systemd unit: RestartSec=5 vs warmup time defeats gateway-lock exit-78 protection

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

The default openclaw-gateway.service unit file has a RestartSec value that is shorter than the actual gateway startup-to-healthy time, which makes the gateway-lock exit-78 protection ineffective and creates a sustained restart cascade under common trigger conditions.

Error Message

After any external trigger that kills the gateway (e.g. a WSL clock-sync event, an out-of-band openclaw gateway restart, or a port-conflict during startup of the wsl-localhost-bridge service), the unit enters a self-sustaining restart loop:

Root Cause

Note the timing: every gateway instance gets through the basic startup (config load → secrets resolve → HTTP server bind → health-monitor start) within ~13-15 seconds, but is killed by the next instance ~2 seconds later. Exit-78 never gets a chance to fire because no gateway ever reaches "healthy on /healthz" before being killed.

Fix Action

Fix / Workaround

Operator-side workaround (verified to fix the issue)

Code Example

[Service]
Restart=always
RestartSec=5
TimeoutStartSec=30
RestartPreventExitStatus=78
KillMode=control-group

---

T+0:00   gateway A killed (any cause)
T+0:05   systemd Restart=always fires → spawns gateway B
T+0:15   gateway B is still warming up, /healthz not yet responding
T+0:15   <competing start (orphan, watchdog, or manual)> spawns gateway C
T+0:15   gateway C runs its own [restart] kill-stale logic, kills B
T+0:25   gateway C is itself still warming up
T+0:30   systemd Restart=always fires again → cycle repeats

---

HH:MM:05  node[A]: [restart] killing 1 stale gateway process(es) before restart: <old PID>
HH:MM:24  node[A]: [health-monitor] started (interval: 300s, startup-grace: 60s, ...)
HH:MM:26  systemd: Main process exited, code=killed, status=9/KILL
HH:MM:31  systemd: Scheduled restart job, restart counter is at N+1
HH:MM:34  node[B]: [restart] killing 1 stale gateway process(es) before restart: <A's PID>
... loop continues

---

# /root/.config/systemd/user/openclaw-gateway.service.d/30-restartsec-race-fix.conf
[Service]
RestartSec=60

---

T+0  systemd spawns gateway A
T+25 gateway A becomes healthy on /healthz
T+60 systemd's next respawn timer fires, spawns gateway B
T+62 gateway B: "gateway already running under systemd; existing gateway 
      is healthy, exiting with code 78 to prevent a systemd Restart=always loop"
T+62 systemd: Main process exited, code=exited, status=78/CONFIG (loop ends)
RAW_BUFFERClick to expand / collapse

Summary

The default openclaw-gateway.service unit file has a RestartSec value that is shorter than the actual gateway startup-to-healthy time, which makes the gateway-lock exit-78 protection ineffective and creates a sustained restart cascade under common trigger conditions.

Environment

  • OpenClaw 2026.5.7 gitSha b8fe34a
  • Ubuntu 24.04 in WSL2 on Windows
  • systemd 256 user service

Current default unit values (the relevant fields)

[Service]
Restart=always
RestartSec=5
TimeoutStartSec=30
RestartPreventExitStatus=78
KillMode=control-group

Observed behavior

After any external trigger that kills the gateway (e.g. a WSL clock-sync event, an out-of-band openclaw gateway restart, or a port-conflict during startup of the wsl-localhost-bridge service), the unit enters a self-sustaining restart loop:

T+0:00   gateway A killed (any cause)
T+0:05   systemd Restart=always fires → spawns gateway B
T+0:15   gateway B is still warming up, /healthz not yet responding
T+0:15   <competing start (orphan, watchdog, or manual)> spawns gateway C
T+0:15   gateway C runs its own [restart] kill-stale logic, kills B
T+0:25   gateway C is itself still warming up
T+0:30   systemd Restart=always fires again → cycle repeats

The gateway-lock mechanism documented at /docs/gateway/gateway-lock.md is designed to break this loop — a duplicate starter that sees a healthy /healthz responder exits with code 78, and RestartPreventExitStatus=78 stops Restart=always from looping. But this only works if the previous gateway has had time to become healthy. With RestartSec=5 and a real warmup time of ~20-25s on a typical WSL2 host, the new starter is always too early: it sees a not-yet-healthy previous instance, classifies it as "stale," kills it, and exit-78 never fires.

Concrete evidence from a real incident on this host

The cascade ran for ~30 minutes (restart counters 1 → 14+) during a single trigger event. The journal pattern repeated every ~15-20 seconds:

HH:MM:05  node[A]: [restart] killing 1 stale gateway process(es) before restart: <old PID>
HH:MM:24  node[A]: [health-monitor] started (interval: 300s, startup-grace: 60s, ...)
HH:MM:26  systemd: Main process exited, code=killed, status=9/KILL
HH:MM:31  systemd: Scheduled restart job, restart counter is at N+1
HH:MM:34  node[B]: [restart] killing 1 stale gateway process(es) before restart: <A's PID>
... loop continues

Note the timing: every gateway instance gets through the basic startup (config load → secrets resolve → HTTP server bind → health-monitor start) within ~13-15 seconds, but is killed by the next instance ~2 seconds later. Exit-78 never gets a chance to fire because no gateway ever reaches "healthy on /healthz" before being killed.

Operator-side workaround (verified to fix the issue)

Adding a drop-in override that raises RestartSec so the previous instance has time to become healthy does let exit-78 fire and breaks the loop on the next attempted respawn:

# /root/.config/systemd/user/openclaw-gateway.service.d/30-restartsec-race-fix.conf
[Service]
RestartSec=60

After applying and reloading, the journal shows the documented exit-78 flow firing correctly:

T+0  systemd spawns gateway A
T+25 gateway A becomes healthy on /healthz
T+60 systemd's next respawn timer fires, spawns gateway B
T+62 gateway B: "gateway already running under systemd; existing gateway 
      is healthy, exiting with code 78 to prevent a systemd Restart=always loop"
T+62 systemd: Main process exited, code=exited, status=78/CONFIG (loop ends)

Suggested fix in the shipped unit file

Pick one of:

  1. Raise RestartSec in the upstream unit file to a value that exceeds typical warmup (e.g. RestartSec=60 or RestartSec=45). Simplest fix, no behavior change beyond longer recovery delay on a real crash.

  2. Switch the relationship to depend on /healthz: use Type=notify with sd_notify(READY=1) emitted only after /healthz is responsive, plus TimeoutStartSec=120. Then RestartSec can stay short because systemd won't treat the service as "started" until it's actually healthy. This is the more robust answer for distributions that may have variable warmup.

  3. Have the gateway's own [restart] killing 1 stale gateway process(es) codepath check /healthz of the existing process first — if the existing one is responsive (even if not "fully healthy"), defer to it with exit 78 instead of killing. Avoids the race even without changing systemd timing.

Impact

When this happens, every restart cycle sends a "gateway shutting down, your task will be interrupted" notification to connected messaging platforms (Telegram, etc.). A 30-minute cascade produces ~12-15 such notifications to every connected operator account. The gateway is functionally degraded throughout (each instance survives ~15 seconds; in-flight work is dropped repeatedly).

Related docs

Cross-reference

  • Memory inefficiency in the same WSL2 setup: openclaw/openclaw#80665

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

openclaw - 💡(How to fix) Fix systemd unit: RestartSec=5 vs warmup time defeats gateway-lock exit-78 protection