openclaw - 💡(How to fix) Fix systemd unit: RestartSec=5 vs warmup time defeats gateway-lock exit-78 protection

Root Cause

Note the timing: every gateway instance gets through the basic startup (config load → secrets resolve → HTTP server bind → health-monitor start) within ~13-15 seconds, but is killed by the next instance ~2 seconds later. Exit-78 never gets a chance to fire because no gateway ever reaches "healthy on /healthz" before being killed.

Code Example

[Service]
Restart=always
RestartSec=5
TimeoutStartSec=30
RestartPreventExitStatus=78
KillMode=control-group

---

T+0:00   gateway A killed (any cause)
T+0:05   systemd Restart=always fires → spawns gateway B
T+0:15   gateway B is still warming up, /healthz not yet responding
T+0:15   <competing start (orphan, watchdog, or manual)> spawns gateway C
T+0:15   gateway C runs its own [restart] kill-stale logic, kills B
T+0:25   gateway C is itself still warming up
T+0:30   systemd Restart=always fires again → cycle repeats

---

HH:MM:05  node[A]: [restart] killing 1 stale gateway process(es) before restart: <old PID>
HH:MM:24  node[A]: [health-monitor] started (interval: 300s, startup-grace: 60s, ...)
HH:MM:26  systemd: Main process exited, code=killed, status=9/KILL
HH:MM:31  systemd: Scheduled restart job, restart counter is at N+1
HH:MM:34  node[B]: [restart] killing 1 stale gateway process(es) before restart: <A's PID>
... loop continues

---

# /root/.config/systemd/user/openclaw-gateway.service.d/30-restartsec-race-fix.conf
[Service]
RestartSec=60

---

T+0  systemd spawns gateway A
T+25 gateway A becomes healthy on /healthz
T+60 systemd's next respawn timer fires, spawns gateway B
T+62 gateway B: "gateway already running under systemd; existing gateway 
      is healthy, exiting with code 78 to prevent a systemd Restart=always loop"
T+62 systemd: Main process exited, code=exited, status=78/CONFIG (loop ends)

Summary

The default openclaw-gateway.service unit file has a RestartSec value that is shorter than the actual gateway startup-to-healthy time, which makes the gateway-lock exit-78 protection ineffective and creates a sustained restart cascade under common trigger conditions.

Environment

OpenClaw 2026.5.7 gitSha b8fe34a
Ubuntu 24.04 in WSL2 on Windows
systemd 256 user service

Current default unit values (the relevant fields)

[Service]
Restart=always
RestartSec=5
TimeoutStartSec=30
RestartPreventExitStatus=78
KillMode=control-group

Observed behavior

After any external trigger that kills the gateway (e.g. a WSL clock-sync event, an out-of-band openclaw gateway restart, or a port-conflict during startup of the wsl-localhost-bridge service), the unit enters a self-sustaining restart loop:

T+0:00   gateway A killed (any cause)
T+0:05   systemd Restart=always fires → spawns gateway B
T+0:15   gateway B is still warming up, /healthz not yet responding
T+0:15   <competing start (orphan, watchdog, or manual)> spawns gateway C
T+0:15   gateway C runs its own [restart] kill-stale logic, kills B
T+0:25   gateway C is itself still warming up
T+0:30   systemd Restart=always fires again → cycle repeats

The gateway-lock mechanism documented at /docs/gateway/gateway-lock.md is designed to break this loop — a duplicate starter that sees a healthy /healthz responder exits with code 78, and RestartPreventExitStatus=78 stops Restart=always from looping. But this only works if the previous gateway has had time to become healthy. With RestartSec=5 and a real warmup time of ~20-25s on a typical WSL2 host, the new starter is always too early: it sees a not-yet-healthy previous instance, classifies it as "stale," kills it, and exit-78 never fires.

Concrete evidence from a real incident on this host

The cascade ran for ~30 minutes (restart counters 1 → 14+) during a single trigger event. The journal pattern repeated every ~15-20 seconds:

HH:MM:05  node[A]: [restart] killing 1 stale gateway process(es) before restart: <old PID>
HH:MM:24  node[A]: [health-monitor] started (interval: 300s, startup-grace: 60s, ...)
HH:MM:26  systemd: Main process exited, code=killed, status=9/KILL
HH:MM:31  systemd: Scheduled restart job, restart counter is at N+1
HH:MM:34  node[B]: [restart] killing 1 stale gateway process(es) before restart: <A's PID>
... loop continues

Operator-side workaround (verified to fix the issue)

Adding a drop-in override that raises RestartSec so the previous instance has time to become healthy does let exit-78 fire and breaks the loop on the next attempted respawn:

# /root/.config/systemd/user/openclaw-gateway.service.d/30-restartsec-race-fix.conf
[Service]
RestartSec=60

After applying and reloading, the journal shows the documented exit-78 flow firing correctly:

T+0  systemd spawns gateway A
T+25 gateway A becomes healthy on /healthz
T+60 systemd's next respawn timer fires, spawns gateway B
T+62 gateway B: "gateway already running under systemd; existing gateway 
      is healthy, exiting with code 78 to prevent a systemd Restart=always loop"
T+62 systemd: Main process exited, code=exited, status=78/CONFIG (loop ends)

Suggested fix in the shipped unit file

Pick one of:

Raise RestartSec in the upstream unit file to a value that exceeds typical warmup (e.g. RestartSec=60 or RestartSec=45). Simplest fix, no behavior change beyond longer recovery delay on a real crash.
Switch the relationship to depend on /healthz: use Type=notify with sd_notify(READY=1) emitted only after /healthz is responsive, plus TimeoutStartSec=120. Then RestartSec can stay short because systemd won't treat the service as "started" until it's actually healthy. This is the more robust answer for distributions that may have variable warmup.
Have the gateway's own [restart] killing 1 stale gateway process(es) codepath check /healthz of the existing process first — if the existing one is responsive (even if not "fully healthy"), defer to it with exit 78 instead of killing. Avoids the race even without changing systemd timing.

Impact

When this happens, every restart cycle sends a "gateway shutting down, your task will be interrupted" notification to connected messaging platforms (Telegram, etc.). A 30-minute cascade produces ~12-15 such notifications to every connected operator account. The gateway is functionally degraded throughout (each instance survives ~15 seconds; in-flight work is dropped repeatedly).

Related docs

/docs/gateway/gateway-lock.md — documents the exit-78 mechanism that is supposed to prevent exactly this scenario
/docs/gateway/troubleshooting.md — documents the EADDRINUSE / "Another process is listening" signature

Cross-reference

Memory inefficiency in the same WSL2 setup: openclaw/openclaw#80665

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

openclaw - 💡(How to fix) Fix systemd unit: RestartSec=5 vs warmup time defeats gateway-lock exit-78 protection

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Fix / Workaround

Operator-side workaround (verified to fix the issue)

Code Example

Summary

Environment

Current default unit values (the relevant fields)

Observed behavior

Concrete evidence from a real incident on this host

Operator-side workaround (verified to fix the issue)

Suggested fix in the shipped unit file

Impact

Related docs

Cross-reference

Still need to ship something?

TRENDING

openclaw - 💡(How to fix) Fix systemd unit: RestartSec=5 vs warmup time defeats gateway-lock exit-78 protection

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Fix / Workaround

Operator-side workaround (verified to fix the issue)

Code Example

Summary

Environment

Current default unit values (the relevant fields)

Observed behavior

Concrete evidence from a real incident on this host

Operator-side workaround (verified to fix the issue)

Suggested fix in the shipped unit file

Impact

Related docs

Cross-reference

Still need to ship something?

RELATED_DISCOVERY

TRENDING