openclaw - ✅(Solved) Fix [Bug]: openclaw doctor regenerates systemd user unit on every update, clobbering custom TimeoutStartSec / ExecStart / EnvironmentFile [1 pull requests, 2 comments, 3 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
openclaw/openclaw#80462Fetched 2026-05-11 03:14:20
View on GitHub
Comments
2
Participants
3
Timeline
14
Reactions
2
Author
Timeline (top)
mentioned ×4subscribed ×4cross-referenced ×3commented ×2

openclaw doctor (which runs inside openclaw update) regenerates the systemd user unit at ~/.config/systemd/user/openclaw-gateway.service on every upgrade, wholesale-replacing it with a template. A .bak of the prior content is preserved, but the live unit is silently overwritten — operators only discover the regeneration when symptoms appear post-restart.

Fields overwritten in our case:

  • TimeoutStartSec (120 → 30)
  • ExecStart node binary path (e.g., ~/.nvm/versions/node/v26.1.0/bin/node/usr/bin/node)
  • EnvironmentFile (e.g., ~/.openclaw/.env-~/.openclaw/gateway.systemd.env, switching to a different file with optional-load - prefix)
  • Inline Environment= keys (replaced with a fixed set listed in OPENCLAW_SERVICE_MANAGED_ENV_KEYS)
  • Adds StartLimitBurst=5, StartLimitIntervalSec=60, RestartPreventExitStatus=78

For our install, the load-bearing impact was the TimeoutStartSec drop to 30s. We use an ExecStartPost script that needs ~60–90s (health-wait poll + post-restart housekeeping + reconnect buffer). With only 30s, systemd SIGTERMs the otherwise-healthy gateway every restart cycle. NRestarts climbed to 22 in ~12 minutes before manual systemctl --user stop intervention.

Root Cause

openclaw doctor (which runs inside openclaw update) regenerates the systemd user unit at ~/.config/systemd/user/openclaw-gateway.service on every upgrade, wholesale-replacing it with a template. A .bak of the prior content is preserved, but the live unit is silently overwritten — operators only discover the regeneration when symptoms appear post-restart.

Fields overwritten in our case:

  • TimeoutStartSec (120 → 30)
  • ExecStart node binary path (e.g., ~/.nvm/versions/node/v26.1.0/bin/node/usr/bin/node)
  • EnvironmentFile (e.g., ~/.openclaw/.env-~/.openclaw/gateway.systemd.env, switching to a different file with optional-load - prefix)
  • Inline Environment= keys (replaced with a fixed set listed in OPENCLAW_SERVICE_MANAGED_ENV_KEYS)
  • Adds StartLimitBurst=5, StartLimitIntervalSec=60, RestartPreventExitStatus=78

For our install, the load-bearing impact was the TimeoutStartSec drop to 30s. We use an ExecStartPost script that needs ~60–90s (health-wait poll + post-restart housekeeping + reconnect buffer). With only 30s, systemd SIGTERMs the otherwise-healthy gateway every restart cycle. NRestarts climbed to 22 in ~12 minutes before manual systemctl --user stop intervention.

Fix Action

Fix / Workaround

Suggested fix

  • Option A (preferred): Don't overwrite the live unit. Write the template to ~/.config/systemd/user/openclaw-gateway.service.suggested and emit a diff + recommended action.
  • Option B: Strict merge — preserve any user-set field that's not invalid; only replace fields that are missing or malformed.
  • Option C (smaller blast-radius mitigation if regeneration must persist): Ship the template with more generous timeouts (TimeoutStartSec=120, TimeoutStopSec=60) so the common ExecStartPost pattern works out of the box.

PR fix notes

PR #80485: [codex] defer update-time systemd service rewrites

Description (problem / solution / changelog)

Summary

  • In Linux/systemd update-mode doctor repair, report gateway service drift but leave the live unit unchanged unless the operator explicitly forces replacement later.
  • Preserve existing non-systemd update-mode staging behavior.
  • Add focused doctor tests for the systemd deferral path and the non-systemd path.

Refs #80462.

Draft Status

Opening as Draft intentionally. This patch is a conservative policy proposal, not a final claim that the broader systemd ownership question is settled. Maintainer direction needed: should update-mode doctor defer live systemd rewrites like this, write a .suggested unit/diff, or perform a managed-field merge?

Root Cause

Update invokes noninteractive doctor repair mode. When doctor finds service drift, the Linux/systemd stage path rewrites the live unit from the canonical template. Because that template owns whole-file output, update-time doctor can replace operator-owned directives such as TimeoutStartSec, ExecStart, EnvironmentFile, and inline Environment= entries.

Real behavior proof

Behavior or issue addressed: openclaw update should not silently overwrite a customized live systemd gateway unit while running noninteractive doctor repair.

Real environment tested: local OpenClaw checkout on this branch using the doctor gateway service unit test harness.

Exact steps or command run after this patch:

env CI=true pnpm test src/commands/doctor-gateway-services.test.ts
pnpm exec oxfmt --check --threads=1 CHANGELOG.md src/commands/doctor-gateway-services.ts src/commands/doctor-gateway-services.test.ts
git diff --check

Evidence after fix:

src/commands/doctor-gateway-services.test.ts: 26 tests passed
oxfmt check completed successfully for the touched files
git diff --check completed with no whitespace errors

Observed result after fix: the systemd update-mode test detects service drift, emits the “left the live systemd unit unchanged” note, and does not call the service stage or install writer. A companion non-systemd update-mode test still calls stage.

What was not tested: no live WSL2/systemd update was run; no broad pnpm check:changed/full build gate was run for this draft patch.

Changed files

  • CHANGELOG.md (modified, +1/-0)
  • src/commands/doctor-gateway-services.test.ts (modified, +27/-3)
  • src/commands/doctor-gateway-services.ts (modified, +24/-0)

Code Example

sed -i 's/^TimeoutStartSec=.*/TimeoutStartSec=120/' ~/.config/systemd/user/openclaw-gateway.service
   systemctl --user daemon-reload

---

openclaw update --tag 2026.5.10-beta.2 --no-restart --yes

---

grep TimeoutStartSec ~/.config/systemd/user/openclaw-gateway.service
   diff ~/.config/systemd/user/openclaw-gateway.service{,.bak}

---

< Description=OpenClaw Gateway (v2026.5.7)
> Description=OpenClaw Gateway (v2026.5.10-beta.1)
> StartLimitBurst=5
> StartLimitIntervalSec=60
< ExecStart=/home/<user>/.nvm/versions/node/v26.1.0/bin/node /home/<user>/.nvm/versions/node/v26.1.0/lib/node_modules/openclaw/dist/index.js gateway --port 18789
> ExecStart=/usr/bin/node /home/<user>/.nvm/versions/node/v26.1.0/lib/node_modules/openclaw/dist/index.js gateway --port 18789
> RestartPreventExitStatus=78
< TimeoutStartSec=120
> TimeoutStartSec=30
< EnvironmentFile=/home/<user>/.openclaw/.env
< Environment=WSL_HOST_IP=...
< Environment=LCM_LEAF_CHUNK_TOKENS=100000
< Environment=LCM_FRESH_TAIL_COUNT=64
> EnvironmentFile=-/home/<user>/.openclaw/gateway.systemd.env
> Environment=OPENCLAW_SERVICE_MANAGED_ENV_KEYS=<TOKEN_KEY_LIST>

---

openclaw-gateway.service: start-post operation timed out. Terminating.
openclaw-gateway.service: Control process exited, code=killed, status=15/TERM
RAW_BUFFERClick to expand / collapse

Bug type

Behavior bug (incorrect output/state without crash)

Beta release blocker

Yes — reproduces on 2026.5.10-beta.1 and 2026.5.10-beta.2

Summary

openclaw doctor (which runs inside openclaw update) regenerates the systemd user unit at ~/.config/systemd/user/openclaw-gateway.service on every upgrade, wholesale-replacing it with a template. A .bak of the prior content is preserved, but the live unit is silently overwritten — operators only discover the regeneration when symptoms appear post-restart.

Fields overwritten in our case:

  • TimeoutStartSec (120 → 30)
  • ExecStart node binary path (e.g., ~/.nvm/versions/node/v26.1.0/bin/node/usr/bin/node)
  • EnvironmentFile (e.g., ~/.openclaw/.env-~/.openclaw/gateway.systemd.env, switching to a different file with optional-load - prefix)
  • Inline Environment= keys (replaced with a fixed set listed in OPENCLAW_SERVICE_MANAGED_ENV_KEYS)
  • Adds StartLimitBurst=5, StartLimitIntervalSec=60, RestartPreventExitStatus=78

For our install, the load-bearing impact was the TimeoutStartSec drop to 30s. We use an ExecStartPost script that needs ~60–90s (health-wait poll + post-restart housekeeping + reconnect buffer). With only 30s, systemd SIGTERMs the otherwise-healthy gateway every restart cycle. NRestarts climbed to 22 in ~12 minutes before manual systemctl --user stop intervention.

Steps to reproduce

  1. Customize the systemd user unit (any field doctor doesn't expect, e.g., TimeoutStartSec=120):
    sed -i 's/^TimeoutStartSec=.*/TimeoutStartSec=120/' ~/.config/systemd/user/openclaw-gateway.service
    systemctl --user daemon-reload
  2. Run any operation that invokes openclaw doctor (any openclaw update does):
    openclaw update --tag 2026.5.10-beta.2 --no-restart --yes
  3. Inspect the unit afterward:
    grep TimeoutStartSec ~/.config/systemd/user/openclaw-gateway.service
    diff ~/.config/systemd/user/openclaw-gateway.service{,.bak}

Expected behavior

doctor should either:

  • Leave the live unit alone and emit a recommended diff for the operator to review/apply, or
  • Strictly merge — preserve any field the user has set that's not invalid (only replace fields that are missing or malformed)

Actual behavior

Live unit is wholesale replaced with a template. Pre-existing copy is preserved as openclaw-gateway.service.bak but no warning is emitted about which fields were replaced.

OpenClaw version

2026.5.10-beta.1 (regenerated unit caused the restart loop) 2026.5.10-beta.2 (same regeneration behavior observed)

Operating system

WSL2 Ubuntu (Linux 6.6.x-microsoft-standard-WSL2)

Install method

npm global, nvm-managed Node 26.1.0

Model

N/A — bug is in update path

Provider / routing chain

N/A

Additional provider/model setup details

No response

Logs, screenshots, and evidence

Diff observed (pre-doctor .bak → post-doctor live):

< Description=OpenClaw Gateway (v2026.5.7)
> Description=OpenClaw Gateway (v2026.5.10-beta.1)
> StartLimitBurst=5
> StartLimitIntervalSec=60
< ExecStart=/home/<user>/.nvm/versions/node/v26.1.0/bin/node /home/<user>/.nvm/versions/node/v26.1.0/lib/node_modules/openclaw/dist/index.js gateway --port 18789
> ExecStart=/usr/bin/node /home/<user>/.nvm/versions/node/v26.1.0/lib/node_modules/openclaw/dist/index.js gateway --port 18789
> RestartPreventExitStatus=78
< TimeoutStartSec=120
> TimeoutStartSec=30
< EnvironmentFile=/home/<user>/.openclaw/.env
< Environment=WSL_HOST_IP=...
< Environment=LCM_LEAF_CHUNK_TOKENS=100000
< Environment=LCM_FRESH_TAIL_COUNT=64
> EnvironmentFile=-/home/<user>/.openclaw/gateway.systemd.env
> Environment=OPENCLAW_SERVICE_MANAGED_ENV_KEYS=<TOKEN_KEY_LIST>

Systemd journal during the resulting restart loop:

openclaw-gateway.service: start-post operation timed out. Terminating.
openclaw-gateway.service: Control process exited, code=killed, status=15/TERM

[gateway] ready was being reached at ~14–17s into startup; ExecStartPost work was getting cut off at the 30s ceiling and triggering the SIGTERM cascade.

Suggested fix

  • Option A (preferred): Don't overwrite the live unit. Write the template to ~/.config/systemd/user/openclaw-gateway.service.suggested and emit a diff + recommended action.
  • Option B: Strict merge — preserve any user-set field that's not invalid; only replace fields that are missing or malformed.
  • Option C (smaller blast-radius mitigation if regeneration must persist): Ship the template with more generous timeouts (TimeoutStartSec=120, TimeoutStopSec=60) so the common ExecStartPost pattern works out of the box.

Adjacent observation

The doctor-written EnvironmentFile=-/home/<user>/.openclaw/gateway.systemd.env uses the - (optional) prefix. If the user had EnvironmentFile=/home/<user>/.openclaw/.env (required) and the new file doesn't exist or has different keys, the gateway starts without the expected env vars — channel auth, model API keys, etc. can silently fail.

Impact and severity

High when combined with an ExecStartPost script that exceeds the default 30s — produces a tight restart loop on an otherwise-healthy gateway. Also high when the EnvironmentFile swap causes secrets to silently drop out.

Additional information

Today's 2026.5.10-beta.1 upgrade hit this in production; the gateway entered a 22-restart loop until we restored TimeoutStartSec=120 manually and re-issued daemon-reload + start. We've since added a post-upgrade fixup step that re-applies the customizations after every openclaw update to mitigate, but the underlying regeneration behavior remains.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

FAQ

Expected behavior

doctor should either:

  • Leave the live unit alone and emit a recommended diff for the operator to review/apply, or
  • Strictly merge — preserve any field the user has set that's not invalid (only replace fields that are missing or malformed)

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

openclaw - ✅(Solved) Fix [Bug]: openclaw doctor regenerates systemd user unit on every update, clobbering custom TimeoutStartSec / ExecStart / EnvironmentFile [1 pull requests, 2 comments, 3 participants]