openclaw - ✅(Solved) Fix doctor --fix self-terminates with SIGTERM when gateway is running [1 pull requests, 1 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
openclaw/openclaw#78217Fetched 2026-05-07 03:39:31
View on GitHub
Comments
1
Participants
2
Timeline
2
Reactions
2
Author
Timeline (top)
commented ×1cross-referenced ×1

openclaw doctor --fix aborts with SIGTERM when the gateway service is running, instead of either completing safe-live fixes or cleanly skipping them.

Observed multiple times in the same session:

  • Gateway active under systemd user supervision (systemctl --user status openclaw-gateway.service)
  • /healthz returning {"ok":true,"status":"live"}
  • openclaw doctor --fix --non-interactive emits some output, then exits via SIGTERM before completing all fix actions
  • Gateway-port section of doctor output includes:
    Health check failed: GatewayTransportError: gateway timeout after 3000ms
    Gateway target: ws://127.0.0.1:18789
    ...
    Port 18789 is already in use.
    - pid <X> shadeform: openclaw (127.0.0.1:18789)
    - Gateway already running locally. Stop it (openclaw gateway stop) or use a different port.

The doctor self-confuses: it can't talk to the gateway over WS (3-second timeout), then concludes the port is "already in use" (by the same gateway that just answered /healthz), then SIGTERMs itself.

Error Message

Health check failed: GatewayTransportError: gateway timeout after 3000ms Gateway target: ws://127.0.0.1:18789 ... Port 18789 is already in use.

  • pid <X> shadeform: openclaw (127.0.0.1:18789)
  • Gateway already running locally. Stop it (openclaw gateway stop) or use a different port.

Root Cause

openclaw doctor --fix aborts with SIGTERM when the gateway service is running, instead of either completing safe-live fixes or cleanly skipping them.

Observed multiple times in the same session:

  • Gateway active under systemd user supervision (systemctl --user status openclaw-gateway.service)
  • /healthz returning {"ok":true,"status":"live"}
  • openclaw doctor --fix --non-interactive emits some output, then exits via SIGTERM before completing all fix actions
  • Gateway-port section of doctor output includes:
    Health check failed: GatewayTransportError: gateway timeout after 3000ms
    Gateway target: ws://127.0.0.1:18789
    ...
    Port 18789 is already in use.
    - pid <X> shadeform: openclaw (127.0.0.1:18789)
    - Gateway already running locally. Stop it (openclaw gateway stop) or use a different port.

The doctor self-confuses: it can't talk to the gateway over WS (3-second timeout), then concludes the port is "already in use" (by the same gateway that just answered /healthz), then SIGTERMs itself.

Fix Action

Fix / Workaround

Workarounds

PR fix notes

PR #78223: fix: avoid doctor live gateway restarts

Description (problem / solution / changelog)

Summary

Fixes doctor --fix --non-interactive treating a live, supervised gateway as a repair target after a short status RPC timeout.

Changes

  • Use the normal 10s gateway health timeout for non-interactive doctor instead of the previous 3s timeout.
  • When the gateway service is loaded/running and owns the configured port, skip doctor-triggered gateway restart after a failed status RPC.
  • Add regression coverage for the running-service-owns-port case.

Testing

  • PATH="/tmp/openclaw-pnpm-shim:$PATH" pnpm exec oxfmt --check src/commands/doctor-gateway-daemon-flow.ts src/commands/doctor-gateway-daemon-flow.test.ts src/flows/doctor-health-contributions.ts
  • git diff --check
  • Targeted vitest attempted but blocked before tests by missing local @openclaw/fs-safe/config.
  • PATH="/tmp/openclaw-pnpm-shim:$PATH" node scripts/check-changed.mjs passed early lanes and failed in unrelated existing core typecheck diagnostics including missing @openclaw/fs-safe/*.

Fixes openclaw/openclaw#78217

Changed files

  • src/commands/doctor-gateway-daemon-flow.test.ts (modified, +19/-0)
  • src/commands/doctor-gateway-daemon-flow.ts (modified, +27/-1)
  • src/flows/doctor-health-contributions.ts (modified, +1/-1)

Code Example

Health check failed: GatewayTransportError: gateway timeout after 3000ms
  Gateway target: ws://127.0.0.1:18789
  ...
  Port 18789 is already in use.
  - pid <X> shadeform: openclaw (127.0.0.1:18789)
  - Gateway already running locally. Stop it (openclaw gateway stop) or use a different port.
RAW_BUFFERClick to expand / collapse

Summary

openclaw doctor --fix aborts with SIGTERM when the gateway service is running, instead of either completing safe-live fixes or cleanly skipping them.

Observed multiple times in the same session:

  • Gateway active under systemd user supervision (systemctl --user status openclaw-gateway.service)
  • /healthz returning {"ok":true,"status":"live"}
  • openclaw doctor --fix --non-interactive emits some output, then exits via SIGTERM before completing all fix actions
  • Gateway-port section of doctor output includes:
    Health check failed: GatewayTransportError: gateway timeout after 3000ms
    Gateway target: ws://127.0.0.1:18789
    ...
    Port 18789 is already in use.
    - pid <X> shadeform: openclaw (127.0.0.1:18789)
    - Gateway already running locally. Stop it (openclaw gateway stop) or use a different port.

The doctor self-confuses: it can't talk to the gateway over WS (3-second timeout), then concludes the port is "already in use" (by the same gateway that just answered /healthz), then SIGTERMs itself.

Environment

  • OpenClaw 2026.5.4 (commit 325df3e)
  • Linux (Ubuntu 24.04)
  • Gateway running under systemctl --user user supervision

Reproduction

  1. Have gateway running and healthy: curl http://127.0.0.1:18789/healthz{"ok":true,"status":"live"}
  2. Run openclaw doctor --fix --non-interactive
  3. Observe the run aborts via SIGTERM before completing all fixes

Expected behavior

doctor --fix should either:

  • (a) preflight-classify each fix as safe-live vs requires-restart, apply the safe-live ones, and skip-with-explanation the others, OR
  • (b) refuse to run with --fix against a live gateway and tell the user to stop it first

What it should NOT do is partially run, then SIGTERM itself mid-stream, leaving the user uncertain about what got applied.

Impact

  • Operators can't use --fix for routine maintenance against a running gateway
  • Manual cleanup is required (renaming orphan transcripts, archiving stale agent dirs, etc.)
  • Self-termination during incident response increases risk and confusion

Workarounds

  • Do cleanup steps manually: archive orphan transcripts via mv *.jsonl *.deleted.<ts>, move stale agent dirs to a .archived/ sibling, etc.
  • Use openclaw doctor --non-interactive (no --fix) to validate state — that one runs cleanly against a live gateway

Suggested fix

  1. The gateway-port check should not run Port already in use warning when the gateway PID match equals the running gateway service PID (it's the same process answering /healthz and listening on 18789 — there's no conflict).
  2. The 3s WS timeout on a loopback gateway is too short under load; bump to 10s or read from config.
  3. The SIGTERM appears to come from doctor's own port-bind probe trying to bind 18789 itself to "verify" it's free. That probe should be skipped when a live gateway is detected.
  4. Consider whether --fix should ever be allowed against a running gateway — if not, exit cleanly with code 78 ("config valid, but live gateway detected; stop it first") rather than SIGTERM.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

FAQ

Expected behavior

doctor --fix should either:

  • (a) preflight-classify each fix as safe-live vs requires-restart, apply the safe-live ones, and skip-with-explanation the others, OR
  • (b) refuse to run with --fix against a live gateway and tell the user to stop it first

What it should NOT do is partially run, then SIGTERM itself mid-stream, leaving the user uncertain about what got applied.

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING