openclaw - 💡(How to fix) Fix [Bug]: node-host silent-zombie after WS close 1012 (service restart); stays alive under KeepAlive but never reconnects and writes no log [2 comments, 3 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
openclaw/openclaw#69800Fetched 2026-04-22 07:48:11
View on GitHub
Comments
2
Participants
3
Timeline
5
Reactions
0
Author
Timeline (top)
commented ×2mentioned ×2subscribed ×1

On macOS, after a burst of WS close code 1012 (service restart) + ECONNREFUSED during a gateway restart, the node-host process remains alive under launchd (KeepAlive=true), stops writing to stderr, stops appearing in openclaw nodes list (stale Last Connect), and only recovers via launchctl kickstart -k.

Error Message

Asymmetry: same token, HTTP works, WS from openclaw doctor fails

$ openclaw doctor ... gateway connect failed: GatewayClientRequestError: unauthorized: gateway token mismatch ...

Meanwhile curl with the same token returned 200 in the block above.

Root Cause

  • Affected: macOS installs using user-level launchd agents for node + gateway on loopback with token auth (the standard single-host layout).
  • Severity: High when triggered — every node-backed capability (skills-remote, exec, subagents that route through the node) silently degrades. There is no user-visible error because the process is alive and logging nothing.
  • Frequency: Observed once in ~2 weeks of continuous operation on 2026.4.15. Related #24988 reports 3 occurrences after gateway updates. Not a one-off.
  • Consequence: All node-backed flows fail or time out until someone notices and runs kickstart. Since the failure is silent, detection depends on external monitoring.

Fix Action

Fix / Workaround

  • Related issues that appear to describe the same failure family, all closed stale, not fixed:
    • #24988 — "After gateway update, companion app node connectivity drops silently — remote bin probe times out, requires host reboot". Closed 2026-03-26; close comment explicitly invites reopening on newer releases — which is what this report does.
    • #11944 — "Ghost node blocks all exec calls after unclean disconnect".
    • #3296 — "Node frequently disconnects when connected via SSH tunnel over Tailscale" (different trigger, possibly same reconnect codepath).
  • Current workaround: 5-minute local watchdog that polls openclaw nodes list, and issues launchctl kickstart -k gui/$UID/ai.openclaw.node when Last Connect > 5m, with a 30-minute debounce.
  • Happy to capture a structured trace with OPENCLAW_DIAGNOSTICS=gateway.* on next reproduction if that would help narrow the codepath.

Code Example

# Last lines of node.err.log before it goes silent (2026-04-20 ~12:12 -03).
# These come from the PRIOR instance; the zombie instance that started
# at 12:54:44 never wrote to this file.

node host gateway connect failed: connect ECONNREFUSED 127.0.0.1:18789
node host gateway closed (1006):
node host gateway closed (1012): service restart
node host gateway connect failed: connect ECONNREFUSED 127.0.0.1:18789
node host gateway closed (1006):
node host gateway closed (1012): service restart
... [repeating for dozens of lines]

---

# Process state at detection (2026-04-21 ~13:30 -03, ~25h after the bad event)
$ ps -o pid,etime,command -p <pid>
  PID     ELAPSED COMMAND
86501 01-01:05:58 openclaw-node

$ ls -la ~/.openclaw/logs/node.err.log
-rw------- ... Apr 20 12:12 ...   # mtime 42 min BEFORE the process started

---

# Gateway healthy during the entire zombie window
$ curl -sS -H "Authorization: Bearer $TOKEN" http://127.0.0.1:18789/health
{"ok":true,"status":"live"}

$ openclaw nodes list
Paired: 1
Mac Studio ...                                              Last Connect: 25h ago

# Recovery
$ launchctl kickstart -k gui/$(id -u)/ai.openclaw.node
$ sleep 10 && openclaw nodes list
Mac Studio ...                                              Last Connect: just now

---

# Asymmetry: same token, HTTP works, WS from openclaw doctor fails
$ openclaw doctor
... gateway connect failed: GatewayClientRequestError: unauthorized: gateway token mismatch ...
# Meanwhile curl with the same token returned 200 in the block above.
RAW_BUFFERClick to expand / collapse

Bug type

Crash (process/app exits or hangs) — process stays alive but hangs functionally (no reconnect, no log).

Beta release blocker

No

Summary

On macOS, after a burst of WS close code 1012 (service restart) + ECONNREFUSED during a gateway restart, the node-host process remains alive under launchd (KeepAlive=true), stops writing to stderr, stops appearing in openclaw nodes list (stale Last Connect), and only recovers via launchctl kickstart -k.

Steps to reproduce

Not deterministic. Observed trigger sequence:

  1. Quarterly token rotation bootouts/bootstraps the gateway, then the node (in that order).
  2. The node, already running, receives a burst of WS close 1012 + ECONNREFUSED events over a few seconds while the gateway is restarting.
  3. A node process that starts shortly after the rotation (verified via ps etime) never writes to its stderr log file, never appears as connected in openclaw nodes list, and stays this way indefinitely (observed: 25h).

NOT_ENOUGH_INFO to produce a deterministic repro path from a single observation; I can re-capture with OPENCLAW_DIAGNOSTICS=gateway.* next time it occurs.

Expected behavior

Same behavior as a healthy session after a clean restart, i.e. one of:

  • keep retrying with bounded exponential backoff indefinitely, logging every attempt to stderr;
  • exit the process, so KeepAlive=true respawns it cleanly;
  • expose a documented liveness signal or health field that operators can poll without parsing nodes list.

Any of these would let operators detect and recover without external polling + kickstart. Reference: #24988 describes the same failure mode and explicitly requests one of these paths.

Actual behavior

  • Node process alive (ps -o etime = 25h) under launchd.
  • ~/.openclaw/logs/node.err.log mtime stays at a point 42 minutes before the process start — i.e. the zombie instance never wrote to stderr at all.
  • openclaw nodes listMac Studio ... Last Connect: 25h ago continuously during the 25h window.
  • Gateway is healthy the whole time: curl -H "Authorization: Bearer $TOKEN" http://127.0.0.1:18789/health{"ok":true,"status":"live"} with the same token the node has in its plist.
  • launchctl kickstart -k gui/$(id -u)/ai.openclaw.nodeLast Connect: just now within seconds.

OpenClaw version

2026.4.15 (041266a)

Operating system

macOS 26.4.1 (Darwin 25.4.0, arm64) — Mac Studio

Install method

Homebrew global (/opt/homebrew/lib/node_modules/openclaw), launchd user agents (gui/$UID/ai.openclaw.gateway + gui/$UID/ai.openclaw.node), loopback bind 127.0.0.1:18789, KeepAlive=true + ThrottleInterval=1 on both plists.

Model

N/A — bug is in node-host WS reconnect lifecycle, independent of inference model.

Provider / routing chain

N/A — bug is on the local node ↔ gateway WS loopback; no external provider involved.

Additional provider/model setup details

N/A

Logs, screenshots, and evidence

# Last lines of node.err.log before it goes silent (2026-04-20 ~12:12 -03).
# These come from the PRIOR instance; the zombie instance that started
# at 12:54:44 never wrote to this file.

node host gateway connect failed: connect ECONNREFUSED 127.0.0.1:18789
node host gateway closed (1006):
node host gateway closed (1012): service restart
node host gateway connect failed: connect ECONNREFUSED 127.0.0.1:18789
node host gateway closed (1006):
node host gateway closed (1012): service restart
... [repeating for dozens of lines]
# Process state at detection (2026-04-21 ~13:30 -03, ~25h after the bad event)
$ ps -o pid,etime,command -p <pid>
  PID     ELAPSED COMMAND
86501 01-01:05:58 openclaw-node

$ ls -la ~/.openclaw/logs/node.err.log
-rw------- ... Apr 20 12:12 ...   # mtime 42 min BEFORE the process started
# Gateway healthy during the entire zombie window
$ curl -sS -H "Authorization: Bearer $TOKEN" http://127.0.0.1:18789/health
{"ok":true,"status":"live"}

$ openclaw nodes list
Paired: 1
Mac Studio ...                                              Last Connect: 25h ago

# Recovery
$ launchctl kickstart -k gui/$(id -u)/ai.openclaw.node
$ sleep 10 && openclaw nodes list
Mac Studio ...                                              Last Connect: just now
# Asymmetry: same token, HTTP works, WS from openclaw doctor fails
$ openclaw doctor
... gateway connect failed: GatewayClientRequestError: unauthorized: gateway token mismatch ...
# Meanwhile curl with the same token returned 200 in the block above.

Tokens verified identical (first 16 + last 8 chars compared) across: SOPS, Keychain, ~/.openclaw/openclaw.json (auth+remote), gateway plist, node plist. No config drift.

Impact and severity

  • Affected: macOS installs using user-level launchd agents for node + gateway on loopback with token auth (the standard single-host layout).
  • Severity: High when triggered — every node-backed capability (skills-remote, exec, subagents that route through the node) silently degrades. There is no user-visible error because the process is alive and logging nothing.
  • Frequency: Observed once in ~2 weeks of continuous operation on 2026.4.15. Related #24988 reports 3 occurrences after gateway updates. Not a one-off.
  • Consequence: All node-backed flows fail or time out until someone notices and runs kickstart. Since the failure is silent, detection depends on external monitoring.

Additional information

  • Related issues that appear to describe the same failure family, all closed stale, not fixed:
    • #24988 — "After gateway update, companion app node connectivity drops silently — remote bin probe times out, requires host reboot". Closed 2026-03-26; close comment explicitly invites reopening on newer releases — which is what this report does.
    • #11944 — "Ghost node blocks all exec calls after unclean disconnect".
    • #3296 — "Node frequently disconnects when connected via SSH tunnel over Tailscale" (different trigger, possibly same reconnect codepath).
  • Current workaround: 5-minute local watchdog that polls openclaw nodes list, and issues launchctl kickstart -k gui/$UID/ai.openclaw.node when Last Connect > 5m, with a 30-minute debounce.
  • Happy to capture a structured trace with OPENCLAW_DIAGNOSTICS=gateway.* on next reproduction if that would help narrow the codepath.

extent analysis

TL;DR

The node process hangs after a burst of WS close code 1012 and ECONNREFUSED events during a gateway restart, and can only be recovered via launchctl kickstart -k.

Guidance

  • Investigate the reconnect logic in the node-host WS lifecycle to identify why the process hangs after a gateway restart.
  • Verify that the token rotation and authentication mechanisms are working correctly, as the openclaw doctor command fails with a token mismatch error despite the token being verified as identical across all configurations.
  • Consider implementing a bounded exponential backoff mechanism for the node to retry connecting to the gateway after a restart, to prevent the process from hanging indefinitely.
  • Review the related issues (#24988, #11944, #3296) to see if there are any common patterns or codepaths that could be contributing to this issue.

Example

No code snippet is provided as the issue is more related to the overall system behavior and token authentication rather than a specific code block.

Notes

The issue seems to be related to the node-host WS reconnect lifecycle and token authentication. The fact that the openclaw doctor command fails with a token mismatch error despite the token being verified as identical across all configurations suggests that there might be an issue with the token validation or authentication mechanism.

Recommendation

Apply a workaround, such as the 5-minute local watchdog that polls openclaw nodes list and issues launchctl kickstart -k gui/$UID/ai.openclaw.node when Last Connect > 5m, until a more permanent fix can be implemented. This will at least provide a way to detect and recover from the issue, even if it's not a complete solution.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

FAQ

Expected behavior

Same behavior as a healthy session after a clean restart, i.e. one of:

  • keep retrying with bounded exponential backoff indefinitely, logging every attempt to stderr;
  • exit the process, so KeepAlive=true respawns it cleanly;
  • expose a documented liveness signal or health field that operators can poll without parsing nodes list.

Any of these would let operators detect and recover without external polling + kickstart. Reference: #24988 describes the same failure mode and explicitly requests one of these paths.

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

openclaw - 💡(How to fix) Fix [Bug]: node-host silent-zombie after WS close 1012 (service restart); stays alive under KeepAlive but never reconnects and writes no log [2 comments, 3 participants]