openclaw - ✅(Solved) Fix [Bug] 2026.4.24 on WSL2: Ghost EADDRINUSE loop & systemd split-brain [1 pull requests, 1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
openclaw/openclaw#72693Fetched 2026-04-28 06:33:24
View on GitHub
Comments
0
Participants
1
Timeline
2
Reactions
0
Participants
Timeline (top)
cross-referenced ×2

Root Cause

Root Cause Analysis: The "Ghost" EADDRINUSE

Fix Action

Fixed

PR fix notes

PR #72820: Fix: Issue 72693 gateway service lifecycle

Description (problem / solution / changelog)

Summary

Describe the problem and fix in 2–5 bullets:

If this PR fixes a plugin beta-release blocker, title it fix(<plugin-id>): beta blocker - <summary> and link the matching Beta blocker: <plugin-name> - <summary> issue labeled beta-blocker. Contributors cannot label PRs, so the title is the PR-side signal for maintainers and automation.

  • Problem: openclaw doctor --fix could install a user-level gateway service even when a system-level OpenClaw gateway service already existed, and installed services could keep using an old pinned --port after gateway.port changed.
  • Why it matters: Both cases can leave users with confusing gateway lifecycle drift, including duplicate supervisors or a service restarting on the wrong port.
  • What changed: Doctor now blocks automatic Linux user-service install when a system-level OpenClaw gateway-like service is detected, and service audit/repair now detects and rewrites stale pinned gateway ports.
  • What did NOT change (scope boundary): This does not change WSL2 ghost listener handling, health monitor behavior, Gmail watcher behavior, broad gateway startup lifecycle, or intentional multi-gateway setups with isolated ports/config/state.

Change Type (select all)

  • Bug fix
  • Feature
  • Refactor required for the fix
  • Docs
  • Security hardening
  • Chore/infra

Scope (select all touched areas)

  • Gateway / orchestration
  • Skills / tool execution
  • Auth / tokens
  • Memory / storage
  • Integrations
  • API / contracts
  • UI / DX
  • CI/CD / infra

Linked Issue/PR

  • Closes #72693
  • Related #
  • This PR fixes a bug or regression

Root Cause (if applicable)

For bug fixes or regressions, explain why this happened, not just what changed. Otherwise write N/A. If the cause is unclear, write Unknown.

  • Root cause: Gateway service repair audited several persisted supervisor fields, but it did not compare the installed command’s pinned --port against the current resolved gateway port. The missing-service install path also treated “user service not loaded” as enough to offer installation, without first checking for system-level OpenClaw gateway services.
  • Missing detection / guardrail: No test covered stale service --port drift, and no doctor daemon-flow test covered system-level service presence while the user-level service was missing.
  • Contributing context (if known): Existing deep service discovery already had the information needed for duplicate supervisor detection, but the install flow did not use it as a blocker.

Regression Test Plan (if applicable)

For bug fixes or regressions, name the smallest reliable test coverage that should catch this. Otherwise write N/A.

  • Coverage level that should have caught this:
    • Unit test
    • Seam / integration test
    • End-to-end test
    • Existing coverage already sufficient
    • Target test or file:
      • src/daemon/service-audit.test.ts
      • src/commands/doctor-gateway-services.test.ts
      • src/commands/doctor-gateway-daemon-flow.test.ts
    • Scenario the test should lock in:
      • Service audit reports gateway-port-mismatch when installed --port differs from resolved config port.
      • Doctor repair passes the expected port into service audit and reinstalls the service with the current port.
      • Doctor does not auto-install a Linux user-level service when a system-level OpenClaw gateway-like service exists.
    • Why this is the smallest reliable guardrail: These tests cover the exact audit and repair decision points without needing a real systemd install.
    • Existing test that already covers this (if any): None.
    • If no new test is added, why not: N/A.

User-visible / Behavior Changes

  • openclaw doctor / service audit can now report and repair a gateway service whose command still pins an old port.
  • On Linux, doctor refuses to automatically install a user-level gateway service when it detects a system-level OpenClaw gateway-like service, and tells the user to inspect duplicates or mark service repair as externally managed.

Diagram (if applicable)

For UI changes or non-trivial logic flows, include a small ASCII diagram reviewers can scan quickly. Otherwise write N/A.

  Before:
  [gateway.port changes] -> [service keeps old --port] -> [restart uses old port]
  [user service missing + system service exists] -> [doctor installs user service] -> [duplicate supervisors]

  After:
  [gateway.port changes] -> [doctor detects port mismatch] -> [service metadata rewritten]
  [user service missing + system service exists] -> [doctor blocks auto-install] -> [user inspects/removes duplicate or sets external policy]

Security Impact (required)

  • New permissions/capabilities? (Yes/No) No
  • Secrets/tokens handling changed? (Yes/No) No
  • New/changed network calls? (Yes/No) No
  • Command/tool execution surface changed? (Yes/No) No
  • Data access scope changed? (Yes/No) No
  • If any Yes, explain risk + mitigation: N/A

Repro + Verification

Environment

  • OS: macOS local dev environment; Linux behavior covered with mocked platform/service discovery tests.
  • Runtime/container: Node 22 / pnpm workspace.
  • Model/provider: N/A
  • Integration/channel (if any): Gateway service lifecycle.
  • Relevant config (redacted): gateway.port changed from installed service port.

Steps

  1. Install or mock a gateway service command pinned to --port 18789.
  2. Resolve current config port as 18888.
  3. Run doctor service audit/repair.
  4. For split-brain behavior, mock Linux with user service missing and a system-scope OpenClaw gateway-like service present.

Expected

  • Doctor reports port drift and rewrites the service with the current port.
  • Doctor does not install a second user-level service when a system-level OpenClaw gateway service exists.

Actual

  • Before this PR, port drift was not audited, and the missing user-service path could proceed to install even with a system-level service present.
  • After this PR, both cases are detected and guarded.

Evidence

Attach at least one:

  • Failing test/log before + passing after

  • Trace/log snippets

  • Screenshot/recording

  • Perf numbers (if relevant)

    Passing local verification:

    pnpm test src/daemon/service-audit.test.ts src/commands/doctor-gateway-services.test.ts src/commands/doctor-gateway-daemon-flow.test.ts pnpm check:changed pnpm build codex review --base origin/main

Human Verification (required)

What you personally verified (not just CI), and how:

  • Verified scenarios:
    • Focused tests for port parsing, port mismatch audit, doctor service repair using the current port, and split-brain install blocking.
    • Changed-lane validation with pnpm check:changed.
    • Build validation with pnpm build.
    • Local Codex review with codex review --base origin/main.
  • Edge cases checked:
    • --port 18888 and --port=18888 argument forms.
    • Matching service/config ports do not produce a mismatch.
    • Split-brain guard only blocks system-scope OpenClaw gateway-like services in the Linux install path.
  • What you did not verify:
    • A live systemd system-service install on a Linux host.
    • WSL2 ghost listener behavior, because it is intentionally out of scope.

Review Conversations

  • I replied to or resolved every bot review conversation I addressed in this PR.
  • I left unresolved only the conversations that still need reviewer or maintainer judgment.

If a bot review conversation is addressed by this PR, resolve that conversation yourself. Do not leave bot review conversation cleanup for maintainers.

Compatibility / Migration

  • Backward compatible? (Yes/No) Yes
  • Config/env changes? (Yes/No) No
  • Migration needed? (Yes/No) No
  • If yes, exact upgrade steps: N/A. Users with stale service metadata can run openclaw doctor --fix or openclaw gateway install --force.

Risks and Mitigations

List only real risks for this PR. Add/remove entries as needed. If none, write None.

  • Risk: A user intentionally running a system-level OpenClaw gateway may expect doctor to also install a user-level service.
    • Mitigation: The new behavior only blocks automatic install in that duplicate-supervisor case and points users to openclaw gateway status --deep, openclaw doctor --deep, and OPENCLAW_SERVICE_REPAIR_POLICY=external.

Changed files

  • docs/gateway/doctor.md (modified, +4/-0)
  • docs/gateway/index.md (modified, +10/-0)
  • docs/gateway/troubleshooting.md (modified, +2/-0)
  • src/commands/doctor-gateway-daemon-flow.test.ts (modified, +34/-0)
  • src/commands/doctor-gateway-daemon-flow.ts (modified, +27/-0)
  • src/commands/doctor-gateway-services.test.ts (modified, +39/-0)
  • src/commands/doctor-gateway-services.ts (modified, +1/-0)
  • src/commands/doctor-service-audit.test-helpers.ts (modified, +1/-0)
  • src/daemon/service-audit.test.ts (modified, +46/-0)
  • src/daemon/service-audit.ts (modified, +46/-0)
  • src/plugins/cli-registry-loader.ts (modified, +27/-1)
  • src/plugins/cli.test.ts (modified, +87/-0)

Code Example

[gateway] starting HTTP server...
[canvas] host mounted at http://127.0.0.1:18789/__openclaw__/canvas/
[health-monitor] started

---

[gmail-watcher] gmail watcher stopped
Gateway failed to start: another gateway instance is already listening on ws://127.0.0.1:18789
listen EADDRINUSE: address already in use 127.0.0.1:18789

---

ExecStart=/usr/bin/node .../dist/index.js gateway --port 18789
RAW_BUFFERClick to expand / collapse

Environment

  • Host: WSL2 Ubuntu on Windows (T14sGen1)
  • OpenClaw Version: 2026.4.24 (npm latest)
  • Node: 22.22.0
  • Kernel: linux 6.6.87.2-microsoft-standard-WSL2 x64
  • Gateway port: 18789
  • Service modes tested: Manual foreground, system-level systemd (/etc/systemd/system), and native user-level systemd (~/.config/systemd/user).

Main Symptoms

After upgrading to OpenClaw 2026.4.24, the gateway became entirely unreachable on WSL2, entering a 30–50 second infinite crash loop.

Observed logs repeatedly showed:

[gateway] starting HTTP server...
[canvas] host mounted at http://127.0.0.1:18789/__openclaw__/canvas/
[health-monitor] started

However, high-frequency polling (lsof -nP -iTCP:18789 -sTCP:LISTEN) confirmed no process was actually listening on the port.

Exactly 30 to 50 seconds later, the gateway would self-terminate with:

[gmail-watcher] gmail watcher stopped
Gateway failed to start: another gateway instance is already listening on ws://127.0.0.1:18789
listen EADDRINUSE: address already in use 127.0.0.1:18789

The CLI (openclaw gateway probe) consistently failed with timeout, socket hang up, or read ECONNRESET.

Diagnostics & Findings

Finding 1: Systemd "Split-Brain" created by doctor --fix

OpenClaw 2026.4.24 strictly expects to manage a native user-level service (~/.config/systemd/user/openclaw-gateway.service).

If a user has a legacy manual system-level service (/etc/systemd/system/openclaw.service), running openclaw doctor --fix will silently generate the user-level unit. This creates a split-brain scenario where two systemd layers fight for the port.

Fix applied: Completely decommissioned the /etc/systemd manual units and isolated management to the native user-level service.

Finding 2: Stricter Schema Enforcement in 2026.4.24

The new version strictly lints openclaw.json. Old config keys like "adapters" or "paperclip", or manual additions like "system", trigger schema validation panics. Furthermore, setting "gateway": { "bind": "0.0.0.0" } is now flagged as a legacy alias (mapped to lan).

Finding 3: Port Shifts Ignore Config

We attempted to bypass the 18789 loop by updating openclaw.json to use "port": 18888. However, the auto-generated systemd unit hardcodes the port:

ExecStart=/usr/bin/node .../dist/index.js gateway --port 18789

This forced the service to ignore the JSON config and crash on the same port.

Root Cause Analysis: The "Ghost" EADDRINUSE

The issue is a lifecycle/watchdog race condition specific to 2026.4.24 under WSL2 networking, not a standard port conflict.

  1. The Hang: The gateway process starts and mounts the canvas, but asynchronously hangs while trying to bind to the WSL2 loopback bridge. lsof remains empty.
  2. The Watchdog Timeout: After ~30 seconds, internal health monitors (like health-monitor or gmail-watcher) time out and initiate a teardown.
  3. The Self-Collision: The gateway attempts to internally restart/rebind while its original socket request is still trapped in the kernel as an unresolved/pending state. It trips over its own ghost socket, throws EADDRINUSE against itself, and exits.

Confirmed Resolution (Downgrade)

Because the internal watchdog race cannot be bypassed via config in 2026.4.24 without triggering strict schema panics, the only stable fix is downgrading to 2026.4.22.

Recovery Steps Taken:

  1. Stopped and physically removed all OpenClaw systemd units (both system and user) and killed all detached node processes.
  2. Downgraded via: npm install -g [email protected]
  3. Sanitized openclaw.json (removed adapters, paperclip, system keys to prevent legacy panics).
  4. Ran openclaw doctor --fix on the downgraded version to rebuild the native user service cleanly.

Result: Immediate stability. The gateway bound successfully, the crash loop stopped, and openclaw status --all reported Gateway: reachable 42ms.

Questions for Maintainers

Could the core team confirm if 2026.4.24 introduced undocumented regressions to any of the following?

  1. Gateway listener lifecycle or self-probe/watchdog logic (health-monitor timeouts).
  2. User-level systemd service generation (specifically hardcoding --port in ExecStart vs reading openclaw.json).
  3. gateway.bind strict schema changes and WSL2-specific loopback handling.
  4. gmail-watcher startup/shutdown blocking the main event loop.

The key confusing symptom is [canvas] host mounted alongside an empty lsof output, followed 30 seconds later by a self-inflicted EADDRINUSE crash. For now, WSL2 users should remain pinned to 2026.4.22.

extent analysis

TL;DR

Downgrade to OpenClaw version 2026.4.22 to resolve the gateway crash loop issue on WSL2.

Guidance

  1. Stop and remove all OpenClaw systemd units: Ensure no conflicting services are running by stopping and physically removing all OpenClaw systemd units (both system and user) and killing all detached node processes.
  2. Downgrade OpenClaw: Use npm install -g [email protected] to downgrade to the stable version.
  3. Sanitize openclaw.json: Remove legacy keys like adapters, paperclip, and system to prevent schema validation panics.
  4. Rebuild the native user service: Run openclaw doctor --fix on the downgraded version to cleanly rebuild the native user service.

Example

No code snippet is necessary for this solution, as it involves downgrading and configuration changes rather than code modifications.

Notes

This solution is specific to WSL2 users and may not apply to other environments. The root cause is a lifecycle/watchdog race condition in OpenClaw version 2026.4.24, which cannot be bypassed via config without triggering strict schema panics.

Recommendation

Apply the workaround by downgrading to version 2026.4.22, as it is the confirmed stable fix for this issue. This is necessary due to the undocumented regressions introduced in version 2026.4.24 that cause the gateway crash loop on WSL2.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

openclaw - ✅(Solved) Fix [Bug] 2026.4.24 on WSL2: Ghost EADDRINUSE loop & systemd split-brain [1 pull requests, 1 participants]