openclaw - ✅(Solved) Fix [Bug]: [macOS] findGatewayPidsOnPortSync drops all PIDs due to lsof p_comm vs argv[0] mismatch [1 pull requests, 1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
openclaw/openclaw#70664Fetched 2026-04-24 05:54:59
View on GitHub
Comments
0
Participants
1
Timeline
1
Reactions
0
Participants
Timeline (top)
cross-referenced ×1

On macOS, findGatewayPidsOnPortSync in src/infra/restart-stale-pids.ts never recognizes live openclaw gateway processes. parsePidsFromLsofOutput filters candidate PIDs by checking whether the lsof -Fpc command-name token contains "openclaw", but the openclaw gateway rewrites its argv[0] to "openclaw-gateway" after exec while the kernel's p_comm field (what lsof reports) stays "node". The filter therefore drops every real gateway PID, and cleanStaleGatewayProcessesSync silently no-ops even when there is a verifiable gateway listening on the port.

The result is a self-sustaining launchd KeepAlive respawn loop any time the managed PID changes: the new spawn can't detect the existing listener, fails EADDRINUSE, exits, and launchd spawns another one under ThrottleInterval. With the default plist written by openclaw doctor --repair (KeepAlive=true, RunAtLoad=true, ThrottleInterval=1), the cadence is one failed respawn every ~18 seconds, indefinitely.

Error Message

  • #68153 — Telegram channel health-monitor EADDRINUSE loop. Same error surface, triggered by channel-health-monitor restart cascading into a full gateway bind, not launchd KeepAlive.
  • #22169 — msteams provider EADDRINUSE restart loop. Already fixed, same error class.

Root Cause

In src/infra/restart-stale-pids.ts, parsePidsFromLsofOutput:

if (
  currentPid != null &&
  currentCmd &&
  normalizeLowercaseStringOrEmpty(currentCmd).includes("openclaw")
) {
  pids.push(currentPid);
}

currentCmd comes from lsof -Fpc output — the c field is the kernel p_comm / BSD_COMM field. On macOS (and Linux), that field is the basename of the exec'd binary as recorded at execve time, truncated to MAXCOMLEN. For a node-based gateway it is "node". The gateway sets its own process title via process.title = "openclaw-gateway" (or equivalent argv[0] rewrite) after startup — ps reads this rewritten argv[0] and prints "openclaw-gateway", but lsof does not; lsof stays with p_comm.

Because "node".includes("openclaw") is false, every real gateway PID is dropped. The outer filter pid !== process.pid is then irrelevant because the set is already empty.

Confirmed locally with a stock install:

$ /usr/sbin/lsof -nP -iTCP:18789 -sTCP:LISTEN -Fpc
p12345
cnode
f15

$ ps -p 12345 -o command=
openclaw-gateway

The Windows code path in the same file (filterVerifiedWindowsGatewayPids / isGatewayArgv) does not rely on a comm filter — it inspects the full process argv via PowerShell and matches against openclaw entry-point patterns. That's the correct abstraction; the Unix path was missing the equivalent step.

Fix Action

Fix / Workaround

I have a working patch against current main. Gateway.err.log growth halted, service-mode: cleared 1 stale gateway pid(s) began appearing correctly, and the EADDRINUSE respawn cycle was broken. Happy to open a PR.

PR fix notes

PR #70681: fix(infra/restart): verify gateway PIDs via ps argv on Unix, not lsof p_comm

Description (problem / solution / changelog)

Summary

Fixes #70664. On macOS, lsof -Fpc reports the kernel p_comm field (the exec'd binary basename — "node") rather than the rewritten argv[0] ("openclaw-gateway"). The .includes("openclaw") filter in parsePidsFromLsofOutput therefore silently dropped every real gateway PID, making cleanStaleGatewayProcessesSync no-op and pollPortOnceUnix report busy ports as {free: true}. With KeepAlive=true in the LaunchAgent plist this manifests as a self-sustaining launchd respawn loop.

Reproduction and full impact analysis in #70664.

Change

  • parsePidsFromLsofOutput — dropped the broken comm filter; now returns every listening PID. The self+ancestor exclusion from 8aadca4c3e is preserved (the invariant that cleanStaleGatewayProcessesSync must never terminate an ancestor of its caller is unchanged).
  • verifyGatewayPidByArgvSync(pid) — new Unix-side verifier. Calls ps -ww -p <pid> -o command= (which reads the rewritten argv[0]) and matches against "openclaw-gateway" plus common entry-file patterns (/dist/index.js gateway, /openclaw.mjs gateway, openclaw gateway). Returns false on any ps failure so we never mis-attribute a non-gateway process as a gateway.
  • findGatewayPidsOnPortSync (Unix path) — now runs lsof, then filters the returned PIDs through verifyGatewayPidByArgvSync. Symmetric with the Windows path's existing filterVerifiedWindowsGatewayPids + isGatewayArgv structure.
  • pollPortOnceUnix — both the status === 0 and the status === 1 + non-empty stdout branches that previously relied on the comm filter now apply the same verifier. Without this, a non-gateway listener on the port (caddy/nginx) would loop the port-free wait forever; with it, the port is reported as free-from-our-perspective and the downstream bind fails with a more informative error.

Tests

  • Added __testing.setVerifyGatewayPidByArgvOverride hook so tests can drive the Unix gateway-vs-other classification directly. Most existing tests use a single mockSpawnSync fake that can't distinguish lsof from ps invocations; the default bypass in beforeEach preserves legacy test semantics.
  • Replaced the obsolete parsePidsFromLsofOutput — branch coverage (lines 67-69) > skips a mid-loop entry when the command does not include 'openclaw' test that specifically asserted the removed comm filter's behavior, with a new test exercising the verifier-driven classification.
  • Updated the two findGatewayPidsOnPortSync tests that exercised non-gateway filtering (excludes pids whose argv does not identify them as an openclaw gateway, returns [] when status 0 but only non-gateway pids present) to drive the verifier override matching the intended behavior.
  • Updated the one pollPortOnce test (treats status 1 + non-gateway stdout as port-free) to use a pid-specific verifier override so the post-kill polling correctly sees the unrelated caddy listener as "free-from-our-perspective".
  • Updated restart.test.ts to add the override hook reset and a verifier override for its parses lsof output and filters non-openclaw/current processes test.

Test plan

  • pnpm vitest run --project infra src/infra/restart-stale-pids.test.ts — 46 passed / 2 skipped (Windows-only)
  • pnpm vitest run --project infra src/infra/restart.test.ts src/infra/gateway-processes.test.ts — all green
  • pnpm vitest run --project cli src/cli/gateway-cli/ — all green (no regressions in consumers of findGatewayPidsOnPortSync)
  • node scripts/tsdown-build.mjs — clean build
  • Manual reproduction on macOS: before this patch, gateway.err.log grew with one Gateway failed to start: another gateway instance is already listening cycle every ~18 s. After this patch, service-mode: cleared N stale gateway pid(s) before bind on port 18789 appears correctly in /tmp/openclaw/openclaw-*.log when stale gateway PIDs are found, and the cleanup proceeds as intended.

What's deliberately out of scope

  • The PORT_FREE_TIMEOUT_MS bump (2 s → 30 s) that #70664 lists as "optional hardening". Discovered mid-test-update that the existing test suite assumes ≤ 2 s of port-free polling (beforeEach doesn't install a setDateNowOverride, so a longer timeout let some tests run long enough to hit OOM on a 4 GB V8 heap). Left as a separate follow-up — the core correctness fix in this PR is the comm filter.
  • A run.ts service-mode "yield-if-peer-exists" behavior change I tested locally: when launchd respawns the gateway but an openclaw-gateway is already listening, exit 0 quickly instead of kill-and-take-over. That's a behavior change (not just a bug fix) so I pulled it out to keep this PR purely corrective; happy to open a follow-up issue/PR if maintainers want to discuss it.

Notes for reviewers

  • The pair of tests that look like "we expected them to not hit spawnSync / ps at all" now do — this is because the code path is correct. The override hook keeps the tests deterministic without needing each test to track its lsof-vs-ps call order through a multi-call mock.
  • The implementation of verifyGatewayPidByArgvSync uses a ps call per PID. In the typical steady-state (a single listener on the port) this is one extra ~5 ms call per findGatewayPidsOnPortSync invocation — well inside the existing SPAWN_TIMEOUT_MS = 2000. If this ever gets called for many PIDs, a short-circuit on first non-gateway match could be added; current behavior is the safe default.

Changed files

  • src/infra/restart-stale-pids.test.ts (modified, +45/-18)
  • src/infra/restart-stale-pids.ts (modified, +100/-27)
  • src/infra/restart.test.ts (modified, +11/-0)

Code Example

if (
  currentPid != null &&
  currentCmd &&
  normalizeLowercaseStringOrEmpty(currentCmd).includes("openclaw")
) {
  pids.push(currentPid);
}

---

$ /usr/sbin/lsof -nP -iTCP:18789 -sTCP:LISTEN -Fpc
p12345
cnode
f15

$ ps -p 12345 -o command=
openclaw-gateway

---

[gateway] ⚠️  Gateway is binding to a non-loopback address. 
    Gateway failed to start: another gateway instance is already listening on ws://0.0.0.0:18789 | listen EADDRINUSE: address already in use 0.0.0.0:18789
    If the gateway is supervised, stop it with: openclaw gateway stop
    Port 18789 is already in use.
    - pid <N> <user>: openclaw-gateway (*:18789)
    - Gateway already running locally. 
RAW_BUFFERClick to expand / collapse

Bug type

Behavior bug (correctness) — silent no-op in a safety-critical cleanup path, observable as launchd respawn churn and log-file bloat.

Summary

On macOS, findGatewayPidsOnPortSync in src/infra/restart-stale-pids.ts never recognizes live openclaw gateway processes. parsePidsFromLsofOutput filters candidate PIDs by checking whether the lsof -Fpc command-name token contains "openclaw", but the openclaw gateway rewrites its argv[0] to "openclaw-gateway" after exec while the kernel's p_comm field (what lsof reports) stays "node". The filter therefore drops every real gateway PID, and cleanStaleGatewayProcessesSync silently no-ops even when there is a verifiable gateway listening on the port.

The result is a self-sustaining launchd KeepAlive respawn loop any time the managed PID changes: the new spawn can't detect the existing listener, fails EADDRINUSE, exits, and launchd spawns another one under ThrottleInterval. With the default plist written by openclaw doctor --repair (KeepAlive=true, RunAtLoad=true, ThrottleInterval=1), the cadence is one failed respawn every ~18 seconds, indefinitely.

Impact

  • Every macOS install running the gateway under launchd with KeepAlive=true is vulnerable. The bug is latent: the managed gateway itself keeps serving traffic, so functionality isn't visibly broken — the observable damage is respawn churn and log-file growth.
  • gateway.err.log grows quickly (a few MB/day under normal conditions, tens of MB/day once the loop is sustained). After a few days of uptime the file routinely hits tens to hundreds of MB, each EADDRINUSE cycle emitting ~7 lines.
  • The same filter is shared with pollPortOnceUnix, so waitForPortFreeSync returns {free: true} while the port is genuinely busy. This additionally masks the condition from any diagnostic that relies on that poll.
  • launchctl list reports runs=N but N is misleadingly low because failed spawns that exit in under MinimumRuntime=10 don't increment the counter. The orphan loop is therefore invisible to a casual launchctl print check — you have to watch ps directly, or see the log growth.

Root cause

In src/infra/restart-stale-pids.ts, parsePidsFromLsofOutput:

if (
  currentPid != null &&
  currentCmd &&
  normalizeLowercaseStringOrEmpty(currentCmd).includes("openclaw")
) {
  pids.push(currentPid);
}

currentCmd comes from lsof -Fpc output — the c field is the kernel p_comm / BSD_COMM field. On macOS (and Linux), that field is the basename of the exec'd binary as recorded at execve time, truncated to MAXCOMLEN. For a node-based gateway it is "node". The gateway sets its own process title via process.title = "openclaw-gateway" (or equivalent argv[0] rewrite) after startup — ps reads this rewritten argv[0] and prints "openclaw-gateway", but lsof does not; lsof stays with p_comm.

Because "node".includes("openclaw") is false, every real gateway PID is dropped. The outer filter pid !== process.pid is then irrelevant because the set is already empty.

Confirmed locally with a stock install:

$ /usr/sbin/lsof -nP -iTCP:18789 -sTCP:LISTEN -Fpc
p12345
cnode
f15

$ ps -p 12345 -o command=
openclaw-gateway

The Windows code path in the same file (filterVerifiedWindowsGatewayPids / isGatewayArgv) does not rely on a comm filter — it inspects the full process argv via PowerShell and matches against openclaw entry-point patterns. That's the correct abstraction; the Unix path was missing the equivalent step.

Reproduction

  1. Install openclaw on macOS.

  2. openclaw doctor --repair to install the default LaunchAgent with KeepAlive=true, RunAtLoad=true, ThrottleInterval=1.

  3. openclaw gateway start.

  4. Trigger any scenario where launchd's currently-tracked gateway PID exits briefly (e.g., the gateway's own in-process restart path, a /restart command, a direct kill <launchd-tracked-pid> where the actual listening PID is a different orphan).

  5. Tail ~/.openclaw/logs/gateway.err.log. Every ~18 seconds a fresh spawn reports:

    [gateway] ⚠️  Gateway is binding to a non-loopback address. …
    Gateway failed to start: another gateway instance is already listening on ws://0.0.0.0:18789 | listen EADDRINUSE: address already in use 0.0.0.0:18789
    If the gateway is supervised, stop it with: openclaw gateway stop
    Port 18789 is already in use.
    - pid <N> <user>: openclaw-gateway (*:18789)
    - Gateway already running locally. …
  6. Tail ~/.openclaw/logs/gateway.log for service-mode: cleared N stale gateway pid(s) before bind on portthe line never appears, because cleanStaleGatewayProcessesSync silently returned [].

  7. Optional definitive proof: launchctl bootout gui/$UID/ai.openclaw.gateway for ~60s and observe that no new orphan spawns occur. Re-bootstrap; the cycle resumes. This rules out any non-launchd source.

Proposed fix

Remove the lsof-comm-based filter and add a ps-based argv verifier on Unix, mirroring the Windows path's structure.

  1. parsePidsFromLsofOutput returns every listening PID (minus process.pid). No gateway-vs-other classification at the parse layer.
  2. New verifyGatewayPidByArgvSync(pid) helper: runs ps -ww -p <pid> -o command= and matches the result against the openclaw-gateway argv[0] rewrite and common entry-file patterns (/dist/index.js gateway, /openclaw.mjs gateway, /openclaw gateway, openclaw_repo…gateway for dev-mode invocations).
  3. findGatewayPidsOnPortSync on Unix runs lsof, then filters the returned PIDs through verifyGatewayPidByArgvSync.

Optional, separable hardening:

  • Bump PORT_FREE_TIMEOUT_MS from 2000 to 30000. Under load the kernel can take multiple seconds to release a TCP socket after SIGKILL (TIME_WAIT + teardown), and the current 2 s window doesn't leave useful slack.

I have a working patch against current main. Gateway.err.log growth halted, service-mode: cleared 1 stale gateway pid(s) began appearing correctly, and the EADDRINUSE respawn cycle was broken. Happy to open a PR.

Environment

  • macOS (Apple Silicon, Darwin 25.x)
  • openclaw 2026.4.x
  • Node 25.x
  • Default LaunchAgent as written by openclaw doctor --repair

Related (checked before filing)

  • #39222 — CLI can leave orphan openclaw-gateway processes on Linux/systemd. Same symptom class (orphan gateway, port conflict), different platform, different root cause (CLI vs systemd race, not lsof comm filter). Fix is complementary.
  • #68153 — Telegram channel health-monitor EADDRINUSE loop. Same error surface, triggered by channel-health-monitor restart cascading into a full gateway bind, not launchd KeepAlive.
  • #22169 — msteams provider EADDRINUSE restart loop. Already fixed, same error class.
  • #44881 — Gateway holds port after npm upgrade. Upgrade-triggered, different root cause.
  • #62276 — Port 18789 is in use on OpenClaw 2026.4.5 startup. Generic.
  • PR #61948 — fix: stale pid cleaner race condition (SIGTERM cascade). Adjacent fix in the same file, different bug (selfPid vs callerPid passthrough). Complementary — the PR's fix is load-bearing once the comm-filter bug is fixed, because real gateway PIDs will start being returned.
  • PR #66627 — Windows-specific split of diagnostic vs verified stale PIDs. Same architectural direction as the proposed fix here, Windows-only.
  • commit 8aadca4c3e (fix(infra/restart): exclude ancestor pids from stale-gateway cleanup) — already in main. Extends the self-pid filter to self+ancestors. Complementary — it assumes findGatewayPidsOnPortSync returns correct PIDs; on macOS, today, it doesn't.

This bug is orthogonal to all of the above. It's been silently present on every macOS install since the comm-filter was introduced, and no issue on the tracker names it. Filing this to get it on the record.

extent analysis

TL;DR

The proposed fix involves removing the lsof-comm-based filter and adding a ps-based argv verifier on Unix to correctly identify openclaw gateway processes.

Guidance

  • Remove the currentCmd.includes("openclaw") filter in parsePidsFromLsofOutput to prevent dropping real gateway PIDs.
  • Implement a new verifyGatewayPidByArgvSync(pid) helper to verify the argv of a given PID against openclaw-gateway patterns.
  • Update findGatewayPidsOnPortSync to use the new verifyGatewayPidByArgvSync helper for filtering PIDs on Unix.
  • Consider increasing PORT_FREE_TIMEOUT_MS to 30000 to account for potential delays in TCP socket release.

Example

// New verifyGatewayPidByArgvSync function
function verifyGatewayPidByArgvSync(pid: number): boolean {
  const argv = execSync(`ps -ww -p ${pid} -o command=`).toString().trim();
  return argv.includes('openclaw-gateway') || argv.includes('/dist/index.js gateway');
}

Notes

The proposed fix assumes that the ps command will accurately report the argv of the openclaw gateway process. If this is not the case, additional debugging may be necessary.

Recommendation

Apply the proposed workaround by removing the lsof-comm-based filter and adding the ps-based argv verifier on Unix, as this directly addresses the root cause of the issue and has been confirmed to work in a local test environment.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING