openclaw - 💡(How to fix) Fix Gateway leak triad on plugin restart: Manifest EADDRINUSE retry loop, signal-handler accumulation, sync I/O on session JSONL → WS handshake starvation [5 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
openclaw/openclaw#73655Fetched 2026-04-29 06:16:53
View on GitHub
Comments
5
Participants
2
Timeline
8
Reactions
0
Timeline (top)
commented ×5cross-referenced ×1mentioned ×1subscribed ×1

After a normal post-update gateway restart on 2026.4.26 (homebrew install, gateway PID 39890), three independent leaks compounded over ~10 minutes and starved the WS upgrade handler. Result: every WebSocket open from a client (Paperclip's adapter, in our case) timed out at 26s with closed before connect / handshake timeout. HTTP /healthz continued to return 200 in <500ms, so naive port-listening checks reported healthy — only WS upgrades were affected.

launchctl kickstart -k cleared all three. Steady-state post-restart was clean (zero leak warnings) — but the leaks are still in the binary, and we expect them to recur the next time the gateway restarts under load.

Error Message

[Manifest] ERROR [NestApplication] Error: listen EADDRINUSE: address already in use 127.0.0.1:2099 [Manifest] ERROR [NestApplication] Error: listen EADDRINUSE: address already in use 127.0.0.1:2100

Root Cause

(That paperclip-complete reason is the client adapter giving up at its 26s timeout.) Once the gateway was kickstarted, all three leak symptoms went to zero post-restart and cron.list round-trips returned in 67–215ms.

Code Example

[Manifest] ERROR [NestApplication] Error: listen EADDRINUSE: address already in use 127.0.0.1:2099
[Manifest] ERROR [NestApplication] Error: listen EADDRINUSE: address already in use 127.0.0.1:2100

---

(node:39890) MaxListenersExceededWarning: Possible EventEmitter memory leak detected.
  11 SIGINT listeners added to [process]. MaxListeners is 10. ...
  11 SIGTERM listeners added to [process]. ...
  11 SIGQUIT / SIGABRT / SIGHUP / SIGTRAP / SIGUSR2 / SIGILL / SIGBUS / SIGFPE / SIGSEGV listeners ...

---

[diagnostic] stuck session: sessionId=polymarket-trader sessionKey=agent:polymarket-trader:cron:0128eb86-1a5d-4226-824c-81b1228760f7:run:4fe385b1-3dda-45a2-a690-2914e08e56c0 state=processing age=173s queueDepth=0
[session-write-lock] releasing lock held for 30445ms (max=15000ms): /Users/.openclaw/agents/coding-helper/sessions/sessions.json.lock

---

[ws] handshake timeout conn=06ade20f-...
[ws] closed before connect ... code=1006 reason=n/a
[ws] closed before connect ... code=1000 reason=paperclip-complete
RAW_BUFFERClick to expand / collapse

Summary

After a normal post-update gateway restart on 2026.4.26 (homebrew install, gateway PID 39890), three independent leaks compounded over ~10 minutes and starved the WS upgrade handler. Result: every WebSocket open from a client (Paperclip's adapter, in our case) timed out at 26s with closed before connect / handshake timeout. HTTP /healthz continued to return 200 in <500ms, so naive port-listening checks reported healthy — only WS upgrades were affected.

launchctl kickstart -k cleared all three. Steady-state post-restart was clean (zero leak warnings) — but the leaks are still in the binary, and we expect them to recur the next time the gateway restarts under load.

The three leaks

1. Manifest plugin EADDRINUSE retry loop on 127.0.0.1:2099 / :2100

lsof -nP -iTCP:2099 -iTCP:2100 showed both bound to the same gateway PID (39890). The embedded Manifest plugin ([Manifest] Loading embedded server...) kept attempting app.listen(2099/2100) and emitting:

[Manifest] ERROR [NestApplication] Error: listen EADDRINUSE: address already in use 127.0.0.1:2099
[Manifest] ERROR [NestApplication] Error: listen EADDRINUSE: address already in use 127.0.0.1:2100

…every few seconds, alongside [🦚 Manifest] Reusing existing server. Open the dashboard to connect a provider and start routing. Looks like the gateway parent and the Manifest plugin's bootstrap path race on the same port and the plugin's retry runs Nest's full module-init synchronously on each attempt.

2. Signal-handler listener leak

(node:39890) MaxListenersExceededWarning: Possible EventEmitter memory leak detected.
  11 SIGINT listeners added to [process]. MaxListeners is 10. ...
  11 SIGTERM listeners added to [process]. ...
  11 SIGQUIT / SIGABRT / SIGHUP / SIGTRAP / SIGUSR2 / SIGILL / SIGBUS / SIGFPE / SIGSEGV listeners ...

(Verbatim from gateway.err.log, captured 09:33:56–09:38:40.) Each restarted plugin instance attaches its own signal handlers without removing prior ones.

3. Stuck sessions + 30s+ session-write-lock holds doing sync I/O

[diagnostic] stuck session: sessionId=polymarket-trader sessionKey=agent:polymarket-trader:cron:0128eb86-1a5d-4226-824c-81b1228760f7:run:4fe385b1-3dda-45a2-a690-2914e08e56c0 state=processing age=173s queueDepth=0
[session-write-lock] releasing lock held for 30445ms (max=15000ms): /Users/.openclaw/agents/coding-helper/sessions/sessions.json.lock

Sync I/O on session JSONL files held the loop for 30+ seconds at a time, well past the lock's max=15000ms.

Compounded effect

The WS upgrade handler couldn't run inside the 26s budget the client was willing to wait. TCP accept worked, HTTP /healthz worked (lighter codepath), but:

[ws] handshake timeout conn=06ade20f-...
[ws] closed before connect ... code=1006 reason=n/a
[ws] closed before connect ... code=1000 reason=paperclip-complete

(That paperclip-complete reason is the client adapter giving up at its 26s timeout.) Once the gateway was kickstarted, all three leak symptoms went to zero post-restart and cron.list round-trips returned in 67–215ms.

Reproduce

Hard to reproduce on demand because it's an emergent leak — but easy to test for:

  • Run gateway under load for 15+ minutes with the Manifest plugin enabled and at least one cron-driven session (so session locks get exercised).
  • Watch listener counts (process.listenerCount(\"SIGINT\")) over time.
  • Watch lsof -i :2099 -i :2100 -p <gateway-pid> — if the parent already owns the port, every Manifest restart pass logs an EADDRINUSE and runs Nest module-init.

Suggested fixes

  • Manifest plugin port collision: don't retry app.listen() if the parent already owns the port; either reuse the existing socket or short-circuit to "already listening".
  • Signal handlers: when a plugin re-initialises, remove its previous handlers before attaching new ones. (Centralised registry would also help.)
  • Session JSONL writes: move to async I/O or a dedicated worker thread; the 30s+ block on sessions.json.lock is plenty long to starve other event-loop work.

The MaxListenersWarning is the loudest signal — it makes the leak visible without a debugger.

Environment

  • macOS 26.4.1 / arm64, Node v24.13.1
  • OpenClaw 2026.4.26 (be8c246), homebrew install
  • Plugins active during incident: Manifest (clawrouter), mem9, feishu, whatsapp, …

extent analysis

TL;DR

Implement fixes for the three identified leaks: Manifest plugin port collision, signal-handler listener leak, and stuck sessions with long session-write-lock holds.

Guidance

  • Address the Manifest plugin port collision by reusing the existing socket or short-circuiting to "already listening" if the parent already owns the port.
  • Remove previous signal handlers before attaching new ones when a plugin re-initializes to prevent the signal-handler listener leak.
  • Move session JSONL writes to async I/O or a dedicated worker thread to prevent the 30s+ block on sessions.json.lock.

Example

No code snippet is provided as the issue does not contain sufficient information for a specific code example.

Notes

The provided information suggests that the leaks are caused by specific issues with the Manifest plugin, signal handlers, and session JSONL writes. Addressing these issues should help prevent the compounded effect that led to the WebSocket upgrade handler timeouts.

Recommendation

Apply workarounds for the identified leaks, as the issue is caused by specific technical problems that can be addressed through code changes and improvements to the plugin and session handling mechanisms.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

openclaw - 💡(How to fix) Fix Gateway leak triad on plugin restart: Manifest EADDRINUSE retry loop, signal-handler accumulation, sync I/O on session JSONL → WS handshake starvation [5 comments, 2 participants]