openclaw - 💡(How to fix) Fix Gateway leaks MCP child process trees on session end (mcp-server-filesystem, mcp-server-github) — ~70 MB RSS each, accumulates 50+/day [1 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
openclaw/openclaw#74774Fetched 2026-05-01 05:41:30
View on GitHub
Comments
1
Participants
2
Timeline
2
Reactions
2
Author
Timeline (top)
closed ×1commented ×1

The gateway spawns mcp-server-{filesystem,github} (and similar npm-distributed MCP servers) as child process trees per agent session. When an agent session ends, the child process tree is never reaped. Each leaked tree is npm exec → sh -c → node mcp-server-* and holds ~70 MB RSS. Over hours of normal multi-agent activity the leaked MCP tree count climbs from ~6 to 300+, eating multiple GB of RAM and eventually starving the host.

Root Cause

  1. v1 — match by cmdline + keep newest: unsafe under multi-agent activity, kills live MCPs of older sessions.
  2. v2 — pipe-peer liveness check: unsafe — these MCPs use socketpairs, not pipes, and the /proc/net/unix state stays CONNECTED (St=03) because the gateway side keeps the peer fd open even after the subagent finishes.
  3. v3 — age + last-subagent-activity heuristic: killed the MCPs the calling orchestrator's own tool client was bound to, breaking that session's filesystem/github tools mid-task. The gateway maintains a pool that's not 1:1 with sessions, so age alone doesn't identify abandoned children.

Fix Action

Fix / Workaround

Option C — ask the children to exit. The MCP servers could be patched to exit on stdin EOF, or the gateway could send a JSON-RPC shutdown request before closing the socket. This requires upstream changes to the @modelcontextprotocol/server-* packages and is the least clean option.

Mitigations users have to live with right now: periodic openclaw gateway restart, larger swap, or more RAM. None are great.

Code Example

# steady-state baseline
$ pgrep -f 'mcp-server-(filesystem|github)' | wc -l
6

# after ~24h of normal activity (subagent spawns, council/ultraplan, etc.)
$ pgrep -f 'mcp-server-(filesystem|github)' | wc -l
326

$ ps -o pid,ppid,etimes,sid,cmd -p $(pgrep -f 'mcp-server-(filesystem|github)') | head
   PID    PPID ELAPSED     SID CMD
 99484   99466   23044     408 sh -c mcp-server-filesystem ...
 99485   99484   23044     408 node .../mcp-server-filesystem ...
 99508   99496   23041     408 sh -c mcp-server-github
 99509   99508   23041     408 node .../mcp-server-github
...

---

$ ps -o pid,ppid,cmd -p 99466
   PID    PPID CMD
 99466     416 npm exec @modelcontextprotocol/server-filesystem /home/manuel/...
$ ps -o pid,cmd -p 416
   PID CMD
   416 openclaw-gateway

---

# normal steady state ≈ 2 × N (where N = number of MCP plugins enabled, typically 3-4)
# > 50 = leak almost certainly accumulating
ps aux | grep -E 'mcp-server-(filesystem|github)' | grep -v grep | wc -l
RAW_BUFFERClick to expand / collapse

Summary

The gateway spawns mcp-server-{filesystem,github} (and similar npm-distributed MCP servers) as child process trees per agent session. When an agent session ends, the child process tree is never reaped. Each leaked tree is npm exec → sh -c → node mcp-server-* and holds ~70 MB RSS. Over hours of normal multi-agent activity the leaked MCP tree count climbs from ~6 to 300+, eating multiple GB of RAM and eventually starving the host.

Reproduction

A multi-agent setup with frequent subagent spawns:

# steady-state baseline
$ pgrep -f 'mcp-server-(filesystem|github)' | wc -l
6

# after ~24h of normal activity (subagent spawns, council/ultraplan, etc.)
$ pgrep -f 'mcp-server-(filesystem|github)' | wc -l
326

$ ps -o pid,ppid,etimes,sid,cmd -p $(pgrep -f 'mcp-server-(filesystem|github)') | head
   PID    PPID ELAPSED     SID CMD
 99484   99466   23044     408 sh -c mcp-server-filesystem ...
 99485   99484   23044     408 node .../mcp-server-filesystem ...
 99508   99496   23041     408 sh -c mcp-server-github
 99509   99508   23041     408 node .../mcp-server-github
...

Every leaked tree:

  • has SID = the gateway's session id (so it survives gateway restart-detection logic if any)
  • has PPID = an npm exec @modelcontextprotocol/server-* process whose parent IS the gateway (PID 416 in this trace)
  • holds a unix socketpair where the gateway-side fd was closed but the gateway never sent SIGTERM/KILL to the child
$ ps -o pid,ppid,cmd -p 99466
   PID    PPID CMD
 99466     416 npm exec @modelcontextprotocol/server-filesystem /home/manuel/...
$ ps -o pid,cmd -p 416
   PID CMD
   416 openclaw-gateway

So the gateway is the direct parent of the npm-exec, and ought to be in a position to kill it.

Why the children don't self-exit on EOF

The npm-distributed MCP servers (e.g. @modelcontextprotocol/server-filesystem) are Node.js processes listening on process.stdin for JSON-RPC. When the gateway closes the parent end of the socketpair, the child's stdin gets EOF. Node ignores stdin EOF by default, so the child keeps running with no work to do. The child has no built-in "if my creator hung up, exit" logic — that's the parent's job.

Side-effect: cleanup-script difficulty

I attempted three workspace-side cleanup strategies, all of which had subtle correctness bugs:

  1. v1 — match by cmdline + keep newest: unsafe under multi-agent activity, kills live MCPs of older sessions.
  2. v2 — pipe-peer liveness check: unsafe — these MCPs use socketpairs, not pipes, and the /proc/net/unix state stays CONNECTED (St=03) because the gateway side keeps the peer fd open even after the subagent finishes.
  3. v3 — age + last-subagent-activity heuristic: killed the MCPs the calling orchestrator's own tool client was bound to, breaking that session's filesystem/github tools mid-task. The gateway maintains a pool that's not 1:1 with sessions, so age alone doesn't identify abandoned children.

Net: there is no safe way to clean this up from outside the gateway. Only the gateway knows which MCP children are still bound to live sessions.

Suggested fixes

Option A — preferred — child-process-group teardown on session end. When an agent session terminates (whether main or subagent), the gateway already runs cleanup hooks. Extend the MCP plugin/runtime to kill the child process group of every MCP it spawned for that session: process.kill(-pgid, 'SIGTERM') then SIGKILL after a short grace period. This is the standard pattern.

Option B — pool the MCP children by config, not by session. Maintain one filesystem MCP and one github MCP per (workspace, env) pair, share across all sessions, and only tear them down when the gateway shuts down. Trade-off: shared MCP processes are slightly more complex to manage (per-call sandboxing, per-session permissions) but eliminate the leak.

Option C — ask the children to exit. The MCP servers could be patched to exit on stdin EOF, or the gateway could send a JSON-RPC shutdown request before closing the socket. This requires upstream changes to the @modelcontextprotocol/server-* packages and is the least clean option.

Option A is the clear winner: smallest scope, no protocol changes, fixes the exact bug, and matches the explicit intent of "this MCP belongs to that session."

Detection (for users on affected versions)

Quick check:

# normal steady state ≈ 2 × N (where N = number of MCP plugins enabled, typically 3-4)
# > 50 = leak almost certainly accumulating
ps aux | grep -E 'mcp-server-(filesystem|github)' | grep -v grep | wc -l

If RAM pressure builds up, the only safe reclaim is openclaw gateway restart (which kills all gateway children including the leaked MCPs) or rebooting. Manual kill of the MCP process trees can break in-flight tool calls in any active session because of the pool-sharing described above.

Impact

  • ~70 MB RSS per leaked MCP tree, both filesystem and github variants
  • 50+ MCP trees per day under typical multi-agent workloads
  • ~4 GB RAM consumed per 24h of activity in my workspace
  • Eventually OOM territory on smaller hosts; mine has 32 GB and was hitting <2 GB available within ~3 days

Mitigations users have to live with right now: periodic openclaw gateway restart, larger swap, or more RAM. None are great.

Environment

  • OpenClaw installed via npm: openclaw package in ~/.nvm/versions/node/v22.22.2/lib/node_modules/openclaw
  • Node v22.22.2
  • Linux 6.6.87.2-microsoft-standard-WSL2 (x64)
  • MCP plugins active: @modelcontextprotocol/server-filesystem, @modelcontextprotocol/server-github
  • Multi-agent setup with worker subagent (subagent runtime, lightContext spawns at typical rate of 5-15/hour)

Related

  • #74448 — separate bug, also subagent-lifecycle related: agent.wait resolves on session-compaction (not actual task completion). Same general theme of subagent termination signals not being reliable.

extent analysis

TL;DR

The most likely fix is to implement child-process-group teardown on session end, using process.kill(-pgid, 'SIGTERM') followed by SIGKILL after a short grace period.

Guidance

  • Identify the agent session termination points in the gateway code and extend the MCP plugin/runtime to kill the child process group of every MCP it spawned for that session.
  • Consider implementing a pool-based approach for MCP children, where one filesystem MCP and one github MCP are maintained per (workspace, env) pair, and only torn down when the gateway shuts down.
  • Verify the fix by monitoring the number of leaked MCP trees over time using the provided ps command and checking for RAM consumption.
  • If unsure about the implementation, start by logging the process groups and session terminations to understand the workflow and identify potential issues.

Example

// Example of killing a child process group
const pgid = // obtain the process group ID of the MCP child process
process.kill(-pgid, 'SIGTERM');
setTimeout(() => {
  process.kill(-pgid, 'SIGKILL');
}, 5000); // adjust the grace period as needed

Notes

The provided ps commands can be used to detect and verify the leak, but may require adjustments based on the specific system and workflow. The example code snippet is a basic illustration and may need to be adapted to the actual implementation.

Recommendation

Apply the child-process-group teardown on session end (Option A) as it is the most straightforward and effective solution, requiring minimal changes to the existing codebase.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING