openclaw - 💡(How to fix) Fix Gateway lazy-spawns duplicate stdio MCP children with identical ppid+config (memory + CPU leak) [2 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
openclaw/openclaw#75621Fetched 2026-05-02 05:32:43
View on GitHub
Comments
2
Participants
2
Timeline
4
Reactions
3
Timeline (top)
commented ×2cross-referenced ×2

On 2026.4.29, the OpenClaw gateway sometimes lazy-spawns two stdio MCP child processes for the same configured server, with identical ppid and identical --config argument. Both children stay alive indefinitely, doubling memory cost and CPU during the index-rebuild / handshake phase that runs at child startup. Restarting the gateway is currently the only way to clear the duplicates.

This appears related to but distinct from #75437: that issue is about per-message bundle-tools restaging; this is about persistent duplicate child processes from a single MCP config entry.

Root Cause

On 2026.4.29, the OpenClaw gateway sometimes lazy-spawns two stdio MCP child processes for the same configured server, with identical ppid and identical --config argument. Both children stay alive indefinitely, doubling memory cost and CPU during the index-rebuild / handshake phase that runs at child startup. Restarting the gateway is currently the only way to clear the duplicates.

This appears related to but distinct from #75437: that issue is about per-message bundle-tools restaging; this is about persistent duplicate child processes from a single MCP config entry.

Fix Action

Fix / Workaround

Recovery / mitigation we deployed

Code Example

$ ps -eo pid,ppid,etimes,%mem,rss,cmd --no-headers | grep graphiti.*main\.py
 464302  460593    471  5.2 850940 .../graphiti/mcp_server/.venv/bin/python .../main.py --config .../config-openclaw.yaml
 473502  460593    131  5.2 850908 .../graphiti/mcp_server/.venv/bin/python .../main.py --config .../config-openclaw.yaml
   6260    1273  21470  5.2 849132 .../graphiti/mcp_server/.venv/bin/python .../main.py --config .../config-zackchiu.yaml
   6032    1273  21480  5.2 851336 .../graphiti/mcp_server/.venv/bin/python .../main.py --config .../config-zackchiu.yaml

---

[diagnostic] liveness warning: reasons=event_loop_delay,event_loop_utilization,cpu
  eventLoopDelayMaxMs=11316.2 eventLoopUtilization=1 cpuCoreRatio=1.033
[fetch-timeout] fetch timeout reached; aborting operation        (×N, repeated)
[telegram] sendChatAction failed: Network request for 'sendChatAction' failed!
typing TTL reached (2m); stopping typing indicator
[trace:embedded-run] prep stages: phase=stream-ready totalMs=102822 stages=
  workspace-sandbox:2217ms,
  core-plugin-tools:11841ms,
  bootstrap-context:5364ms,
  bundle-tools:29245ms,        ← duplicate stdio MCP startup running
  system-prompt:26264ms,
  session-resource-loader:9203ms,
  agent-session:7ms,
  stream-setup:18680ms

---

OpenClaw   2026.4.29 (a448042)
Node       v22.22.2
Platform   Linux 6.17.0 (Ubuntu)
MCP server graphiti-mcp (stdio) via openclaw bundle-mcp; LLM=Gemini, DB=Neo4j (3 isolated instances)
RAW_BUFFERClick to expand / collapse

Summary

On 2026.4.29, the OpenClaw gateway sometimes lazy-spawns two stdio MCP child processes for the same configured server, with identical ppid and identical --config argument. Both children stay alive indefinitely, doubling memory cost and CPU during the index-rebuild / handshake phase that runs at child startup. Restarting the gateway is currently the only way to clear the duplicates.

This appears related to but distinct from #75437: that issue is about per-message bundle-tools restaging; this is about persistent duplicate child processes from a single MCP config entry.

Evidence (steady state, no recent gateway restart)

$ ps -eo pid,ppid,etimes,%mem,rss,cmd --no-headers | grep graphiti.*main\.py
 464302  460593    471  5.2 850940 .../graphiti/mcp_server/.venv/bin/python .../main.py --config .../config-openclaw.yaml
 473502  460593    131  5.2 850908 .../graphiti/mcp_server/.venv/bin/python .../main.py --config .../config-openclaw.yaml
   6260    1273  21470  5.2 849132 .../graphiti/mcp_server/.venv/bin/python .../main.py --config .../config-zackchiu.yaml
   6032    1273  21480  5.2 851336 .../graphiti/mcp_server/.venv/bin/python .../main.py --config .../config-zackchiu.yaml

PID 460593 is the OpenClaw gateway. It owns two graphiti-mcp children with the same config — etimes shows they were spawned ~5.5 min apart (471s vs 131s), suggesting the second was spawned during a later agent session even though the first was still healthy.

(For context, PIDs 6032/6260 are an unrelated supervisor exhibiting the same pattern; reporting only the OpenClaw side here.)

After systemctl --user restart openclaw-gateway, both children disappear (clean shutdown). On the next inbound agent session that needs the MCP, exactly one new child is spawned — but eventually a second one appears again.

Symptoms when this happens

Concurrent gateway logs:

[diagnostic] liveness warning: reasons=event_loop_delay,event_loop_utilization,cpu
  eventLoopDelayMaxMs=11316.2 eventLoopUtilization=1 cpuCoreRatio=1.033
[fetch-timeout] fetch timeout reached; aborting operation        (×N, repeated)
[telegram] sendChatAction failed: Network request for 'sendChatAction' failed!
typing TTL reached (2m); stopping typing indicator
[trace:embedded-run] prep stages: phase=stream-ready totalMs=102822 stages=
  workspace-sandbox:2217ms,
  core-plugin-tools:11841ms,
  bootstrap-context:5364ms,
  bundle-tools:29245ms,        ← duplicate stdio MCP startup running
  system-prompt:26264ms,
  session-resource-loader:9203ms,
  agent-session:7ms,
  stream-setup:18680ms

Two graphiti-mcp processes simultaneously running build_indices_and_constraints() against the same Neo4j (each fires ~30 index queries) saturates the gateway event loop, leading to the cascading fetch-timeout aborts above. End-user symptom on Telegram: messages take 100+ seconds and the typing indicator times out before the reply lands. Killing one of the duplicates with kill <youngest-pid> immediately drops gateway CPU and restores normal latency.

Recovery / mitigation we deployed

Until upstream fixes this, we run a per-5-min cron that:

  1. parses ps -eo pid,ppid,etimes,cmd --no-headers for graphiti.*main\.py,
  2. groups by (ppid, --config),
  3. for any group with count > 1, keeps the oldest etimes and SIGTERMs the rest,
  4. notifies via Telegram if either duplicates are found or [diagnostic] liveness warning exceeds a threshold in the last 10 min.

This works as a band-aid but should not be necessary.

Suggested investigation

The bundle-mcp / lazy-load path appears to start a new child without checking whether an existing healthy child for the same (plugin, config) tuple is already running. Possible causes:

  • Race in agent-session bootstrap (two concurrent inbound messages each spawn before the registry sees the other's child).
  • The earlier child not being registered into the per-gateway "live MCP children" map after the previous gateway restart’s child reaping completed asynchronously.
  • The MCP server registry keying on something that differs across attempts (PID, run-id) instead of (plugin id, config path).

A simple fix worth considering: hold a Map<configKey, Promise<Child>> and have all callers await the same in-flight spawn promise.

Environment

OpenClaw   2026.4.29 (a448042)
Node       v22.22.2
Platform   Linux 6.17.0 (Ubuntu)
MCP server graphiti-mcp (stdio) via openclaw bundle-mcp; LLM=Gemini, DB=Neo4j (3 isolated instances)

Happy to attach a fuller journalctl span or strace/perf capture if useful — just say the word.

extent analysis

TL;DR

Implementing a Map<configKey, Promise<Child>> to track in-flight spawn promises for MCP children may resolve the issue of duplicate child processes.

Guidance

  • Investigate the bundle-mcp/lazy-load path to determine why a new child is started without checking for an existing healthy child with the same (plugin, config) tuple.
  • Review the agent-session bootstrap process to identify potential races that could lead to concurrent child spawns.
  • Consider adding logging or debugging statements to track the registration of children into the "live MCP children" map after a gateway restart.
  • Evaluate the MCP server registry keying mechanism to ensure it uses the correct identifier (e.g., (plugin id, config path)) instead of PID or run-id.

Example

A possible implementation of the suggested fix could involve creating a Map<configKey, Promise<Child>> to hold in-flight spawn promises:

spawn_promises = {}

def spawn_child(config_key):
    if config_key in spawn_promises:
        return spawn_promises[config_key]
    else:
        promise = spawn_mcp_child(config_key)
        spawn_promises[config_key] = promise
        return promise

Notes

The provided cron-based mitigation may not be sufficient in all cases, and a more robust solution is needed to prevent duplicate child processes.

Recommendation

Apply the suggested fix by implementing a Map<configKey, Promise<Child>> to track in-flight spawn promises, as this approach addresses the likely cause of the issue and provides a clear path forward for resolving the problem.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

openclaw - 💡(How to fix) Fix Gateway lazy-spawns duplicate stdio MCP children with identical ppid+config (memory + CPU leak) [2 comments, 2 participants]