openclaw - 💡(How to fix) Fix Gateway lazy-spawns duplicate stdio MCP children with identical ppid+config (memory + CPU leak) [2 comments, 2 participants]

Root Cause

On 2026.4.29, the OpenClaw gateway sometimes lazy-spawns two stdio MCP child processes for the same configured server, with identical ppid and identical --config argument. Both children stay alive indefinitely, doubling memory cost and CPU during the index-rebuild / handshake phase that runs at child startup. Restarting the gateway is currently the only way to clear the duplicates.

This appears related to but distinct from #75437: that issue is about per-message bundle-tools restaging; this is about persistent duplicate child processes from a single MCP config entry.

Code Example

$ ps -eo pid,ppid,etimes,%mem,rss,cmd --no-headers | grep graphiti.*main\.py
 464302  460593    471  5.2 850940 .../graphiti/mcp_server/.venv/bin/python .../main.py --config .../config-openclaw.yaml
 473502  460593    131  5.2 850908 .../graphiti/mcp_server/.venv/bin/python .../main.py --config .../config-openclaw.yaml
   6260    1273  21470  5.2 849132 .../graphiti/mcp_server/.venv/bin/python .../main.py --config .../config-zackchiu.yaml
   6032    1273  21480  5.2 851336 .../graphiti/mcp_server/.venv/bin/python .../main.py --config .../config-zackchiu.yaml

---

[diagnostic] liveness warning: reasons=event_loop_delay,event_loop_utilization,cpu
  eventLoopDelayMaxMs=11316.2 eventLoopUtilization=1 cpuCoreRatio=1.033
[fetch-timeout] fetch timeout reached; aborting operation        (×N, repeated)
[telegram] sendChatAction failed: Network request for 'sendChatAction' failed!
typing TTL reached (2m); stopping typing indicator
[trace:embedded-run] prep stages: phase=stream-ready totalMs=102822 stages=
  workspace-sandbox:2217ms,
  core-plugin-tools:11841ms,
  bootstrap-context:5364ms,
  bundle-tools:29245ms,        ← duplicate stdio MCP startup running
  system-prompt:26264ms,
  session-resource-loader:9203ms,
  agent-session:7ms,
  stream-setup:18680ms

---

OpenClaw   2026.4.29 (a448042)
Node       v22.22.2
Platform   Linux 6.17.0 (Ubuntu)
MCP server graphiti-mcp (stdio) via openclaw bundle-mcp; LLM=Gemini, DB=Neo4j (3 isolated instances)

Summary

This appears related to but distinct from #75437: that issue is about per-message bundle-tools restaging; this is about persistent duplicate child processes from a single MCP config entry.

Evidence (steady state, no recent gateway restart)

$ ps -eo pid,ppid,etimes,%mem,rss,cmd --no-headers | grep graphiti.*main\.py
 464302  460593    471  5.2 850940 .../graphiti/mcp_server/.venv/bin/python .../main.py --config .../config-openclaw.yaml
 473502  460593    131  5.2 850908 .../graphiti/mcp_server/.venv/bin/python .../main.py --config .../config-openclaw.yaml
   6260    1273  21470  5.2 849132 .../graphiti/mcp_server/.venv/bin/python .../main.py --config .../config-zackchiu.yaml
   6032    1273  21480  5.2 851336 .../graphiti/mcp_server/.venv/bin/python .../main.py --config .../config-zackchiu.yaml

PID 460593 is the OpenClaw gateway. It owns two graphiti-mcp children with the same config — etimes shows they were spawned ~5.5 min apart (471s vs 131s), suggesting the second was spawned during a later agent session even though the first was still healthy.

(For context, PIDs 6032/6260 are an unrelated supervisor exhibiting the same pattern; reporting only the OpenClaw side here.)

After systemctl --user restart openclaw-gateway, both children disappear (clean shutdown). On the next inbound agent session that needs the MCP, exactly one new child is spawned — but eventually a second one appears again.

Symptoms when this happens

Concurrent gateway logs:

[diagnostic] liveness warning: reasons=event_loop_delay,event_loop_utilization,cpu
  eventLoopDelayMaxMs=11316.2 eventLoopUtilization=1 cpuCoreRatio=1.033
[fetch-timeout] fetch timeout reached; aborting operation        (×N, repeated)
[telegram] sendChatAction failed: Network request for 'sendChatAction' failed!
typing TTL reached (2m); stopping typing indicator
[trace:embedded-run] prep stages: phase=stream-ready totalMs=102822 stages=
  workspace-sandbox:2217ms,
  core-plugin-tools:11841ms,
  bootstrap-context:5364ms,
  bundle-tools:29245ms,        ← duplicate stdio MCP startup running
  system-prompt:26264ms,
  session-resource-loader:9203ms,
  agent-session:7ms,
  stream-setup:18680ms

Two graphiti-mcp processes simultaneously running build_indices_and_constraints() against the same Neo4j (each fires ~30 index queries) saturates the gateway event loop, leading to the cascading fetch-timeout aborts above. End-user symptom on Telegram: messages take 100+ seconds and the typing indicator times out before the reply lands. Killing one of the duplicates with kill <youngest-pid> immediately drops gateway CPU and restores normal latency.

Recovery / mitigation we deployed

Until upstream fixes this, we run a per-5-min cron that:

parses ps -eo pid,ppid,etimes,cmd --no-headers for graphiti.*main\.py,
groups by (ppid, --config),
for any group with count > 1, keeps the oldest etimes and SIGTERMs the rest,
notifies via Telegram if either duplicates are found or [diagnostic] liveness warning exceeds a threshold in the last 10 min.

This works as a band-aid but should not be necessary.

Suggested investigation

The bundle-mcp / lazy-load path appears to start a new child without checking whether an existing healthy child for the same (plugin, config) tuple is already running. Possible causes:

Race in agent-session bootstrap (two concurrent inbound messages each spawn before the registry sees the other's child).
The earlier child not being registered into the per-gateway "live MCP children" map after the previous gateway restart’s child reaping completed asynchronously.
The MCP server registry keying on something that differs across attempts (PID, run-id) instead of (plugin id, config path).

A simple fix worth considering: hold a Map<configKey, Promise<Child>> and have all callers await the same in-flight spawn promise.

Environment

OpenClaw   2026.4.29 (a448042)
Node       v22.22.2
Platform   Linux 6.17.0 (Ubuntu)
MCP server graphiti-mcp (stdio) via openclaw bundle-mcp; LLM=Gemini, DB=Neo4j (3 isolated instances)

Happy to attach a fuller journalctl span or strace/perf capture if useful — just say the word.

extent analysis

TL;DR

Implementing a Map<configKey, Promise<Child>> to track in-flight spawn promises for MCP children may resolve the issue of duplicate child processes.

Guidance

Investigate the bundle-mcp/lazy-load path to determine why a new child is started without checking for an existing healthy child with the same (plugin, config) tuple.
Review the agent-session bootstrap process to identify potential races that could lead to concurrent child spawns.
Consider adding logging or debugging statements to track the registration of children into the "live MCP children" map after a gateway restart.
Evaluate the MCP server registry keying mechanism to ensure it uses the correct identifier (e.g., (plugin id, config path)) instead of PID or run-id.

Example

A possible implementation of the suggested fix could involve creating a Map<configKey, Promise<Child>> to hold in-flight spawn promises:

spawn_promises = {}

def spawn_child(config_key):
    if config_key in spawn_promises:
        return spawn_promises[config_key]
    else:
        promise = spawn_mcp_child(config_key)
        spawn_promises[config_key] = promise
        return promise

Notes

The provided cron-based mitigation may not be sufficient in all cases, and a more robust solution is needed to prevent duplicate child processes.

Recommendation

Apply the suggested fix by implementing a Map<configKey, Promise<Child>> to track in-flight spawn promises, as this approach addresses the likely cause of the issue and provides a clear path forward for resolving the problem.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

openclaw - 💡(How to fix) Fix Gateway lazy-spawns duplicate stdio MCP children with identical ppid+config (memory + CPU leak) [2 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Fix Action

Fix / Workaround

Recovery / mitigation we deployed

Code Example

Summary

Evidence (steady state, no recent gateway restart)

Symptoms when this happens

Recovery / mitigation we deployed

Suggested investigation

Environment

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

TRENDING

openclaw - 💡(How to fix) Fix Gateway lazy-spawns duplicate stdio MCP children with identical ppid+config (memory + CPU leak) [2 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Fix Action

Fix / Workaround

Recovery / mitigation we deployed

Code Example

Summary

Evidence (steady state, no recent gateway restart)

Symptoms when this happens

Recovery / mitigation we deployed

Suggested investigation

Environment

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING