openclaw - ✅(Solved) Fix [Bug] MCP stdio server processes accumulate as children of gateway — never reaped when new session spawns fresh pool [1 pull requests, 1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
openclaw/openclaw#64169Fetched 2026-04-11 06:16:05
View on GitHub
Comments
0
Participants
1
Timeline
1
Reactions
0
Author
Participants
Timeline (top)
cross-referenced ×1

Every time an isolated agentTurn session starts (heartbeat, cron jobs), the gateway spawns a fresh set of MCP stdio server processes as children. The old processes are never terminated. Over time they accumulate indefinitely.

Root Cause

Each isolated agentTurn session initialises its own MCP client connection pool, spawning new child processes. When the session ends, the gateway does not reap the MCP children it spawned for that session.

Fix Action

Workaround

Periodic cleanup cron to kill duplicate processes, keeping only the newest instance of each MCP server type.

PR fix notes

PR #64316: fix(agents): release bundle MCP runtime on mid-run session reset

Description (problem / solution / changelog)

Summary

  • Problem: resetReplyRunSession rotated the active sessionId after auto-compaction failure, context overflow, or role-ordering conflicts, but it never released the previous session id's entry from the bundle MCP runtime cache.
  • Why it matters: runtimesBySessionId in src/agents/pi-bundle-mcp-runtime.ts holds Client -> Transport -> stdio ChildProcess references with no TTL/LRU, so old MCP workers like gate-mcp, bnbchain-mcp, and chrome-devtools-mcp stayed alive and accumulated until memory-constrained deployments OOM'd.
  • What changed: resetReplyRunSession now disposes the previous session id's bundle MCP runtime in the background with void ... .catch(...) after the replacement session is fully established, and logs disposal failures through the existing deps.error(...) seam without blocking the retry path.
  • What did NOT change (scope boundary): this does not alter how the new session is established, and it does not add broader cache eviction or runtime lifecycle changes outside the session-reset path.

Change Type (select all)

  • Bug fix
  • Feature
  • Refactor required for the fix
  • Docs
  • Security hardening
  • Chore/infra

Scope (select all touched areas)

  • Gateway / orchestration
  • Skills / tool execution
  • Auth / tokens
  • Memory / storage
  • Integrations
  • API / contracts
  • UI / DX
  • CI/CD / infra

Linked Issue/PR

  • Closes #
  • Related #64169
  • Related #60656
  • Related #62026
  • Related #62731
  • This PR fixes a bug or regression

Root Cause (if applicable)

  • Root cause: mid-run session rotation persisted the new session metadata and optionally deleted the old transcript, but it never released the previous session id from the bundle MCP runtime cache.
  • Missing detection / guardrail: the reset-path tests did not assert MCP runtime disposal or cover disposal failures.
  • Contributing context (if known): the cache has no TTL/LRU and holds stdio-backed MCP client/process references, so each reset could strand another worker pool.

Regression Test Plan (if applicable)

  • Coverage level that should have caught this:
    • Unit test
    • Seam / integration test
    • End-to-end test
    • Existing coverage already sufficient
  • Target test or file: src/auto-reply/reply/agent-runner-session-reset.test.ts
  • Scenario the test should lock in: resetting a reply run session disposes the previous session runtime, still succeeds when disposal throws, and still deletes the old transcript when cleanupTranscripts: true even if disposal fails.
  • Why this is the smallest reliable guardrail: the behavior is owned by resetReplyRunSession, and the relevant failure handling is exposed through that helper's dependency seam.
  • Existing test that already covers this (if any): none before this change.
  • If no new test is added, why not: N/A

User-visible / Behavior Changes

None.

Diagram (if applicable)

Before:
[reset trigger] -> [new session metadata persisted] -> [old MCP runtime remains cached/alive]

After:
[reset trigger] -> [new session metadata persisted] -> [background dispose of old MCP runtime] -> [retry continues]

Security Impact (required)

  • New permissions/capabilities? (No)
  • Secrets/tokens handling changed? (No)
  • New/changed network calls? (No)
  • Command/tool execution surface changed? (No)
  • Data access scope changed? (No)
  • If any Yes, explain risk + mitigation:

Repro + Verification

Environment

  • OS: Linux
  • Runtime/container: local repo checkout
  • Model/provider: N/A
  • Integration/channel (if any): MCP stdio runtime path
  • Relevant config (redacted): N/A

Steps

  1. Trigger resetReplyRunSession from an auto-compaction failure, context overflow, or role-ordering conflict while a bundle MCP runtime is cached for the current session id.
  2. Observe that the helper rotates sessionId, persists the replacement session metadata, and optionally deletes old transcript files.
  3. Verify that the previous session id's MCP runtime is disposed in the background and that disposal failures are logged without preventing the reset from succeeding.

Expected

  • Old session MCP workers are released when the session resets.
  • Reset still returns success even if runtime disposal throws.
  • Transcript cleanup still occurs when requested.

Actual

  • Matches expected after this change.

Evidence

  • Failing test/log before + passing after
  • Trace/log snippets
  • Screenshot/recording
  • Perf numbers (if relevant)

Human Verification (required)

  • Verified scenarios: pnpm tsgo; pnpm test src/auto-reply/reply/agent-runner-session-reset.test.ts; pnpm lint src/auto-reply/reply/agent-runner-session-reset.ts src/auto-reply/reply/agent-runner-session-reset.test.ts; pnpm format on the touched files.
  • Edge cases checked: disposal success, disposal failure logging, and transcript cleanup when disposal fails.
  • What you did not verify: long-running production memory behavior outside the targeted reset/test surface.

Review Conversations

  • I replied to or resolved every bot review conversation I addressed in this PR.
  • I left unresolved only the conversations that still need reviewer or maintainer judgment.

Compatibility / Migration

  • Backward compatible? (Yes)
  • Config/env changes? (No)
  • Migration needed? (No)
  • If yes, exact upgrade steps:

Risks and Mitigations

  • Risk: background disposal could fail or hang independently of the retry path.
    • Mitigation: disposal is intentionally non-blocking, failures are logged through deps.error(...), and tests cover the failure path plus transcript cleanup continuity.

Tests

  • pnpm tsgo
  • pnpm test src/auto-reply/reply/agent-runner-session-reset.test.ts (4 passed)
  • pnpm lint src/auto-reply/reply/agent-runner-session-reset.ts src/auto-reply/reply/agent-runner-session-reset.test.ts (0 warnings / 0 errors)
  • pnpm format on touched files

Changed files

  • src/auto-reply/reply/agent-runner-session-reset.test.ts (modified, +76/-0)
  • src/auto-reply/reply/agent-runner-session-reset.ts (modified, +12/-0)

Code Example

# After ~90 minutes (3 heartbeat cycles at 30 min):
$ ps aux | grep -E "mcp-server-fetch|mcp-yfinance" | grep -v grep | wc -l
9  # 3 sets of 3 processes

# pstree confirms all are children of openclaw-gateway (pid 131886)
uv(132045)→python(132070)   # spawned 22:54
uv(132299)→python(132321)   # spawned 23:10 (heartbeat 1)
uv(132676)→python(132691)   # spawned 23:40 (heartbeat 2)
uv(133194)→python(133213)   # spawned 00:12 (heartbeat 3)
RAW_BUFFERClick to expand / collapse

Environment

  • OpenClaw: 2026.4.9
  • OS: Linux 6.12 (arm64, Raspberry Pi)
  • MCP servers: mcp-server-fetch (uvx), mcp-yfinance-server (uv + python)
  • Config: native mcp.servers in openclaw.json

Description

Every time an isolated agentTurn session starts (heartbeat, cron jobs), the gateway spawns a fresh set of MCP stdio server processes as children. The old processes are never terminated. Over time they accumulate indefinitely.

Steps to Reproduce

  1. Configure 2+ stdio MCP servers in openclaw.json under mcp.servers
  2. Enable heartbeat (any interval)
  3. Wait for heartbeat to fire
  4. pstree -p <gateway-pid> — new MCP child processes appear each cycle, old ones remain

Evidence

# After ~90 minutes (3 heartbeat cycles at 30 min):
$ ps aux | grep -E "mcp-server-fetch|mcp-yfinance" | grep -v grep | wc -l
9  # 3 sets of 3 processes

# pstree confirms all are children of openclaw-gateway (pid 131886)
uv(132045)→python(132070)   # spawned 22:54
uv(132299)→python(132321)   # spawned 23:10 (heartbeat 1)
uv(132676)→python(132691)   # spawned 23:40 (heartbeat 2)
uv(133194)→python(133213)   # spawned 00:12 (heartbeat 3)

Key Findings

  • 15 consecutive tool calls → still 1 process (not per-call spawning like #15337)
  • Heartbeat fires → new process triplet spawned ~1 min later
  • Processes are children of the gateway (not orphaned/detached)
  • No MCP timeout config exists to tune this behaviour
  • Session store prunes stale sessions after 24h (maxAgeMs: 86400000) but does not terminate associated MCP child processes

Root Cause

Each isolated agentTurn session initialises its own MCP client connection pool, spawning new child processes. When the session ends, the gateway does not reap the MCP children it spawned for that session.

Expected Behaviour

Either:

  • Share a single MCP process pool at the gateway level across all sessions, or
  • Reap MCP child processes when the session that spawned them ends

Impact

On a Raspberry Pi (4GB RAM), ~6 new processes per hour × ~110MB each = OOM within hours of uptime without manual intervention.

Workaround

Periodic cleanup cron to kill duplicate processes, keeping only the newest instance of each MCP server type.

Related

#15337 (per-call spawning variant of the same family)

extent analysis

TL;DR

Implement a periodic cleanup mechanism to terminate redundant MCP child processes spawned by the gateway.

Guidance

  • Identify the process IDs of the redundant MCP child processes using pstree and ps aux commands.
  • Develop a cron job to periodically kill the duplicate processes, ensuring only the newest instance of each MCP server type remains.
  • Consider implementing a shared MCP process pool at the gateway level to prevent redundant process spawning.
  • Monitor system resources to prevent out-of-memory errors due to excessive process accumulation.

Example

A simple cron job using pkill command can be used to kill duplicate processes:

# Kill all but the newest instance of each MCP server type
pkill -f "mcp-server-fetch" -o oldest
pkill -f "mcp-yfinance-server" -o oldest

Note: This example is a simplified illustration and may require modifications to fit the specific use case.

Notes

The provided workaround is a temporary solution to mitigate the issue. A more permanent fix would involve modifying the gateway to reap MCP child processes when the session ends or implementing a shared MCP process pool.

Recommendation

Apply the workaround by implementing a periodic cleanup cron job to kill duplicate processes, as this provides a temporary solution to prevent out-of-memory errors and allows for further investigation into a more permanent fix.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING