openclaw - ✅(Solved) Fix [Bug] MCP stdio server processes accumulate as children of gateway — never reaped when new session spawns fresh pool [1 pull requests, 1 participants]

podulator · 2026-04-10T07:27:00Z

[openclaw] Every time an isolated agentTurn session starts heartbeat, cron jobs , the gateway spawns a fresh set of MCP stdio server processes as children. The… Every time an isolated agentTurn session starts (heartbeat, cron jobs), the gateway spawns a fresh set of MCP stdio server processes as children. The old processes are never terminated. Over time they accumulate indefinitely. # PR #64316: fix(agents): release bundle MCP runtime on mid-run session reset - Repository: openclaw/openclaw - Author: xxxxxmax - State: open | merged: False - Link: https://github.com/openclaw/openclaw/pull/64316 ## Description (problem / solution / changelog) ## Summary - Problem: `resetReplyRunSession` rotated the active `sessionId` after auto-compaction failure, context overflow, or role-ordering conflicts, but it never released the previous session id's entry from the bundle MCP runtime cache. - Why it matters: `runtimesBySessionId` in `src/agents/pi-bundle-mcp-runtime.ts` holds `Client` -> `Transport` -> stdio `ChildProcess` references with no TTL/LRU, so old MCP workers like `gate-mcp`, `bnbchain-mcp`, and `chrome-devtools-mcp` stayed alive and accumulated until memory-constrained deployments OOM'd. - What changed: `resetReplyRunSession` now disposes the previous session id's bundle MCP runtime in the background with `void ... .catch(...)` after the replacement session is fully established, and logs disposal failures through the existing `deps.error(...)` seam without blocking the retry path. - What did NOT change (scope boundary): this does not alter how the new session is established, and it does not add broader cache eviction or runtime lifecycle changes outside the session-reset path. ## Change Type (select all) - [x] Bug fix - [ ] Feature - [ ] Refactor required for the fix - [ ] Docs - [ ] Security hardening - [ ] Chore/infra ## Scope (select all touched areas) - [x] Gateway / orchestration - [x] Skills / tool execution - [ ] Auth / tokens - [ ] Memory / storage - [x] Integrations - [ ] API / contracts - [ ] UI / DX - [ ] CI/CD / infra ## Linked Issue/PR - Closes # - Related #64169 - Related #60656 - Related #62026 - Related #62731 - [x] This PR fixes a bug or regression ## Root Cause (if applicable) - Root cause: mid-run session rotation persisted the new session metadata and optionally deleted the old transcript, but it never released the previous session id from the bundle MCP runtime cache. - Missing detection / guardrail: the reset-path tests did not assert MCP runtime disposal or cover disposal failures. - Contributing context (if known): the cache has no TTL/LRU and holds stdio-backed MCP client/process references, so each reset could strand another worker pool. ## Regression Test Plan (if applicable) - Coverage level that should have caught this: - [x] Unit test - [ ] Seam / integration test - [ ] End-to-end test - [ ] Existing coverage already sufficient - Target test or file: `src/auto-reply/reply/agent-runner-session-reset.test.ts` - Scenario the test should lock in: resetting a reply run session disposes the previous session runtime, still succeeds when disposal throws, and still deletes the old transcript when `cleanupTranscripts: true` even if disposal fails. - Why this is the smallest reliable guardrail: the behavior is owned by `resetReplyRunSession`, and the relevant failure handling is exposed through that helper's dependency seam. - Existing test that already covers this (if any): none before this change. - If no new test is added, why not: N/A ## User-visible / Behavior Changes None. ## Diagram (if applicable) ```text Before: [reset trigger] -> [new session metadata persisted] -> [old MCP runtime remains cached/alive] After: [reset trigger] -> [new session metadata persisted] -> [background dispose of old MCP runtime] -> [retry continues] ``` ## Security Impact (required) - New permissions/capabilities? (No) - Secrets/tokens handling changed? (No) - New/changed network calls? (No) - Command/tool execution surface changed? (No) - Data access scope changed? (No) - If any `Yes`, explain risk + mitigation: ## Repro + Verification ### Environment - OS: Linux - Runtime/container: local repo checkout - Model/provider: N/A - Integration/channel (if any): MCP stdio runtime path - Relevant config (redacted): N/A ### Steps 1. Trigger `resetReplyRunSession` from an auto-compaction failure, context overflow, or role-ordering conflict while a bundle MCP runtime is cached for the current session id. 2. Observe that the helper rotates `sessionId`, persists the replacement session metadata, and optionally deletes old transcript files. 3. Verify that the previous session id's MCP runtime is disposed in the background and that disposal failures are logged without preventing the reset from succeeding. ### Expected - Old session MCP workers are released when the session resets. - Reset still returns success even if runtime disposal throws. - Transcript cleanup still occurs when requested. ### Actual - Matches expec

openclaw2026-04-10 07:27:00

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

openclaw/openclaw#64169•Fetched 2026-04-11 06:16:05

View on GitHub

Comments

Participants

Timeline

Reactions

Author

podulator

Participants

podulator

Timeline (top)

cross-referenced ×1

Every time an isolated agentTurn session starts (heartbeat, cron jobs), the gateway spawns a fresh set of MCP stdio server processes as children. The old processes are never terminated. Over time they accumulate indefinitely.

Root Cause

Each isolated agentTurn session initialises its own MCP client connection pool, spawning new child processes. When the session ends, the gateway does not reap the MCP children it spawned for that session.

Fix Action

Workaround

Periodic cleanup cron to kill duplicate processes, keeping only the newest instance of each MCP server type.

PR fix notes

PR #64316: fix(agents): release bundle MCP runtime on mid-run session reset

Repository: openclaw/openclaw
Author: xxxxxmax
State: open | merged: False
Link: https://github.com/openclaw/openclaw/pull/64316

Description (problem / solution / changelog)

Summary

Problem: resetReplyRunSession rotated the active sessionId after auto-compaction failure, context overflow, or role-ordering conflicts, but it never released the previous session id's entry from the bundle MCP runtime cache.
Why it matters: runtimesBySessionId in src/agents/pi-bundle-mcp-runtime.ts holds Client -> Transport -> stdio ChildProcess references with no TTL/LRU, so old MCP workers like gate-mcp, bnbchain-mcp, and chrome-devtools-mcp stayed alive and accumulated until memory-constrained deployments OOM'd.
What changed: resetReplyRunSession now disposes the previous session id's bundle MCP runtime in the background with void ... .catch(...) after the replacement session is fully established, and logs disposal failures through the existing deps.error(...) seam without blocking the retry path.
What did NOT change (scope boundary): this does not alter how the new session is established, and it does not add broader cache eviction or runtime lifecycle changes outside the session-reset path.

Change Type (select all)

Scope (select all touched areas)

Linked Issue/PR

Closes #
Related #64169
Related #60656
Related #62026
Related #62731
This PR fixes a bug or regression

Root Cause (if applicable)

Root cause: mid-run session rotation persisted the new session metadata and optionally deleted the old transcript, but it never released the previous session id from the bundle MCP runtime cache.
Missing detection / guardrail: the reset-path tests did not assert MCP runtime disposal or cover disposal failures.
Contributing context (if known): the cache has no TTL/LRU and holds stdio-backed MCP client/process references, so each reset could strand another worker pool.

Regression Test Plan (if applicable)

Coverage level that should have caught this:
- Unit test
- Seam / integration test
- End-to-end test
- Existing coverage already sufficient
Target test or file: src/auto-reply/reply/agent-runner-session-reset.test.ts
Scenario the test should lock in: resetting a reply run session disposes the previous session runtime, still succeeds when disposal throws, and still deletes the old transcript when cleanupTranscripts: true even if disposal fails.
Why this is the smallest reliable guardrail: the behavior is owned by resetReplyRunSession, and the relevant failure handling is exposed through that helper's dependency seam.
Existing test that already covers this (if any): none before this change.
If no new test is added, why not: N/A

User-visible / Behavior Changes

None.

Diagram (if applicable)

Before:
[reset trigger] -> [new session metadata persisted] -> [old MCP runtime remains cached/alive]

After:
[reset trigger] -> [new session metadata persisted] -> [background dispose of old MCP runtime] -> [retry continues]

Security Impact (required)

New permissions/capabilities? (No)
Secrets/tokens handling changed? (No)
New/changed network calls? (No)
Command/tool execution surface changed? (No)
Data access scope changed? (No)
If any Yes, explain risk + mitigation:

Repro + Verification

Environment

OS: Linux
Runtime/container: local repo checkout
Model/provider: N/A
Integration/channel (if any): MCP stdio runtime path
Relevant config (redacted): N/A

Steps

Trigger resetReplyRunSession from an auto-compaction failure, context overflow, or role-ordering conflict while a bundle MCP runtime is cached for the current session id.
Observe that the helper rotates sessionId, persists the replacement session metadata, and optionally deletes old transcript files.
Verify that the previous session id's MCP runtime is disposed in the background and that disposal failures are logged without preventing the reset from succeeding.

Expected

Old session MCP workers are released when the session resets.
Reset still returns success even if runtime disposal throws.
Transcript cleanup still occurs when requested.

Actual

Matches expected after this change.

Evidence

Failing test/log before + passing after
Trace/log snippets
Screenshot/recording
Perf numbers (if relevant)

Human Verification (required)

Verified scenarios: pnpm tsgo; pnpm test src/auto-reply/reply/agent-runner-session-reset.test.ts; pnpm lint src/auto-reply/reply/agent-runner-session-reset.ts src/auto-reply/reply/agent-runner-session-reset.test.ts; pnpm format on the touched files.
Edge cases checked: disposal success, disposal failure logging, and transcript cleanup when disposal fails.
What you did not verify: long-running production memory behavior outside the targeted reset/test surface.

Review Conversations

I replied to or resolved every bot review conversation I addressed in this PR.
I left unresolved only the conversations that still need reviewer or maintainer judgment.

Compatibility / Migration

Backward compatible? (Yes)
Config/env changes? (No)
Migration needed? (No)
If yes, exact upgrade steps:

Risks and Mitigations

Risk: background disposal could fail or hang independently of the retry path.
- Mitigation: disposal is intentionally non-blocking, failures are logged through deps.error(...), and tests cover the failure path plus transcript cleanup continuity.

Tests

pnpm tsgo
pnpm test src/auto-reply/reply/agent-runner-session-reset.test.ts (4 passed)
pnpm lint src/auto-reply/reply/agent-runner-session-reset.ts src/auto-reply/reply/agent-runner-session-reset.test.ts (0 warnings / 0 errors)
pnpm format on touched files

Changed files

src/auto-reply/reply/agent-runner-session-reset.test.ts (modified, +76/-0)
src/auto-reply/reply/agent-runner-session-reset.ts (modified, +12/-0)

Code Example

# After ~90 minutes (3 heartbeat cycles at 30 min):
$ ps aux | grep -E "mcp-server-fetch|mcp-yfinance" | grep -v grep | wc -l
9  # 3 sets of 3 processes

# pstree confirms all are children of openclaw-gateway (pid 131886)
uv(132045)→python(132070)   # spawned 22:54
uv(132299)→python(132321)   # spawned 23:10 (heartbeat 1)
uv(132676)→python(132691)   # spawned 23:40 (heartbeat 2)
uv(133194)→python(133213)   # spawned 00:12 (heartbeat 3)

RAW_BUFFERClick to expand / collapse

Environment

OpenClaw: 2026.4.9
OS: Linux 6.12 (arm64, Raspberry Pi)
MCP servers: mcp-server-fetch (uvx), mcp-yfinance-server (uv + python)
Config: native mcp.servers in openclaw.json

Description

Steps to Reproduce

Configure 2+ stdio MCP servers in openclaw.json under mcp.servers
Enable heartbeat (any interval)
Wait for heartbeat to fire
pstree -p <gateway-pid> — new MCP child processes appear each cycle, old ones remain

Evidence

# After ~90 minutes (3 heartbeat cycles at 30 min):
$ ps aux | grep -E "mcp-server-fetch|mcp-yfinance" | grep -v grep | wc -l
9  # 3 sets of 3 processes

# pstree confirms all are children of openclaw-gateway (pid 131886)
uv(132045)→python(132070)   # spawned 22:54
uv(132299)→python(132321)   # spawned 23:10 (heartbeat 1)
uv(132676)→python(132691)   # spawned 23:40 (heartbeat 2)
uv(133194)→python(133213)   # spawned 00:12 (heartbeat 3)

Key Findings

15 consecutive tool calls → still 1 process (not per-call spawning like #15337)
Heartbeat fires → new process triplet spawned ~1 min later
Processes are children of the gateway (not orphaned/detached)
No MCP timeout config exists to tune this behaviour
Session store prunes stale sessions after 24h (maxAgeMs: 86400000) but does not terminate associated MCP child processes

Root Cause

Expected Behaviour

Either:

Share a single MCP process pool at the gateway level across all sessions, or
Reap MCP child processes when the session that spawned them ends

Impact

On a Raspberry Pi (4GB RAM), ~6 new processes per hour × ~110MB each = OOM within hours of uptime without manual intervention.

Workaround

Periodic cleanup cron to kill duplicate processes, keeping only the newest instance of each MCP server type.

#15337 (per-call spawning variant of the same family)

extent analysis

TL;DR

Implement a periodic cleanup mechanism to terminate redundant MCP child processes spawned by the gateway.

Guidance

Identify the process IDs of the redundant MCP child processes using pstree and ps aux commands.
Develop a cron job to periodically kill the duplicate processes, ensuring only the newest instance of each MCP server type remains.
Consider implementing a shared MCP process pool at the gateway level to prevent redundant process spawning.
Monitor system resources to prevent out-of-memory errors due to excessive process accumulation.

Example

A simple cron job using pkill command can be used to kill duplicate processes:

# Kill all but the newest instance of each MCP server type
pkill -f "mcp-server-fetch" -o oldest
pkill -f "mcp-yfinance-server" -o oldest

Note: This example is a simplified illustration and may require modifications to fit the specific use case.

Notes

The provided workaround is a temporary solution to mitigate the issue. A more permanent fix would involve modifying the gateway to reap MCP child processes when the session ends or implementing a shared MCP process pool.

Recommendation

Apply the workaround by implementing a periodic cleanup cron job to kill duplicate processes, as this provides a temporary solution to prevent out-of-memory errors and allows for further investigation into a more permanent fix.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#latency issue #model loading #dependency error #configuration error #environment variable

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

openclaw - ✅(Solved) Fix [Bug] MCP stdio server processes accumulate as children of gateway — never reaped when new session spawns fresh pool [1 pull requests, 1 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Fix Action

Workaround

PR fix notes

PR #64316: fix(agents): release bundle MCP runtime on mid-run session reset

Description (problem / solution / changelog)

Summary

Change Type (select all)

Scope (select all touched areas)

Linked Issue/PR

Root Cause (if applicable)

Regression Test Plan (if applicable)

User-visible / Behavior Changes

Diagram (if applicable)

Security Impact (required)

Repro + Verification

Environment

Steps

Expected

Actual

Evidence

Human Verification (required)

Review Conversations

Compatibility / Migration

Risks and Mitigations

Tests

Changed files

Code Example

Environment

Description

Steps to Reproduce

Evidence

Key Findings

Root Cause

Expected Behaviour

Impact

Workaround

Related

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING