openclaw - ✅(Solved) Fix [MCP] Process pool lifecycle bug: uvx minimax-coding-plan-mcp processes never cleaned up, causing memory leak (88 processes, ~6GB) [3 pull requests, 1 comments, 1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
openclaw/openclaw#62026Fetched 2026-04-08 03:10:10
View on GitHub
Comments
1
Participants
1
Timeline
2
Reactions
0
Author
Participants
Timeline (top)
commented ×1cross-referenced ×1

Root Cause

The MCP server is spawned via uvx minimax-coding-plan-mcp and the gateway maintains a connection pool, but the pooled processes are not terminated when idle or after use. This is a lifecycle management bug in the gateway's MCP integration.\n\n## Suggested Fix\n\n1. Implement proper process cleanup after MCP tool execution\n2. Or implement a max-size process pool with LRU eviction\n3. Add monitoring/metrics for MCP process count per server\n4. Consider adding a gateway config option to limit max MCP subprocesses

Fix Action

Temporary Workaround

Restart the gateway:

openclaw gateway restart

PR fix notes

PR #62925: fix(cron): dispose MCP runtimes after isolated cron run completes

Description (problem / solution / changelog)

Summary

Add cleanupBundleMcpOnRunEnd: true to cron isolated agent runs to prevent MCP process accumulation.

Root Cause

When cron jobs run in isolated mode, the runEmbeddedPiAgent call was not passing cleanupBundleMcpOnRunEnd: true, causing MCP runtimes to accumulate over time. Each cron run spawned new MCP processes that were never cleaned up.

  • CLI local mode: passes cleanupBundleMcpOnRunEnd: true (agent-via-gateway.ts:186)
  • Cron isolated mode: missing

Fix

Add the same cleanup flag to the cron isolated agent run at src/cron/isolated-agent/run.ts

Impact

  • Before: ~2.1 process pairs/hour accumulated, ~6GB memory growth over 5 days
  • After: MCP runtimes properly disposed after each cron run

Fixes #62026

Changed files

  • src/cron/isolated-agent/run.ts (modified, +2/-0)

PR #64316: fix(agents): release bundle MCP runtime on mid-run session reset

Description (problem / solution / changelog)

Summary

  • Problem: resetReplyRunSession rotated the active sessionId after auto-compaction failure, context overflow, or role-ordering conflicts, but it never released the previous session id's entry from the bundle MCP runtime cache.
  • Why it matters: runtimesBySessionId in src/agents/pi-bundle-mcp-runtime.ts holds Client -> Transport -> stdio ChildProcess references with no TTL/LRU, so old MCP workers like gate-mcp, bnbchain-mcp, and chrome-devtools-mcp stayed alive and accumulated until memory-constrained deployments OOM'd.
  • What changed: resetReplyRunSession now disposes the previous session id's bundle MCP runtime in the background with void ... .catch(...) after the replacement session is fully established, and logs disposal failures through the existing deps.error(...) seam without blocking the retry path.
  • What did NOT change (scope boundary): this does not alter how the new session is established, and it does not add broader cache eviction or runtime lifecycle changes outside the session-reset path.

Change Type (select all)

  • Bug fix
  • Feature
  • Refactor required for the fix
  • Docs
  • Security hardening
  • Chore/infra

Scope (select all touched areas)

  • Gateway / orchestration
  • Skills / tool execution
  • Auth / tokens
  • Memory / storage
  • Integrations
  • API / contracts
  • UI / DX
  • CI/CD / infra

Linked Issue/PR

  • Closes #
  • Related #64169
  • Related #60656
  • Related #62026
  • Related #62731
  • This PR fixes a bug or regression

Root Cause (if applicable)

  • Root cause: mid-run session rotation persisted the new session metadata and optionally deleted the old transcript, but it never released the previous session id from the bundle MCP runtime cache.
  • Missing detection / guardrail: the reset-path tests did not assert MCP runtime disposal or cover disposal failures.
  • Contributing context (if known): the cache has no TTL/LRU and holds stdio-backed MCP client/process references, so each reset could strand another worker pool.

Regression Test Plan (if applicable)

  • Coverage level that should have caught this:
    • Unit test
    • Seam / integration test
    • End-to-end test
    • Existing coverage already sufficient
  • Target test or file: src/auto-reply/reply/agent-runner-session-reset.test.ts
  • Scenario the test should lock in: resetting a reply run session disposes the previous session runtime, still succeeds when disposal throws, and still deletes the old transcript when cleanupTranscripts: true even if disposal fails.
  • Why this is the smallest reliable guardrail: the behavior is owned by resetReplyRunSession, and the relevant failure handling is exposed through that helper's dependency seam.
  • Existing test that already covers this (if any): none before this change.
  • If no new test is added, why not: N/A

User-visible / Behavior Changes

None.

Diagram (if applicable)

Before:
[reset trigger] -> [new session metadata persisted] -> [old MCP runtime remains cached/alive]

After:
[reset trigger] -> [new session metadata persisted] -> [background dispose of old MCP runtime] -> [retry continues]

Security Impact (required)

  • New permissions/capabilities? (No)
  • Secrets/tokens handling changed? (No)
  • New/changed network calls? (No)
  • Command/tool execution surface changed? (No)
  • Data access scope changed? (No)
  • If any Yes, explain risk + mitigation:

Repro + Verification

Environment

  • OS: Linux
  • Runtime/container: local repo checkout
  • Model/provider: N/A
  • Integration/channel (if any): MCP stdio runtime path
  • Relevant config (redacted): N/A

Steps

  1. Trigger resetReplyRunSession from an auto-compaction failure, context overflow, or role-ordering conflict while a bundle MCP runtime is cached for the current session id.
  2. Observe that the helper rotates sessionId, persists the replacement session metadata, and optionally deletes old transcript files.
  3. Verify that the previous session id's MCP runtime is disposed in the background and that disposal failures are logged without preventing the reset from succeeding.

Expected

  • Old session MCP workers are released when the session resets.
  • Reset still returns success even if runtime disposal throws.
  • Transcript cleanup still occurs when requested.

Actual

  • Matches expected after this change.

Evidence

  • Failing test/log before + passing after
  • Trace/log snippets
  • Screenshot/recording
  • Perf numbers (if relevant)

Human Verification (required)

  • Verified scenarios: pnpm tsgo; pnpm test src/auto-reply/reply/agent-runner-session-reset.test.ts; pnpm lint src/auto-reply/reply/agent-runner-session-reset.ts src/auto-reply/reply/agent-runner-session-reset.test.ts; pnpm format on the touched files.
  • Edge cases checked: disposal success, disposal failure logging, and transcript cleanup when disposal fails.
  • What you did not verify: long-running production memory behavior outside the targeted reset/test surface.

Review Conversations

  • I replied to or resolved every bot review conversation I addressed in this PR.
  • I left unresolved only the conversations that still need reviewer or maintainer judgment.

Compatibility / Migration

  • Backward compatible? (Yes)
  • Config/env changes? (No)
  • Migration needed? (No)
  • If yes, exact upgrade steps:

Risks and Mitigations

  • Risk: background disposal could fail or hang independently of the retry path.
    • Mitigation: disposal is intentionally non-blocking, failures are logged through deps.error(...), and tests cover the failure path plus transcript cleanup continuity.

Tests

  • pnpm tsgo
  • pnpm test src/auto-reply/reply/agent-runner-session-reset.test.ts (4 passed)
  • pnpm lint src/auto-reply/reply/agent-runner-session-reset.ts src/auto-reply/reply/agent-runner-session-reset.test.ts (0 warnings / 0 errors)
  • pnpm format on touched files

Changed files

  • src/auto-reply/reply/agent-runner-session-reset.test.ts (modified, +76/-0)
  • src/auto-reply/reply/agent-runner-session-reset.ts (modified, +12/-0)

PR #68450: fix(mcp): dispose bundled MCP runtimes after isolated one-shot runs

Description (problem / solution / changelog)

Summary

  • Problem: isolated cron, isolated heartbeat, and one-shot follow-up runs can create bundle MCP runtimes and leave them alive after the one-shot run ends, which causes connection/process accumulation over time.
  • Why it matters: these paths create fresh session ids frequently, so leaked runtimes ratchet upward under steady autonomous load.
  • What changed: add one-shot bundle MCP cleanup for isolated cron and heartbeat paths, move cleanup to the outer run boundary after fallback/continuation settles, and retire both initial and post-compaction session ids.
  • What did NOT change (scope boundary): this PR does not change persistent session targets, does not add gateway shutdown/global runtime cleanup, and does not remove eager bundle MCP materialization.

Change Type (select all)

  • Bug fix
  • Feature
  • Refactor required for the fix
  • Docs
  • Security hardening
  • Chore/infra

Scope (select all touched areas)

  • Gateway / orchestration
  • Skills / tool execution
  • Auth / tokens
  • Memory / storage
  • Integrations
  • API / contracts
  • UI / DX
  • CI/CD / infra

Linked Issue/PR

  • Closes #
  • Related #62026
  • Related #62925
  • Related #67567
  • Related #64316
  • This PR fixes a bug or regression

Root Cause (if applicable)

  • Root cause: isolated one-shot runs created bundle MCP runtimes but did not deterministically dispose them at the outer run boundary, so runtimes survived after the run completed.
  • Missing detection / guardrail: there was no regression coverage for isolated cron/heartbeat/follow-up cleanup across fallback completion and session-id rotation after compaction.
  • Contributing context (if known): eager bundle MCP tool materialization amplifies the leak because runtimes are initialized before the model actually uses MCP tools.

Regression Test Plan (if applicable)

  • Coverage level that should have caught this:
    • Unit test
    • Seam / integration test
    • End-to-end test
    • Existing coverage already sufficient
  • Target test or file:
    • src/agents/pi-bundle-mcp-runtime.streamable-http.test.ts
    • src/agents/pi-bundle-mcp-runtime.test.ts
    • src/auto-reply/reply/agent-runner-execution.test.ts
    • src/auto-reply/reply/followup-runner.test.ts
    • src/auto-reply/reply/get-reply-run.media-only.test.ts
    • src/cron/isolated-agent/run.skill-filter.test.ts
    • src/infra/heartbeat-runner.returns-default-unset.test.ts
  • Scenario the test should lock in: isolated one-shot runs dispose bundle MCP runtimes only after the whole run settles, including fallback completion and compacted session-id rotation, and streamable-http disposal terminates the remote session.
  • Why this is the smallest reliable guardrail: the leak spans both the runtime manager and the outer one-shot orchestration boundary, so runtime-only or runner-only coverage is insufficient by itself.
  • Existing test that already covers this (if any): None before this PR.
  • If no new test is added, why not: N/A

User-visible / Behavior Changes

  • Operationally, isolated cron, isolated heartbeat, and one-shot follow-up runs no longer accumulate bundle MCP connections/processes indefinitely.
  • No config, default, or user-facing command changes.

Diagram (if applicable)

Before:
[isolated cron/heartbeat/follow-up]
  -> [create bundle MCP runtime]
  -> [run/fallback/compaction ends]
  -> [runtime remains cached]
  -> [connections/processes accumulate]

After:
[isolated cron/heartbeat/follow-up]
  -> [create bundle MCP runtime]
  -> [whole one-shot run ends]
  -> [dispose initial + latest session runtimes]
  -> [connections stay bounded]

Security Impact (required)

  • New permissions/capabilities? (No)
  • Secrets/tokens handling changed? (No)
  • New/changed network calls? (No)
  • Command/tool execution surface changed? (No)
  • Data access scope changed? (No)
  • If any Yes, explain risk + mitigation: N/A

Repro + Verification

Environment

  • OS: macOS (Darwin)
  • Runtime/container: local gateway worktrees via pnpm
  • Model/provider: openai/gpt-5.4
  • Integration/channel (if any): bundled MCP via streamable-http (mcp.ticktick.com)
  • Relevant config (redacted): isolated mcp.servers.repro entry with bearer auth, 5 isolated cron jobs at 15s, 2 isolated heartbeat agents at 20s

Steps

  1. Start a local gateway against an isolated tmp root with bundled MCP repro configured.
  2. Create 5 isolated cron jobs and 2 isolated heartbeat agents against that gateway.
  3. Sample established connections from the gateway PID to the current mcp.ticktick.com IPs every 10s.

Expected

  • Established bundle MCP connection count remains bounded after each isolated one-shot run completes.

Actual

  • main: 340 -> 881 in about 3 minutes
  • fresh upstream/main: 618 -> 1654 in about 4 minutes
  • this branch: 2 -> 2 flat under the same load

Evidence

  • Failing test/log before + passing after
  • Trace/log snippets
  • Screenshot/recording
  • Perf numbers (if relevant)

Human Verification (required)

  • Verified scenarios:
    • Live A/B repro on main vs this branch using the same isolated cron + heartbeat harness
    • Fresh upstream/main repro after pull
    • Same repro against #62925, #67567, and #64316
  • Edge cases checked:
    • Cleanup happens after fallback settles
    • Isolated cron continuation prompt path
    • One-shot follow-up cleanup
    • Auto-compaction session-id rotation
    • Failed-start cleanup
    • Streamable-http remote session termination on disposal
  • What you did not verify:
    • A live stdio/uvx minimax-specific repro with the same harness
    • A fully green pnpm test on latest main, because there are unrelated/pre-existing suite issues

Review Conversations

  • I replied to or resolved every bot review conversation I addressed in this PR.
  • I left unresolved only the conversations that still need reviewer or maintainer judgment.

Compatibility / Migration

  • Backward compatible? (Yes)
  • Config/env changes? (No)
  • Migration needed? (No)
  • If yes, exact upgrade steps: N/A

Risks and Mitigations

  • Risk:

    • Cleanup could affect long-lived sessions if applied too broadly.
  • Mitigation:

    • Cleanup is gated to isolated one-shot paths only; persistent session targets are out of scope.
  • Risk:

    • Session-id rotation during compaction could leave the newest runtime behind.
  • Mitigation:

    • Dispose both initial and latest session ids and cover that path in tests.
  • Risk:

    • Disposal failures could surface at run teardown.
  • Mitigation:

    • Cleanup runs in finally with best-effort logging, and streamable-http termination is covered by dedicated tests.

AI-assisted: opencode <gpt-5.4>

Changed files

  • src/agents/pi-bundle-mcp-runtime.streamable-http.test.ts (added, +186/-0)
  • src/agents/pi-bundle-mcp-runtime.test.ts (modified, +45/-2)
  • src/agents/pi-bundle-mcp-runtime.ts (modified, +82/-30)
  • src/auto-reply/get-reply-options.types.ts (modified, +2/-0)
  • src/auto-reply/reply/agent-runner-execution.test.ts (modified, +64/-0)
  • src/auto-reply/reply/agent-runner-execution.ts (modified, +811/-763)
  • src/auto-reply/reply/followup-runner.test.ts (modified, +75/-1)
  • src/auto-reply/reply/followup-runner.ts (modified, +37/-0)
  • src/auto-reply/reply/get-reply-run.media-only.test.ts (modified, +13/-0)
  • src/auto-reply/reply/get-reply-run.ts (modified, +1/-0)
  • src/auto-reply/reply/queue/types.ts (modified, +1/-0)
  • src/cron/isolated-agent/run-executor.ts (modified, +152/-111)
  • src/cron/isolated-agent/run.skill-filter.test.ts (modified, +58/-0)
  • src/cron/isolated-agent/run.test-harness.ts (modified, +7/-0)
  • src/infra/heartbeat-runner.returns-default-unset.test.ts (modified, +65/-0)
  • src/infra/heartbeat-runner.ts (modified, +3/-0)

Code Example

# After several days of normal MCP usage:
$ ps aux | grep minimax-coding-plan-mcp | wc -l
88

# Each pair (uv tool + python) consumes ~70MB:
# Total: 88 processes × ~70MB ≈ 6GB

---

openclaw gateway restart
RAW_BUFFERClick to expand / collapse

Bug Description

When using MCP servers (specifically minimax-coding-plan-mcp via uvx), the OpenClaw gateway spawns new processes for each tool call but never cleans them up after completion. This causes a progressive memory leak that eventually consumes all available RAM.

Environment

  • OpenClaw version: 2026.4.2 (d74a122)
  • Node.js: v22.22.0
  • OS: Linux 6.17.0-19-generic (Ubuntu)
  • MCP Server: minimax-coding-plan-mcp (uvx)

Steps to Reproduce

  1. Configure minimax-web MCP in OpenClaw with minimax-coding-plan-mcp
  2. Have multiple agents (or one active agent) use the web_search tool frequently
  3. Observe the process list over time

Observed Behavior

  • Each time an agent calls a minimax MCP tool, the gateway spawns a new uvx minimax-coding-plan-mcp process pair (uv tool + python child)
  • These processes are never terminated after the tool call completes
  • Process count grows linearly with usage
  • After several days of normal usage: 88 processes consuming ~6GB RAM
  • Only fix: restart openclaw-gateway (which clears the process pool temporarily)

Expected Behavior

  • MCP server processes should be cleaned up after tool execution completes
  • Or: a process pool with fixed size should be used instead of spawning new processes per call

Evidence

# After several days of normal MCP usage:
$ ps aux | grep minimax-coding-plan-mcp | wc -l
88

# Each pair (uv tool + python) consumes ~70MB:
# Total: 88 processes × ~70MB ≈ 6GB

Temporary Workaround

Restart the gateway:

openclaw gateway restart

Root Cause Analysis

The MCP server is spawned via uvx minimax-coding-plan-mcp and the gateway maintains a connection pool, but the pooled processes are not terminated when idle or after use. This is a lifecycle management bug in the gateway's MCP integration.\n\n## Suggested Fix\n\n1. Implement proper process cleanup after MCP tool execution\n2. Or implement a max-size process pool with LRU eviction\n3. Add monitoring/metrics for MCP process count per server\n4. Consider adding a gateway config option to limit max MCP subprocesses

extent analysis

TL;DR

Implementing a proper process cleanup mechanism after MCP tool execution or using a max-size process pool with LRU eviction can fix the memory leak issue.

Guidance

  • Verify the issue by monitoring the process count and memory usage over time using commands like ps aux | grep minimax-coding-plan-mcp | wc -l and checking the total memory consumption.
  • Implement a process cleanup mechanism after each MCP tool execution to prevent the accumulation of unused processes.
  • Consider adding a gateway configuration option to limit the maximum number of MCP subprocesses to prevent excessive memory usage.
  • Monitor the process count and memory usage after implementing the fix to ensure the issue is resolved.

Example

No specific code snippet is provided as the issue is related to the lifecycle management of processes in the gateway's MCP integration, which requires a more comprehensive solution.

Notes

The provided temporary workaround of restarting the gateway can provide temporary relief but does not address the root cause of the issue. A permanent fix requires modifications to the gateway's MCP integration to properly manage process lifecycles.

Recommendation

Apply a workaround by implementing a max-size process pool with LRU eviction, as this can help mitigate the memory leak issue until a permanent fix is implemented. This approach can help limit the number of subprocesses and prevent excessive memory usage.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

openclaw - ✅(Solved) Fix [MCP] Process pool lifecycle bug: uvx minimax-coding-plan-mcp processes never cleaned up, causing memory leak (88 processes, ~6GB) [3 pull requests, 1 comments, 1 participants]