openclaw - ✅(Solved) Fix MCP stdio servers accumulate across turns and are not cleaned up on config reload (memory leak) [1 pull requests, 1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
openclaw/openclaw#60656Fetched 2026-04-08 02:48:38
View on GitHub
Comments
0
Participants
1
Timeline
0
Reactions
0
Participants

On OpenClaw 2026.4.2, MCP stdio servers appear to accumulate across turns and are not cleaned up properly.

Observed servers:

  • @modelcontextprotocol/server-sequential-thinking
  • @upstash/context7-mcp
  • mcp-deepwiki / mcp-instruct

This caused repeated memory growth on a 15 GiB VPS until restart.

Root Cause

and process count was still non-zero because old instances were not reaped by reload.

Fix Action

Fix / Workaround

Before mitigation

PR fix notes

PR #64316: fix(agents): release bundle MCP runtime on mid-run session reset

Description (problem / solution / changelog)

Summary

  • Problem: resetReplyRunSession rotated the active sessionId after auto-compaction failure, context overflow, or role-ordering conflicts, but it never released the previous session id's entry from the bundle MCP runtime cache.
  • Why it matters: runtimesBySessionId in src/agents/pi-bundle-mcp-runtime.ts holds Client -> Transport -> stdio ChildProcess references with no TTL/LRU, so old MCP workers like gate-mcp, bnbchain-mcp, and chrome-devtools-mcp stayed alive and accumulated until memory-constrained deployments OOM'd.
  • What changed: resetReplyRunSession now disposes the previous session id's bundle MCP runtime in the background with void ... .catch(...) after the replacement session is fully established, and logs disposal failures through the existing deps.error(...) seam without blocking the retry path.
  • What did NOT change (scope boundary): this does not alter how the new session is established, and it does not add broader cache eviction or runtime lifecycle changes outside the session-reset path.

Change Type (select all)

  • Bug fix
  • Feature
  • Refactor required for the fix
  • Docs
  • Security hardening
  • Chore/infra

Scope (select all touched areas)

  • Gateway / orchestration
  • Skills / tool execution
  • Auth / tokens
  • Memory / storage
  • Integrations
  • API / contracts
  • UI / DX
  • CI/CD / infra

Linked Issue/PR

  • Closes #
  • Related #64169
  • Related #60656
  • Related #62026
  • Related #62731
  • This PR fixes a bug or regression

Root Cause (if applicable)

  • Root cause: mid-run session rotation persisted the new session metadata and optionally deleted the old transcript, but it never released the previous session id from the bundle MCP runtime cache.
  • Missing detection / guardrail: the reset-path tests did not assert MCP runtime disposal or cover disposal failures.
  • Contributing context (if known): the cache has no TTL/LRU and holds stdio-backed MCP client/process references, so each reset could strand another worker pool.

Regression Test Plan (if applicable)

  • Coverage level that should have caught this:
    • Unit test
    • Seam / integration test
    • End-to-end test
    • Existing coverage already sufficient
  • Target test or file: src/auto-reply/reply/agent-runner-session-reset.test.ts
  • Scenario the test should lock in: resetting a reply run session disposes the previous session runtime, still succeeds when disposal throws, and still deletes the old transcript when cleanupTranscripts: true even if disposal fails.
  • Why this is the smallest reliable guardrail: the behavior is owned by resetReplyRunSession, and the relevant failure handling is exposed through that helper's dependency seam.
  • Existing test that already covers this (if any): none before this change.
  • If no new test is added, why not: N/A

User-visible / Behavior Changes

None.

Diagram (if applicable)

Before:
[reset trigger] -> [new session metadata persisted] -> [old MCP runtime remains cached/alive]

After:
[reset trigger] -> [new session metadata persisted] -> [background dispose of old MCP runtime] -> [retry continues]

Security Impact (required)

  • New permissions/capabilities? (No)
  • Secrets/tokens handling changed? (No)
  • New/changed network calls? (No)
  • Command/tool execution surface changed? (No)
  • Data access scope changed? (No)
  • If any Yes, explain risk + mitigation:

Repro + Verification

Environment

  • OS: Linux
  • Runtime/container: local repo checkout
  • Model/provider: N/A
  • Integration/channel (if any): MCP stdio runtime path
  • Relevant config (redacted): N/A

Steps

  1. Trigger resetReplyRunSession from an auto-compaction failure, context overflow, or role-ordering conflict while a bundle MCP runtime is cached for the current session id.
  2. Observe that the helper rotates sessionId, persists the replacement session metadata, and optionally deletes old transcript files.
  3. Verify that the previous session id's MCP runtime is disposed in the background and that disposal failures are logged without preventing the reset from succeeding.

Expected

  • Old session MCP workers are released when the session resets.
  • Reset still returns success even if runtime disposal throws.
  • Transcript cleanup still occurs when requested.

Actual

  • Matches expected after this change.

Evidence

  • Failing test/log before + passing after
  • Trace/log snippets
  • Screenshot/recording
  • Perf numbers (if relevant)

Human Verification (required)

  • Verified scenarios: pnpm tsgo; pnpm test src/auto-reply/reply/agent-runner-session-reset.test.ts; pnpm lint src/auto-reply/reply/agent-runner-session-reset.ts src/auto-reply/reply/agent-runner-session-reset.test.ts; pnpm format on the touched files.
  • Edge cases checked: disposal success, disposal failure logging, and transcript cleanup when disposal fails.
  • What you did not verify: long-running production memory behavior outside the targeted reset/test surface.

Review Conversations

  • I replied to or resolved every bot review conversation I addressed in this PR.
  • I left unresolved only the conversations that still need reviewer or maintainer judgment.

Compatibility / Migration

  • Backward compatible? (Yes)
  • Config/env changes? (No)
  • Migration needed? (No)
  • If yes, exact upgrade steps:

Risks and Mitigations

  • Risk: background disposal could fail or hang independently of the retry path.
    • Mitigation: disposal is intentionally non-blocking, failures are logged through deps.error(...), and tests cover the failure path plus transcript cleanup continuity.

Tests

  • pnpm tsgo
  • pnpm test src/auto-reply/reply/agent-runner-session-reset.test.ts (4 passed)
  • pnpm lint src/auto-reply/reply/agent-runner-session-reset.ts src/auto-reply/reply/agent-runner-session-reset.test.ts (0 warnings / 0 errors)
  • pnpm format on touched files

Changed files

  • src/auto-reply/reply/agent-runner-session-reset.test.ts (modified, +76/-0)
  • src/auto-reply/reply/agent-runner-session-reset.ts (modified, +12/-0)

Code Example

"mcp": {
  "servers": {}
}

---

"mcp": {
  "servers": {}
}
RAW_BUFFERClick to expand / collapse

Summary

On OpenClaw 2026.4.2, MCP stdio servers appear to accumulate across turns and are not cleaned up properly.

Observed servers:

  • @modelcontextprotocol/server-sequential-thinking
  • @upstash/context7-mcp
  • mcp-deepwiki / mcp-instruct

This caused repeated memory growth on a 15 GiB VPS until restart.

Environment

  • OpenClaw version: 2026.4.2
  • Host OS: Linux x64
  • Session type: Telegram direct chat
  • Agents affected: not limited to one agent; global mcp.servers exposure made this reproducible across agents

What I observed

  1. Multiple batches of the same MCP-related processes were spawned under openclaw-gateway over time.
  2. They were created repeatedly across conversation turns, even when logs did not show explicit calls to all corresponding tools on each turn.
  3. The processes did not get cleaned up after the turn finished.
  4. Memory usage kept increasing until service restart.

At peak I observed roughly:

  • about 180 MCP-related processes
  • about 13.5 GiB RSS combined

Representative process families:

  • npm exec @modelcontextprotocol/server-sequential-thinking
  • node /root/.npm/_npx/.../mcp-server-sequential-thinking
  • npm exec @upstash/context7-mcp --api-key ...
  • node /root/.npm/_npx/.../context7-mcp --api-key ...
  • npm exec mcp-deepwiki
  • node /root/.npm/_npx/.../mcp-instruct

Important behavior

I removed these MCP servers from config by changing:

"mcp": {
  "servers": {}
}

However, after config apply / hot reload, old leaked MCP child processes were still present.

So there seem to be two related issues:

  1. MCP server lifecycle leak / accumulation during normal turns
  2. Config reload does not clean up already spawned MCP child processes that are no longer configured

Evidence

Before mitigation

Representative counts from live inspection:

  • count=180 total_rss=13471.2 MiB

Later, after restart and more turns, more batches reappeared.

After removing MCP config

Config correctly became:

"mcp": {
  "servers": {}
}

But old processes were still alive, for example:

  • server-sequential-thinking
  • context7-mcp
  • mcp-deepwiki
  • mcp-instruct

and process count was still non-zero because old instances were not reaped by reload.

Why this looks like an OpenClaw runtime issue

This does not look like a single third-party MCP server bug, because multiple different MCP servers accumulated in the same pattern under openclaw-gateway.

It looks more like OpenClaw's MCP process lifecycle management is:

  • spawning new stdio servers repeatedly
  • not reusing or reaping them correctly
  • and not fully cleaning removed servers on config reload

Expected behavior

  • MCP stdio servers should either be reused safely or terminated after use
  • completed turns should not leave stale MCP child processes behind indefinitely
  • removing mcp.servers from config should stop and reap previously managed MCP child processes

Actual behavior

  • MCP child processes accumulate over time
  • memory usage keeps growing
  • hot reload/config apply does not fully clean old MCP child processes
  • only stronger restart/cleanup restores memory

Suggested investigation areas

  • MCP stdio server lifecycle ownership under openclaw-gateway
  • per-turn tool/mcp bootstrap path spawning behavior
  • child process cleanup on turn completion
  • child process cleanup on config reload / server removal
  • whether MCP tool listing/bootstrap is causing eager respawn each turn

If helpful, I can also provide a more detailed timestamped process timeline from live inspection.

extent analysis

TL;DR

The most likely fix involves modifying OpenClaw's MCP process lifecycle management to properly reuse or terminate stdio servers after use and ensure that completed turns do not leave stale child processes behind.

Guidance

  1. Investigate MCP stdio server lifecycle ownership: Review the code responsible for managing the lifecycle of MCP stdio servers under openclaw-gateway to identify why servers are not being properly cleaned up.
  2. Examine per-turn tool/mcp bootstrap path spawning behavior: Analyze how MCP tools are being spawned on each turn to determine if there's an issue with eager respawn or if the spawning mechanism is not properly reusing existing servers.
  3. Implement child process cleanup on turn completion and config reload: Develop a mechanism to ensure that child processes are terminated after each turn and when their corresponding configuration is removed or reloaded.
  4. Review MCP tool listing/bootstrap to prevent eager respawn: Investigate if the MCP tool listing or bootstrap process is causing the eager respawn of servers on each turn and adjust the logic to prevent unnecessary spawning.

Example

No specific code snippet can be provided without access to the OpenClaw codebase, but an example of how process cleanup might be implemented could involve using a mechanism like process.kill() or child_process.exec() with proper error handling and timeout management to ensure that child processes are terminated after use.

Notes

The provided information suggests that the issue is related to OpenClaw's management of MCP stdio servers, but without direct access to the code, the exact solution will depend on the specifics of the implementation. It's also important to consider potential edge cases, such as handling server crashes or network issues that might affect the cleanup process.

Recommendation

Apply a workaround by implementing a custom script or modifying the existing code to manually clean up MCP child processes after each turn and on config reload, until a permanent fix can be integrated into the OpenClaw codebase. This approach will help mitigate the memory growth issue while a more comprehensive solution is developed.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

FAQ

Expected behavior

  • MCP stdio servers should either be reused safely or terminated after use
  • completed turns should not leave stale MCP child processes behind indefinitely
  • removing mcp.servers from config should stop and reap previously managed MCP child processes

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING