openclaw - ✅(Solved) Fix [Bug]: MCP child process leak: sessions_send via gateway never calls disposeSessionMcpRuntime [4 pull requests, 1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
openclaw/openclaw#70364Fetched 2026-04-23 07:25:39
View on GitHub
Comments
0
Participants
1
Timeline
6
Reactions
0
Participants
Timeline (top)
cross-referenced ×4labeled ×2

OpenClaw Version: 2026.4.15 (041266a) Platform: Ubuntu 24.04, systemd user service (openclaw-gateway) Affects: Multi-agent fleet setups with agentToAgent.enabled: true and per-agent MCP servers configured in openclaw.json

Summary Every call to sessions_send targeting another agent leaks a full cohort of MCP child processes. With 9 agents configured, baseline is 9 MCP children after a clean gateway start. One sessions_send causes 9 additional MCP processes to spawn and the original cohort is never cleaned up. The leak is deterministic and reproduces 100% of the time.

Root Cause

Root Cause (Code-Level) The cleanup flag cleanupBundleMcpOnRunEnd is only ever set to true in local/embedded mode.

Fix Action

Fix / Workaround

Impact Fleet setups with cross-agent communication via sessions_send will accumulate MCP child processes indefinitely Each sessions_send = one leaked cohort (N processes, where N = number of configured MCP servers) Gateway eventually becomes unstable under load; agent-to-agent comms degrade Only workaround is periodic systemctl --user restart openclaw-gateway to reset process count

PR fix notes

PR #1: fix(gateway): clean up MCP child processes after nested lane runs end

Description (problem / solution / changelog)

Fixes openclaw/openclaw#70364

Problem

Every sessions_send call targeting another agent leaks a full cohort of MCP child processes. With 9 agents configured, each sessions_send adds 9 new child processes and the original cohort is never cleaned up.

Root cause: cleanupBundleMcpOnRunEnd was only set to true in the CLI --local path (agentCliCommand). When sessions_send dispatches a run through the gateway (dispatchAgentRunFromGateway), the ingressOpts never included cleanupBundleMcpOnRunEnd, so the finally block in pi-embedded-runner/run.ts that calls retireSessionMcpRuntime never fired for gateway-path nested sessions.

Fix

Import isNestedAgentLane in src/gateway/server-methods/agent.ts and add cleanupBundleMcpOnRunEnd: isNestedAgentLane(request.lane) to the ingressOpts passed to dispatchAgentRunFromGateway.

Nested lane runs are ephemeral and should tear down their MCP cohort when done. Top-level gateway sessions keep processes warm across turns.

Test

Added test in agent.test.ts asserting cleanupBundleMcpOnRunEnd === true for nested lane requests and false for regular requests.


Generated by Claude Code

<!-- devin-review-badge-begin -->
<a href="https://app.devin.ai/review/suboss87/openclaw/pull/1" target="_blank"> <picture> <source media="(prefers-color-scheme: dark)" srcset="https://static.devin.ai/assets/gh-open-in-devin-review-dark.svg?v=1"> <img src="https://static.devin.ai/assets/gh-open-in-devin-review-light.svg?v=1" alt="Open in Devin Review"> </picture> </a> <!-- devin-review-badge-end -->

Changed files

  • .agent/workflows/update_clawdbot.md (removed, +0/-380)
  • .agents/maintainers.md (removed, +0/-1)
  • .agents/skills/blacksmith-testbox/SKILL.md (added, +340/-0)
  • .agents/skills/openclaw-ghsa-maintainer/SKILL.md (added, +87/-0)
  • .agents/skills/openclaw-parallels-smoke/SKILL.md (added, +151/-0)
  • .agents/skills/openclaw-pr-maintainer/SKILL.md (added, +75/-0)
  • .agents/skills/openclaw-qa-testing/SKILL.md (added, +148/-0)
  • .agents/skills/openclaw-qa-testing/agents/openai.yaml (added, +4/-0)
  • .agents/skills/openclaw-release-maintainer/SKILL.md (added, +456/-0)
  • .agents/skills/openclaw-secret-scanning-maintainer/SKILL.md (added, +220/-0)
  • .agents/skills/openclaw-secret-scanning-maintainer/scripts/secret-scanning.mjs (added, +797/-0)
  • .agents/skills/openclaw-test-heap-leaks/SKILL.md (added, +75/-0)
  • .agents/skills/openclaw-test-heap-leaks/agents/openai.yaml (added, +4/-0)
  • .agents/skills/openclaw-test-heap-leaks/scripts/heapsnapshot-delta.mjs (added, +553/-0)
  • .agents/skills/openclaw-test-performance/SKILL.md (added, +134/-0)
  • .agents/skills/openclaw-test-performance/agents/openai.yaml (added, +6/-0)
  • .agents/skills/optimizetests/SKILL.md (added, +41/-0)
  • .agents/skills/optimizetests/agents/openai.yaml (added, +6/-0)
  • .agents/skills/parallels-discord-roundtrip/SKILL.md (added, +62/-0)
  • .agents/skills/security-triage/SKILL.md (added, +111/-0)
  • .agents/skills/tag-duplicate-prs-issues/SKILL.md (added, +485/-0)
  • .agents/skills/tag-duplicate-prs-issues/agents/openai.yaml (added, +4/-0)
  • .codex (renamed, +0/-0)
  • .dockerignore (modified, +8/-0)
  • .env.example (modified, +9/-4)
  • .github/CODEOWNERS (added, +54/-0)
  • .github/ISSUE_TEMPLATE/bug_report.yml (modified, +36/-25)
  • .github/actionlint.yaml (modified, +3/-0)
  • .github/actions/ensure-base-commit/action.yml (modified, +16/-2)
  • .github/actions/setup-node-env/action.yml (modified, +11/-25)
  • .github/actions/setup-pnpm-store-cache/action.yml (modified, +6/-19)
  • .github/instructions/copilot.instructions.md (modified, +3/-3)
  • .github/labeler.yml (modified, +137/-16)
  • .github/pr-assets/compaction-checkpoints/sessions-checkpoints-inline.png (added, +0/-0)
  • .github/pr-assets/compaction-checkpoints/sessions-overview-inline.png (added, +0/-0)
  • .github/pull_request_template.md (modified, +39/-7)
  • .github/workflows/auto-response.yml (modified, +18/-5)
  • .github/workflows/ci-check-testbox.yml (added, +100/-0)
  • .github/workflows/ci.yml (modified, +2018/-507)
  • .github/workflows/codeql.yml (modified, +13/-9)
  • .github/workflows/control-ui-locale-refresh.yml (added, +172/-0)
  • .github/workflows/docker-release.yml (modified, +137/-48)
  • .github/workflows/docs-sync-publish.yml (added, +70/-0)
  • .github/workflows/docs-translate-trigger-release.yml (added, +42/-0)
  • .github/workflows/install-smoke.yml (modified, +168/-33)
  • .github/workflows/labeler.yml (modified, +181/-18)
  • .github/workflows/macos-release.yml (added, +93/-0)
  • .github/workflows/openclaw-cross-os-release-checks-reusable.yml (added, +472/-0)
  • .github/workflows/openclaw-live-and-e2e-checks-reusable.yml (added, +664/-0)
  • .github/workflows/openclaw-npm-release.yml (modified, +370/-29)
  • .github/workflows/openclaw-release-checks.yml (added, +198/-0)
  • .github/workflows/openclaw-scheduled-live-checks.yml (added, +74/-0)
  • .github/workflows/parity-gate.yml (added, +114/-0)
  • .github/workflows/plugin-clawhub-release.yml (added, +273/-0)
  • .github/workflows/plugin-npm-release.yml (added, +214/-0)
  • .github/workflows/sandbox-common-smoke.yml (modified, +11/-3)
  • .github/workflows/stale.yml (modified, +12/-9)
  • .github/workflows/workflow-sanity.yml (modified, +41/-8)
  • .gitignore (modified, +26/-5)
  • .jscpd.json (added, +16/-0)
  • .markdownlint-cli2.jsonc (modified, +3/-0)
  • .npmignore (modified, +2/-0)
  • .npmrc (modified, +3/-0)
  • .oxfmtrc.jsonc (modified, +3/-2)
  • .oxlintrc.json (modified, +65/-10)
  • .pi/prompts/landpr.md (removed, +0/-73)
  • .pi/prompts/reviewpr.md (removed, +0/-134)
  • .pre-commit-config.yaml (modified, +2/-2)
  • .prettierignore (added, +1/-0)
  • .secrets.baseline (modified, +4/-4)
  • .vscode/settings.json (modified, +1/-1)
  • AGENTS.md (modified, +201/-297)
  • CHANGELOG.md (modified, +2845/-398)
  • CONTRIBUTING.md (modified, +44/-9)
  • Dockerfile (modified, +45/-11)
  • Dockerfile.sandbox (modified, +2/-1)
  • Dockerfile.sandbox-browser (modified, +3/-1)
  • Dockerfile.sandbox-common (modified, +1/-0)
  • INCIDENT_RESPONSE.md (added, +52/-0)
  • Makefile (added, +4/-0)
  • README.md (modified, +315/-391)
  • SECURITY.md (modified, +38/-2)
  • Swabble/Sources/SwabbleKit/WakeWordGate.swift (modified, +7/-13)
  • Swabble/Tests/SwabbleKitTests/WakeWordGateTests.swift (modified, +19/-0)
  • appcast.xml (modified, +278/-620)
  • apps/android/README.md (modified, +69/-3)
  • apps/android/app/build.gradle.kts (modified, +96/-19)
  • apps/android/app/proguard-rules.pro (modified, +0/-20)
  • apps/android/app/src/main/AndroidManifest.xml (modified, +16/-0)
  • apps/android/app/src/main/java/ai/openclaw/app/AssistantLaunch.kt (added, +43/-0)
  • apps/android/app/src/main/java/ai/openclaw/app/MainActivity.kt (modified, +28/-6)
  • apps/android/app/src/main/java/ai/openclaw/app/MainViewModel.kt (modified, +273/-94)
  • apps/android/app/src/main/java/ai/openclaw/app/NodeApp.kt (modified, +12/-1)
  • apps/android/app/src/main/java/ai/openclaw/app/NodeForegroundService.kt (modified, +6/-2)
  • apps/android/app/src/main/java/ai/openclaw/app/NodeRuntime.kt (modified, +567/-125)
  • apps/android/app/src/main/java/ai/openclaw/app/NotificationForwardingPolicy.kt (added, +102/-0)
  • apps/android/app/src/main/java/ai/openclaw/app/PermissionRequester.kt (modified, +89/-22)
  • apps/android/app/src/main/java/ai/openclaw/app/SecurePrefs.kt (modified, +202/-0)
  • apps/android/app/src/main/java/ai/openclaw/app/SessionKey.kt (modified, +11/-0)
  • apps/android/app/src/main/java/ai/openclaw/app/chat/ChatController.kt (modified, +157/-53)

PR #70442: fix(sandbox): use dedicated dm bucket for Telegram DMs so they are never the main session

Description (problem / solution / changelog)

Summary

Telegram DMs with dmScope="main" (the default) were resolving to the same session key as the agent main session (agent:main:main). This caused shouldSandboxSession to return false when sessionKey === mainSessionKey, even when mode="all", bypassing sandbox isolation entirely — a security regression.

Root cause

In src/routing/session-key.ts, buildAgentPeerSessionKey for direct chats with dmScope="main" was calling buildAgentMainSessionKey, producing agent:main:main. Since the Telegram DM session key matched the main session key, shouldSandboxSession excluded it from sandboxing.

Fix

Use a dedicated "dm" bucket (agent:<agentId>:dm) for all direct chats when dmScope="main", giving Telegram DMs their own sandbox context that is never the main session.

-    return buildAgentMainSessionKey({
-      agentId: params.agentId,
-      mainKey: params.mainKey,
-    });
+    // Use a dedicated DM bucket so Telegram (and other direct-chat) sessions always get
+    // their own sandbox context distinct from the agent main session.
+    return `agent:${normalizeAgentId(params.agentId)}:dm`;

Test coverage

Added src/agents/sandbox/runtime-status.regression.test.ts covering:

  • mode="all": agent:main:dm IS sandboxed
  • mode="non-main": agent:main:dm IS sandboxed (not the main session)
  • mode="off": agent:main:dm is NOT sandboxed
  • mode="all": agent:main:main IS sandboxed
  • mode="non-main": agent:main:main is NOT sandboxed
  • dm bucket distinct from main session
  • per-peer DM scope still works

Also added src/routing/session-key.continuity.test.ts for session key continuity.

Verification

pnpm test -- --run src/agents/sandbox/runtime-status.regression.test.ts src/routing/session-key.continuity.test.ts — 56 tests passing.


Fixes #70342

Changed files

  • extensions/brave/src/brave-web-search-provider.test.ts (modified, +64/-0)
  • extensions/slack/src/accounts.test.ts (modified, +38/-0)
  • extensions/slack/src/channel.ts (modified, +2/-2)
  • extensions/slack/src/monitor/provider.ts (modified, +1/-0)
  • src/agents/acp-spawn.test.ts (modified, +25/-0)
  • src/agents/acp-spawn.ts (modified, +4/-0)
  • src/agents/command/attempt-execution.ts (modified, +0/-1)
  • src/agents/command/types.ts (modified, +0/-2)
  • src/agents/pi-embedded-runner.cache.live.test.ts (modified, +0/-1)
  • src/agents/pi-embedded-runner.e2e.test.ts (modified, +47/-2)
  • src/agents/pi-embedded-runner/compact.hooks.test.ts (modified, +106/-0)
  • src/agents/pi-embedded-runner/compact.queued.ts (modified, +20/-0)
  • src/agents/pi-embedded-runner/openrouter-model-capabilities.test.ts (modified, +49/-0)
  • src/agents/pi-embedded-runner/openrouter-model-capabilities.ts (modified, +6/-1)
  • src/agents/pi-embedded-runner/run.ts (modified, +5/-7)
  • src/agents/pi-embedded-runner/run/params.ts (modified, +0/-6)
  • src/agents/sandbox/runtime-status.regression.test.ts (added, +86/-0)
  • src/agents/tools/sessions-spawn-tool.test.ts (modified, +28/-0)
  • src/agents/tools/sessions-spawn-tool.ts (modified, +2/-0)
  • src/commands/agent-via-gateway.test.ts (modified, +0/-6)
  • src/commands/agent-via-gateway.ts (modified, +0/-1)
  • src/commands/agent.test.ts (modified, +0/-1)
  • src/infra/bonjour.test.ts (modified, +64/-0)
  • src/infra/bonjour.ts (modified, +66/-17)
  • src/routing/resolve-route.test.ts (modified, +4/-4)
  • src/routing/session-key.continuity.test.ts (modified, +2/-1)
  • src/routing/session-key.ts (modified, +7/-4)
  • test/vitest/vitest.agents.config.ts (modified, +1/-1)

PR #70465: fix(gateway): cleanup MCP runtime for nested-lane agent runs to plug sessions_send leak (#70364)

Description (problem / solution / changelog)

Summary

  • Problem: Every sessions_send / runAgentStep call into another agent leaks a full cohort of MCP child processes. Reporter measured: 9 baseline MCP children → 18 after one sessions_send → 27 after two — the original cohort is never reclaimed. Each call is N processes (N = configured MCP servers per agent). Eventually the gateway becomes unstable and the only recovery is systemctl --user restart openclaw-gateway.
  • Why it matters: Multi-agent fleet setups with agentToAgent.enabled: true and per-agent MCP servers are unusable in steady state. The leak is deterministic and reproduces 100% of the time across every cross-agent call (Jarvis → Atlas, Jarvis → Forge, Jarvis → Spark all confirmed by reporter on 2026.4.15).
  • What changed: Default cleanupBundleMcpOnRunEnd to true at the gateway agent handler when the request lane is nested AND the caller hasn't explicitly opted out. disposeSessionMcpRuntime(sessionId) then fires from the existing pi-embedded-runner finally-block when the ephemeral run ends, freeing the cohort.
  • What did NOT change (scope boundary): No change to disposeSessionMcpRuntime itself, the SessionMcpRuntimeManager, or the pi-embedded-runner lifecycle. Local-CLI (--local), subagent-spawn, and isolated-cron callers continue to set the flag themselves; their behavior is unchanged. Session-mode subagent spawns (which deliberately keep the runtime alive across nested runs) can still pass cleanupBundleMcpOnRunEnd: false explicitly and are honoured.

Credit to @aiedvlyman for a fully root-caused report — including exact filenames, line numbers, and a working Option-A patch sketch in the issue body.

Change Type (select all)

  • Bug fix

Scope (select all touched areas)

  • Gateway / orchestration

Linked Issue/PR

  • Closes #70364
  • This PR fixes a bug or regression

Root Cause (if applicable)

  • Root cause: cleanupBundleMcpOnRunEnd is set to true at the CLI layer only when opts.local === true (src/commands/agent-via-gateway.ts:185). The gateway agent handler at src/gateway/server-methods/agent.ts:950 then forwarded request.cleanupBundleMcpOnRunEnd === true straight through. Nested gateway-routed runs (runAgentSteplane: "nested" | "nested:...") don't pass that flag, so the embedded runner's finally-block at src/agents/pi-embedded-runner/run.ts:2136 (if (params.cleanupBundleMcpOnRunEnd === true) await disposeSessionMcpRuntime(...)) never fires for them.
  • Missing detection / guardrail: no test asserted the gateway agent handler's cleanupBundleMcpOnRunEnd decision for nested-lane requests. subagent-spawn.ts and cron/isolated-agent/run-executor.ts had their own per-caller logic but there was no contract test at the gateway-method seam.
  • Contributing context: the flag's name (cleanupBundleMcpOnRunEnd) is shared by both opt-in and opt-out callers; a centralised "nested lane = default true" rule is the smallest place to make the cleanup policy explicit.

Regression Test Plan (if applicable)

  • Coverage level that should have caught this:
    • Unit test
  • Target test file: src/gateway/server-methods/agent.test.ts (extended).
  • Scenarios the test should lock in:
    • lane="nested" → default true
    • lane="nested:agent:spark:main" → default true (prefix match via isNestedAgentLane)
    • lane="main" → default false (unchanged)
    • lane unset → default false (unchanged)
    • explicit cleanupBundleMcpOnRunEnd: false on a nested run → still false (opt-out preserved)
  • Why this is the smallest reliable guardrail: the policy decision is a single conditional in agent.ts; five matrix cases cover the full decision table.
  • Existing test that already covers this (if any): none — agent.test.ts exercised cleanupBundleMcpOnRunEnd only via end-to-end runner tests where the flag was passed explicitly.

User-visible / Behavior Changes

  • Cross-agent sessions_send / runAgentStep calls no longer leak MCP child-process cohorts. Steady-state process count drops from N × (1 + calls_per_session) to N.
  • Operators do not need to set anything new. Behavior change is opt-out, not opt-in.
  • No config schema change. No public-API change.

Diagram (if applicable)

Before:
  sessions_send -> runAgentStep -> callGateway({method:"agent", lane:"nested", ...})
    -> gateway agent handler:
         cleanupBundleMcpOnRunEnd: false        (request flag undefined)
    -> embedded-runner finally:
         if (params.cleanupBundleMcpOnRunEnd === true) await disposeSessionMcpRuntime(...)
         // never fires -> N MCP children leak per call

After:
  sessions_send -> runAgentStep -> callGateway({method:"agent", lane:"nested", ...})
    -> gateway agent handler:
         cleanupBundleMcpOnRunEnd:
           request.cleanupBundleMcpOnRunEnd === true
           || (request.cleanupBundleMcpOnRunEnd === undefined
               && isNestedAgentLane(request.lane))
         // -> true for nested / nested:* lanes
    -> embedded-runner finally:
         await disposeSessionMcpRuntime(sessionId)   // fires; cohort freed

Security Impact (required)

  • New permissions/capabilities? No
  • Secrets/tokens handling changed? No
  • New/changed network calls? No
  • Command/tool execution surface changed? No (only the cleanup signal is changing; disposeSessionMcpRuntime itself was already there)
  • Data access scope changed? No
  • Net: this is purely a runtime-cleanup signal, gating an existing teardown helper. It tightens process hygiene without expanding what runs.

Repro + Verification

Environment

  • OS: macOS 26.5 (arm64) for development; reporter on Ubuntu 24.04 systemd user service
  • Runtime/container: Node v25.9.0
  • Model/provider: irrelevant (leak is in gateway MCP lifecycle, not inference path) — reporter confirmed across openrouter/minimax/minimax-m2.7 + openrouter/google/gemini-2.5-flash + openai-codex/gpt-5.4
  • Integration/channel: any channel that surfaces sessions_send
  • Relevant config: agentToAgent.enabled: true, tools.sessions.visibility: all, per-agent MCP servers

Steps (per reporter)

  1. systemctl --user restart openclaw-gateway
  2. pgrep -a -f "mcp/server.py" → 9 children (one per agent)
  3. From any agent's session, sessions_send to another agent.
  4. pgrep -a -f "mcp/server.py" → 18 children (original 9 still running, 9 new ones added)

Expected

  • After step 4, child count returns to 9 once the nested run finishes.

Actual (before fix)

  • Child count grows by N on every sessions_send and never decays.

Actual (after fix)

  • The nested run's finally-block calls disposeSessionMcpRuntime(sessionId); the cohort spawned for that run is reclaimed. Steady-state count returns to N.

Evidence

  • Failing test/log before + passing after
$ git diff --stat
 src/gateway/server-methods/agent.test.ts | 44 ++++++++++++++++++++++++++++++++
 src/gateway/server-methods/agent.ts      | 16 +++++++++++-
 2 files changed, 59 insertions(+), 1 deletion(-)
$ npx -p typescript@5 tsc --noEmit --skipLibCheck \
    --target ES2022 --module ESNext --moduleResolution Bundler \
    --esModuleInterop --strict \
    src/gateway/server-methods/agent.ts \
  | grep "server-methods/agent.ts"
# (no errors in modified file beyond missing transitive types in tmp clone)

5 new regression cases (4 parametrized + 1 explicit opt-out) added to the existing gateway agent handler describe block. CI on this PR will exercise the full vitest project.

Human Verification (required)

  • Verified scenarios: walked the call graph end-to-end: sessions_sendrunAgentStep (src/agents/tools/agent-step.ts) → callGateway({method:"agent", lane: ..., ...})agentHandlers.agent (src/gateway/server-methods/agent.ts:950) → agentCommandFromIngress(..., cleanupBundleMcpOnRunEnd: ...) → embedded-runner params.cleanupBundleMcpOnRunEnd === true finally-block (src/agents/pi-embedded-runner/run.ts:2136). Confirmed isNestedAgentLane already correctly recognises both bare "nested" and prefixed "nested:..." forms. Confirmed subagent-spawn.ts and cron/isolated-agent/run-executor.ts callers will continue to set the flag themselves and are unaffected.
  • Edge cases checked:
    • bare "nested" lane → defaulted true
    • prefixed "nested:agent:spark:main" lane → defaulted true
    • non-nested lane (e.g. "main") → unchanged false default
    • missing lane → unchanged false default
    • explicit cleanupBundleMcpOnRunEnd: false on a nested run → preserved (opt-out for session-mode subagents)
  • What I did NOT verify: running the full Ubuntu systemd repro on a multi-agent fleet — I don't have a 9-agent gateway with per-agent MCP servers locally. The unit-level decision is precisely covered by the new tests and the downstream disposeSessionMcpRuntime path is already exercised by pi-embedded-runner.cache.live.test.ts:327 / pi-embedded-runner.e2e.test.ts:473,516.

Review Conversations

  • I replied to or resolved every bot review conversation I addressed in this PR.
  • I left unresolved only the conversations that still need reviewer or maintainer judgment.

Compatibility / Migration

  • Backward compatible? Yes
  • Config/env changes? No
  • Migration needed? No
  • Existing callers that pass cleanupBundleMcpOnRunEnd explicitly are honoured. The only behavior change is for callers that omit the flag entirely AND target a nested lane — which is exactly the leak case.

Risks and Mitigations

  • Risk: a future caller that wants a long-lived nested MCP runtime (e.g. an interactive nested REPL) would silently get the runtime torn down on first turn.
    • Mitigation: the explicit-opt-out test (respects an explicit cleanupBundleMcpOnRunEnd=false on a nested-lane request) locks in the escape hatch. Such callers can pass cleanupBundleMcpOnRunEnd: false and the gateway will honour it.
  • Risk: isNestedAgentLane mis-classifies a custom lane string that happens to start with nested:.
    • Mitigation: isNestedAgentLane is the project's canonical "is this a nested lane?" predicate and is already used to drive other nested-lane behavior; reusing it keeps the policy consistent. If the predicate ever needs tightening, both this PR's site and the others move together.
  • Risk: Reporter's Option B (wire teardown into the gateway session lifecycle) is more robust against future code paths that spawn nested sessions outside agentCliCommand.
    • Mitigation: Option A is the minimal safe fix and resolves the reported leak deterministically. Option B is a follow-up worth doing as a separate PR — it's a wider refactor of the lifecycle event surface and deserves its own design + review.

Changed files

  • src/gateway/server-methods/agent.test.ts (modified, +44/-0)
  • src/gateway/server-methods/agent.ts (modified, +15/-1)

PR #70480: fix(gateway): tear down nested-lane MCP cohort on run end

Description (problem / solution / changelog)

Summary

Fixes #70364. Nested agent runs dispatched via sessions_send (one agent sending to another) spawn their own MCP cohort per session but never called retireSessionMcpRuntime on completion. Each dispatch leaked a full cohort of MCP child processes, growing unboundedly until gateway restart.

Root cause

cleanupBundleMcpOnRunEnd was only set to true in the CLI --local path (src/cli/cli-runner.ts). When a gateway nested lane dispatched through dispatchAgentRunFromGateway in src/gateway/server-methods/agent.ts:958, the ingressOpts had no cleanupBundleMcpOnRunEnd, so pi-embedded-runner never tore down the session's MCP runtime. Top-level gateway sessions are fine (they keep MCP warm across turns by design), but nested lane runs should be ephemeral.

Fix

Wire isNestedAgentLane(request.lane) into the ingressOpts passed to dispatchAgentRunFromGateway. Nested lane runs now tear down their MCP cohort on completion. Top-level gateway sessions continue to keep processes warm.

Evidence

  • Symptom: reporter in #70364 describes 9 agents configured, each sessions_send adds 9 new MCP children that are never reaped.
  • Root cause in code: src/gateway/server-methods/agent.ts:958 was missing the cleanup flag that existed in the --local CLI path.
  • Fix touches the implicated path: single-line addition at the same site, plus reusing the existing isNestedAgentLane helper from src/agents/lanes.ts.
  • Regression test: new test in src/gateway/server-methods/agent.test.ts asserts that nested-lane dispatches pass cleanupBundleMcpOnRunEnd: true while top-level dispatches do not.

Test plan

  • New regression test covers nested-lane cleanup flag
  • Top-level lane path unchanged (verified by existing tests)
  • All 32 agent.test.ts tests pass
  • Fix diff under 40 lines

Closes #70364.

Changed files

  • src/gateway/server-methods/agent.test.ts (modified, +30/-0)
  • src/gateway/server-methods/agent.ts (modified, +5/-0)

Code Example

Suggested Fix
The gateway needs to call disposeSessionMcpRuntime(sessionId) when a nested/ephemeral agent run ends. Two options:

Option ASet cleanupBundleMcpOnRunEnd: true for nested lane runs in the gateway's agent handler (similar to how --local sets it):

cleanupBundleMcpOnRunEnd: opts.local === true || opts.lane === AGENT_LANE_NESTED
Option BWire disposeSessionMcpRuntime into the gateway's session lifecycle end handler (onSessionLifecycleEvent in subagent-registry-BrNWizSY.js) so cleanup fires on session end regardless of how the run was initiated.

Option B is more robust as it handles any future code paths that spawn nested sessions without going through agentCliCommand.

Additional Notes
pkill -f mcp/server.py clears child symptoms but does not fix the leak — it reproduces immediately on next sessions_send
socat processes are a red herring — confirmed the real unit is the user service
Logs around repro show [agent:nested], ANNOUNCE_SKIP, and repeated webchat reconnect churn consistent with leaked runtimes
agentToAgent.enabled: true and tools.sessions.visibility: all are set in openclaw.json
RAW_BUFFERClick to expand / collapse

Bug type

Behavior bug (incorrect output/state without crash)

Beta release blocker

No

Summary

OpenClaw Version: 2026.4.15 (041266a) Platform: Ubuntu 24.04, systemd user service (openclaw-gateway) Affects: Multi-agent fleet setups with agentToAgent.enabled: true and per-agent MCP servers configured in openclaw.json

Summary Every call to sessions_send targeting another agent leaks a full cohort of MCP child processes. With 9 agents configured, baseline is 9 MCP children after a clean gateway start. One sessions_send causes 9 additional MCP processes to spawn and the original cohort is never cleaned up. The leak is deterministic and reproduces 100% of the time.

Steps to reproduce

Repro Steps Start fresh gateway: systemctl --user restart openclaw-gateway Confirm baseline — 1 gateway parent + 9 MCP children: pgrep -a -f "mcp/server.py"

→ 9 processes (one per configured agent)

Send one minimal cross-agent message via sessions_send to any agent (e.g. Spark), reply REPRO_OK Check processes again: pgrep -a -f "mcp/server.py"

→ 18 processes — original 9 still running, 9 new ones added

Confirmed repro with fresh gateway PID 179821:

Baseline MCP children: 179966 179969 179972 179977 179982 179985 179988 179991 179994 After one sessions_send to Spark: 180076 180079 180082 180089 180092 180095 180098 180101 180104 Original cohort still running — not cleaned up Repeats on every sessions_send. Each call adds another full cohort.

Root Cause (Code-Level) The cleanup flag cleanupBundleMcpOnRunEnd is only ever set to true in local/embedded mode.

register.agent-COPfBHma.js, line 115:

cleanupBundleMcpOnRunEnd: opts.local === true The cleanup itself lives in pi-embedded-runner-DN0VbqlW.js, line 9713:

if (params.cleanupBundleMcpOnRunEnd === true) await disposeSessionMcpRuntime(params.sessionId).catch(...) When sessions_send triggers a nested agent run via runAgentStep() in subagent-registry-BrNWizSY.js:

const response = await agentStepDeps.callGateway({ method: "agent", params: { lane: params.lane ?? AGENT_LANE_NESTED, ... } }); This goes through the gateway path, not --local. cleanupBundleMcpOnRunEnd is false (or unset). The finally block that calls disposeSessionMcpRuntime never fires. The MCP child processes spawned for that nested session's runtime are never cleaned up.

server.impl-GQ72oJBa.js — the gateway implementation — does not reference cleanupBundleMcpOnRunEnd at all.

The SessionMcpRuntimeManager (pi-bundle-mcp-tools-vusm-AE2.js, line 483) correctly tracks runtimes by sessionId and has a working disposeSession() method — but it is never called for gateway-path nested sessions because the flag that triggers it is hardcoded to local === true.

Expected behavior

Impact Fleet setups with cross-agent communication via sessions_send will accumulate MCP child processes indefinitely Each sessions_send = one leaked cohort (N processes, where N = number of configured MCP servers) Gateway eventually becomes unstable under load; agent-to-agent comms degrade Only workaround is periodic systemctl --user restart openclaw-gateway to reset process count

Actual behavior

Root Cause (Code-Level) The cleanup flag cleanupBundleMcpOnRunEnd is only ever set to true in local/embedded mode.

register.agent-COPfBHma.js, line 115:

cleanupBundleMcpOnRunEnd: opts.local === true The cleanup itself lives in pi-embedded-runner-DN0VbqlW.js, line 9713:

if (params.cleanupBundleMcpOnRunEnd === true) await disposeSessionMcpRuntime(params.sessionId).catch(...) When sessions_send triggers a nested agent run via runAgentStep() in subagent-registry-BrNWizSY.js:

const response = await agentStepDeps.callGateway({ method: "agent", params: { lane: params.lane ?? AGENT_LANE_NESTED, ... } }); This goes through the gateway path, not --local. cleanupBundleMcpOnRunEnd is false (or unset). The finally block that calls disposeSessionMcpRuntime never fires. The MCP child processes spawned for that nested session's runtime are never cleaned up.

server.impl-GQ72oJBa.js — the gateway implementation — does not reference cleanupBundleMcpOnRunEnd at all.

The SessionMcpRuntimeManager (pi-bundle-mcp-tools-vusm-AE2.js, line 483) correctly tracks runtimes by sessionId and has a working disposeSession() method — but it is never called for gateway-path nested sessions because the flag that triggers it is hardcoded to local === true.

OpenClaw version

Version: 2026.4.15 (041266a)

Operating system

Platform: Ubuntu 24.04, systemd user service (openclaw-gateway)

Install method

No response

Model

GPT-5.4

Provider / routing chain

openclaw

Additional provider/model setup details

Provider config during repro:

  • main (Jarvis): openai-codex/gpt-5.4 primary, fallbacks: openrouter/minimax/minimax-m2.7, openrouter/google/gemini-2.5-flash
  • All other agents: openrouter/minimax/minimax-m2.7 primary, fallback: openrouter/google/gemini-2.5-flash
  • sessions_send target during repro: Spark (openrouter/minimax/minimax-m2.7)
  • Bug also confirmed cross-agent: Jarvis → Atlas, Jarvis → Forge
  • Provider chain does not appear relevant — leak is in gateway MCP lifecycle, not inference path

Logs, screenshots, and evidence

Suggested Fix
The gateway needs to call disposeSessionMcpRuntime(sessionId) when a nested/ephemeral agent run ends. Two options:

Option A — Set cleanupBundleMcpOnRunEnd: true for nested lane runs in the gateway's agent handler (similar to how --local sets it):

cleanupBundleMcpOnRunEnd: opts.local === true || opts.lane === AGENT_LANE_NESTED
Option B — Wire disposeSessionMcpRuntime into the gateway's session lifecycle end handler (onSessionLifecycleEvent in subagent-registry-BrNWizSY.js) so cleanup fires on session end regardless of how the run was initiated.

Option B is more robust as it handles any future code paths that spawn nested sessions without going through agentCliCommand.

Additional Notes
pkill -f mcp/server.py clears child symptoms but does not fix the leak — it reproduces immediately on next sessions_send
socat processes are a red herring — confirmed the real unit is the user service
Logs around repro show [agent:nested], ANNOUNCE_SKIP, and repeated webchat reconnect churn consistent with leaked runtimes
agentToAgent.enabled: true and tools.sessions.visibility: all are set in openclaw.json

Impact and severity

No response

Additional information

No response

extent analysis

TL;DR

The most likely fix for the MCP child process leak is to modify the gateway's agent handler to set cleanupBundleMcpOnRunEnd to true for nested lane runs or wire disposeSessionMcpRuntime into the gateway's session lifecycle end handler.

Guidance

  • Identify the root cause of the issue, which is the cleanupBundleMcpOnRunEnd flag being hardcoded to local === true, preventing cleanup of MCP child processes in non-local modes.
  • Consider two possible solutions:
    • Option A: Modify the cleanupBundleMcpOnRunEnd condition to include opts.lane === AGENT_LANE_NESTED.
    • Option B: Integrate disposeSessionMcpRuntime into the gateway's session lifecycle end handler to ensure cleanup regardless of the run initiation method.
  • Verify the fix by checking the number of MCP child processes before and after sending a cross-agent message via sessions_send.

Example

// Option A: Modify cleanupBundleMcpOnRunEnd condition
cleanupBundleMcpOnRunEnd: opts.local === true || opts.lane === AGENT_LANE_NESTED

// Option B: Integrate disposeSessionMcpRuntime into session lifecycle end handler
onSessionLifecycleEvent: (event) => {
  if (event.type === 'sessionEnd') {
    disposeSessionMcpRuntime(event.sessionId);
  }
}

Notes

  • The provided fix options assume that the disposeSessionMcpRuntime function is correctly implemented and functional.
  • The choice between Option A and Option B depends on the specific requirements and constraints of the OpenClaw system.
  • It is essential to thoroughly test the chosen solution to ensure it resolves the MCP child process leak issue.

Recommendation

Apply workaround Option B, as it provides a more robust solution by handling any future code paths that spawn nested sessions without going through agentCliCommand. This approach ensures that the cleanup mechanism is triggered regardless of the run initiation method, reducing the likelihood of similar issues arising in the future.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

FAQ

Expected behavior

Impact Fleet setups with cross-agent communication via sessions_send will accumulate MCP child processes indefinitely Each sessions_send = one leaked cohort (N processes, where N = number of configured MCP servers) Gateway eventually becomes unstable under load; agent-to-agent comms degrade Only workaround is periodic systemctl --user restart openclaw-gateway to reset process count

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

openclaw - ✅(Solved) Fix [Bug]: MCP child process leak: sessions_send via gateway never calls disposeSessionMcpRuntime [4 pull requests, 1 participants]