openclaw - ✅(Solved) Fix [Bug]: MCP child process leak: sessions_send via gateway never calls disposeSessionMcpRuntime [4 pull requests, 1 participants]

aiedvlyman · 2026-04-22T22:14:45Z

[openclaw] OpenClaw Version: 2026.4.15 041266a Platform: Ubuntu 24.04, systemd user service openclaw-gateway Affects: Multi-agent fleet setups with agentToAgen… OpenClaw Version: 2026.4.15 (041266a) Platform: Ubuntu 24.04, systemd user service (openclaw-gateway) Affects: Multi-agent fleet setups with agentToAgent.enabled: true and per-agent MCP servers configured in openclaw.json Summary Every call to sessions_send targeting another agent leaks a full cohort of MCP child processes. With 9 agents configured, baseline is 9 MCP children after a clean gateway start. One sessions_send causes 9 additional MCP processes to spawn and the original cohort is never cleaned up. The leak is deterministic and reproduces 100% of the time. # PR #1: fix(gateway): clean up MCP child processes after nested lane runs end - Repository: suboss87/openclaw - Author: suboss87 - State: open | merged: False - Link: https://github.com/suboss87/openclaw/pull/1 ## Description (problem / solution / changelog) Fixes openclaw/openclaw#70364 ## Problem Every `sessions_send` call targeting another agent leaks a full cohort of MCP child processes. With 9 agents configured, each `sessions_send` adds 9 new child processes and the original cohort is never cleaned up. Root cause: `cleanupBundleMcpOnRunEnd` was only set to `true` in the CLI `--local` path (`agentCliCommand`). When `sessions_send` dispatches a run through the gateway (`dispatchAgentRunFromGateway`), the `ingressOpts` never included `cleanupBundleMcpOnRunEnd`, so the `finally` block in `pi-embedded-runner/run.ts` that calls `retireSessionMcpRuntime` never fired for gateway-path nested sessions. ## Fix Import `isNestedAgentLane` in `src/gateway/server-methods/agent.ts` and add `cleanupBundleMcpOnRunEnd: isNestedAgentLane(request.lane)` to the `ingressOpts` passed to `dispatchAgentRunFromGateway`. Nested lane runs are ephemeral and should tear down their MCP cohort when done. Top-level gateway sessions keep processes warm across turns. ## Test Added test in `agent.test.ts` asserting `cleanupBundleMcpOnRunEnd === true` for nested lane requests and `false` for regular requests. --- _Generated by [Claude Code](https://claude.ai/code/session_01DWCxqR4vP4SNo1aiAhFuVP)_ --- ## Changed files - `.agent/workflows/update_clawdbot.md` (removed, +0/-380) - `.agents/maintainers.md` (removed, +0/-1) - `.agents/skills/blacksmith-testbox/SKILL.md` (added, +340/-0) - `.agents/skills/openclaw-ghsa-maintainer/SKILL.md` (added, +87/-0) - `.agents/skills/openclaw-parallels-smoke/SKILL.md` (added, +151/-0) - `.agents/skills/openclaw-pr-maintainer/SKILL.md` (added, +75/-0) - `.agents/skills/openclaw-qa-testing/SKILL.md` (added, +148/-0) - `.agents/skills/openclaw-qa-testing/agents/openai.yaml` (added, +4/-0) - `.agents/skills/openclaw-release-maintainer/SKILL.md` (added, +456/-0) - `.agents/skills/openclaw-secret-scanning-maintainer/SKILL.md` (added, +220/-0) - `.agents/skills/openclaw-secret-scanning-maintainer/scripts/secret-scanning.mjs` (added, +797/-0) - `.agents/skills/openclaw-test-heap-leaks/SKILL.md` (added, +75/-0) - `.agents/skills/openclaw-test-heap-leaks/agents/openai.yaml` (added, +4/-0) - `.agents/skills/openclaw-test-heap-leaks/scripts/heapsnapshot-delta.mjs` (added, +553/-0) - `.agents/skills/openclaw-test-performance/SKILL.md` (added, +134/-0) - `.agents/skills/openclaw-test-performance/agents/openai.yaml` (added, +6/-0) - `.agents/skills/optimizetests/SKILL.md` (added, +41/-0) - `.agents/skills/optimizetests/agents/openai.yaml` (added, +6/-0) - `.agents/skills/parallels-discord-roundtrip/SKILL.md` (added, +62/-0) - `.agents/skills/security-triage/SKILL.md` (added, +111/-0) - `.agents/skills/tag-duplicate-prs-issues/SKILL.md` (added, +485/-0) - `.agents/skills/tag-duplicate-prs-issues/agents/openai.yaml` (added, +4/-0) - `.codex` (renamed, +0/-0) - `.dockerignore` (modified, +8/-0) - `.env.example` (modified, +9/-4) - `.github/CODEOWNERS` (added, +54/-0) - `.github/ISSUE_TEMPLATE/bug_report.yml` (modified, +36/-25) - `.github/actionlint.yaml` (modified, +3/-0) - `.github/actions/ensure-base-commit/action.yml` (modified, +16/-2) - `.github/actions/setup-node-env/action.yml` (modified, +11/-25) - `.github/actions/setup-pnpm-store-cache/action.yml` (modified, +6/-19) - `.github/instructions/copilot.instructions.md` (modified, +3/-3) - `.github/labeler.yml` (modified, +137/-16) - `.github/pr-assets/compaction-checkpoints/sessions-checkpoints-inline.png` (added, +0/-0) - `.github/pr-assets/compaction-checkpoints/sessions-overview-inline.png` (added, +0/-0) - `.github/pull_request_template.md` (modif

openclaw2026-04-22 22:14:45

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

openclaw/openclaw#70364•Fetched 2026-04-23 07:25:39

View on GitHub

Comments

Participants

Timeline

Reactions

Author

aiedvlyman

Participants

aiedvlyman

Timeline (top)

cross-referenced ×4labeled ×2

OpenClaw Version: 2026.4.15 (041266a) Platform: Ubuntu 24.04, systemd user service (openclaw-gateway) Affects: Multi-agent fleet setups with agentToAgent.enabled: true and per-agent MCP servers configured in openclaw.json

Summary Every call to sessions_send targeting another agent leaks a full cohort of MCP child processes. With 9 agents configured, baseline is 9 MCP children after a clean gateway start. One sessions_send causes 9 additional MCP processes to spawn and the original cohort is never cleaned up. The leak is deterministic and reproduces 100% of the time.

Root Cause

Root Cause (Code-Level) The cleanup flag cleanupBundleMcpOnRunEnd is only ever set to true in local/embedded mode.

Fix Action

Fix / Workaround

Impact Fleet setups with cross-agent communication via sessions_send will accumulate MCP child processes indefinitely Each sessions_send = one leaked cohort (N processes, where N = number of configured MCP servers) Gateway eventually becomes unstable under load; agent-to-agent comms degrade Only workaround is periodic systemctl --user restart openclaw-gateway to reset process count

PR fix notes

PR #1: fix(gateway): clean up MCP child processes after nested lane runs end

Repository: suboss87/openclaw
Author: suboss87
State: open | merged: False
Link: https://github.com/suboss87/openclaw/pull/1

Description (problem / solution / changelog)

Fixes openclaw/openclaw#70364

Problem

Every sessions_send call targeting another agent leaks a full cohort of MCP child processes. With 9 agents configured, each sessions_send adds 9 new child processes and the original cohort is never cleaned up.

Root cause: cleanupBundleMcpOnRunEnd was only set to true in the CLI --local path (agentCliCommand). When sessions_send dispatches a run through the gateway (dispatchAgentRunFromGateway), the ingressOpts never included cleanupBundleMcpOnRunEnd, so the finally block in pi-embedded-runner/run.ts that calls retireSessionMcpRuntime never fired for gateway-path nested sessions.

Fix

Import isNestedAgentLane in src/gateway/server-methods/agent.ts and add cleanupBundleMcpOnRunEnd: isNestedAgentLane(request.lane) to the ingressOpts passed to dispatchAgentRunFromGateway.

Nested lane runs are ephemeral and should tear down their MCP cohort when done. Top-level gateway sessions keep processes warm across turns.

Test

Added test in agent.test.ts asserting cleanupBundleMcpOnRunEnd === true for nested lane requests and false for regular requests.

Generated by Claude Code

Changed files

.agent/workflows/update_clawdbot.md (removed, +0/-380)
.agents/maintainers.md (removed, +0/-1)
.agents/skills/blacksmith-testbox/SKILL.md (added, +340/-0)
.agents/skills/openclaw-ghsa-maintainer/SKILL.md (added, +87/-0)
.agents/skills/openclaw-parallels-smoke/SKILL.md (added, +151/-0)
.agents/skills/openclaw-pr-maintainer/SKILL.md (added, +75/-0)
.agents/skills/openclaw-qa-testing/SKILL.md (added, +148/-0)
.agents/skills/openclaw-qa-testing/agents/openai.yaml (added, +4/-0)
.agents/skills/openclaw-release-maintainer/SKILL.md (added, +456/-0)
.agents/skills/openclaw-secret-scanning-maintainer/SKILL.md (added, +220/-0)
.agents/skills/openclaw-secret-scanning-maintainer/scripts/secret-scanning.mjs (added, +797/-0)
.agents/skills/openclaw-test-heap-leaks/SKILL.md (added, +75/-0)
.agents/skills/openclaw-test-heap-leaks/agents/openai.yaml (added, +4/-0)
.agents/skills/openclaw-test-heap-leaks/scripts/heapsnapshot-delta.mjs (added, +553/-0)
.agents/skills/openclaw-test-performance/SKILL.md (added, +134/-0)
.agents/skills/openclaw-test-performance/agents/openai.yaml (added, +6/-0)
.agents/skills/optimizetests/SKILL.md (added, +41/-0)
.agents/skills/optimizetests/agents/openai.yaml (added, +6/-0)
.agents/skills/parallels-discord-roundtrip/SKILL.md (added, +62/-0)
.agents/skills/security-triage/SKILL.md (added, +111/-0)
.agents/skills/tag-duplicate-prs-issues/SKILL.md (added, +485/-0)
.agents/skills/tag-duplicate-prs-issues/agents/openai.yaml (added, +4/-0)
.codex (renamed, +0/-0)
.dockerignore (modified, +8/-0)
.env.example (modified, +9/-4)
.github/CODEOWNERS (added, +54/-0)
.github/ISSUE_TEMPLATE/bug_report.yml (modified, +36/-25)
.github/actionlint.yaml (modified, +3/-0)
.github/actions/ensure-base-commit/action.yml (modified, +16/-2)
.github/actions/setup-node-env/action.yml (modified, +11/-25)
.github/actions/setup-pnpm-store-cache/action.yml (modified, +6/-19)
.github/instructions/copilot.instructions.md (modified, +3/-3)
.github/labeler.yml (modified, +137/-16)
.github/pr-assets/compaction-checkpoints/sessions-checkpoints-inline.png (added, +0/-0)
.github/pr-assets/compaction-checkpoints/sessions-overview-inline.png (added, +0/-0)
.github/pull_request_template.md (modified, +39/-7)
.github/workflows/auto-response.yml (modified, +18/-5)
.github/workflows/ci-check-testbox.yml (added, +100/-0)
.github/workflows/ci.yml (modified, +2018/-507)
.github/workflows/codeql.yml (modified, +13/-9)
.github/workflows/control-ui-locale-refresh.yml (added, +172/-0)
.github/workflows/docker-release.yml (modified, +137/-48)
.github/workflows/docs-sync-publish.yml (added, +70/-0)
.github/workflows/docs-translate-trigger-release.yml (added, +42/-0)
.github/workflows/install-smoke.yml (modified, +168/-33)
.github/workflows/labeler.yml (modified, +181/-18)
.github/workflows/macos-release.yml (added, +93/-0)
.github/workflows/openclaw-cross-os-release-checks-reusable.yml (added, +472/-0)
.github/workflows/openclaw-live-and-e2e-checks-reusable.yml (added, +664/-0)
.github/workflows/openclaw-npm-release.yml (modified, +370/-29)
.github/workflows/openclaw-release-checks.yml (added, +198/-0)
.github/workflows/openclaw-scheduled-live-checks.yml (added, +74/-0)
.github/workflows/parity-gate.yml (added, +114/-0)
.github/workflows/plugin-clawhub-release.yml (added, +273/-0)
.github/workflows/plugin-npm-release.yml (added, +214/-0)
.github/workflows/sandbox-common-smoke.yml (modified, +11/-3)
.github/workflows/stale.yml (modified, +12/-9)
.github/workflows/workflow-sanity.yml (modified, +41/-8)
.gitignore (modified, +26/-5)
.jscpd.json (added, +16/-0)
.markdownlint-cli2.jsonc (modified, +3/-0)
.npmignore (modified, +2/-0)
.npmrc (modified, +3/-0)
.oxfmtrc.jsonc (modified, +3/-2)
.oxlintrc.json (modified, +65/-10)
.pi/prompts/landpr.md (removed, +0/-73)
.pi/prompts/reviewpr.md (removed, +0/-134)
.pre-commit-config.yaml (modified, +2/-2)
.prettierignore (added, +1/-0)
.secrets.baseline (modified, +4/-4)
.vscode/settings.json (modified, +1/-1)
AGENTS.md (modified, +201/-297)
CHANGELOG.md (modified, +2845/-398)
CONTRIBUTING.md (modified, +44/-9)
Dockerfile (modified, +45/-11)
Dockerfile.sandbox (modified, +2/-1)
Dockerfile.sandbox-browser (modified, +3/-1)
Dockerfile.sandbox-common (modified, +1/-0)
INCIDENT_RESPONSE.md (added, +52/-0)
Makefile (added, +4/-0)
README.md (modified, +315/-391)
SECURITY.md (modified, +38/-2)
Swabble/Sources/SwabbleKit/WakeWordGate.swift (modified, +7/-13)
Swabble/Tests/SwabbleKitTests/WakeWordGateTests.swift (modified, +19/-0)
appcast.xml (modified, +278/-620)
apps/android/README.md (modified, +69/-3)
apps/android/app/build.gradle.kts (modified, +96/-19)
apps/android/app/proguard-rules.pro (modified, +0/-20)
apps/android/app/src/main/AndroidManifest.xml (modified, +16/-0)
apps/android/app/src/main/java/ai/openclaw/app/AssistantLaunch.kt (added, +43/-0)
apps/android/app/src/main/java/ai/openclaw/app/MainActivity.kt (modified, +28/-6)
apps/android/app/src/main/java/ai/openclaw/app/MainViewModel.kt (modified, +273/-94)
apps/android/app/src/main/java/ai/openclaw/app/NodeApp.kt (modified, +12/-1)
apps/android/app/src/main/java/ai/openclaw/app/NodeForegroundService.kt (modified, +6/-2)
apps/android/app/src/main/java/ai/openclaw/app/NodeRuntime.kt (modified, +567/-125)
apps/android/app/src/main/java/ai/openclaw/app/NotificationForwardingPolicy.kt (added, +102/-0)
apps/android/app/src/main/java/ai/openclaw/app/PermissionRequester.kt (modified, +89/-22)
apps/android/app/src/main/java/ai/openclaw/app/SecurePrefs.kt (modified, +202/-0)
apps/android/app/src/main/java/ai/openclaw/app/SessionKey.kt (modified, +11/-0)
apps/android/app/src/main/java/ai/openclaw/app/chat/ChatController.kt (modified, +157/-53)

PR #70442: fix(sandbox): use dedicated dm bucket for Telegram DMs so they are never the main session

Repository: openclaw/openclaw
Author: EronFan
State: open | merged: False
Link: https://github.com/openclaw/openclaw/pull/70442

Description (problem / solution / changelog)

Summary

Telegram DMs with dmScope="main" (the default) were resolving to the same session key as the agent main session (agent:main:main). This caused shouldSandboxSession to return false when sessionKey === mainSessionKey, even when mode="all", bypassing sandbox isolation entirely — a security regression.

Root cause

In src/routing/session-key.ts, buildAgentPeerSessionKey for direct chats with dmScope="main" was calling buildAgentMainSessionKey, producing agent:main:main. Since the Telegram DM session key matched the main session key, shouldSandboxSession excluded it from sandboxing.

Fix

Use a dedicated "dm" bucket (agent:<agentId>:dm) for all direct chats when dmScope="main", giving Telegram DMs their own sandbox context that is never the main session.

-    return buildAgentMainSessionKey({
-      agentId: params.agentId,
-      mainKey: params.mainKey,
-    });
+    // Use a dedicated DM bucket so Telegram (and other direct-chat) sessions always get
+    // their own sandbox context distinct from the agent main session.
+    return `agent:${normalizeAgentId(params.agentId)}:dm`;

Test coverage

Added src/agents/sandbox/runtime-status.regression.test.ts covering:

mode="all": agent:main:dm IS sandboxed
mode="non-main": agent:main:dm IS sandboxed (not the main session)
mode="off": agent:main:dm is NOT sandboxed
mode="all": agent:main:main IS sandboxed
mode="non-main": agent:main:main is NOT sandboxed
dm bucket distinct from main session
per-peer DM scope still works

Also added src/routing/session-key.continuity.test.ts for session key continuity.

Verification

pnpm test -- --run src/agents/sandbox/runtime-status.regression.test.ts src/routing/session-key.continuity.test.ts — 56 tests passing.

Fixes #70342

Changed files

extensions/brave/src/brave-web-search-provider.test.ts (modified, +64/-0)
extensions/slack/src/accounts.test.ts (modified, +38/-0)
extensions/slack/src/channel.ts (modified, +2/-2)
extensions/slack/src/monitor/provider.ts (modified, +1/-0)
src/agents/acp-spawn.test.ts (modified, +25/-0)
src/agents/acp-spawn.ts (modified, +4/-0)
src/agents/command/attempt-execution.ts (modified, +0/-1)
src/agents/command/types.ts (modified, +0/-2)
src/agents/pi-embedded-runner.cache.live.test.ts (modified, +0/-1)
src/agents/pi-embedded-runner.e2e.test.ts (modified, +47/-2)
src/agents/pi-embedded-runner/compact.hooks.test.ts (modified, +106/-0)
src/agents/pi-embedded-runner/compact.queued.ts (modified, +20/-0)
src/agents/pi-embedded-runner/openrouter-model-capabilities.test.ts (modified, +49/-0)
src/agents/pi-embedded-runner/openrouter-model-capabilities.ts (modified, +6/-1)
src/agents/pi-embedded-runner/run.ts (modified, +5/-7)
src/agents/pi-embedded-runner/run/params.ts (modified, +0/-6)
src/agents/sandbox/runtime-status.regression.test.ts (added, +86/-0)
src/agents/tools/sessions-spawn-tool.test.ts (modified, +28/-0)
src/agents/tools/sessions-spawn-tool.ts (modified, +2/-0)
src/commands/agent-via-gateway.test.ts (modified, +0/-6)
src/commands/agent-via-gateway.ts (modified, +0/-1)
src/commands/agent.test.ts (modified, +0/-1)
src/infra/bonjour.test.ts (modified, +64/-0)
src/infra/bonjour.ts (modified, +66/-17)
src/routing/resolve-route.test.ts (modified, +4/-4)
src/routing/session-key.continuity.test.ts (modified, +2/-1)
src/routing/session-key.ts (modified, +7/-4)
test/vitest/vitest.agents.config.ts (modified, +1/-1)

PR #70465: fix(gateway): cleanup MCP runtime for nested-lane agent runs to plug sessions_send leak (#70364)

Repository: openclaw/openclaw
Author: Sanjays2402
State: open | merged: False
Link: https://github.com/openclaw/openclaw/pull/70465

Description (problem / solution / changelog)

Summary

Problem: Every sessions_send / runAgentStep call into another agent leaks a full cohort of MCP child processes. Reporter measured: 9 baseline MCP children → 18 after one sessions_send → 27 after two — the original cohort is never reclaimed. Each call is N processes (N = configured MCP servers per agent). Eventually the gateway becomes unstable and the only recovery is systemctl --user restart openclaw-gateway.
Why it matters: Multi-agent fleet setups with agentToAgent.enabled: true and per-agent MCP servers are unusable in steady state. The leak is deterministic and reproduces 100% of the time across every cross-agent call (Jarvis → Atlas, Jarvis → Forge, Jarvis → Spark all confirmed by reporter on 2026.4.15).
What changed: Default cleanupBundleMcpOnRunEnd to true at the gateway agent handler when the request lane is nested AND the caller hasn't explicitly opted out. disposeSessionMcpRuntime(sessionId) then fires from the existing pi-embedded-runner finally-block when the ephemeral run ends, freeing the cohort.
What did NOT change (scope boundary): No change to disposeSessionMcpRuntime itself, the SessionMcpRuntimeManager, or the pi-embedded-runner lifecycle. Local-CLI (--local), subagent-spawn, and isolated-cron callers continue to set the flag themselves; their behavior is unchanged. Session-mode subagent spawns (which deliberately keep the runtime alive across nested runs) can still pass cleanupBundleMcpOnRunEnd: false explicitly and are honoured.

Credit to @aiedvlyman for a fully root-caused report — including exact filenames, line numbers, and a working Option-A patch sketch in the issue body.

Change Type (select all)

Bug fix

Scope (select all touched areas)

Gateway / orchestration

Linked Issue/PR

Closes #70364
This PR fixes a bug or regression

Root Cause (if applicable)

Root cause: cleanupBundleMcpOnRunEnd is set to true at the CLI layer only when opts.local === true (src/commands/agent-via-gateway.ts:185). The gateway agent handler at src/gateway/server-methods/agent.ts:950 then forwarded request.cleanupBundleMcpOnRunEnd === true straight through. Nested gateway-routed runs (runAgentStep → lane: "nested" | "nested:...") don't pass that flag, so the embedded runner's finally-block at src/agents/pi-embedded-runner/run.ts:2136 (if (params.cleanupBundleMcpOnRunEnd === true) await disposeSessionMcpRuntime(...)) never fires for them.
Missing detection / guardrail: no test asserted the gateway agent handler's cleanupBundleMcpOnRunEnd decision for nested-lane requests. subagent-spawn.ts and cron/isolated-agent/run-executor.ts had their own per-caller logic but there was no contract test at the gateway-method seam.
Contributing context: the flag's name (cleanupBundleMcpOnRunEnd) is shared by both opt-in and opt-out callers; a centralised "nested lane = default true" rule is the smallest place to make the cleanup policy explicit.

Regression Test Plan (if applicable)

Coverage level that should have caught this:
- Unit test
Target test file: src/gateway/server-methods/agent.test.ts (extended).
Scenarios the test should lock in:
- lane="nested" → default true
- lane="nested:agent:spark:main" → default true (prefix match via isNestedAgentLane)
- lane="main" → default false (unchanged)
- lane unset → default false (unchanged)
- explicit cleanupBundleMcpOnRunEnd: false on a nested run → still false (opt-out preserved)
Why this is the smallest reliable guardrail: the policy decision is a single conditional in agent.ts; five matrix cases cover the full decision table.
Existing test that already covers this (if any): none — agent.test.ts exercised cleanupBundleMcpOnRunEnd only via end-to-end runner tests where the flag was passed explicitly.

User-visible / Behavior Changes

Cross-agent sessions_send / runAgentStep calls no longer leak MCP child-process cohorts. Steady-state process count drops from N × (1 + calls_per_session) to N.
Operators do not need to set anything new. Behavior change is opt-out, not opt-in.
No config schema change. No public-API change.

Diagram (if applicable)

Before:
  sessions_send -> runAgentStep -> callGateway({method:"agent", lane:"nested", ...})
    -> gateway agent handler:
         cleanupBundleMcpOnRunEnd: false        (request flag undefined)
    -> embedded-runner finally:
         if (params.cleanupBundleMcpOnRunEnd === true) await disposeSessionMcpRuntime(...)
         // never fires -> N MCP children leak per call

After:
  sessions_send -> runAgentStep -> callGateway({method:"agent", lane:"nested", ...})
    -> gateway agent handler:
         cleanupBundleMcpOnRunEnd:
           request.cleanupBundleMcpOnRunEnd === true
           || (request.cleanupBundleMcpOnRunEnd === undefined
               && isNestedAgentLane(request.lane))
         // -> true for nested / nested:* lanes
    -> embedded-runner finally:
         await disposeSessionMcpRuntime(sessionId)   // fires; cohort freed

Security Impact (required)

New permissions/capabilities? No
Secrets/tokens handling changed? No
New/changed network calls? No
Command/tool execution surface changed? No (only the cleanup signal is changing; disposeSessionMcpRuntime itself was already there)
Data access scope changed? No
Net: this is purely a runtime-cleanup signal, gating an existing teardown helper. It tightens process hygiene without expanding what runs.

Repro + Verification

Environment

OS: macOS 26.5 (arm64) for development; reporter on Ubuntu 24.04 systemd user service
Runtime/container: Node v25.9.0
Model/provider: irrelevant (leak is in gateway MCP lifecycle, not inference path) — reporter confirmed across openrouter/minimax/minimax-m2.7 + openrouter/google/gemini-2.5-flash + openai-codex/gpt-5.4
Integration/channel: any channel that surfaces sessions_send
Relevant config: agentToAgent.enabled: true, tools.sessions.visibility: all, per-agent MCP servers

Steps (per reporter)

systemctl --user restart openclaw-gateway
pgrep -a -f "mcp/server.py" → 9 children (one per agent)
From any agent's session, sessions_send to another agent.
pgrep -a -f "mcp/server.py" → 18 children (original 9 still running, 9 new ones added)

Expected

After step 4, child count returns to 9 once the nested run finishes.

Actual (before fix)

Child count grows by N on every sessions_send and never decays.

Actual (after fix)

The nested run's finally-block calls disposeSessionMcpRuntime(sessionId); the cohort spawned for that run is reclaimed. Steady-state count returns to N.

Evidence

Failing test/log before + passing after

$ git diff --stat
 src/gateway/server-methods/agent.test.ts | 44 ++++++++++++++++++++++++++++++++
 src/gateway/server-methods/agent.ts      | 16 +++++++++++-
 2 files changed, 59 insertions(+), 1 deletion(-)

$ npx -p typescript@5 tsc --noEmit --skipLibCheck \
    --target ES2022 --module ESNext --moduleResolution Bundler \
    --esModuleInterop --strict \
    src/gateway/server-methods/agent.ts \
  | grep "server-methods/agent.ts"
# (no errors in modified file beyond missing transitive types in tmp clone)

5 new regression cases (4 parametrized + 1 explicit opt-out) added to the existing gateway agent handler describe block. CI on this PR will exercise the full vitest project.

Human Verification (required)

Verified scenarios: walked the call graph end-to-end: sessions_send → runAgentStep (src/agents/tools/agent-step.ts) → callGateway({method:"agent", lane: ..., ...}) → agentHandlers.agent (src/gateway/server-methods/agent.ts:950) → agentCommandFromIngress(..., cleanupBundleMcpOnRunEnd: ...) → embedded-runner params.cleanupBundleMcpOnRunEnd === true finally-block (src/agents/pi-embedded-runner/run.ts:2136). Confirmed isNestedAgentLane already correctly recognises both bare "nested" and prefixed "nested:..." forms. Confirmed subagent-spawn.ts and cron/isolated-agent/run-executor.ts callers will continue to set the flag themselves and are unaffected.
Edge cases checked:
- bare "nested" lane → defaulted true
- prefixed "nested:agent:spark:main" lane → defaulted true
- non-nested lane (e.g. "main") → unchanged false default
- missing lane → unchanged false default
- explicit cleanupBundleMcpOnRunEnd: false on a nested run → preserved (opt-out for session-mode subagents)
What I did NOT verify: running the full Ubuntu systemd repro on a multi-agent fleet — I don't have a 9-agent gateway with per-agent MCP servers locally. The unit-level decision is precisely covered by the new tests and the downstream disposeSessionMcpRuntime path is already exercised by pi-embedded-runner.cache.live.test.ts:327 / pi-embedded-runner.e2e.test.ts:473,516.

Review Conversations

I replied to or resolved every bot review conversation I addressed in this PR.
I left unresolved only the conversations that still need reviewer or maintainer judgment.

Compatibility / Migration

Backward compatible? Yes
Config/env changes? No
Migration needed? No
Existing callers that pass cleanupBundleMcpOnRunEnd explicitly are honoured. The only behavior change is for callers that omit the flag entirely AND target a nested lane — which is exactly the leak case.

Risks and Mitigations

Risk: a future caller that wants a long-lived nested MCP runtime (e.g. an interactive nested REPL) would silently get the runtime torn down on first turn.
- Mitigation: the explicit-opt-out test (respects an explicit cleanupBundleMcpOnRunEnd=false on a nested-lane request) locks in the escape hatch. Such callers can pass cleanupBundleMcpOnRunEnd: false and the gateway will honour it.
Risk: isNestedAgentLane mis-classifies a custom lane string that happens to start with nested:.
- Mitigation: isNestedAgentLane is the project's canonical "is this a nested lane?" predicate and is already used to drive other nested-lane behavior; reusing it keeps the policy consistent. If the predicate ever needs tightening, both this PR's site and the others move together.
Risk: Reporter's Option B (wire teardown into the gateway session lifecycle) is more robust against future code paths that spawn nested sessions outside agentCliCommand.
- Mitigation: Option A is the minimal safe fix and resolves the reported leak deterministically. Option B is a follow-up worth doing as a separate PR — it's a wider refactor of the lifecycle event surface and deserves its own design + review.

Changed files

src/gateway/server-methods/agent.test.ts (modified, +44/-0)
src/gateway/server-methods/agent.ts (modified, +15/-1)

PR #70480: fix(gateway): tear down nested-lane MCP cohort on run end

Repository: openclaw/openclaw
Author: suboss87
State: closed | merged: False
Link: https://github.com/openclaw/openclaw/pull/70480

Description (problem / solution / changelog)

Summary

Fixes #70364. Nested agent runs dispatched via sessions_send (one agent sending to another) spawn their own MCP cohort per session but never called retireSessionMcpRuntime on completion. Each dispatch leaked a full cohort of MCP child processes, growing unboundedly until gateway restart.

Root cause

cleanupBundleMcpOnRunEnd was only set to true in the CLI --local path (src/cli/cli-runner.ts). When a gateway nested lane dispatched through dispatchAgentRunFromGateway in src/gateway/server-methods/agent.ts:958, the ingressOpts had no cleanupBundleMcpOnRunEnd, so pi-embedded-runner never tore down the session's MCP runtime. Top-level gateway sessions are fine (they keep MCP warm across turns by design), but nested lane runs should be ephemeral.

Fix

Wire isNestedAgentLane(request.lane) into the ingressOpts passed to dispatchAgentRunFromGateway. Nested lane runs now tear down their MCP cohort on completion. Top-level gateway sessions continue to keep processes warm.

Evidence

Symptom: reporter in #70364 describes 9 agents configured, each sessions_send adds 9 new MCP children that are never reaped.
Root cause in code: src/gateway/server-methods/agent.ts:958 was missing the cleanup flag that existed in the --local CLI path.
Fix touches the implicated path: single-line addition at the same site, plus reusing the existing isNestedAgentLane helper from src/agents/lanes.ts.
Regression test: new test in src/gateway/server-methods/agent.test.ts asserts that nested-lane dispatches pass cleanupBundleMcpOnRunEnd: true while top-level dispatches do not.

Test plan

New regression test covers nested-lane cleanup flag
Top-level lane path unchanged (verified by existing tests)
All 32 agent.test.ts tests pass
Fix diff under 40 lines

Closes #70364.

Changed files

src/gateway/server-methods/agent.test.ts (modified, +30/-0)
src/gateway/server-methods/agent.ts (modified, +5/-0)

Code Example

Suggested Fix
The gateway needs to call disposeSessionMcpRuntime(sessionId) when a nested/ephemeral agent run ends. Two options:

Option A — Set cleanupBundleMcpOnRunEnd: true for nested lane runs in the gateway's agent handler (similar to how --local sets it):

cleanupBundleMcpOnRunEnd: opts.local === true || opts.lane === AGENT_LANE_NESTED
Option B — Wire disposeSessionMcpRuntime into the gateway's session lifecycle end handler (onSessionLifecycleEvent in subagent-registry-BrNWizSY.js) so cleanup fires on session end regardless of how the run was initiated.

Option B is more robust as it handles any future code paths that spawn nested sessions without going through agentCliCommand.

Additional Notes
pkill -f mcp/server.py clears child symptoms but does not fix the leak — it reproduces immediately on next sessions_send
socat processes are a red herring — confirmed the real unit is the user service
Logs around repro show [agent:nested], ANNOUNCE_SKIP, and repeated webchat reconnect churn consistent with leaked runtimes
agentToAgent.enabled: true and tools.sessions.visibility: all are set in openclaw.json

RAW_BUFFERClick to expand / collapse

Bug type

Behavior bug (incorrect output/state without crash)

Beta release blocker

Summary

Steps to reproduce

Repro Steps Start fresh gateway: systemctl --user restart openclaw-gateway Confirm baseline — 1 gateway parent + 9 MCP children: pgrep -a -f "mcp/server.py"

→ 9 processes (one per configured agent)

Send one minimal cross-agent message via sessions_send to any agent (e.g. Spark), reply REPRO_OK Check processes again: pgrep -a -f "mcp/server.py"

→ 18 processes — original 9 still running, 9 new ones added

Confirmed repro with fresh gateway PID 179821:

Baseline MCP children: 179966 179969 179972 179977 179982 179985 179988 179991 179994 After one sessions_send to Spark: 180076 180079 180082 180089 180092 180095 180098 180101 180104 Original cohort still running — not cleaned up Repeats on every sessions_send. Each call adds another full cohort.

Root Cause (Code-Level) The cleanup flag cleanupBundleMcpOnRunEnd is only ever set to true in local/embedded mode.

cleanupBundleMcpOnRunEnd: opts.local === true The cleanup itself lives in pi-embedded-runner-DN0VbqlW.js, line 9713:

if (params.cleanupBundleMcpOnRunEnd === true) await disposeSessionMcpRuntime(params.sessionId).catch(...) When sessions_send triggers a nested agent run via runAgentStep() in subagent-registry-BrNWizSY.js:

const response = await agentStepDeps.callGateway({ method: "agent", params: { lane: params.lane ?? AGENT_LANE_NESTED, ... } }); This goes through the gateway path, not --local. cleanupBundleMcpOnRunEnd is false (or unset). The finally block that calls disposeSessionMcpRuntime never fires. The MCP child processes spawned for that nested session's runtime are never cleaned up.

server.impl-GQ72oJBa.js — the gateway implementation — does not reference cleanupBundleMcpOnRunEnd at all.

The SessionMcpRuntimeManager (pi-bundle-mcp-tools-vusm-AE2.js, line 483) correctly tracks runtimes by sessionId and has a working disposeSession() method — but it is never called for gateway-path nested sessions because the flag that triggers it is hardcoded to local === true.

Expected behavior

Actual behavior

Root Cause (Code-Level) The cleanup flag cleanupBundleMcpOnRunEnd is only ever set to true in local/embedded mode.

cleanupBundleMcpOnRunEnd: opts.local === true The cleanup itself lives in pi-embedded-runner-DN0VbqlW.js, line 9713:

server.impl-GQ72oJBa.js — the gateway implementation — does not reference cleanupBundleMcpOnRunEnd at all.

OpenClaw version

Version: 2026.4.15 (041266a)

Operating system

Platform: Ubuntu 24.04, systemd user service (openclaw-gateway)

Install method

No response

Model

GPT-5.4

Provider / routing chain

openclaw

Additional provider/model setup details

Provider config during repro:

main (Jarvis): openai-codex/gpt-5.4 primary, fallbacks: openrouter/minimax/minimax-m2.7, openrouter/google/gemini-2.5-flash
All other agents: openrouter/minimax/minimax-m2.7 primary, fallback: openrouter/google/gemini-2.5-flash
sessions_send target during repro: Spark (openrouter/minimax/minimax-m2.7)
Bug also confirmed cross-agent: Jarvis → Atlas, Jarvis → Forge
Provider chain does not appear relevant — leak is in gateway MCP lifecycle, not inference path

Logs, screenshots, and evidence

Suggested Fix
The gateway needs to call disposeSessionMcpRuntime(sessionId) when a nested/ephemeral agent run ends. Two options:

Option A — Set cleanupBundleMcpOnRunEnd: true for nested lane runs in the gateway's agent handler (similar to how --local sets it):

cleanupBundleMcpOnRunEnd: opts.local === true || opts.lane === AGENT_LANE_NESTED
Option B — Wire disposeSessionMcpRuntime into the gateway's session lifecycle end handler (onSessionLifecycleEvent in subagent-registry-BrNWizSY.js) so cleanup fires on session end regardless of how the run was initiated.

Option B is more robust as it handles any future code paths that spawn nested sessions without going through agentCliCommand.

Additional Notes
pkill -f mcp/server.py clears child symptoms but does not fix the leak — it reproduces immediately on next sessions_send
socat processes are a red herring — confirmed the real unit is the user service
Logs around repro show [agent:nested], ANNOUNCE_SKIP, and repeated webchat reconnect churn consistent with leaked runtimes
agentToAgent.enabled: true and tools.sessions.visibility: all are set in openclaw.json

Impact and severity

No response

Additional information

No response

extent analysis

TL;DR

The most likely fix for the MCP child process leak is to modify the gateway's agent handler to set cleanupBundleMcpOnRunEnd to true for nested lane runs or wire disposeSessionMcpRuntime into the gateway's session lifecycle end handler.

Guidance

Identify the root cause of the issue, which is the cleanupBundleMcpOnRunEnd flag being hardcoded to local === true, preventing cleanup of MCP child processes in non-local modes.
Consider two possible solutions:
- Option A: Modify the cleanupBundleMcpOnRunEnd condition to include opts.lane === AGENT_LANE_NESTED.
- Option B: Integrate disposeSessionMcpRuntime into the gateway's session lifecycle end handler to ensure cleanup regardless of the run initiation method.
Verify the fix by checking the number of MCP child processes before and after sending a cross-agent message via sessions_send.

Example

// Option A: Modify cleanupBundleMcpOnRunEnd condition
cleanupBundleMcpOnRunEnd: opts.local === true || opts.lane === AGENT_LANE_NESTED

// Option B: Integrate disposeSessionMcpRuntime into session lifecycle end handler
onSessionLifecycleEvent: (event) => {
  if (event.type === 'sessionEnd') {
    disposeSessionMcpRuntime(event.sessionId);
  }
}

Notes

The provided fix options assume that the disposeSessionMcpRuntime function is correctly implemented and functional.
The choice between Option A and Option B depends on the specific requirements and constraints of the OpenClaw system.
It is essential to thoroughly test the chosen solution to ensure it resolves the MCP child process leak issue.

Recommendation

Apply workaround Option B, as it provides a more robust solution by handling any future code paths that spawn nested sessions without going through agentCliCommand. This approach ensures that the cleanup mechanism is triggered regardless of the run initiation method, reducing the likelihood of similar issues arising in the future.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

FAQ

Expected behavior

#API middleware #SSR setup #ISR setup #authentication setup #request error

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

openclaw - ✅(Solved) Fix [Bug]: MCP child process leak: sessions_send via gateway never calls disposeSessionMcpRuntime [4 pull requests, 1 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Fix Action

Fix / Workaround

PR fix notes

PR #1: fix(gateway): clean up MCP child processes after nested lane runs end

Description (problem / solution / changelog)

Problem

Fix

Test

Changed files

PR #70442: fix(sandbox): use dedicated dm bucket for Telegram DMs so they are never the main session

Description (problem / solution / changelog)

Summary

Root cause

Fix

Test coverage

Verification

Changed files

PR #70465: fix(gateway): cleanup MCP runtime for nested-lane agent runs to plug sessions_send leak (#70364)

Description (problem / solution / changelog)

Summary

Change Type (select all)

Scope (select all touched areas)

Linked Issue/PR

Root Cause (if applicable)

Regression Test Plan (if applicable)

User-visible / Behavior Changes

Diagram (if applicable)

Security Impact (required)

Repro + Verification

Environment

Steps (per reporter)

Expected

Actual (before fix)

Actual (after fix)

Evidence

Human Verification (required)

Review Conversations

Compatibility / Migration

Risks and Mitigations

Changed files

PR #70480: fix(gateway): tear down nested-lane MCP cohort on run end

Description (problem / solution / changelog)

Summary

Root cause

Fix

Evidence

Test plan

Changed files

Code Example

Bug type

Beta release blocker

Summary

Steps to reproduce

→ 9 processes (one per configured agent)

→ 18 processes — original 9 still running, 9 new ones added

Expected behavior

Actual behavior

OpenClaw version

Operating system

Install method

Model

Provider / routing chain

Additional provider/model setup details

Logs, screenshots, and evidence

Impact and severity

Additional information

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

FAQ

Expected behavior

Still need to ship something?

RELATED_DISCOVERY