openclaw - ✅(Solved) Fix Subagent bundle-MCP runtimes can leak stdio child processes across sessions [2 pull requests, 1 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
openclaw/openclaw#70389Fetched 2026-04-23 07:25:26
View on GitHub
Comments
1
Participants
2
Timeline
3
Reactions
0
Timeline (top)
cross-referenced ×2commented ×1

Subagent runs can leak bundle-MCP child processes because the cleanup path is not reliably enabled for :subagent: sessions, and even when cleanup runs, stdio MCP child shutdown order is wrong.

In real usage this leaves lingering minimax-mcp, minimax-coding-plan-mcp, and codex mcp-server processes under openclaw-gateway, causing steady RSS growth and eventually system instability.

Root Cause

Subagent runs can leak bundle-MCP child processes because the cleanup path is not reliably enabled for :subagent: sessions, and even when cleanup runs, stdio MCP child shutdown order is wrong.

Fix Action

Fix / Workaround

Observed pattern before patch:

Local patch that fixed it for us

PR fix notes

PR #70400: fix(mcp): dispose transport before client so stdio children reclaim

Description (problem / solution / changelog)

Summary

Partial fix for #70389 (ordering side). In `src/agents/pi-bundle-mcp-runtime.ts:disposeSession`, closing `session.client` before `session.transport` leaves stdio transport children unreclaimed in practice — the client-level protocol close doesn't itself signal EOF on the child's stdio pipes, and the subsequent `transport.close` races the child's own shutdown.

Reporter observed:

```json { "before": { "minimax": 0, "minimax_coding": 0, "codex": 0 }, "first.after_wait": { "minimax": 1, "minimax_coding": 1, "codex": 1 }, "second.after_wait": { "minimax": 2, "minimax_coding": 2, "codex": 2 } } ```

…with `minimax-mcp`, `minimax-coding-plan-mcp`, and `codex mcp-server` accumulating across subagent runs and feeding OOM pressure in production.

Fix

Swap the close order:

```ts // before await session.client.close().catch(() => {}); await session.transport.close().catch(() => {});

// after (this PR) await session.transport.close().catch(() => {}); await session.client.close().catch(() => {}); ```

Closing the transport first closes the stdio pipes the child process reads from, letting it exit cleanly. The client.close() then tears down MCP protocol state on a transport that's already dead — safe because `client.close()` doesn't require a live transport to return. Reporter verified this makes bundle MCP child count return to zero after run completion in live testing.

Scope note

The issue also identifies a separate gap — that `cleanupBundleMcpOnRunEnd` isn't reliably enabled for `:subagent:` sessions — which requires tracing the gateway-ingress agent-opts path and wiring a subagent-detection condition into the ingress layer. That's a broader change in a different file, deferred to a follow-up so this ordering fix lands clean.

Test

The ordering fix is one line. I checked existing tests in `pi-bundle-mcp-runtime.test.ts` which use a mocked `createRuntime` so the internals of `disposeSession` are not unit-tested — adding coverage for the close order would require exposing `disposeSession` on `__testing`, which felt disproportionate. Reporter's live-test verification of the bundle-MCP child count returning to zero is the empirical proof.

oxlint clean.

Closes #70389 (ordering fix — the `cleanupBundleMcpOnRunEnd` subagent-detection side is intentionally scoped out).

Changed files

  • src/agents/pi-bundle-mcp-runtime.ts (modified, +8/-1)

PR #70419: fix(gateway): raise child oom_score_adj on linux to spare the gateway under OOM

Description (problem / solution / changelog)

Closes #70404.

Root Cause

On Linux, child processes inherit the gateway's oom_score_adj. In a memory-constrained cgroup, the gateway is often the largest-RSS process because it keeps long-lived WebSocket state and V8 heap resident, while transient children such as agent workers, MCP stdio servers, PTY shells, and Chrome/browser helpers are smaller individually. When the cgroup hits its memory limit, the kernel can therefore kill openclaw-gateway instead of the transient child that pushed the cgroup over the edge. The gateway exits with 137 and all connected sessions drop.

The important constraint: lowering the gateway's OOM score, or having the parent process write a lower score into children, is capability-sensitive in hardened containers. The reliable unprivileged operation is the opposite: a Linux process may voluntarily increase its own OOM kill likelihood.

Fix

Add a shared Linux-only spawn helper that wraps eligible child commands in a short /bin/sh shim:

/bin/sh -c 'echo 1000 > /proc/self/oom_score_adj 2>/dev/null; exec "$0" "$@"' <cmd> <args...>

The shim runs in the post-fork child, raises that child's own oom_score_adj, then execs the real command. There is no extra long-lived shell process, and after exec the process identity, PID, stdio, exit, and kill semantics remain the target process.

Current covered spawn surfaces:

  • src/process/supervisor/adapters/child.ts for regular supervisor-managed children.
  • src/process/supervisor/adapters/pty.ts for PTY-backed shell children.
  • src/agents/mcp-stdio-transport.ts for MCP stdio server children.
  • extensions/browser/src/browser/chrome.ts for launched browser/Chrome processes, through the public plugin SDK process-runtime seam.

The helper is no-op when:

  • the platform is not Linux,
  • OPENCLAW_CHILD_OOM_SCORE_ADJ=0 / false / no / off is set in the child env,
  • /bin/sh is unavailable, so distroless/scratch images degrade to previous behavior instead of failing with ENOENT,
  • the argv is already wrapped,
  • the command name starts with -, because POSIX sh implementations do not support exec -- and a leading-dash command could be parsed as an exec option.

Safety Notes

  • Linux-only behavior. macOS, Windows, and other platforms keep their existing spawn shape.
  • Argument-safe execution. The wrapper script is fixed text. The real command and args are passed as shell positional parameters and executed with POSIX-compatible exec "$0" "$@", so user args are not re-parsed as shell source. Leading-dash command names are intentionally left on the original direct-spawn path.
  • Shell env hardening. Wrapped spawns strip BASH_ENV, ENV, and CDPATH so the /bin/sh -c shim cannot source caller-influenced startup files before exec.
  • Transparent failure mode. If /proc/self/oom_score_adj is unavailable or unwritable, stderr is suppressed and the child still runs normally. It just does not get the OOM bias.
  • Plugin boundary kept clean. Browser plugin code uses openclaw/plugin-sdk/process-runtime; it does not deep-import core internals.

Scope Boundary / Related Work

This PR is intentionally a kernel victim-selection fix. It does not try to solve every child-process OOM class.

Related issues/PRs that remain separate work:

  • #70400, #70389, #69145, #64169, #64984: MCP stdio/runtime lifecycle leaks. This PR makes leaked or transient MCP children better OOM victims than the gateway, but it does not replace proper runtime disposal and transport shutdown ordering.
  • #70270, #55698, #30130, #31504: browser/Chrome renderer cleanup and container hardening. This PR covers launched browser process trees with the OOM bias, but stale renderer cleanup/resource caps remain separate lifecycle work.
  • #23409, #28629: broader child resource controls such as cgroup v2 limits, systemd MemoryMax=, spawn caps, and watchdogs. Those are stronger resource-governance features and should not be folded into this focused fix.
  • #68680, #69242: SIGKILL observability. Once children are intentionally preferred OOM victims, surfacing signal-killed subprocesses clearly becomes more useful, but it is an independent reporting improvement.
  • #52205, #47776: process-group and orphan cleanup. The shim uses exec, so it preserves the existing process-tree cleanup model rather than changing it.

Documentation

Added Linux docs for OOM victim selection, covered child process surfaces, opt-out env values, and /proc/<pid>/oom_score_adj verification:

  • docs/platforms/linux.md
  • docs/vps.md

Live Linux Docker Validation

Ran on node:22-bookworm inside Docker and verified real /proc/<pid>/oom_score_adj values for all covered spawn paths:

  • direct shared helper wrapped spawn: 1000
  • direct helper opt-out with OPENCLAW_CHILD_OOM_SCORE_ADJ=0: 0
  • supervisor child adapter: 1000
  • PTY adapter: 1000
  • MCP stdio transport: 1000
  • browser launch path with a fake Chrome executable: 1000

Also ran a cgroup memory-pressure simulation with --memory=256m --memory-swap=256m, a gateway-like parent holding ~179 MB RSS, and a child allocating memory in 4 MB chunks:

  • baseline/no wrapper: child inherited oom_score_adj=0; the parent/container was killed with exit 137 while the child was around 141 MB RSS.
  • wrapper enabled: child had oom_score_adj=1000; the child was killed with SIGKILL while the parent stayed alive at ~179 MB RSS.

This live pass also caught a portability bug in the earlier wrapper: Debian's /bin/sh is dash and rejects exec --. The PR now uses portable exec "$0" "$@" and skips wrapping leading-dash command names.

Tests Run

  • pnpm docs:list
  • pnpm test src/process/linux-oom-score.test.ts src/process/supervisor/adapters/child.test.ts src/process/supervisor/adapters/pty.test.ts src/agents/mcp-stdio-transport.test.ts extensions/browser/src/browser/chrome.internal.test.ts
  • node scripts/run-vitest.mjs run --config test/vitest/vitest.extension-browser.config.ts extensions/browser/src/browser/chrome.internal.test.ts
  • pnpm tsgo:prod
  • pnpm plugin-sdk:check-exports
  • pnpm plugin-sdk:api:check
  • pnpm check:changed
  • Linux Docker live harness against node:22-bookworm verifying /proc/<pid>/oom_score_adj for helper, opt-out, supervisor child, PTY, MCP stdio, and browser launch paths.
  • Linux Docker cgroup memory-pressure simulation with --memory=256m --memory-swap=256m, confirming the wrapper changes victim selection from parent/container to child.

Note: after the full pnpm check:changed passed locally on the prior commit, later repeated pnpm check:changed / combined targeted test invocations hit a Vitest unit-fast process stuck at 0% CPU. The focused test lanes above were rerun split by lane and passed.

Changed files

  • CHANGELOG.md (modified, +1/-0)
  • docs/.generated/plugin-sdk-api-baseline.sha256 (modified, +2/-2)
  • docs/platforms/linux.md (modified, +37/-0)
  • docs/vps.md (modified, +3/-0)
  • extensions/browser/src/browser/chrome.ts (modified, +6/-2)
  • src/agents/mcp-stdio-transport.test.ts (modified, +20/-3)
  • src/agents/mcp-stdio-transport.ts (modified, +12/-5)
  • src/plugin-sdk/process-runtime.ts (modified, +2/-0)
  • src/process/linux-oom-score.test.ts (added, +105/-0)
  • src/process/linux-oom-score.ts (added, +143/-0)
  • src/process/supervisor/adapters/child.test.ts (modified, +53/-1)
  • src/process/supervisor/adapters/child.ts (modified, +7/-2)
  • src/process/supervisor/adapters/pty.test.ts (modified, +55/-8)
  • src/process/supervisor/adapters/pty.ts (modified, +5/-2)

Code Example

{
  "before": { "minimax": 0, "minimax_coding": 0, "codex": 0 },
  "first.after_wait": { "minimax": 1, "minimax_coding": 1, "codex": 1 },
  "second.after_wait": { "minimax": 2, "minimax_coding": 2, "codex": 2 }
}

---

await session.client.close().catch(() => {});
await session.transport.close().catch(() => {});

---

Boolean(resolvedSessionKey?.includes(":subagent:"))

---

await session.transport.close().catch(() => {});
await session.client.close().catch(() => {});
RAW_BUFFERClick to expand / collapse

Summary

Subagent runs can leak bundle-MCP child processes because the cleanup path is not reliably enabled for :subagent: sessions, and even when cleanup runs, stdio MCP child shutdown order is wrong.

In real usage this leaves lingering minimax-mcp, minimax-coding-plan-mcp, and codex mcp-server processes under openclaw-gateway, causing steady RSS growth and eventually system instability.

Environment

  • OpenClaw: 2026.4.8 (reproduced on Linux x86_64, global pnpm install)
  • Gateway mode: long-running openclaw-gateway user service
  • Subagent sessions: agent:<agentId>:subagent:<id>
  • MCP servers involved: stdio transports (minimax-mcp, minimax-coding-plan-mcp, codex mcp-server)

Reproduction

  1. Start openclaw-gateway with stdio MCP servers configured in bundle MCP.
  2. Trigger a subagent run against one child session, wait for completion.
  3. Trigger another subagent run against a different :subagent: session.
  4. Inspect child processes under the gateway PID.

Actual behavior

The count of bundle MCP child processes grows across subagent sessions instead of being reclaimed.

Observed pattern before patch:

{
  "before": { "minimax": 0, "minimax_coding": 0, "codex": 0 },
  "first.after_wait": { "minimax": 1, "minimax_coding": 1, "codex": 1 },
  "second.after_wait": { "minimax": 2, "minimax_coding": 2, "codex": 2 }
}

This accumulation was enough to contribute to OOM pressure in production.

Expected behavior

Subagent-scoped bundle MCP runtimes should be disposed after run end, and stdio MCP child processes should be terminated reliably, so repeated runs across different subagent sessions do not accumulate lingering child processes.

Root cause found locally

There appear to be two issues:

  1. cleanupBundleMcpOnRunEnd is not reliably enabled for remote subagent runs.
    • The subagent path can be identified by resolvedSessionKey containing ":subagent:", but relying only on request.lane === "subagent" is insufficient in practice.
  2. In bundle MCP runtime cleanup, disposeSession(session) closes the MCP client before closing the transport:
await session.client.close().catch(() => {});
await session.transport.close().catch(() => {});

For stdio transports, reversing the order to close the transport first made child-process reclamation behave correctly in our live test.

Local patch that fixed it for us

  1. Ensure cleanup is enabled for subagent sessions in the gateway ingress path, e.g. also detect:
Boolean(resolvedSessionKey?.includes(":subagent:"))
  1. In bundle MCP cleanup, change order to:
await session.transport.close().catch(() => {});
await session.client.close().catch(() => {});

Verification

After applying both changes locally:

  • Different subagent sessions no longer accumulated 1 -> 2 -> 3 lingering MCP children.
  • After completion and a short delay, gateway MCP child count returned to zero.

Related context

This feels adjacent to other subagent wrapper / lane issues, but this report is specifically about bundle MCP child-process leakage and subagent cleanup not being reliably applied.

extent analysis

TL;DR

Enable cleanup for subagent sessions and reverse the order of closing transport and client in bundle MCP runtime cleanup to prevent child process leakage.

Guidance

  • Identify subagent sessions by checking if resolvedSessionKey contains ":subagent:" to ensure cleanup is enabled.
  • Modify the bundle MCP cleanup to close the transport before closing the client, using the corrected order: `await session.transport.close().catch(() => {}); await session.client.close().catch(() =>

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

FAQ

Expected behavior

Subagent-scoped bundle MCP runtimes should be disposed after run end, and stdio MCP child processes should be terminated reliably, so repeated runs across different subagent sessions do not accumulate lingering child processes.

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

openclaw - ✅(Solved) Fix Subagent bundle-MCP runtimes can leak stdio child processes across sessions [2 pull requests, 1 comments, 2 participants]