codex - 💡(How to fix) Fix MCP child processes leak when McpConnectionManager is replaced [4 comments, 4 participants]

7etsuo · 2026-04-21T20:29:08Z

[codex] What issue are you seeing? A long-running codex --dangerously-bypass-approvals-and-sandbox daemon leaks stdio MCP child processes over time. A single d… ## Fix / Workaround ### Operational workaround ## What issue are you seeing? A long-running `codex --dangerously-bypass-approvals-and-sandbox` daemon leaks stdio MCP child processes over time. A single daemon with roughly 15 hours of uptime accumulated 492 orphaned MCP children, 123 of each of 4 configured stdio MCP servers, all direct children (PPID) of the daemon process. Each leaked child was busy-looping at roughly 35% CPU, and cumulative RSS of the leaked set was about 82 GB. The spawn/leak rate was approximately one full cycle per 7 minutes. Configured stdio MCP servers in the affected daemon: - `mcp-codebase-index` - `codebase-memory-mcp` - `github-mcp-server --toolsets actions,issues,pull_requests,repos stdio` - `ast-grep-server` Confirmed via `ps`: - 492 processes with PPID equal to the daemon PID - exactly 123 instances per configured server (123 × 4 = 492) - killing the daemon PID reaped all 492 children immediately, ruling out a non-codex parent ## What steps can reproduce the bug? I do not have a minimal isolated repro yet. Observed on a long-lived daemon with four configured stdio MCP servers. Based on code inspection the leak should reproduce by driving either of these paths repeatedly: 1. Session MCP refresh, via `refresh_mcp_servers_inner` in `codex-rs/core/src/session/mcp.rs:177`. 2. Accessible-connectors refresh, via `compute_accessible_connectors` in `codex-rs/core/src/connectors.rs` (reached through `list_accessible_connectors_from_mcp_tools*`). Both paths construct a fresh `McpConnectionManager` (which spawns all configured MCP server processes), perform tool listing, then release the manager. Field leak rate (about one cycle per 7 minutes across 15 hours) matches "one refresh cycle per interval" rather than "one per daemon start". A targeted integration test would configure a stub stdio MCP server that records its own PID at startup, drive the refresh path N times, and assert that previous PIDs are no longer alive by the time the N-th refresh completes. ## What is the expected behavior? When an `McpConnectionManager` is replaced or dropped, every MCP child process it spawned should be terminated (SIGTERM, escalating to SIGKILL after a bounded grace period) during that drop. Teardown should not depend on `Arc ` refcount reaching zero, because cloned `Arc`s legitimately escape the manager's ownership boundary during in-flight operations. ## Additional information ### Root cause `McpConnectionManager` (`codex-rs/codex-mcp/src/mcp_connection_manager.rs:657`) has no `Drop` impl. Child-process cleanup is entirely downstream of `Arc ` refcount reaching zero. `StdioServerTransport` owns both a `TokioChildProcess` with `kill_on_drop(true)` and a `ProcessGroupGuard` that SIGTERMs the child's process group on drop: - `codex-rs/rmcp-client/src/stdio_server_launcher.rs:211`, `command.kill_on_drop(true)` - `codex-rs/rmcp-client/src/stdio_server_launcher.rs:218`, `command.process_group(0)` - `codex-rs/rmcp-client/src/stdio_server_launcher.rs:290-296`, `impl Drop for ProcessGroupGuard` calling `terminate_process_group` But `StdioServerTransport` is held inside `RunningService`, which is held inside `Arc >` inside `ClientState::Ready`, which is held inside `RmcpClient`, which is wrapped by `AsyncManagedClient` as a cloneable shared future: - `codex-rs/codex-mcp/src/mcp_connection_manager.rs:478-483`, `AsyncManagedClient { client: Shared >>, ... }` - `codex-rs/rmcp-client/src/rmcp_client.rs:494`, `pub struct RmcpClient` - `codex-rs/rmcp-client/src/rmcp_client.rs:1014-1049`, `run_service_operation` clones the `Arc ` into every in-flight operation Every in-flight `run_service_operation_once` clones the `Arc`. Dropping the manager drops only its reference; any concurrent or detached task holding an `Arc` clone keeps the transport, and therefore the child PID, alive indefinitely. The `Shared >` at `AsyncManagedClient.client` additionally retains the completed `ManagedClient` for replay; any other clone of that `Shared` extends client lifetime past manager drop. ### Why the observed symptoms match this mechanism - **PPID equals daemon**: `LocalStdioServerLauncher::launch_server` spawns via `tokio::process::Command` directly in the daemon (`stdio_server_launcher.rs:209-225`). No subagent intermediary. - **123 of each of 4 servers**: each `McpConnectionManager::new(...)` call spawns one process per configured server, so multiples of 4 are expected. - **Roughly 35% CPU per leaked child**: consistent with the child's stdio reader loop spinning after the parent dropped its pipe end without delivering EOF or SIGTERM. - **Scales with uptime, not startup**: the daemon boots once; refresh paths fire repeatedly. ### Replacement sites with no te

codex2026-04-21 20:29:08

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

openai/codex#18881•Fetched 2026-04-22 07:51:08

View on GitHub

Comments

Participants

Timeline

Reactions

Author

Participants

Timeline (top)

commented ×4labeled ×3cross-referenced ×1unlabeled ×1

Root Cause

McpConnectionManager (codex-rs/codex-mcp/src/mcp_connection_manager.rs:657) has no Drop impl. Child-process cleanup is entirely downstream of Arc<RunningService> refcount reaching zero.

StdioServerTransport owns both a TokioChildProcess with kill_on_drop(true) and a ProcessGroupGuard that SIGTERMs the child's process group on drop:

codex-rs/rmcp-client/src/stdio_server_launcher.rs:211, command.kill_on_drop(true)
codex-rs/rmcp-client/src/stdio_server_launcher.rs:218, command.process_group(0)
codex-rs/rmcp-client/src/stdio_server_launcher.rs:290-296, impl Drop for ProcessGroupGuard calling terminate_process_group

But StdioServerTransport is held inside RunningService, which is held inside Arc<RunningService<...>> inside ClientState::Ready, which is held inside RmcpClient, which is wrapped by AsyncManagedClient as a cloneable shared future:

codex-rs/codex-mcp/src/mcp_connection_manager.rs:478-483, AsyncManagedClient { client: Shared<BoxFuture<..., Result<ManagedClient, _>>>, ... }
codex-rs/rmcp-client/src/rmcp_client.rs:494, pub struct RmcpClient
codex-rs/rmcp-client/src/rmcp_client.rs:1014-1049, run_service_operation clones the Arc<RunningService> into every in-flight operation

Every in-flight run_service_operation_once clones the Arc. Dropping the manager drops only its reference; any concurrent or detached task holding an Arc clone keeps the transport, and therefore the child PID, alive indefinitely. The Shared<BoxFuture<...>> at AsyncManagedClient.client additionally retains the completed ManagedClient for replay; any other clone of that Shared extends client lifetime past manager drop.

Fix Action

Fix / Workaround

Operational workaround

Code Example

let (refreshed_manager, cancel_token) = McpConnectionManager::new(...).await;
...
let mut manager = self.services.mcp_connection_manager.write().await;
*manager = refreshed_manager;        // old manager dropped here, no shutdown()

---

let (mcp_connection_manager, cancel_token) = McpConnectionManager::new(...).await;
...
// function returns; manager falls out of scope, no shutdown()

RAW_BUFFERClick to expand / collapse

What issue are you seeing?

A long-running codex --dangerously-bypass-approvals-and-sandbox daemon leaks stdio MCP child processes over time. A single daemon with roughly 15 hours of uptime accumulated 492 orphaned MCP children, 123 of each of 4 configured stdio MCP servers, all direct children (PPID) of the daemon process. Each leaked child was busy-looping at roughly 35% CPU, and cumulative RSS of the leaked set was about 82 GB. The spawn/leak rate was approximately one full cycle per 7 minutes.

Configured stdio MCP servers in the affected daemon:

mcp-codebase-index
codebase-memory-mcp
github-mcp-server --toolsets actions,issues,pull_requests,repos stdio
ast-grep-server

Confirmed via ps:

492 processes with PPID equal to the daemon PID
exactly 123 instances per configured server (123 × 4 = 492)
killing the daemon PID reaped all 492 children immediately, ruling out a non-codex parent

What steps can reproduce the bug?

I do not have a minimal isolated repro yet. Observed on a long-lived daemon with four configured stdio MCP servers. Based on code inspection the leak should reproduce by driving either of these paths repeatedly:

Session MCP refresh, via refresh_mcp_servers_inner in codex-rs/core/src/session/mcp.rs:177.
Accessible-connectors refresh, via compute_accessible_connectors in codex-rs/core/src/connectors.rs (reached through list_accessible_connectors_from_mcp_tools*).

Both paths construct a fresh McpConnectionManager (which spawns all configured MCP server processes), perform tool listing, then release the manager. Field leak rate (about one cycle per 7 minutes across 15 hours) matches "one refresh cycle per interval" rather than "one per daemon start".

A targeted integration test would configure a stub stdio MCP server that records its own PID at startup, drive the refresh path N times, and assert that previous PIDs are no longer alive by the time the N-th refresh completes.

What is the expected behavior?

When an McpConnectionManager is replaced or dropped, every MCP child process it spawned should be terminated (SIGTERM, escalating to SIGKILL after a bounded grace period) during that drop. Teardown should not depend on Arc<RunningService> refcount reaching zero, because cloned Arcs legitimately escape the manager's ownership boundary during in-flight operations.

Additional information

Root cause

McpConnectionManager (codex-rs/codex-mcp/src/mcp_connection_manager.rs:657) has no Drop impl. Child-process cleanup is entirely downstream of Arc<RunningService> refcount reaching zero.

StdioServerTransport owns both a TokioChildProcess with kill_on_drop(true) and a ProcessGroupGuard that SIGTERMs the child's process group on drop:

codex-rs/rmcp-client/src/stdio_server_launcher.rs:211, command.kill_on_drop(true)
codex-rs/rmcp-client/src/stdio_server_launcher.rs:218, command.process_group(0)
codex-rs/rmcp-client/src/stdio_server_launcher.rs:290-296, impl Drop for ProcessGroupGuard calling terminate_process_group

codex-rs/codex-mcp/src/mcp_connection_manager.rs:478-483, AsyncManagedClient { client: Shared<BoxFuture<..., Result<ManagedClient, _>>>, ... }
codex-rs/rmcp-client/src/rmcp_client.rs:494, pub struct RmcpClient
codex-rs/rmcp-client/src/rmcp_client.rs:1014-1049, run_service_operation clones the Arc<RunningService> into every in-flight operation

Why the observed symptoms match this mechanism

PPID equals daemon: LocalStdioServerLauncher::launch_server spawns via tokio::process::Command directly in the daemon (stdio_server_launcher.rs:209-225). No subagent intermediary.
123 of each of 4 servers: each McpConnectionManager::new(...) call spawns one process per configured server, so multiples of 4 are expected.
Roughly 35% CPU per leaked child: consistent with the child's stdio reader loop spinning after the parent dropped its pipe end without delivering EOF or SIGTERM.
Scales with uptime, not startup: the daemon boots once; refresh paths fire repeatedly.

Replacement sites with no teardown

codex-rs/core/src/session/mcp.rs:200-229:

let (refreshed_manager, cancel_token) = McpConnectionManager::new(...).await;
...
let mut manager = self.services.mcp_connection_manager.write().await;
*manager = refreshed_manager;        // old manager dropped here, no shutdown()

codex-rs/core/src/connectors.rs:238-320 (compute_accessible_connectors):

let (mcp_connection_manager, cancel_token) = McpConnectionManager::new(...).await;
...
// function returns; manager falls out of scope, no shutdown()

Proposed fix direction

A. Explicit async shutdown. Add pub async fn shutdown(&mut self) on McpConnectionManager that drives each AsyncManagedClient to readiness and calls a new RmcpClient::shutdown() which closes the transport and kills the process group. Invoke shutdown() at both the replacement and the scope-exit sites.

B. Manager-level Drop plus hoisted PGID. Hoist the child's process group ID out of the StdioServerTransport and Arc<RunningService> chain and store it at the AsyncManagedClient level, outside the Arc. Add impl Drop for McpConnectionManager that iterates self.clients and calls terminate_process_group on each stored PGID synchronously (the existing ProcessGroupGuard::drop is already sync via libc kill, so no async is required). This handles panic-driven drops and requires no caller changes.

Option B more closely matches the invariant that the manager owns these processes' lifetime. Per docs/contributing.md external PRs are invitation-only. Happy to prepare one if helpful.

Operational workaround

Restart the daemon to reap leaked children.

Environment

Platform: Linux, kernel 6.17.0-20-generic
Codex: codex-cli 0.122.0
MCP transports: 4 configured stdio servers (listed above)

extent analysis

TL;DR

Implement an explicit async shutdown or add a Drop implementation to McpConnectionManager to ensure child processes are terminated when the manager is replaced or dropped.

Guidance

Investigate McpConnectionManager shutdown: Implement pub async fn shutdown(&mut self) to drive each AsyncManagedClient to readiness and call RmcpClient::shutdown() to close the transport and kill the process group.
Verify Drop implementation: Consider adding impl Drop for McpConnectionManager to iterate self.clients and call terminate_process_group on each stored PGID synchronously.
Test shutdown and replacement sites: Ensure shutdown() is invoked at both replacement and scope-exit sites, such as codex-rs/core/src/session/mcp.rs:200-229 and codex-rs/core/src/connectors.rs:238-320.
Monitor for leaked processes: After implementing the shutdown or Drop solution, monitor the system for leaked processes to verify the fix.

Example

// Example implementation of McpConnectionManager shutdown
impl McpConnectionManager {
    pub async fn shutdown(&mut self) {
        for client in self.clients.iter_mut() {
            client.shutdown().await;
        }
    }
}

// Example implementation of RmcpClient shutdown
impl RmcpClient {
    pub async fn shutdown(&mut self) {
        self.transport.close().await;
        self.process_group.terminate();
    }
}

Notes

The proposed fix direction involves either explicit async shutdown or adding a Drop implementation to McpConnectionManager. The choice between these approaches depends on the specific requirements and constraints of the system.

Recommendation

Apply workaround: Implement an explicit async shutdown for MpConnectionManager to ensure child processes are terminated when the manager is replaced or dropped. This approach provides a more explicit and controlled shutdown mechanism.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#container setup #orchestration issue #cache issue #memory leak #API versioning

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

codex - 💡(How to fix) Fix MCP child processes leak when McpConnectionManager is replaced [4 comments, 4 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Fix Action

Fix / Workaround

Operational workaround

Code Example

What issue are you seeing?

What steps can reproduce the bug?

What is the expected behavior?

Additional information

Root cause

Why the observed symptoms match this mechanism

Replacement sites with no teardown

Proposed fix direction

Operational workaround

Environment

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

TRENDING

codex - 💡(How to fix) Fix MCP child processes leak when McpConnectionManager is replaced [4 comments, 4 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Fix Action

Fix / Workaround

Operational workaround

Code Example

What issue are you seeing?

What steps can reproduce the bug?

What is the expected behavior?

Additional information

Root cause

Why the observed symptoms match this mechanism

Replacement sites with no teardown

Proposed fix direction

Operational workaround

Environment

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING