codex - 💡(How to fix) Fix MCP child processes leak when McpConnectionManager is replaced [4 comments, 4 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
openai/codex#18881Fetched 2026-04-22 07:51:08
View on GitHub
Comments
4
Participants
4
Timeline
9
Reactions
0
Author
Timeline (top)
commented ×4labeled ×3cross-referenced ×1unlabeled ×1

Root Cause

McpConnectionManager (codex-rs/codex-mcp/src/mcp_connection_manager.rs:657) has no Drop impl. Child-process cleanup is entirely downstream of Arc<RunningService> refcount reaching zero.

StdioServerTransport owns both a TokioChildProcess with kill_on_drop(true) and a ProcessGroupGuard that SIGTERMs the child's process group on drop:

  • codex-rs/rmcp-client/src/stdio_server_launcher.rs:211, command.kill_on_drop(true)
  • codex-rs/rmcp-client/src/stdio_server_launcher.rs:218, command.process_group(0)
  • codex-rs/rmcp-client/src/stdio_server_launcher.rs:290-296, impl Drop for ProcessGroupGuard calling terminate_process_group

But StdioServerTransport is held inside RunningService, which is held inside Arc<RunningService<...>> inside ClientState::Ready, which is held inside RmcpClient, which is wrapped by AsyncManagedClient as a cloneable shared future:

  • codex-rs/codex-mcp/src/mcp_connection_manager.rs:478-483, AsyncManagedClient { client: Shared<BoxFuture<..., Result<ManagedClient, _>>>, ... }
  • codex-rs/rmcp-client/src/rmcp_client.rs:494, pub struct RmcpClient
  • codex-rs/rmcp-client/src/rmcp_client.rs:1014-1049, run_service_operation clones the Arc<RunningService> into every in-flight operation

Every in-flight run_service_operation_once clones the Arc. Dropping the manager drops only its reference; any concurrent or detached task holding an Arc clone keeps the transport, and therefore the child PID, alive indefinitely. The Shared<BoxFuture<...>> at AsyncManagedClient.client additionally retains the completed ManagedClient for replay; any other clone of that Shared extends client lifetime past manager drop.

Fix Action

Fix / Workaround

Operational workaround

Code Example

let (refreshed_manager, cancel_token) = McpConnectionManager::new(...).await;
...
let mut manager = self.services.mcp_connection_manager.write().await;
*manager = refreshed_manager;        // old manager dropped here, no shutdown()

---

let (mcp_connection_manager, cancel_token) = McpConnectionManager::new(...).await;
...
// function returns; manager falls out of scope, no shutdown()
RAW_BUFFERClick to expand / collapse

What issue are you seeing?

A long-running codex --dangerously-bypass-approvals-and-sandbox daemon leaks stdio MCP child processes over time. A single daemon with roughly 15 hours of uptime accumulated 492 orphaned MCP children, 123 of each of 4 configured stdio MCP servers, all direct children (PPID) of the daemon process. Each leaked child was busy-looping at roughly 35% CPU, and cumulative RSS of the leaked set was about 82 GB. The spawn/leak rate was approximately one full cycle per 7 minutes.

Configured stdio MCP servers in the affected daemon:

  • mcp-codebase-index
  • codebase-memory-mcp
  • github-mcp-server --toolsets actions,issues,pull_requests,repos stdio
  • ast-grep-server

Confirmed via ps:

  • 492 processes with PPID equal to the daemon PID
  • exactly 123 instances per configured server (123 × 4 = 492)
  • killing the daemon PID reaped all 492 children immediately, ruling out a non-codex parent

What steps can reproduce the bug?

I do not have a minimal isolated repro yet. Observed on a long-lived daemon with four configured stdio MCP servers. Based on code inspection the leak should reproduce by driving either of these paths repeatedly:

  1. Session MCP refresh, via refresh_mcp_servers_inner in codex-rs/core/src/session/mcp.rs:177.
  2. Accessible-connectors refresh, via compute_accessible_connectors in codex-rs/core/src/connectors.rs (reached through list_accessible_connectors_from_mcp_tools*).

Both paths construct a fresh McpConnectionManager (which spawns all configured MCP server processes), perform tool listing, then release the manager. Field leak rate (about one cycle per 7 minutes across 15 hours) matches "one refresh cycle per interval" rather than "one per daemon start".

A targeted integration test would configure a stub stdio MCP server that records its own PID at startup, drive the refresh path N times, and assert that previous PIDs are no longer alive by the time the N-th refresh completes.

What is the expected behavior?

When an McpConnectionManager is replaced or dropped, every MCP child process it spawned should be terminated (SIGTERM, escalating to SIGKILL after a bounded grace period) during that drop. Teardown should not depend on Arc<RunningService> refcount reaching zero, because cloned Arcs legitimately escape the manager's ownership boundary during in-flight operations.

Additional information

Root cause

McpConnectionManager (codex-rs/codex-mcp/src/mcp_connection_manager.rs:657) has no Drop impl. Child-process cleanup is entirely downstream of Arc<RunningService> refcount reaching zero.

StdioServerTransport owns both a TokioChildProcess with kill_on_drop(true) and a ProcessGroupGuard that SIGTERMs the child's process group on drop:

  • codex-rs/rmcp-client/src/stdio_server_launcher.rs:211, command.kill_on_drop(true)
  • codex-rs/rmcp-client/src/stdio_server_launcher.rs:218, command.process_group(0)
  • codex-rs/rmcp-client/src/stdio_server_launcher.rs:290-296, impl Drop for ProcessGroupGuard calling terminate_process_group

But StdioServerTransport is held inside RunningService, which is held inside Arc<RunningService<...>> inside ClientState::Ready, which is held inside RmcpClient, which is wrapped by AsyncManagedClient as a cloneable shared future:

  • codex-rs/codex-mcp/src/mcp_connection_manager.rs:478-483, AsyncManagedClient { client: Shared<BoxFuture<..., Result<ManagedClient, _>>>, ... }
  • codex-rs/rmcp-client/src/rmcp_client.rs:494, pub struct RmcpClient
  • codex-rs/rmcp-client/src/rmcp_client.rs:1014-1049, run_service_operation clones the Arc<RunningService> into every in-flight operation

Every in-flight run_service_operation_once clones the Arc. Dropping the manager drops only its reference; any concurrent or detached task holding an Arc clone keeps the transport, and therefore the child PID, alive indefinitely. The Shared<BoxFuture<...>> at AsyncManagedClient.client additionally retains the completed ManagedClient for replay; any other clone of that Shared extends client lifetime past manager drop.

Why the observed symptoms match this mechanism

  • PPID equals daemon: LocalStdioServerLauncher::launch_server spawns via tokio::process::Command directly in the daemon (stdio_server_launcher.rs:209-225). No subagent intermediary.
  • 123 of each of 4 servers: each McpConnectionManager::new(...) call spawns one process per configured server, so multiples of 4 are expected.
  • Roughly 35% CPU per leaked child: consistent with the child's stdio reader loop spinning after the parent dropped its pipe end without delivering EOF or SIGTERM.
  • Scales with uptime, not startup: the daemon boots once; refresh paths fire repeatedly.

Replacement sites with no teardown

codex-rs/core/src/session/mcp.rs:200-229:

let (refreshed_manager, cancel_token) = McpConnectionManager::new(...).await;
...
let mut manager = self.services.mcp_connection_manager.write().await;
*manager = refreshed_manager;        // old manager dropped here, no shutdown()

codex-rs/core/src/connectors.rs:238-320 (compute_accessible_connectors):

let (mcp_connection_manager, cancel_token) = McpConnectionManager::new(...).await;
...
// function returns; manager falls out of scope, no shutdown()

Proposed fix direction

A. Explicit async shutdown. Add pub async fn shutdown(&mut self) on McpConnectionManager that drives each AsyncManagedClient to readiness and calls a new RmcpClient::shutdown() which closes the transport and kills the process group. Invoke shutdown() at both the replacement and the scope-exit sites.

B. Manager-level Drop plus hoisted PGID. Hoist the child's process group ID out of the StdioServerTransport and Arc<RunningService> chain and store it at the AsyncManagedClient level, outside the Arc. Add impl Drop for McpConnectionManager that iterates self.clients and calls terminate_process_group on each stored PGID synchronously (the existing ProcessGroupGuard::drop is already sync via libc kill, so no async is required). This handles panic-driven drops and requires no caller changes.

Option B more closely matches the invariant that the manager owns these processes' lifetime. Per docs/contributing.md external PRs are invitation-only. Happy to prepare one if helpful.

Operational workaround

Restart the daemon to reap leaked children.

Environment

  • Platform: Linux, kernel 6.17.0-20-generic
  • Codex: codex-cli 0.122.0
  • MCP transports: 4 configured stdio servers (listed above)

extent analysis

TL;DR

Implement an explicit async shutdown or add a Drop implementation to McpConnectionManager to ensure child processes are terminated when the manager is replaced or dropped.

Guidance

  1. Investigate McpConnectionManager shutdown: Implement pub async fn shutdown(&mut self) to drive each AsyncManagedClient to readiness and call RmcpClient::shutdown() to close the transport and kill the process group.
  2. Verify Drop implementation: Consider adding impl Drop for McpConnectionManager to iterate self.clients and call terminate_process_group on each stored PGID synchronously.
  3. Test shutdown and replacement sites: Ensure shutdown() is invoked at both replacement and scope-exit sites, such as codex-rs/core/src/session/mcp.rs:200-229 and codex-rs/core/src/connectors.rs:238-320.
  4. Monitor for leaked processes: After implementing the shutdown or Drop solution, monitor the system for leaked processes to verify the fix.

Example

// Example implementation of McpConnectionManager shutdown
impl McpConnectionManager {
    pub async fn shutdown(&mut self) {
        for client in self.clients.iter_mut() {
            client.shutdown().await;
        }
    }
}

// Example implementation of RmcpClient shutdown
impl RmcpClient {
    pub async fn shutdown(&mut self) {
        self.transport.close().await;
        self.process_group.terminate();
    }
}

Notes

The proposed fix direction involves either explicit async shutdown or adding a Drop implementation to McpConnectionManager. The choice between these approaches depends on the specific requirements and constraints of the system.

Recommendation

Apply workaround: Implement an explicit async shutdown for MpConnectionManager to ensure child processes are terminated when the manager is replaced or dropped. This approach provides a more explicit and controlled shutdown mechanism.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING