hermes - 💡(How to fix) Fix MCP multi-server coroutine timeout causes event loop starvation and zombie processes [1 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
NousResearch/hermes-agent#20269Fetched 2026-05-06 06:37:42
View on GitHub
Comments
1
Participants
2
Timeline
4
Reactions
0
Timeline (top)
labeled ×3commented ×1

When using multiple MCP servers with coroutines, timed-out connection attempts leave zombie coroutines running on the shared background event loop. These coroutines hold ClientSession references and subprocess handles open indefinitely, starving the event loop and causing all subsequent tool calls to time out. Orphaned node/npx subprocesses accumulate as zombie processes.

Root Cause

future.cancel() in MCPManager._run_async() does not stop a running coroutine—it only abandons the caller's wait. The coroutine continues executing on the background loop, holding resources and blocking new requests.

RAW_BUFFERClick to expand / collapse

Summary

When using multiple MCP servers with coroutines, timed-out connection attempts leave zombie coroutines running on the shared background event loop. These coroutines hold ClientSession references and subprocess handles open indefinitely, starving the event loop and causing all subsequent tool calls to time out. Orphaned node/npx subprocesses accumulate as zombie processes.

Root Cause

future.cancel() in MCPManager._run_async() does not stop a running coroutine—it only abandons the caller's wait. The coroutine continues executing on the background loop, holding resources and blocking new requests.

Affected Components

  • MCPManager._run_async (openzess/backend/mcp_manager.py)
  • MCPClientManager.start_server (wintermute/integrations/mcp_runtime.py)
  • MCPRuntime.make_handler (wintermute/integrations/mcp_runtime.py)
  • McpHub singleton lifecycle (auto-coder/src/autocoder/common/mcp_hub.py)
  • stdio transport / subprocess lifecycle

Reproduction

  1. Start backend with empty MCP registry
  2. Connect two MCP servers simultaneously—one fast (stdio), one slow/unreachable
  3. Wait 15 seconds for timeout on slow server
  4. Attempt tool call on fast server → times out (event loop starved by zombie coroutine)
  5. Repeat tool calls → progressive degradation, all timeout
  6. Check processes → orphaned node/npx subprocesses still running

Related Issues

  • bytedance/deer-flow#2615: Event loop closed due to globally-cached async clients
  • bytedance/deer-flow#2627: Fix using persistent event loop + run_coroutine_threadsafe
  • Fluid-AI/fluidmcp#515: MCP stdio/subprocess lifecycle failures
  • dhyansraj/mcp-mesh#851: Zombie probe detection and lifecycle races
  • motsognirr/olmlx#263: MCP tool timeout and connection retry patterns

extent analysis

TL;DR

Canceling a coroutine in MCPManager._run_async() does not stop its execution, leading to zombie coroutines and resource starvation, so a proper cancellation mechanism is needed.

Guidance

  • Investigate using asyncio.Task.cancel() with a subsequent await asyncio.Task.wait() to ensure the coroutine is properly cleaned up.
  • Review the MCPManager._run_async() method to ensure it handles coroutine cancellation correctly and releases resources.
  • Consider implementing a timeout and retry mechanism for MCP server connections to prevent indefinite resource holding.
  • Look into using a more robust event loop management system to prevent starvation and resource leaks.

Example

import asyncio

async def _run_async(self, coroutine):
    task = asyncio.create_task(coroutine)
    try:
        await asyncio.wait_for(task, timeout=10)  # Set a timeout
    except asyncio.TimeoutError:
        task.cancel()
        try:
            await task  # Wait for the task to be cancelled
        except asyncio.CancelledError:
            pass

Notes

The provided solution is a starting point and may require further modifications to fit the specific use case. The asyncio library version and the Python version being used may also impact the implementation.

Recommendation

Apply a workaround by implementing a proper cancellation mechanism for coroutines in MCPManager._run_async() to prevent resource starvation and zombie coroutines. This will help mitigate the issue until a more permanent fix can be implemented.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

hermes - 💡(How to fix) Fix MCP multi-server coroutine timeout causes event loop starvation and zombie processes [1 comments, 2 participants]