hermes - 💡(How to fix) Fix MCP multi-server coroutine timeout causes event loop starvation and zombie processes [1 comments, 2 participants]

hermes2026-05-05 15:17:22

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

NousResearch/hermes-agent#20269•Fetched 2026-05-06 06:37:42

View on GitHub

Comments

Participants

Timeline

Reactions

Author

aircode610

Participants

aircode610

alt-glitch

Timeline (top)

labeled ×3commented ×1

When using multiple MCP servers with coroutines, timed-out connection attempts leave zombie coroutines running on the shared background event loop. These coroutines hold ClientSession references and subprocess handles open indefinitely, starving the event loop and causing all subsequent tool calls to time out. Orphaned node/npx subprocesses accumulate as zombie processes.

Root Cause

future.cancel() in MCPManager._run_async() does not stop a running coroutine—it only abandons the caller's wait. The coroutine continues executing on the background loop, holding resources and blocking new requests.

RAW_BUFFERClick to expand / collapse

Summary

Root Cause

Affected Components

MCPManager._run_async (openzess/backend/mcp_manager.py)
MCPClientManager.start_server (wintermute/integrations/mcp_runtime.py)
MCPRuntime.make_handler (wintermute/integrations/mcp_runtime.py)
McpHub singleton lifecycle (auto-coder/src/autocoder/common/mcp_hub.py)
stdio transport / subprocess lifecycle

Reproduction

Start backend with empty MCP registry
Connect two MCP servers simultaneously—one fast (stdio), one slow/unreachable
Wait 15 seconds for timeout on slow server
Attempt tool call on fast server → times out (event loop starved by zombie coroutine)
Repeat tool calls → progressive degradation, all timeout
Check processes → orphaned node/npx subprocesses still running

Related Issues

bytedance/deer-flow#2615: Event loop closed due to globally-cached async clients
bytedance/deer-flow#2627: Fix using persistent event loop + run_coroutine_threadsafe
Fluid-AI/fluidmcp#515: MCP stdio/subprocess lifecycle failures
dhyansraj/mcp-mesh#851: Zombie probe detection and lifecycle races
motsognirr/olmlx#263: MCP tool timeout and connection retry patterns

extent analysis

TL;DR

Canceling a coroutine in MCPManager._run_async() does not stop its execution, leading to zombie coroutines and resource starvation, so a proper cancellation mechanism is needed.

Guidance

Investigate using asyncio.Task.cancel() with a subsequent await asyncio.Task.wait() to ensure the coroutine is properly cleaned up.
Review the MCPManager._run_async() method to ensure it handles coroutine cancellation correctly and releases resources.
Consider implementing a timeout and retry mechanism for MCP server connections to prevent indefinite resource holding.
Look into using a more robust event loop management system to prevent starvation and resource leaks.

Example

import asyncio

async def _run_async(self, coroutine):
    task = asyncio.create_task(coroutine)
    try:
        await asyncio.wait_for(task, timeout=10)  # Set a timeout
    except asyncio.TimeoutError:
        task.cancel()
        try:
            await task  # Wait for the task to be cancelled
        except asyncio.CancelledError:
            pass

Notes

The provided solution is a starting point and may require further modifications to fit the specific use case. The asyncio library version and the Python version being used may also impact the implementation.

Recommendation

Apply a workaround by implementing a proper cancellation mechanism for coroutines in MCPManager._run_async() to prevent resource starvation and zombie coroutines. This will help mitigate the issue until a more permanent fix can be implemented.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#generation error #database connection #vector store #embedding generation #cache error

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

hermes - 💡(How to fix) Fix MCP multi-server coroutine timeout causes event loop starvation and zombie processes [1 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Summary

Root Cause

Affected Components

Reproduction

Related Issues

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

TRENDING

hermes - 💡(How to fix) Fix MCP multi-server coroutine timeout causes event loop starvation and zombie processes [1 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Summary

Root Cause

Affected Components

Reproduction

Related Issues

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING