hermes - ✅(Solved) Fix Lazy import of model_tools blocks asyncio event loop on first gateway message when an MCP server is slow/unreachable [2 pull requests, 1 participants]

Official PRs (…)
ON THIS PAGE

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
NousResearch/hermes-agent#16856Fetched 2026-04-29 06:38:28
View on GitHub
Comments
0
Participants
1
Timeline
10
Reactions
0
Author
Participants
Timeline (top)
labeled ×5cross-referenced ×2referenced ×2closed ×1

model_tools.py runs discover_mcp_tools() as a module-level side effect (line 143). The gateway lazy-imports run_agent (which imports model_tools) the first time a user message reaches _handle_message_with_agent — meaning the very first message after gateway start triggers MCP discovery inside the asyncio event loop thread. Since _run_on_mcp_loop uses a blocking future.result(timeout=120) rather than await, this freezes the Discord/Telegram/etc. WebSocket heartbeat for up to 120 seconds whenever any configured MCP server is unreachable. After ~50s Discord force-closes the shard.

This is distinct from #10138 (which is about a nested-call deadlock inside register_mcp_servers). Even if #10138 is fixed, a slow/unreachable MCP server will still freeze the loop because the discovery is invoked synchronously from an async context.

Error Message

2026-04-28 05:54:59 WARNING discord.gateway: Shard ID None heartbeat blocked for more than 40 seconds. Loop thread traceback (most recent call last): ... File "gateway/platforms/base.py", line 2072, in _process_message_background response = await self._message_handler(event) File "gateway/run.py", line 3871, in _handle_message return await self._handle_message_with_agent(...) File "gateway/run.py", line 4516, in _handle_message_with_agent agent_result = await self._run_agent(...) File "gateway/run.py", line 9334, in _run_agent from run_agent import AIAgent # lazy import File "run_agent.py", line 67, in <module> from model_tools import (...) # transitive File "model_tools.py", line 143, in <module> discover_mcp_tools() # module-level side effect File "tools/mcp_tool.py", line 2455, in discover_mcp_tools tool_names = register_mcp_servers(servers) File "tools/mcp_tool.py", line 2408, in register_mcp_servers _run_on_mcp_loop(_discover_all(), timeout=120) File "tools/mcp_tool.py", line 1577, in _run_on_mcp_loop return future.result(timeout=wait_timeout) # BLOCKS asyncio loop File ".../concurrent/futures/_base.py", line 451, in result self._condition.wait(timeout)

Root Cause

This is distinct from #10138 (which is about a nested-call deadlock inside register_mcp_servers). Even if #10138 is fixed, a slow/unreachable MCP server will still freeze the loop because the discovery is invoked synchronously from an async context.

Fix Action

Workaround

Remove the slow/unreachable server from mcp_servers in config.yaml. Discovery completes in ~2s and the import-time call returns fast enough not to trip the heartbeat watchdog. This is what we did locally.

PR fix notes

PR #16877: fix(gateway): defer MCP discovery to executor when imported inside event loop (#16856)

Description (problem / solution / changelog)

Summary

Fixes #16856 — Lazy import of model_tools blocks the asyncio event loop on the first gateway message when an MCP server is slow/unreachable.

Root Cause

model_tools.py calls discover_mcp_tools() as a module-level side effect (line 143). The gateway lazy-imports run_agent (which transitively imports model_tools) inside the asyncio event loop thread. discover_mcp_tools()_run_on_mcp_loop()future.result(timeout=120) blocks the loop for up to 120s when an MCP server is unreachable, killing Discord shard heartbeats and Telegram polling.

Fix

Detect whether an asyncio event loop is running at import time:

  • Running loop (gateway): Schedule discovery via loop.run_in_executor() so the event loop stays responsive
  • No running loop (CLI/TUI startup): Run discovery inline as before

discover_mcp_tools() is already idempotent, so deferred execution is safe. MCP tools may not be available for the very first message in a gateway session, but they'll be ready by the second one — far better than freezing the entire platform for 120s.

Changes

  • model_tools.py: Wrap module-level discover_mcp_tools() with event loop detection
  • tests/test_model_tools_async_bridge.py: Added TestMcpDiscoveryDeferral with 2 test cases

Testing

python3 -m pytest tests/test_model_tools_async_bridge.py -x -q -o "addopts="
# 13 passed (11 existing + 2 new)

Test coverage:

TestVerifies
test_mcp_discovery_offloaded_when_loop_runningDiscovery scheduled via executor, not inline
test_mcp_discovery_runs_inline_without_loopCLI/TUI path unchanged — discovery runs synchronously

How to verify manually

  1. Configure an unreachable MCP server in config.yaml
  2. Start the gateway
  3. Send a message via Discord/Telegram
  4. Before: First message hangs for ~120s, heartbeat warnings flood logs
  5. After: First message responds promptly, MCP tools become available shortly after

Changed files

  • model_tools.py (modified, +31/-1)
  • tests/test_model_tools_async_bridge.py (modified, +91/-0)

PR #16899: fix(mcp): move discovery out of model_tools import side effect (#16856)

Description (problem / solution / changelog)

Summary

Removes the module-level discover_mcp_tools() call in model_tools.py so lazy-importing this module from inside an asyncio event loop no longer blocks the loop for up to 120s when an MCP server is slow or unreachable. Closes #16856. Supersedes #16877 (same intent, cleaner fix per suggestion #1 in the issue — credit @Bartok9).

Root cause

model_tools.py ran discover_mcp_tools() as an import-time side effect. The gateway lazy-imports run_agent (→ model_tools) on the first user message, which executes inside the asyncio event loop thread. discover_mcp_tools() uses future.result(timeout=120) internally, freezing Discord shard heartbeats and Telegram polling for up to 120s whenever a configured MCP server was unreachable.

Changes

  • model_tools.py: remove module-level discover_mcp_tools() call (keeps the symbol importable for explicit callers).
  • gateway/run.py: run discovery via loop.run_in_executor(None, discover_mcp_tools) in start_gateway() before runner.start() — loop stays responsive, tools are ready before the first message arrives.
  • hermes_cli/main.py: inline discovery in the agent-command startup path (no event loop running).
  • tui_gateway/entry.py: inline discovery in main() (sync stdin loop).
  • acp_adapter/entry.py: inline discovery before asyncio.run() (sync context).

Why this instead of PR #16877

#16877 added event-loop detection inside model_tools.py and offloaded discovery to an executor when a loop was running. That worked, but the module-level call made the first gateway message race against discovery (first message could hit the model before MCP tool schemas were registered). Moving the call into each entry point's startup sequence avoids both problems: no import-time side effect, and gateway discovery completes before platforms accept traffic.

Validation

BeforeAfter
import model_tools triggers MCP discoveryyesno (verified via subprocess probe)
Gateway first-message delay w/ dead MCP serverup to 120s~0s (runs at startup, not first message)
tests/test_model_tools_async_bridge.py11 passed11 passed
tests/tools/test_mcp_tool.py177 passed177 passed

How to verify manually

  1. Configure an unreachable MCP server in config.yaml.
  2. Start the gateway — discovery happens in the executor during startup, logs show "(1 failed)" after the retry window.
  3. Send the first Discord/Telegram message.
  4. Before this PR: first message hangs for ~120s, heartbeat warnings flood logs.
  5. After this PR: first message responds promptly, no heartbeat warnings.

Changed files

  • acp_adapter/entry.py (modified, +11/-0)
  • gateway/run.py (modified, +13/-0)
  • hermes_cli/main.py (modified, +11/-0)
  • model_tools.py (modified, +12/-6)
  • tui_gateway/entry.py (modified, +11/-0)

Code Example

mcp_servers:
     unreachable:
       url: http://10.99.99.99:9999/mcp

---

2026-04-28 05:54:59 WARNING discord.gateway: Shard ID None heartbeat blocked for more than 40 seconds.
Loop thread traceback (most recent call last):
  ...
  File "gateway/platforms/base.py", line 2072, in _process_message_background
    response = await self._message_handler(event)
  File "gateway/run.py", line 3871, in _handle_message
    return await self._handle_message_with_agent(...)
  File "gateway/run.py", line 4516, in _handle_message_with_agent
    agent_result = await self._run_agent(...)
  File "gateway/run.py", line 9334, in _run_agent
    from run_agent import AIAgent              # lazy import
  File "run_agent.py", line 67, in <module>
    from model_tools import (...)              # transitive
  File "model_tools.py", line 143, in <module>
    discover_mcp_tools()                       # module-level side effect
  File "tools/mcp_tool.py", line 2455, in discover_mcp_tools
    tool_names = register_mcp_servers(servers)
  File "tools/mcp_tool.py", line 2408, in register_mcp_servers
    _run_on_mcp_loop(_discover_all(), timeout=120)
  File "tools/mcp_tool.py", line 1577, in _run_on_mcp_loop
    return future.result(timeout=wait_timeout)  # BLOCKS asyncio loop
  File ".../concurrent/futures/_base.py", line 451, in result
    self._condition.wait(timeout)
RAW_BUFFERClick to expand / collapse

Summary

model_tools.py runs discover_mcp_tools() as a module-level side effect (line 143). The gateway lazy-imports run_agent (which imports model_tools) the first time a user message reaches _handle_message_with_agent — meaning the very first message after gateway start triggers MCP discovery inside the asyncio event loop thread. Since _run_on_mcp_loop uses a blocking future.result(timeout=120) rather than await, this freezes the Discord/Telegram/etc. WebSocket heartbeat for up to 120 seconds whenever any configured MCP server is unreachable. After ~50s Discord force-closes the shard.

This is distinct from #10138 (which is about a nested-call deadlock inside register_mcp_servers). Even if #10138 is fixed, a slow/unreachable MCP server will still freeze the loop because the discovery is invoked synchronously from an async context.

Reproduction

  1. Add an unreachable MCP server URL to config.yaml:
    mcp_servers:
      unreachable:
        url: http://10.99.99.99:9999/mcp
  2. Start the gateway. Discovery succeeds at startup (logs MCP: registered N tool(s) from M server(s) (1 failed) after a short retry window).
  3. Send the first Discord/Telegram message after gateway start.
  4. Within ~10s, the platform logs Shard ID None heartbeat blocked for more than 10 seconds. Heartbeat-block warnings escalate every 10s. The first message hangs for ~120s before either responding or the shard reconnects.

A subsequent message in the same gateway process is fine — model_tools is now imported and the side-effect doesn't re-run.

Stack trace (Hermes 0.11.0 / v2026.4.23, Python 3.11.15)

2026-04-28 05:54:59 WARNING discord.gateway: Shard ID None heartbeat blocked for more than 40 seconds.
Loop thread traceback (most recent call last):
  ...
  File "gateway/platforms/base.py", line 2072, in _process_message_background
    response = await self._message_handler(event)
  File "gateway/run.py", line 3871, in _handle_message
    return await self._handle_message_with_agent(...)
  File "gateway/run.py", line 4516, in _handle_message_with_agent
    agent_result = await self._run_agent(...)
  File "gateway/run.py", line 9334, in _run_agent
    from run_agent import AIAgent              # lazy import
  File "run_agent.py", line 67, in <module>
    from model_tools import (...)              # transitive
  File "model_tools.py", line 143, in <module>
    discover_mcp_tools()                       # module-level side effect
  File "tools/mcp_tool.py", line 2455, in discover_mcp_tools
    tool_names = register_mcp_servers(servers)
  File "tools/mcp_tool.py", line 2408, in register_mcp_servers
    _run_on_mcp_loop(_discover_all(), timeout=120)
  File "tools/mcp_tool.py", line 1577, in _run_on_mcp_loop
    return future.result(timeout=wait_timeout)  # BLOCKS asyncio loop
  File ".../concurrent/futures/_base.py", line 451, in result
    self._condition.wait(timeout)

Why it manifests now

In a clean dev session, MCP discovery has already happened at gateway startup, so the lazy import on first message is cheap. The bug surfaces when:

  • An MCP server is configured but unreachable (network timeout, dead host, wrong port, etc.) — startup discovery records "(1 failed)" but doesn't blacklist it, and
  • The lazy import path re-invokes discover_mcp_tools which retries the failed server with the full 120s budget.

I'd guess most users haven't hit this because their MCP servers are local/reachable.

Suggested fixes

Either of these resolves the symptom; ideally both:

  1. Remove the module-level call. model_tools.py:143 calling discover_mcp_tools() at import is a side effect that's unsafe from any async context. Discovery already runs at gateway startup; a second invocation from within a message handler shouldn't be needed. If a re-discovery hook is wanted, expose it as an explicit function and call it from a non-async lifecycle event.

  2. Make _run_on_mcp_loop async-aware. When called from an event loop, schedule the coroutine and await the future via asyncio.wrap_future rather than future.result(timeout=...). Today's blocking-wait pattern silently freezes whatever loop happens to be running.

Workaround

Remove the slow/unreachable server from mcp_servers in config.yaml. Discovery completes in ~2s and the import-time call returns fast enough not to trip the heartbeat watchdog. This is what we did locally.

Environment

  • Hermes Agent v0.11.0 (v2026.4.23)
  • Python 3.11.15 on Linux (Debian/LXC)
  • Gateway: hermes-gateway systemd user service
  • Platform: Discord (discord.py); the same blocking-wait pattern would affect any platform whose handler runs in the asyncio loop

extent analysis

TL;DR

To fix the issue, remove the module-level call to discover_mcp_tools() in model_tools.py or make _run_on_mcp_loop async-aware to prevent blocking the asyncio event loop.

Guidance

  • Identify and remove any unreachable MCP servers from the config.yaml file to prevent the discovery process from freezing the asyncio event loop.
  • Consider exposing an explicit function for re-discovery and call it from a non-async lifecycle event to avoid invoking discover_mcp_tools() from within an async context.
  • To make _run_on_mcp_loop async-aware, use asyncio.wrap_future to schedule the coroutine and await the future instead of using future.result(timeout=...).
  • Verify that the fix worked by testing the gateway with an unreachable MCP server and checking for heartbeat block warnings.

Example

# Make _run_on_mcp_loop async-aware
import asyncio

async def _run_on_mcp_loop(coroutine, timeout):
    future = asyncio.wrap_future(coroutine)
    return await asyncio.wait_for(future, timeout=timeout)

Notes

The provided fix assumes that the issue is caused by the blocking call to future.result(timeout=...) in _run_on_mcp_loop. If the issue persists after applying the fix, further investigation may be necessary to identify the root cause.

Recommendation

Apply the workaround by removing unreachable MCP servers from config.yaml and consider implementing one of the suggested fixes to prevent the issue from occurring in the future. This approach ensures that the gateway remains functional while a more permanent solution is developed.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

hermes - ✅(Solved) Fix Lazy import of model_tools blocks asyncio event loop on first gateway message when an MCP server is slow/unreachable [2 pull requests, 1 participants]