hermes - ✅(Solved) Fix [Bug] Gateway fails to register stdio MCP servers silently on macOS (launchd) [1 pull requests, 1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
NousResearch/hermes-agent#14113Fetched 2026-04-23 07:46:44
View on GitHub
Comments
0
Participants
1
Timeline
6
Reactions
0
Author
Participants
Timeline (top)
labeled ×5cross-referenced ×1

A stdio-based MCP server that works with hermes mcp add / hermes mcp test and with hermes chat -q … (CLI) never becomes reachable from the Telegram-facing gateway. The gateway silently fails to connect — no successful MCP: registered N tool(s) log line, no MCP child processes of the gateway PID — and sessions are built with only the built-in toolsets.

Error Message

  • gateway.error.log: Happy to provide full gateway.log / gateway.error.log snippets and a minimal repro repo if helpful.

Root Cause

A stdio-based MCP server that works with hermes mcp add / hermes mcp test and with hermes chat -q … (CLI) never becomes reachable from the Telegram-facing gateway. The gateway silently fails to connect — no successful MCP: registered N tool(s) log line, no MCP child processes of the gateway PID — and sessions are built with only the built-in toolsets.

Fix Action

Workaround

Switching the same server to HTTP transport (StreamableHTTPServerTransport bound to 127.0.0.1:PORT) makes it register immediately and work end-to-end in Telegram sessions. So the bug appears specific to stdio + gateway-under-launchd.

PR fix notes

PR #14173: fix(mcp): don't cancel registration on stale per-thread interrupt flag

Description (problem / solution / changelog)

What does this PR do?

Narrow fix for #14113 where stdio MCP servers silently fail to register in the launchd-managed gateway with a bare CancelledError, while the same config works via hermes mcp test, hermes chat -q, and the gateway's HTTP transport.

What's happening

_run_on_mcp_loop() (tools/mcp_tool.py:1544) polls is_interrupted() in a 0.1s loop and calls future.cancel() on the submitted coroutine if the calling thread's tid is in _interrupted_threads. That cancellation channel exists so a new user message aborts a long-running tool call mid-flight — exactly the right semantics for _make_tool_handler.

The same plumbing was also being used for registration / discovery / probe — call paths that run during gateway startup and /reload-mcp, scheduled onto the default executor's worker pool. run_agent.py's own clear_interrupt() (line 3619–3623) already warns that stale per-tid flags can survive a turn boundary and fire on unrelated work scheduled onto the recycled worker:

"guarantees no stale interrupt can survive a turn boundary and fire on a subsequent, unrelated tool call that happens to get scheduled onto the same recycled worker tid."

When a prior agent turn's cleanup was missed (crash, abandoned task, whatever), the executor worker's tid stays in _interrupted_threads. The next thing to run on that worker — discover_mcp_tools() at gateway boot or /reload-mcp — gets its stdio handshake cancelled before it finishes. asyncio.gather(..., return_exceptions=True) collects the bare CancelledError and the warning in the report is emitted.

The fix

One kwarg on _run_on_mcp_loop:

def _run_on_mcp_loop(coro, timeout: float = 30, *, respect_interrupt: bool = True):
    ...
    if respect_interrupt and is_interrupted():
        future.cancel()
        raise InterruptedError("User sent a new message")
  • Default True — tool-call callers (_make_tool_handler._call_once and its five variants) keep the existing interrupt behavior; user interrupts still abort in-flight tool calls.
  • Explicit False from register_mcp_servers() and probe_mcp_server_tools() — registration / probe cannot be cancelled by a stale flag on a recycled worker.

Also: _format_connect_error() now recognises a naked asyncio.CancelledError (no __cause__, no __context__) and emits a short actionable message instead of the bare class name. A chained CancelledError with a real cause still falls through to the existing flattening path so the underlying reason is preserved.

Scope limits

  • Does not touch gateway/run.py, run_agent.py, or model_tools.py. The fix lives where the bug lives.
  • Does not change tool-call interrupt semantics.
  • Does not empirically confirm the stale-flag hypothesis against a live launchd environment — I don't have one. The fix is defensible on its own terms (registration shouldn't be cancellable by a stale user-interrupt flag), and removes the most plausible trigger for the reported symptom. If the reporter sees residual cancellations after this lands, the improved _format_connect_error message will point them to the next datapoint (parent-task cancel vs. some other source).

Related Issue

Fixes #14113

Type of Change

  • 🐛 Bug fix (non-breaking change that fixes an issue)
  • ✨ New feature (non-breaking change that adds functionality)
  • 🔒 Security fix
  • 📝 Documentation update
  • ✅ Tests (adding or improving test coverage)
  • ♻️ Refactor (no behavior change)
  • 🎯 New skill (bundled or hub)

Changes Made

  • tools/mcp_tool.py:
    • _run_on_mcp_loop: new respect_interrupt kwarg (keyword-only, default True)
    • register_mcp_servers and probe_mcp_server_tools: pass respect_interrupt=False
    • _format_connect_error: recognize naked CancelledError and emit an actionable hint
  • tests/tools/test_mcp_tool.py:
    • TestRunOnMCPLoopInterrupts.test_respect_interrupt_false_ignores_stale_flag — simulates stale flag, asserts registration proceeds
    • TestRunOnMCPLoopInterrupts.test_respect_interrupt_default_still_cancels — regression guard, tool-call path still raises InterruptedError
    • TestFormatConnectError.test_naked_cancelled_error_has_actionable_message — new CancelledError branch
    • TestFormatConnectError.test_cancelled_error_with_cause_falls_through — chained cause still rendered
    • TestFormatConnectError.test_missing_node_executable_unchanged — existing FileNotFoundError path unaffected

How to Test

source .venv/bin/activate
scripts/run_tests.sh tests/tools/test_mcp_tool.py -q

Expected: 171 passed.

Checklist

Code

  • I've read the Contributing Guide
  • My commit messages follow Conventional Commits (fix(scope):, feat(scope):, etc.)
  • I searched for existing PRs to make sure this isn't a duplicate
  • My PR contains only changes related to this fix/feature (no unrelated commits)
  • I've run pytest tests/ -q and all tests pass — only tests/tools/test_mcp_tool.py was run (171 passed); the full suite was not run
  • I've added tests for my changes (required for bug fixes, strongly encouraged for features)
  • I've tested on my platform: macOS (Darwin 25.4.0). No live launchd repro — I don't have the reporter's environment.

Documentation & Housekeeping

  • I've updated relevant documentation (README, docs/, docstrings) — added explanatory docstrings on the new respect_interrupt kwarg
  • I've updated cli-config.yaml.example if I added/changed config keys — N/A
  • I've updated CONTRIBUTING.md or AGENTS.md if I changed architecture or workflows — N/A
  • I've considered cross-platform impact (Windows, macOS) — pure Python, no platform-specific code
  • I've updated tool descriptions/schemas if I changed tool behavior — N/A

Screenshots / Logs

▶ running pytest with 4 workers, hermetic env, in /…/hermes-agent
  (TZ=UTC LANG=C.UTF-8 PYTHONHASHSEED=0; all credential env vars unset)
........................................................................ [ 42%]
........................................................................ [ 84%]
...........................                                              [100%]
171 passed, 5 warnings in 8.01s

Changed files

  • tests/tools/test_mcp_tool.py (modified, +102/-0)
  • tools/mcp_tool.py (modified, +25/-4)

Code Example

hermes mcp add myserver --command node --args /path/to/server.mjs --env DATABASE_URL=...

---

WARNING tools.mcp_tool: Failed to connect to MCP server 'myserver' (command=node): CancelledError
RAW_BUFFERClick to expand / collapse

Environment

  • Hermes Agent v0.10.0 (2026.4.16)
  • macOS 26.3.1 (arm64)
  • Python 3.11.14
  • Node v25.9.0
  • Gateway managed by launchd (ai.hermes.gateway.plist)

Summary

A stdio-based MCP server that works with hermes mcp add / hermes mcp test and with hermes chat -q … (CLI) never becomes reachable from the Telegram-facing gateway. The gateway silently fails to connect — no successful MCP: registered N tool(s) log line, no MCP child processes of the gateway PID — and sessions are built with only the built-in toolsets.

Reproduction

  1. Register a local stdio MCP server (Node subprocess):
    hermes mcp add myserver --command node --args /path/to/server.mjs --env DATABASE_URL=...
  2. hermes mcp test myserver → ✓ connected in ~200 ms, 26 tools discovered.
  3. hermes chat -Q -q \"call tool X\" (CLI) → works.
  4. Restart the gateway (hermes gateway restart), open a Telegram session, /new, then ask anything that should use the MCP.
  5. Observed:
    • pgrep -P <gateway-pid> → no child node process.
    • /reload-mcp on Telegram → No MCP servers connected.
    • gateway.error.log:
      WARNING tools.mcp_tool: Failed to connect to MCP server 'myserver' (command=node): CancelledError
    • Per-session tool list dumped from ~/.hermes/sessions/*.json contains only built-in toolsets (e.g. 29 entries, none prefixed mcp_).
    • hermes tools list --platform telegram still reports myserver all tools enabled — misleading since the server was never actually registered in the gateway process.

What works vs. what doesn't

ContextResult
hermes mcp test✓ connects (~200 ms)
hermes chat -q … (CLI one-shot)✓ MCP tools callable
Gateway (launchd-managed, Telegram) — stdio transport✗ never registers, CancelledError
Gateway (launchd-managed, Telegram) — HTTP transport✓ works reliably

Workaround

Switching the same server to HTTP transport (StreamableHTTPServerTransport bound to 127.0.0.1:PORT) makes it register immediately and work end-to-end in Telegram sessions. So the bug appears specific to stdio + gateway-under-launchd.

Hypothesis

discover_mcp_tools() is invoked via nested executors (loop.run_in_executor(None, discover_mcp_tools)_run_on_mcp_loop on a background thread) and the stdio client's subprocess is spawned via anyio. The outer call appears to be cancelled before the stdio handshake completes, producing the silent CancelledError. Worth checking why the cancellation happens in the gateway context but not in a plain Python REPL or hermes chat.

Evidence attached

Happy to provide full gateway.log / gateway.error.log snippets and a minimal repro repo if helpful.

extent analysis

TL;DR

The issue can be worked around by switching the MCP server to use HTTP transport instead of stdio.

Guidance

  • Verify that the CancelledError is indeed caused by the outer call being cancelled before the stdio handshake completes by checking the gateway.log and gateway.error.log files for any relevant error messages.
  • Investigate why the cancellation happens in the gateway context but not in a plain Python REPL or hermes chat by comparing the execution environments and thread management.
  • Consider using a debugger or adding logging statements to the discover_mcp_tools() function to understand the execution flow and identify the point where the cancellation occurs.
  • Test the MCP server with a different transport mechanism, such as HTTP, to confirm that the issue is specific to stdio.

Example

No code snippet is provided as the issue is more related to the environment and configuration rather than a specific code block.

Notes

The issue seems to be specific to the combination of stdio transport and the gateway being managed by launchd. The fact that switching to HTTP transport resolves the issue suggests that the problem lies in the stdio handshake or the way the subprocess is spawned.

Recommendation

Apply the workaround by switching the MCP server to use HTTP transport, as it has been confirmed to work reliably in the given environment. This change can be made until the root cause of the stdio issue is identified and fixed.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

hermes - ✅(Solved) Fix [Bug] Gateway fails to register stdio MCP servers silently on macOS (launchd) [1 pull requests, 1 participants]