hermes - ✅(Solved) Fix MCP circuit breaker open state can permanently wedge gateway when stdio subprocess dies [1 pull requests, 2 comments, 3 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
NousResearch/hermes-agent#16788Fetched 2026-04-29 06:39:07
View on GitHub
Comments
2
Participants
3
Timeline
7
Reactions
0
Author
Timeline (top)
labeled ×4commented ×2cross-referenced ×1

When a long-running gateway session (hermes gateway run) trips the MCP circuit breaker for a stdio-transport server and the underlying subprocess has died, the breaker's half-open probe never respawns the subprocess — it just retries through a closed pipe. Result: every probe re-fails → breaker re-opens → loop forever. The MCP server is permanently broken for the lifetime of the gateway process.

Verified on v0.11.0 (2026.4.23), single-host deployment, Python 3.11.

Error Message

calling, OR returns clean error to model so it stops calling.

  • Probe call goes to the dead pipe → returns empty error
  • Worked around locally with a systemd timer that watchdogs the empty-error

Root Cause

When a long-running gateway session (hermes gateway run) trips the MCP circuit breaker for a stdio-transport server and the underlying subprocess has died, the breaker's half-open probe never respawns the subprocess — it just retries through a closed pipe. Result: every probe re-fails → breaker re-opens → loop forever. The MCP server is permanently broken for the lifetime of the gateway process.

Verified on v0.11.0 (2026.4.23), single-host deployment, Python 3.11.

Fix Action

Workaround

systemctl --user restart hermes-gateway.service — fully resets the in-process breaker dicts (_server_error_counts, _server_breaker_opened_at). Brief gateway downtime (~10-30s).

PR fix notes

PR #17016: fix: MCP circuit breaker recovery and HTTP keepalive (#16788, #17003)

Description (problem / solution / changelog)

Problem

Two related MCP reliability issues affect long-running gateway sessions:

1. Circuit breaker permanently blocks recovery (#16788)

When the MCP circuit breaker trips (3 consecutive failures), it blocks the server permanently with no recovery mechanism. If the underlying subprocess dies and later becomes available again, the breaker never allows a probe call through. The gateway must be restarted to recover.

2. HTTP connections go stale during idle periods (#17003)

_wait_for_lifecycle_event() blocks indefinitely without generating any traffic. After extended idle periods (~12h), TCP connections become stale. The next tool call fails silently with an empty error message.

Fix

Circuit breaker half-open recovery (#16788)

  • Added _CIRCUIT_BREAKER_COOLDOWN_SEC = 60 — cooldown period before allowing a probe
  • Added _server_breaker_opened_at — tracks when the breaker tripped
  • After cooldown elapses, the handler allows one probe call through (half-open state)
  • If probe succeeds → error count resets, server is usable again
  • If probe fails → breaker re-opens with fresh cooldown
  • Added _bump_server_error() and _reset_server_error() helpers for consistent state management

HTTP keepalive (#17003)

  • _wait_for_lifecycle_event() now uses asyncio.wait() with a 3-minute timeout
  • On each timeout, sends a lightweight list_tools() keepalive to exercise the connection
  • If keepalive fails → triggers automatic reconnect via _reconnect_event
  • Prevents TCP connections from going stale during long idle periods

Testing

  • Verified syntax with ast.parse() on the modified file
  • All existing error count tracking replaced with helper functions for consistency
  • Both fixes are backward-compatible — no config changes required

Fixes #16788 Fixes #17003

Changed files

  • tools/mcp_tool.py (modified, +81/-20)
RAW_BUFFERClick to expand / collapse

Summary

When a long-running gateway session (hermes gateway run) trips the MCP circuit breaker for a stdio-transport server and the underlying subprocess has died, the breaker's half-open probe never respawns the subprocess — it just retries through a closed pipe. Result: every probe re-fails → breaker re-opens → loop forever. The MCP server is permanently broken for the lifetime of the gateway process.

Verified on v0.11.0 (2026.4.23), single-host deployment, Python 3.11.

Reproduction

  1. Configure a stdio MCP server whose subprocess can crash mid-session (e.g. mcr.microsoft.com/playwright/mcp:v0.0.70 via command: docker, where Chrome can OOM under bursty asset loads).
  2. Trigger ≥ _CIRCUIT_BREAKER_THRESHOLD (3) consecutive tool-call failures that take down the subprocess. In our case: a Squid forward proxy returned 403 for a CDN, Chrome went into a 387 req/sec retry storm, MCP tool calls timed out.
  3. Wait ≥ _CIRCUIT_BREAKER_COOLDOWN_SEC (60s) — breaker enters half-open, fires probe.

Expected

Probe failure detects "subprocess pipe dead" and respawns it before calling, OR returns clean error to model so it stops calling.

Actual

  • Probe call goes to the dead pipe → returns empty error
  • ~/.hermes/logs/errors.log shows: MCP tool playwright/browser_navigate call failed: (empty after the colon, repeating every ~60s)
  • docker ps -a --filter ancestor=<image> returns zero entries — gateway never re-spawns the container
  • Gateway-bound chats: agent reports a "browser crashed, recovering ~58s" message forever
  • hermes mcp test playwright from CLI succeeds in <1s — proves the config is correct, the docker run works, and the issue is gateway in-process state

Workaround

systemctl --user restart hermes-gateway.service — fully resets the in-process breaker dicts (_server_error_counts, _server_breaker_opened_at). Brief gateway downtime (~10-30s).

Suggested fix

In tools/mcp_tool.py around the half-open probe path, before reusing the existing session, check if the underlying transport is alive (e.g. subprocess.poll() is not None for stdio) and force a fresh session via the existing reconnect path (_reconnect_event) if not.

Refs: _CIRCUIT_BREAKER_THRESHOLD, _CIRCUIT_BREAKER_COOLDOWN_SEC, _bump_server_error, _reset_server_error at tools/mcp_tool.py:1377-1404.

Notes

  • No data loss — once gateway restarts, the persistent profile volume picks up where it left off (logins survive)
  • Worked around locally with a systemd timer that watchdogs the empty-error pattern in errors.log and restarts the gateway. Happy to share if useful.

extent analysis

TL;DR

The likely fix involves modifying the tools/mcp_tool.py file to check if the underlying transport is alive before reusing the existing session in the half-open probe path.

Guidance

  • Verify that the subprocess is indeed dead by checking the docker ps -a --filter ancestor=<image> output and the ~/.hermes/logs/errors.log for empty error messages.
  • Check the tools/mcp_tool.py file around lines 1377-1404 to understand the current implementation of the circuit breaker and half-open probe path.
  • Consider implementing a check for the underlying transport's liveness using subprocess.poll() is not None before reusing the existing session.
  • Test the suggested fix by triggering the circuit breaker and verifying that the subprocess is respawned correctly.

Example

# tools/mcp_tool.py
if subprocess.poll() is not None:
    # Force a fresh session via the existing reconnect path
    _reconnect_event()

Notes

  • The suggested fix assumes that the issue is specific to the stdio-transport server and may not apply to other types of servers.
  • The workaround using systemctl --user restart hermes-gateway.service can be used to temporarily resolve the issue, but it may not be suitable for production environments.

Recommendation

Apply the suggested fix by modifying the tools/mcp_tool.py file to check for the underlying transport's liveness before reusing the existing session. This should prevent the circuit breaker from getting stuck in an infinite loop and allow the subprocess to be respawned correctly.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING