hermes - ✅(Solved) Fix MCP circuit breaker open state can permanently wedge gateway when stdio subprocess dies [1 pull requests, 2 comments, 3 participants]

hermes2026-04-28 03:14:35

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

NousResearch/hermes-agent#16788•Fetched 2026-04-29 06:39:07

View on GitHub

Comments

Participants

Timeline

Reactions

Author

Participants

Timeline (top)

labeled ×4commented ×2cross-referenced ×1

When a long-running gateway session (hermes gateway run) trips the MCP circuit breaker for a stdio-transport server and the underlying subprocess has died, the breaker's half-open probe never respawns the subprocess — it just retries through a closed pipe. Result: every probe re-fails → breaker re-opens → loop forever. The MCP server is permanently broken for the lifetime of the gateway process.

Verified on v0.11.0 (2026.4.23), single-host deployment, Python 3.11.

Error Message

calling, OR returns clean error to model so it stops calling.

Probe call goes to the dead pipe → returns empty error
Worked around locally with a systemd timer that watchdogs the empty-error

Root Cause

Verified on v0.11.0 (2026.4.23), single-host deployment, Python 3.11.

Fix Action

Workaround

systemctl --user restart hermes-gateway.service — fully resets the in-process breaker dicts (_server_error_counts, _server_breaker_opened_at). Brief gateway downtime (~10-30s).

PR fix notes

PR #17016: fix: MCP circuit breaker recovery and HTTP keepalive (#16788, #17003)

Repository: NousResearch/hermes-agent
Author: vominh1919
State: open | merged: False
Link: https://github.com/NousResearch/hermes-agent/pull/17016

Description (problem / solution / changelog)

Problem

Two related MCP reliability issues affect long-running gateway sessions:

1. Circuit breaker permanently blocks recovery (#16788)

When the MCP circuit breaker trips (3 consecutive failures), it blocks the server permanently with no recovery mechanism. If the underlying subprocess dies and later becomes available again, the breaker never allows a probe call through. The gateway must be restarted to recover.

2. HTTP connections go stale during idle periods (#17003)

_wait_for_lifecycle_event() blocks indefinitely without generating any traffic. After extended idle periods (~12h), TCP connections become stale. The next tool call fails silently with an empty error message.

Fix

Circuit breaker half-open recovery (#16788)

Added _CIRCUIT_BREAKER_COOLDOWN_SEC = 60 — cooldown period before allowing a probe
Added _server_breaker_opened_at — tracks when the breaker tripped
After cooldown elapses, the handler allows one probe call through (half-open state)
If probe succeeds → error count resets, server is usable again
If probe fails → breaker re-opens with fresh cooldown
Added _bump_server_error() and _reset_server_error() helpers for consistent state management

HTTP keepalive (#17003)

_wait_for_lifecycle_event() now uses asyncio.wait() with a 3-minute timeout
On each timeout, sends a lightweight list_tools() keepalive to exercise the connection
If keepalive fails → triggers automatic reconnect via _reconnect_event
Prevents TCP connections from going stale during long idle periods

Testing

Verified syntax with ast.parse() on the modified file
All existing error count tracking replaced with helper functions for consistency
Both fixes are backward-compatible — no config changes required

Fixes #16788 Fixes #17003

Changed files

tools/mcp_tool.py (modified, +81/-20)

RAW_BUFFERClick to expand / collapse

Summary

Verified on v0.11.0 (2026.4.23), single-host deployment, Python 3.11.

Reproduction

Configure a stdio MCP server whose subprocess can crash mid-session (e.g. mcr.microsoft.com/playwright/mcp:v0.0.70 via command: docker, where Chrome can OOM under bursty asset loads).
Trigger ≥ _CIRCUIT_BREAKER_THRESHOLD (3) consecutive tool-call failures that take down the subprocess. In our case: a Squid forward proxy returned 403 for a CDN, Chrome went into a 387 req/sec retry storm, MCP tool calls timed out.
Wait ≥ _CIRCUIT_BREAKER_COOLDOWN_SEC (60s) — breaker enters half-open, fires probe.

Expected

Probe failure detects "subprocess pipe dead" and respawns it before calling, OR returns clean error to model so it stops calling.

Actual

Probe call goes to the dead pipe → returns empty error
~/.hermes/logs/errors.log shows: MCP tool playwright/browser_navigate call failed: (empty after the colon, repeating every ~60s)
docker ps -a --filter ancestor=<image> returns zero entries — gateway never re-spawns the container
Gateway-bound chats: agent reports a "browser crashed, recovering ~58s" message forever
hermes mcp test playwright from CLI succeeds in <1s — proves the config is correct, the docker run works, and the issue is gateway in-process state

Workaround

systemctl --user restart hermes-gateway.service — fully resets the in-process breaker dicts (_server_error_counts, _server_breaker_opened_at). Brief gateway downtime (~10-30s).

Suggested fix

In tools/mcp_tool.py around the half-open probe path, before reusing the existing session, check if the underlying transport is alive (e.g. subprocess.poll() is not None for stdio) and force a fresh session via the existing reconnect path (_reconnect_event) if not.

Refs: _CIRCUIT_BREAKER_THRESHOLD, _CIRCUIT_BREAKER_COOLDOWN_SEC, _bump_server_error, _reset_server_error at tools/mcp_tool.py:1377-1404.

Notes

No data loss — once gateway restarts, the persistent profile volume picks up where it left off (logins survive)
Worked around locally with a systemd timer that watchdogs the empty-error pattern in errors.log and restarts the gateway. Happy to share if useful.

extent analysis

TL;DR

The likely fix involves modifying the tools/mcp_tool.py file to check if the underlying transport is alive before reusing the existing session in the half-open probe path.

Guidance

Verify that the subprocess is indeed dead by checking the docker ps -a --filter ancestor=<image> output and the ~/.hermes/logs/errors.log for empty error messages.
Check the tools/mcp_tool.py file around lines 1377-1404 to understand the current implementation of the circuit breaker and half-open probe path.
Consider implementing a check for the underlying transport's liveness using subprocess.poll() is not None before reusing the existing session.
Test the suggested fix by triggering the circuit breaker and verifying that the subprocess is respawned correctly.

Example

# tools/mcp_tool.py
if subprocess.poll() is not None:
    # Force a fresh session via the existing reconnect path
    _reconnect_event()

Notes

The suggested fix assumes that the issue is specific to the stdio-transport server and may not apply to other types of servers.
The workaround using systemctl --user restart hermes-gateway.service can be used to temporarily resolve the issue, but it may not be suitable for production environments.

Recommendation

Apply the suggested fix by modifying the tools/mcp_tool.py file to check for the underlying transport's liveness before reusing the existing session. This should prevent the circuit breaker from getting stuck in an infinite loop and allow the subprocess to be respawned correctly.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#container setup #orchestration issue #cache issue #memory leak #API versioning

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

hermes - ✅(Solved) Fix MCP circuit breaker open state can permanently wedge gateway when stdio subprocess dies [1 pull requests, 2 comments, 3 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Workaround

PR fix notes

PR #17016: fix: MCP circuit breaker recovery and HTTP keepalive (#16788, #17003)

Description (problem / solution / changelog)

Problem

1. Circuit breaker permanently blocks recovery (#16788)

2. HTTP connections go stale during idle periods (#17003)

Fix

Circuit breaker half-open recovery (#16788)

HTTP keepalive (#17003)

Testing

Changed files

Summary

Reproduction

Expected

Actual

Workaround

Suggested fix

Notes

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING