hermes - 💡(How to fix) Fix Unify MCP HTTP recovery around connection lifecycle and error classification [1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
NousResearch/hermes-agent#18165Fetched 2026-05-01 05:53:35
View on GitHub
Comments
0
Participants
1
Timeline
3
Reactions
0
Author
Participants
Timeline (top)
labeled ×3

Hermes currently handles several MCP HTTP recovery cases as separate fixes:

  • expired server-side MCP sessions
  • terminated HTTP MCP streams
  • stale idle HTTP connections
  • timeout / empty-message failures after long-lived sessions
  • protocol or endpoint mismatches that should not be retried as transport failures

These issues appear related at the user level: MCP tools stop working until Hermes is restarted or the MCP connection is manually refreshed.

I would like to propose an umbrella architecture issue before submitting a staged PR series, so the recovery model can be agreed on before implementation.

Error Message

The current recovery behavior is spread across MCP operation handlers and specific error-message checks.

  • empty error messages after idle disconnects
  1. Reconnecting when the error is actually an auth failure, tool/business error, or protocol/configuration problem. | Error Classifier | Classify failures as auth, session identity, transport dead, timeout/stale, protocol/config error, or ordinary tool error | | Session identity error | Reconnect MCP session and retry once | | Transport-dead error | Reconnect MCP transport/session and retry once | | MCP tool/business error | Do not reconnect; surface the tool error normally | empty error after idle StreamableHTTP disconnect
  • make every error mentioning “session” reconnect automatically The retry policy should stay conservative: reconnect and retry once, then surface the error if recovery fails.
  1. Add centralized MCP error classification without changing behavior.

Root Cause

Hermes currently handles several MCP HTTP recovery cases as separate fixes:

  • expired server-side MCP sessions
  • terminated HTTP MCP streams
  • stale idle HTTP connections
  • timeout / empty-message failures after long-lived sessions
  • protocol or endpoint mismatches that should not be retried as transport failures

These issues appear related at the user level: MCP tools stop working until Hermes is restarted or the MCP connection is manually refreshed.

I would like to propose an umbrella architecture issue before submitting a staged PR series, so the recovery model can be agreed on before implementation.

Fix Action

Fix / Workaround

That makes each new failure mode require another targeted patch, for example:

Code Example

Invalid or expired session
expired session
session expired
session not found
unknown session
invalid or missing session id

---

Session terminated
stream closed
EndOfStream
connection closed
connection reset
server disconnected

---

TimeoutError
empty error after idle StreamableHTTP disconnect
RAW_BUFFERClick to expand / collapse

Summary

Hermes currently handles several MCP HTTP recovery cases as separate fixes:

  • expired server-side MCP sessions
  • terminated HTTP MCP streams
  • stale idle HTTP connections
  • timeout / empty-message failures after long-lived sessions
  • protocol or endpoint mismatches that should not be retried as transport failures

These issues appear related at the user level: MCP tools stop working until Hermes is restarted or the MCP connection is manually refreshed.

I would like to propose an umbrella architecture issue before submitting a staged PR series, so the recovery model can be agreed on before implementation.

Related Issues / PRs

  • #13383 — MCP server session expires during long-running gateway, no auto-reconnect
  • #15125 — merged fix for Invalid or expired session
  • #13795 — open PR for Session terminated transport recovery
  • #17662 — open PR for TimeoutError after stale StreamableHTTP sessions
  • #17003 — open issue for MCP HTTP connections going stale after idle periods
  • #17244 — related but different: SSE discovery / endpoint mismatch, not a normal reconnect case

Problem

The current recovery behavior is spread across MCP operation handlers and specific error-message checks.

That makes each new failure mode require another targeted patch, for example:

  • Invalid or expired session
  • Session terminated
  • TimeoutError
  • empty error messages after idle disconnects
  • stream closed
  • connection reset
  • server disconnected

This works for individual cases, but it does not provide a clear recovery model for long-running MCP HTTP connections.

It also increases the risk of two opposite mistakes:

  1. Not reconnecting when the MCP transport/session is genuinely dead.
  2. Reconnecting when the error is actually an auth failure, tool/business error, or protocol/configuration problem.

Proposed Direction

Introduce a unified MCP runtime recovery layer with clear separation of responsibilities.

Suggested boundaries:

ComponentResponsibility
Error ClassifierClassify failures as auth, session identity, transport dead, timeout/stale, protocol/config error, or ordinary tool error
Operation ExecutorRoute all MCP operations through one recovery path
Connection ManagerOwn reconnect, ready waiting, and single-flight reconnect coordination
Health MonitorDetect idle/stale HTTP sessions proactively or before the next operation
Tool Registry AdapterKeep existing Hermes tool registration behavior separate from transport recovery

All MCP operations should use the same executor:

  • tool calls
  • list resources
  • read resource
  • list prompts
  • get prompt
  • tool discovery / refresh where applicable

Recovery Model

Suggested behavior:

Failure classExpected handling
OAuth/auth failureUse existing OAuth recovery / reauth flow
Session identity errorReconnect MCP session and retry once
Transport-dead errorReconnect MCP transport/session and retry once
Timeout / stale idle connectionReconnect and retry once if classified as transport/session failure
Protocol or endpoint mismatchDo not retry as transport recovery; return a clear diagnostic
MCP tool/business errorDo not reconnect; surface the tool error normally

Important: classification should remain allow-list based. A broad rule like if "session" in message: reconnect would be too risky.

Examples of Recoverable Markers

Session identity errors:

Invalid or expired session
expired session
session expired
session not found
unknown session
invalid or missing session id

Transport-dead errors:

Session terminated
stream closed
EndOfStream
connection closed
connection reset
server disconnected

Timeout / stale cases:

TimeoutError
empty error after idle StreamableHTTP disconnect

These should remain separate from auth failures such as 401 Unauthorized.

Non-goals

This proposal should not:

  • replace the existing OAuth recovery path
  • trigger reauth for normal transport/session failures
  • retry arbitrary tool/business errors
  • hide permanent configuration problems
  • introduce unbounded retries
  • make every error mentioning “session” reconnect automatically

The retry policy should stay conservative: reconnect and retry once, then surface the error if recovery fails.

Suggested PR Series

I propose splitting this into small PRs:

  1. Add centralized MCP error classification without changing behavior.
  2. Add a unified MCP operation executor and route existing MCP operations through it.
  3. Add a connection manager with single-flight reconnect and consistent ready waiting.
  4. Extend recovery coverage for Session terminated, timeout/stale, and transport-dead cases.
  5. Add idle health monitoring / pre-operation stale connection handling.
  6. Add protocol/endpoint diagnostics so cases like SSE discovery mismatch are not treated as reconnectable transport failures.

Acceptance Criteria

  • Invalid or expired session still reconnects and retries once.
  • Session terminated reconnects and retries once.
  • stale/idle HTTP transport failures reconnect and retry once when safely classified.
  • successful retry resets the relevant server failure state.
  • OAuth/auth failures remain separate from transport/session recovery.
  • unrelated MCP tool errors do not trigger reconnect.
  • protocol/endpoint mismatches produce clear diagnostics instead of retry loops.
  • recovery behavior is shared across:
    • tool calls
    • list resources
    • read resource
    • list prompts
    • get prompt
  • concurrent failures on the same server do not trigger multiple overlapping reconnects.

extent analysis

TL;DR

Implement a unified MCP runtime recovery layer with clear separation of responsibilities to handle various failure modes in a consistent and reliable manner.

Guidance

  • Introduce an Error Classifier to categorize failures into auth, session identity, transport dead, timeout/stale, protocol/config error, or ordinary tool error.
  • Create a unified MCP operation executor to route all MCP operations through one recovery path, ensuring consistent handling of failures.
  • Develop a Connection Manager to own reconnect, ready waiting, and single-flight reconnect coordination, preventing multiple overlapping reconnects.
  • Implement a Health Monitor to detect idle/stale HTTP sessions proactively or before the next operation, allowing for timely recovery.

Example

No code snippet is provided as the issue focuses on architectural changes and high-level design.

Notes

The proposed solution requires a staged PR series to ensure a smooth transition and minimize disruptions. It's essential to maintain a conservative retry policy, reconnecting and retrying once before surfacing errors if recovery fails.

Recommendation

Apply the proposed workaround by introducing a unified MCP runtime recovery layer, as it provides a clear and structured approach to handling various failure modes, reducing the risk of incorrect reconnects and improving overall system reliability.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING