hermes - 💡(How to fix) Fix Unify MCP HTTP recovery around connection lifecycle and error classification [1 participants]

hermes2026-05-01 02:34:35

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

NousResearch/hermes-agent#18165•Fetched 2026-05-01 05:53:35

View on GitHub

Comments

Participants

Timeline

Reactions

Author

versun

Participants

versun

Timeline (top)

labeled ×3

Hermes currently handles several MCP HTTP recovery cases as separate fixes:

expired server-side MCP sessions
terminated HTTP MCP streams
stale idle HTTP connections
timeout / empty-message failures after long-lived sessions
protocol or endpoint mismatches that should not be retried as transport failures

These issues appear related at the user level: MCP tools stop working until Hermes is restarted or the MCP connection is manually refreshed.

I would like to propose an umbrella architecture issue before submitting a staged PR series, so the recovery model can be agreed on before implementation.

Error Message

The current recovery behavior is spread across MCP operation handlers and specific error-message checks.

empty error messages after idle disconnects

Reconnecting when the error is actually an auth failure, tool/business error, or protocol/configuration problem. | Error Classifier | Classify failures as auth, session identity, transport dead, timeout/stale, protocol/config error, or ordinary tool error | | Session identity error | Reconnect MCP session and retry once | | Transport-dead error | Reconnect MCP transport/session and retry once | | MCP tool/business error | Do not reconnect; surface the tool error normally | empty error after idle StreamableHTTP disconnect

make every error mentioning “session” reconnect automatically The retry policy should stay conservative: reconnect and retry once, then surface the error if recovery fails.

Add centralized MCP error classification without changing behavior.

Root Cause

Hermes currently handles several MCP HTTP recovery cases as separate fixes:

expired server-side MCP sessions
terminated HTTP MCP streams
stale idle HTTP connections
timeout / empty-message failures after long-lived sessions
protocol or endpoint mismatches that should not be retried as transport failures

These issues appear related at the user level: MCP tools stop working until Hermes is restarted or the MCP connection is manually refreshed.

I would like to propose an umbrella architecture issue before submitting a staged PR series, so the recovery model can be agreed on before implementation.

Fix Action

Fix / Workaround

That makes each new failure mode require another targeted patch, for example:

Code Example

Invalid or expired session
expired session
session expired
session not found
unknown session
invalid or missing session id

---

Session terminated
stream closed
EndOfStream
connection closed
connection reset
server disconnected

---

TimeoutError
empty error after idle StreamableHTTP disconnect

RAW_BUFFERClick to expand / collapse

Summary

Hermes currently handles several MCP HTTP recovery cases as separate fixes:

expired server-side MCP sessions
terminated HTTP MCP streams
stale idle HTTP connections
timeout / empty-message failures after long-lived sessions
protocol or endpoint mismatches that should not be retried as transport failures

These issues appear related at the user level: MCP tools stop working until Hermes is restarted or the MCP connection is manually refreshed.

I would like to propose an umbrella architecture issue before submitting a staged PR series, so the recovery model can be agreed on before implementation.

Related Issues / PRs

#13383 — MCP server session expires during long-running gateway, no auto-reconnect
#15125 — merged fix for Invalid or expired session
#13795 — open PR for Session terminated transport recovery
#17662 — open PR for TimeoutError after stale StreamableHTTP sessions
#17003 — open issue for MCP HTTP connections going stale after idle periods
#17244 — related but different: SSE discovery / endpoint mismatch, not a normal reconnect case

Problem

The current recovery behavior is spread across MCP operation handlers and specific error-message checks.

That makes each new failure mode require another targeted patch, for example:

Invalid or expired session
Session terminated
TimeoutError
empty error messages after idle disconnects
stream closed
connection reset
server disconnected

This works for individual cases, but it does not provide a clear recovery model for long-running MCP HTTP connections.

It also increases the risk of two opposite mistakes:

Not reconnecting when the MCP transport/session is genuinely dead.
Reconnecting when the error is actually an auth failure, tool/business error, or protocol/configuration problem.

Proposed Direction

Introduce a unified MCP runtime recovery layer with clear separation of responsibilities.

Suggested boundaries:

Component	Responsibility
Error Classifier	Classify failures as auth, session identity, transport dead, timeout/stale, protocol/config error, or ordinary tool error
Operation Executor	Route all MCP operations through one recovery path
Connection Manager	Own reconnect, ready waiting, and single-flight reconnect coordination
Health Monitor	Detect idle/stale HTTP sessions proactively or before the next operation
Tool Registry Adapter	Keep existing Hermes tool registration behavior separate from transport recovery

All MCP operations should use the same executor:

tool calls
list resources
read resource
list prompts
get prompt
tool discovery / refresh where applicable

Recovery Model

Suggested behavior:

Failure class	Expected handling
OAuth/auth failure	Use existing OAuth recovery / reauth flow
Session identity error	Reconnect MCP session and retry once
Transport-dead error	Reconnect MCP transport/session and retry once
Timeout / stale idle connection	Reconnect and retry once if classified as transport/session failure
Protocol or endpoint mismatch	Do not retry as transport recovery; return a clear diagnostic
MCP tool/business error	Do not reconnect; surface the tool error normally

Important: classification should remain allow-list based. A broad rule like if "session" in message: reconnect would be too risky.

Examples of Recoverable Markers

Session identity errors:

Invalid or expired session
expired session
session expired
session not found
unknown session
invalid or missing session id

Transport-dead errors:

Session terminated
stream closed
EndOfStream
connection closed
connection reset
server disconnected

Timeout / stale cases:

TimeoutError
empty error after idle StreamableHTTP disconnect

These should remain separate from auth failures such as 401 Unauthorized.

Non-goals

This proposal should not:

replace the existing OAuth recovery path
trigger reauth for normal transport/session failures
retry arbitrary tool/business errors
hide permanent configuration problems
introduce unbounded retries
make every error mentioning “session” reconnect automatically

The retry policy should stay conservative: reconnect and retry once, then surface the error if recovery fails.

Suggested PR Series

I propose splitting this into small PRs:

Add centralized MCP error classification without changing behavior.
Add a unified MCP operation executor and route existing MCP operations through it.
Add a connection manager with single-flight reconnect and consistent ready waiting.
Extend recovery coverage for Session terminated, timeout/stale, and transport-dead cases.
Add idle health monitoring / pre-operation stale connection handling.
Add protocol/endpoint diagnostics so cases like SSE discovery mismatch are not treated as reconnectable transport failures.

Acceptance Criteria

Invalid or expired session still reconnects and retries once.
Session terminated reconnects and retries once.
stale/idle HTTP transport failures reconnect and retry once when safely classified.
successful retry resets the relevant server failure state.
OAuth/auth failures remain separate from transport/session recovery.
unrelated MCP tool errors do not trigger reconnect.
protocol/endpoint mismatches produce clear diagnostics instead of retry loops.
recovery behavior is shared across:
- tool calls
- list resources
- read resource
- list prompts
- get prompt
concurrent failures on the same server do not trigger multiple overlapping reconnects.

extent analysis

TL;DR

Implement a unified MCP runtime recovery layer with clear separation of responsibilities to handle various failure modes in a consistent and reliable manner.

Guidance

Introduce an Error Classifier to categorize failures into auth, session identity, transport dead, timeout/stale, protocol/config error, or ordinary tool error.
Create a unified MCP operation executor to route all MCP operations through one recovery path, ensuring consistent handling of failures.
Develop a Connection Manager to own reconnect, ready waiting, and single-flight reconnect coordination, preventing multiple overlapping reconnects.
Implement a Health Monitor to detect idle/stale HTTP sessions proactively or before the next operation, allowing for timely recovery.

Example

No code snippet is provided as the issue focuses on architectural changes and high-level design.

Notes

The proposed solution requires a staged PR series to ensure a smooth transition and minimize disruptions. It's essential to maintain a conservative retry policy, reconnecting and retrying once before surfacing errors if recovery fails.

Recommendation

Apply the proposed workaround by introducing a unified MCP runtime recovery layer, as it provides a clear and structured approach to handling various failure modes, reducing the risk of incorrect reconnects and improving overall system reliability.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#orchestration issue #cache issue #memory leak #API versioning #request timeout

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

hermes - 💡(How to fix) Fix Unify MCP HTTP recovery around connection lifecycle and error classification [1 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Fix / Workaround

Code Example

Summary

Related Issues / PRs

Problem

Proposed Direction

Recovery Model

Examples of Recoverable Markers

Non-goals

Suggested PR Series

Acceptance Criteria

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

TRENDING

hermes - 💡(How to fix) Fix Unify MCP HTTP recovery around connection lifecycle and error classification [1 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Fix / Workaround

Code Example

Summary

Related Issues / PRs

Problem

Proposed Direction

Recovery Model

Examples of Recoverable Markers

Non-goals

Suggested PR Series

Acceptance Criteria

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING