hermes - ✅(Solved) Fix [Bug]: MCP server session expires during long-running gateway — no auto-reconnect [3 pull requests, 1 participants]

hermes2026-04-21 07:12:21

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

NousResearch/hermes-agent#13383•Fetched 2026-04-22 08:06:48

View on GitHub

Comments

Participants

Timeline

Reactions

Author

Alex-wuhu

Participants

Alex-wuhu

Timeline (top)

labeled ×4cross-referenced ×3

Error Message

Observe the error ERROR tools.mcp_tool: MCP tool wpcom-mcp/wpcom-mcp-content-authoring call failed: Invalid params: Invalid or expired session WARNING tools.mcp_tool: Failed to connect to MCP server 'wpcom-mcp': Client error '401 Unauthorized' The MCP client in tools/mcp_tool.py does not treat "Invalid or expired session" as a reconnect trigger. The 3-attempt retry logic runs only at gateway startup, not on mid-session failures. Session expiry during normal operation falls through as a plain tool error with no recovery path.

Root Cause

The MCP client in tools/mcp_tool.py does not treat "Invalid or expired session" as a reconnect trigger. The 3-attempt retry logic runs only at gateway startup, not on mid-session failures. Session expiry during normal operation falls through as a plain tool error with no recovery path.

Fix Action

Fixed

Fixed by PR: fix(mcp): auto-reconnect on expired session during tool call (https://github.com/NousResearch/hermes-agent/pull/13402)
Fixed by PR: fix(mcp): auto-reconnect + retry once when the transport session expires (#13383) (https://github.com/NousResearch/hermes-agent/pull/13406)

PR fix notes

PR #13402: fix(mcp): auto-reconnect on expired session during tool call

Repository: NousResearch/hermes-agent
Author: HiddenPuppy
State: open | merged: False
Link: https://github.com/NousResearch/hermes-agent/pull/13402

Description (problem / solution / changelog)

Problem

When the Hermes gateway runs for an extended period, MCP servers using the Streamable HTTP transport lose their server-side session. Subsequent tool calls fail with:

The MCP client does not detect this condition and re-establish the session automatically. The only recovery is a full gateway restart.

Changes

_make_tool_handler: Catch session expiry errors and trigger automatic reconnection before retrying the call once
_reconnect_server: New helper that shuts down the old connection, cleans up the tool registry, and reinitializes the server
_SESSION_EXPIRED_PATTERNS: Detect session expiry error messages from MCP servers
Tests: 3 new tests covering reconnect success, reconnect failure, and non-session errors

Verification

133 existing tests pass
3 new tests added and passing:

Notes

The OAuth token remains valid during reconnection; only the transport-layer session needs re-establishment
Reconnection is attempted exactly once per failed call to avoid infinite loops
The fix applies to both stdio and HTTP transports

Fixes #13383

Changed files

scripts/release.py (modified, +1/-0)
tests/tools/test_mcp_tool.py (modified, +104/-0)
tools/mcp_tool.py (modified, +91/-0)

PR #13406: fix(mcp): auto-reconnect + retry once when the transport session expires (#13383)

Repository: NousResearch/hermes-agent
Author: briandevans
State: open | merged: False
Link: https://github.com/NousResearch/hermes-agent/pull/13406

Description (problem / solution / changelog)

Fixes #13383.

TL;DR

Streamable HTTP MCP servers may garbage-collect their server-side session state while the OAuth token remains valid — idle TTL, server restart, pod rotation, etc. Before this fix, the tool-call handler treated the resulting "Invalid or expired session" error as a plain tool failure with no recovery path. Every subsequent call on the affected server failed until the gateway was manually restarted.

The existing _handle_auth_error_and_retry only fires on 401s, which this class of failure never triggers (token is still valid).

Fix: add a sibling _handle_session_expired_and_retry that detects the session-expiry error pattern and drives the existing transport-reconnect mechanism (MCPServerTask._reconnect_event), then retries the call once.

Root cause

tools/mcp_tool.py, all 5 handler branches follow this pattern:

except Exception as exc:
    recovered = _handle_auth_error_and_retry(...)  # 401-only
    if recovered is not None: return recovered
    # ← session-expired falls through here with no recovery
    return generic_error_json(...)

_is_auth_error only catches:

OAuthFlowError / OAuthTokenError / OAuthNonInteractiveError
httpx.HTTPStatusError with status_code == 401

When the server returns "Invalid params: Invalid or expired session" as a JSON-RPC error (reporter's exact wpcom-mcp log), it's wrapped in an mcp.McpError. The token is still valid (direct API calls return 200), so 401-based detection never fires.

Fix

New narrow detection + reconnect helper, wired into all 5 handlers:

_SESSION_EXPIRED_MARKERS: tuple = (
    "invalid or expired session",
    "expired session",
    "session expired",
    "session not found",
    "unknown session",
)


def _is_session_expired_error(exc: BaseException) -> bool:
    if isinstance(exc, InterruptedError):
        return False
    msg = str(exc).lower()
    if not msg:
        return False
    return any(marker in msg for marker in _SESSION_EXPIRED_MARKERS)


def _handle_session_expired_and_retry(server_name, exc, retry_call, op_description):
    # Unlike _handle_auth_error_and_retry, no handle_401 — token is valid.
    # Just trigger the transport reconnect + retry once.
    if not _is_session_expired_error(exc): return None
    # ... set _reconnect_event, wait for ready, retry once ...

Each handler gets one additional 4-line block:

recovered = _handle_auth_error_and_retry(...)          # unchanged
if recovered is not None: return recovered
recovered = _handle_session_expired_and_retry(...)     # new
if recovered is not None: return recovered

Behaviour matrix

Exception surfaced by MCP SDK	Before	After
401 Unauthorized	OAuth recovery → reconnect → retry	unchanged
`McpError("Invalid or expired session")`	generic tool error → stuck until gateway restart	transport reconnect → retry once → success
`McpError("Session expired")`	generic error	reconnect + retry
`RuntimeError("Tool execution failed")`	generic error	generic error (unchanged — narrow scope)
`InterruptedError`	user-cancel path (unchanged)	user-cancel path (unchanged — explicitly excluded)
Empty-string exception	generic error	generic error

Narrow scope — explicitly not changed

Detection is string-based on a 5-entry allow-list. MCP SDK exception types vary across versions; message-substring matching is the durable path. Kept narrow to avoid false positives (pinned by test_is_session_expired_rejects_unrelated_errors).
Existing 401 recovery flow. Untouched. The new path is consulted only after the auth path declines.
Retry count stays at 1. If reconnect+retry also fails, we don't loop — the error surfaces so the model sees the failure rather than a hang.
InterruptedError is explicitly excluded from session-expiry detection. User-cancel signals short-circuit identically to before (pinned by dedicated test).
Reconnect mechanism itself. Uses the existing _reconnect_event that the 401 path already drives — no new transport code.

Regression coverage

tests/tools/test_mcp_tool_session_expired.py — 16 new test cases:

7 unit tests for _is_session_expired_error:

Reporter's exact wpcom text ("Invalid params: Invalid or expired session")
"Session expired" / "expired session" variants
Server GC variants ("session not found", "unknown session")
Case-insensitive match
Narrow-scope canaries: rejects unrelated RuntimeError / ValueError, rejects 401, rejects InterruptedError, rejects empty-message exceptions.

5 integration tests for handler plumbing:

Reporter's full repro end-to-end via _make_tool_handler.
Preserved-behaviour canary: non-session-expired errors fall through.
Defensive: returns None without event loop, without server record, when retry also fails.

4 parametrised tests across all non-tools/call handlers (list_resources, read_resource, list_prompts, get_prompt) confirming they share the recovery pattern.

15 of 16 tests fail on clean origin/main (6fb69229) with:

ImportError: cannot import name '_is_session_expired_error' from 'tools.mcp_tool'

The 1 that passes is an xdist-ordering artefact of worker collection.

Validation

source venv/bin/activate
python -m pytest tests/tools/test_mcp_tool_session_expired.py -q
# 16 passed

Broader MCP suite (5 files):

python -m pytest \
  tests/tools/test_mcp_tool.py \
  tests/tools/test_mcp_tool_401_handling.py \
  tests/tools/test_mcp_tool_session_expired.py \
  tests/tools/test_mcp_reconnect_signal.py \
  tests/tools/test_mcp_oauth.py -q
# 230 passed, 0 regressions

Pre-empted review questions

Q. Why substring detection instead of an exception-type check? MCP SDK exception types (McpError, RpcError, etc.) vary across mcp package versions and are sometimes wrapped in transport-level exceptions (httpx.ReadError, anyio.EndOfStream) when the server closes the connection. Message-substring matching on a narrow allow-list survives SDK upgrades and covers the same error class across MCP server implementations.

Q. Could this cause a retry storm if the server is permanently broken? No — retry count is exactly 1 per tool call. If the reconnect+retry also fails, the error surfaces and the circuit-breaker (_server_error_counts) starts counting failures. The model sees an error and stops retrying.

Q. Why not extend _handle_auth_error_and_retry to handle both cases? They have different recovery semantics — the 401 path calls handle_401 which may refresh tokens or prompt for re-auth; the session-expired path skips all of that because the token is valid. Keeping them as sibling functions makes the intent explicit at each handler call site.

Q. What if InterruptedError carries a session-expired message in args? The isinstance(exc, InterruptedError) short-circuit returns False before any message matching runs. User-cancel signals always route through _interrupted_call_result() first (via the handlers' own except InterruptedError branch) — this check is defense-in-depth for the rare case where the exception chain gets reordered.

Q. What happens to in-flight tool calls during the reconnect? Only the specific tool call that triggered the session-expired error is affected. Other in-flight calls against the same server will also see session-expired on their next operation and trigger the same recovery independently (idempotent: _reconnect_event.set() on an already-set event is a no-op; the reconnect drains + rebuilds once regardless of how many callers signalled it).

<sub>Co-authored via LLM assistance; I've reviewed every line and am responsible for correctness.</sub>

Changed files

tests/tools/test_mcp_tool_session_expired.py (added, +359/-0)
tools/mcp_tool.py (modified, +153/-0)

PR #13795: fix(mcp): reconnect on session identity and transport errors

Repository: NousResearch/hermes-agent
Author: qwertysc
State: open | merged: False
Link: https://github.com/NousResearch/hermes-agent/pull/13795

Description (problem / solution / changelog)

Summary

This PR improves MCP recovery by explicitly separating reconnectable non-auth failures into two categories:

session identity errors
transport-dead errors

Both categories now trigger the same reconnect flow and retry once.

Auth failures remain a separate path.

Background

The intended MCP recovery behavior is more specific than “if an error mentions session, reconnect”.

There are actually three different failure classes we care about:

Auth failures
- token / OAuth issues
- should go through auth recovery / reauth handling
Session identity errors
- the server-side MCP session id is expired, missing, or unknown
- should reconnect and retry once
Transport-dead errors
- the MCP transport / client stream is already dead
- should also reconnect and retry once

Before this change, the reconnectable non-auth cases were not modeled precisely enough.

In particular, errors like:

Session terminated
stream closed
connection reset
server disconnected

should be treated as recoverable transport failures, not auth failures and not ordinary tool/business errors.

What changed

1. Added explicit recoverable error classification

Introduced:

_classify_recoverable_error(exc)

It now returns:

"auth"
"session_identity_error"
"transport_dead"
None

This makes the control flow reflect the real recovery model directly.

2. Split reconnectable non-auth failures into two explicit buckets

Added:

_is_session_identity_error(exc)
_is_transport_dead_error(exc)

Session identity error markers

invalid or expired session
expired session
session expired
session not found
unknown session
invalid or missing session id

Transport-dead error markers

session terminated
stream closed
endofstream
connection closed
connection reset
server disconnected

3. Unified reconnect behavior for both reconnectable categories

Added:

_reconnect_and_retry_once(...)
_handle_reconnectable_error_and_retry(...)

Both reconnectable categories now share the same recovery flow:

signal _reconnect_event
wait for readiness
retry once
reset server error count on successful recovery

This avoids duplicating reconnect logic while still keeping the classification explicit.

4. Kept auth handling separate

Auth failures still go through:

_handle_auth_error_and_retry(...)

That keeps credential/token recovery distinct from reconnectable transport/session failures.

5. Routed all MCP handlers through the same recoverable dispatcher

Updated all five MCP handler entry points to use:

_handle_recoverable_error_and_retry(...)

Applied to:

tool calls
list resources
read resource
list prompts
get prompt

This ensures consistent behavior across MCP tools, resources, and prompts.

Why this approach

This PR intentionally does not use a broad heuristic like:

if "session" in msg:
    reconnect

That would be too loose and could reconnect on unrelated business/tool errors.

Instead, this change keeps the logic:

explicit
whitelist-based
category-aware

That makes the behavior more predictable and easier to maintain.

Tests

Added tests/tools/test_mcp_tool_session_expired.py covering:

Classification

session identity markers
transport-dead markers
unrelated errors
interrupted errors

Dispatcher / recovery behavior

classification into:
- session_identity_error
- transport_dead
- None
reconnectable recovery retries successfully
graceful fallthrough when:
- MCP loop is unavailable
- server record is missing
- retry fails

Integration

tool handler reconnects on:
- Invalid or expired session
- Session terminated
non-tool handlers also reconnect on both reconnectable categories:
- list resources
- read resource
- list prompts
- get prompt
unrelated failures still surface normal MCP errors

Validation

Ran:

pytest -q tests/tools/test_mcp_tool_session_expired.py tests/tools/test_mcp_tool.py

Result:

193 passed

Reviewer notes

Main things worth checking:

whether the session-identity vs transport-dead split matches the intended MCP recovery model
whether the marker lists are narrow enough to avoid false positives
whether both reconnectable categories should indeed share the same _reconnect_event flow
whether the auth vs reconnectable distinction is now clearer in code structure

Changed files

tests/tools/test_mcp_tool_session_expired.py (added, +359/-0)
tools/mcp_tool.py (modified, +151/-47)

Code Example

Invalid params: Invalid or expired session

---

ERROR tools.mcp_tool: MCP tool wpcom-mcp/wpcom-mcp-content-authoring call failed: Invalid params: Invalid or expired session
WARNING tools.mcp_tool: Failed to connect to MCP server 'wpcom-mcp': Client error '401 Unauthorized'
WARNING tools.mcp_oauth: MCP OAuth for 'wpcom-mcp': non-interactive environment and no cached tokens found.

RAW_BUFFERClick to expand / collapse

Bug Description

When the Hermes gateway runs for an extended period, MCP servers using the Streamable HTTP transport lose their server-side session. Subsequent tool calls fail with:

Invalid params: Invalid or expired session

The MCP client does not detect this condition and re-establish the session automatically. The only recovery is a full gateway restart, which interrupts all connected messaging platforms.

Note: This is NOT an OAuth token expiry issue — the access token remains valid (direct API calls return HTTP 200). The failure is at the MCP transport session layer.

Steps to Reproduce

Configure a Streamable HTTP MCP server (e.g. WordPress.com MCP)
Run hermes gateway run and leave it running for several days
Invoke any tool on that MCP server
Observe the error

Expected Behavior

When a tool call returns "Invalid or expired session", the MCP client should automatically re-establish the session using the still-valid credentials and retry the call transparently.

Actual Behavior

Every subsequent tool call on the affected MCP server fails. The server remains broken until the gateway is manually restarted.

Relevant log entries:

ERROR tools.mcp_tool: MCP tool wpcom-mcp/wpcom-mcp-content-authoring call failed: Invalid params: Invalid or expired session
WARNING tools.mcp_tool: Failed to connect to MCP server 'wpcom-mcp': Client error '401 Unauthorized'
WARNING tools.mcp_oauth: MCP OAuth for 'wpcom-mcp': non-interactive environment and no cached tokens found.

Affected Component

Tools (MCP client)
Agent Core (gateway long-running stability)

Environment

OS: Linux x86_64
Hermes Version: 0.10.0 (2026.4.16)
Python: 3.11.13
MCP transport: Streamable HTTP

Root Cause Analysis

Proposed Fix

In the MCP tool call handler, catch "Invalid or expired session" errors, tear down and re-initialize the MCP client for that server, then retry the original call once. The OAuth token remains valid — only the transport-layer session needs to be re-established.

Willing to submit a PR?

Not at this time, but the fix scope is well-defined above.

extent analysis

TL;DR

Catch "Invalid or expired session" errors in the MCP tool call handler and re-establish the transport-layer session by tearing down and re-initializing the MCP client for that server.

Guidance

Modify the tools/mcp_tool.py to catch "Invalid or expired session" errors and trigger a reconnect.
Implement a retry mechanism to re-establish the session and retry the original call once.
Ensure the OAuth token validity is checked before attempting to re-establish the session, as it remains valid in this scenario.
Consider adding logging to track session re-establishment attempts and successes for monitoring and debugging purposes.

Example

try:
    # Original MCP tool call
except Exception as e:
    if "Invalid or expired session" in str(e):
        # Tear down and re-initialize the MCP client for that server
        # Re-establish the transport-layer session
        # Retry the original call once
        pass

Notes

The proposed fix assumes that the OAuth token remains valid and only the transport-layer session needs to be re-established. This solution may not apply if the OAuth token expiry is also a concern.

Recommendation

Apply workaround: Implement the proposed fix in the MCP tool call handler to catch "Invalid or expired session" errors and re-establish the transport-layer session. This approach directly addresses the identified root cause and provides a clear recovery path without requiring a full gateway restart.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #retrieval issue #search optimization #API routing #API middleware

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

hermes - ✅(Solved) Fix [Bug]: MCP server session expires during long-running gateway — no auto-reconnect [3 pull requests, 1 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Fixed

PR fix notes

PR #13402: fix(mcp): auto-reconnect on expired session during tool call

Description (problem / solution / changelog)

Problem

Changes

Verification

3 new tests added and passing:

Notes

Changed files

PR #13406: fix(mcp): auto-reconnect + retry once when the transport session expires (#13383)

Description (problem / solution / changelog)

TL;DR

Root cause

Fix

Behaviour matrix

Narrow scope — explicitly not changed

Regression coverage

Validation

Pre-empted review questions

Changed files

PR #13795: fix(mcp): reconnect on session identity and transport errors

Description (problem / solution / changelog)

Summary

Background

What changed

1. Added explicit recoverable error classification

2. Split reconnectable non-auth failures into two explicit buckets

Session identity error markers

Transport-dead error markers

3. Unified reconnect behavior for both reconnectable categories

4. Kept auth handling separate

5. Routed all MCP handlers through the same recoverable dispatcher

Why this approach

Tests

Classification

Dispatcher / recovery behavior

Integration

Validation

Reviewer notes

Changed files

Code Example

Bug Description

Steps to Reproduce

Expected Behavior

Actual Behavior

Affected Component

Environment

Root Cause Analysis

Proposed Fix

Willing to submit a PR?

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING