hermes - ✅(Solved) Fix [Bug]: MCP server session expires during long-running gateway — no auto-reconnect [3 pull requests, 1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
NousResearch/hermes-agent#13383Fetched 2026-04-22 08:06:48
View on GitHub
Comments
0
Participants
1
Timeline
7
Reactions
0
Author
Participants
Timeline (top)
labeled ×4cross-referenced ×3

Error Message

  1. Observe the error ERROR tools.mcp_tool: MCP tool wpcom-mcp/wpcom-mcp-content-authoring call failed: Invalid params: Invalid or expired session WARNING tools.mcp_tool: Failed to connect to MCP server 'wpcom-mcp': Client error '401 Unauthorized' The MCP client in tools/mcp_tool.py does not treat "Invalid or expired session" as a reconnect trigger. The 3-attempt retry logic runs only at gateway startup, not on mid-session failures. Session expiry during normal operation falls through as a plain tool error with no recovery path.

Root Cause

The MCP client in tools/mcp_tool.py does not treat "Invalid or expired session" as a reconnect trigger. The 3-attempt retry logic runs only at gateway startup, not on mid-session failures. Session expiry during normal operation falls through as a plain tool error with no recovery path.

Fix Action

Fixed

PR fix notes

PR #13402: fix(mcp): auto-reconnect on expired session during tool call

Description (problem / solution / changelog)

Problem

When the Hermes gateway runs for an extended period, MCP servers using the Streamable HTTP transport lose their server-side session. Subsequent tool calls fail with:

The MCP client does not detect this condition and re-establish the session automatically. The only recovery is a full gateway restart.

Changes

  • _make_tool_handler: Catch session expiry errors and trigger automatic reconnection before retrying the call once
  • _reconnect_server: New helper that shuts down the old connection, cleans up the tool registry, and reinitializes the server
  • _SESSION_EXPIRED_PATTERNS: Detect session expiry error messages from MCP servers
  • Tests: 3 new tests covering reconnect success, reconnect failure, and non-session errors

Verification

  • 133 existing tests pass
  • 3 new tests added and passing:

Notes

  • The OAuth token remains valid during reconnection; only the transport-layer session needs re-establishment
  • Reconnection is attempted exactly once per failed call to avoid infinite loops
  • The fix applies to both stdio and HTTP transports

Fixes #13383

Changed files

  • scripts/release.py (modified, +1/-0)
  • tests/tools/test_mcp_tool.py (modified, +104/-0)
  • tools/mcp_tool.py (modified, +91/-0)

PR #13406: fix(mcp): auto-reconnect + retry once when the transport session expires (#13383)

Description (problem / solution / changelog)

Fixes #13383.

TL;DR

Streamable HTTP MCP servers may garbage-collect their server-side session state while the OAuth token remains valid — idle TTL, server restart, pod rotation, etc. Before this fix, the tool-call handler treated the resulting "Invalid or expired session" error as a plain tool failure with no recovery path. Every subsequent call on the affected server failed until the gateway was manually restarted.

The existing _handle_auth_error_and_retry only fires on 401s, which this class of failure never triggers (token is still valid).

Fix: add a sibling _handle_session_expired_and_retry that detects the session-expiry error pattern and drives the existing transport-reconnect mechanism (MCPServerTask._reconnect_event), then retries the call once.

Root cause

tools/mcp_tool.py, all 5 handler branches follow this pattern:

except Exception as exc:
    recovered = _handle_auth_error_and_retry(...)  # 401-only
    if recovered is not None: return recovered
    # ← session-expired falls through here with no recovery
    return generic_error_json(...)

_is_auth_error only catches:

  • OAuthFlowError / OAuthTokenError / OAuthNonInteractiveError
  • httpx.HTTPStatusError with status_code == 401

When the server returns "Invalid params: Invalid or expired session" as a JSON-RPC error (reporter's exact wpcom-mcp log), it's wrapped in an mcp.McpError. The token is still valid (direct API calls return 200), so 401-based detection never fires.

Fix

New narrow detection + reconnect helper, wired into all 5 handlers:

_SESSION_EXPIRED_MARKERS: tuple = (
    "invalid or expired session",
    "expired session",
    "session expired",
    "session not found",
    "unknown session",
)


def _is_session_expired_error(exc: BaseException) -> bool:
    if isinstance(exc, InterruptedError):
        return False
    msg = str(exc).lower()
    if not msg:
        return False
    return any(marker in msg for marker in _SESSION_EXPIRED_MARKERS)


def _handle_session_expired_and_retry(server_name, exc, retry_call, op_description):
    # Unlike _handle_auth_error_and_retry, no handle_401 — token is valid.
    # Just trigger the transport reconnect + retry once.
    if not _is_session_expired_error(exc): return None
    # ... set _reconnect_event, wait for ready, retry once ...

Each handler gets one additional 4-line block:

recovered = _handle_auth_error_and_retry(...)          # unchanged
if recovered is not None: return recovered
recovered = _handle_session_expired_and_retry(...)     # new
if recovered is not None: return recovered

Behaviour matrix

Exception surfaced by MCP SDKBeforeAfter
401 UnauthorizedOAuth recovery → reconnect → retryunchanged
McpError("Invalid or expired session")generic tool error → stuck until gateway restarttransport reconnect → retry once → success
McpError("Session expired")generic errorreconnect + retry
RuntimeError("Tool execution failed")generic errorgeneric error (unchanged — narrow scope)
InterruptedErroruser-cancel path (unchanged)user-cancel path (unchanged — explicitly excluded)
Empty-string exceptiongeneric errorgeneric error

Narrow scope — explicitly not changed

  • Detection is string-based on a 5-entry allow-list. MCP SDK exception types vary across versions; message-substring matching is the durable path. Kept narrow to avoid false positives (pinned by test_is_session_expired_rejects_unrelated_errors).
  • Existing 401 recovery flow. Untouched. The new path is consulted only after the auth path declines.
  • Retry count stays at 1. If reconnect+retry also fails, we don't loop — the error surfaces so the model sees the failure rather than a hang.
  • InterruptedError is explicitly excluded from session-expiry detection. User-cancel signals short-circuit identically to before (pinned by dedicated test).
  • Reconnect mechanism itself. Uses the existing _reconnect_event that the 401 path already drives — no new transport code.

Regression coverage

tests/tools/test_mcp_tool_session_expired.py16 new test cases:

7 unit tests for _is_session_expired_error:

  • Reporter's exact wpcom text ("Invalid params: Invalid or expired session")
  • "Session expired" / "expired session" variants
  • Server GC variants ("session not found", "unknown session")
  • Case-insensitive match
  • Narrow-scope canaries: rejects unrelated RuntimeError / ValueError, rejects 401, rejects InterruptedError, rejects empty-message exceptions.

5 integration tests for handler plumbing:

  • Reporter's full repro end-to-end via _make_tool_handler.
  • Preserved-behaviour canary: non-session-expired errors fall through.
  • Defensive: returns None without event loop, without server record, when retry also fails.

4 parametrised tests across all non-tools/call handlers (list_resources, read_resource, list_prompts, get_prompt) confirming they share the recovery pattern.

15 of 16 tests fail on clean origin/main (6fb69229) with:

ImportError: cannot import name '_is_session_expired_error' from 'tools.mcp_tool'

The 1 that passes is an xdist-ordering artefact of worker collection.

Validation

source venv/bin/activate
python -m pytest tests/tools/test_mcp_tool_session_expired.py -q
# 16 passed

Broader MCP suite (5 files):

python -m pytest \
  tests/tools/test_mcp_tool.py \
  tests/tools/test_mcp_tool_401_handling.py \
  tests/tools/test_mcp_tool_session_expired.py \
  tests/tools/test_mcp_reconnect_signal.py \
  tests/tools/test_mcp_oauth.py -q
# 230 passed, 0 regressions

Pre-empted review questions

Q. Why substring detection instead of an exception-type check? MCP SDK exception types (McpError, RpcError, etc.) vary across mcp package versions and are sometimes wrapped in transport-level exceptions (httpx.ReadError, anyio.EndOfStream) when the server closes the connection. Message-substring matching on a narrow allow-list survives SDK upgrades and covers the same error class across MCP server implementations.

Q. Could this cause a retry storm if the server is permanently broken? No — retry count is exactly 1 per tool call. If the reconnect+retry also fails, the error surfaces and the circuit-breaker (_server_error_counts) starts counting failures. The model sees an error and stops retrying.

Q. Why not extend _handle_auth_error_and_retry to handle both cases? They have different recovery semantics — the 401 path calls handle_401 which may refresh tokens or prompt for re-auth; the session-expired path skips all of that because the token is valid. Keeping them as sibling functions makes the intent explicit at each handler call site.

Q. What if InterruptedError carries a session-expired message in args? The isinstance(exc, InterruptedError) short-circuit returns False before any message matching runs. User-cancel signals always route through _interrupted_call_result() first (via the handlers' own except InterruptedError branch) — this check is defense-in-depth for the rare case where the exception chain gets reordered.

Q. What happens to in-flight tool calls during the reconnect? Only the specific tool call that triggered the session-expired error is affected. Other in-flight calls against the same server will also see session-expired on their next operation and trigger the same recovery independently (idempotent: _reconnect_event.set() on an already-set event is a no-op; the reconnect drains + rebuilds once regardless of how many callers signalled it).


<sub>Co-authored via LLM assistance; I've reviewed every line and am responsible for correctness.</sub>

Changed files

  • tests/tools/test_mcp_tool_session_expired.py (added, +359/-0)
  • tools/mcp_tool.py (modified, +153/-0)

PR #13795: fix(mcp): reconnect on session identity and transport errors

Description (problem / solution / changelog)

Summary

This PR improves MCP recovery by explicitly separating reconnectable non-auth failures into two categories:

  • session identity errors
  • transport-dead errors

Both categories now trigger the same reconnect flow and retry once.

Auth failures remain a separate path.

Background

The intended MCP recovery behavior is more specific than “if an error mentions session, reconnect”.

There are actually three different failure classes we care about:

  1. Auth failures

    • token / OAuth issues
    • should go through auth recovery / reauth handling
  2. Session identity errors

    • the server-side MCP session id is expired, missing, or unknown
    • should reconnect and retry once
  3. Transport-dead errors

    • the MCP transport / client stream is already dead
    • should also reconnect and retry once

Before this change, the reconnectable non-auth cases were not modeled precisely enough.

In particular, errors like:

  • Session terminated
  • stream closed
  • connection reset
  • server disconnected

should be treated as recoverable transport failures, not auth failures and not ordinary tool/business errors.

What changed

1. Added explicit recoverable error classification

Introduced:

  • _classify_recoverable_error(exc)

It now returns:

  • "auth"
  • "session_identity_error"
  • "transport_dead"
  • None

This makes the control flow reflect the real recovery model directly.

2. Split reconnectable non-auth failures into two explicit buckets

Added:

  • _is_session_identity_error(exc)
  • _is_transport_dead_error(exc)

Session identity error markers

  • invalid or expired session
  • expired session
  • session expired
  • session not found
  • unknown session
  • invalid or missing session id

Transport-dead error markers

  • session terminated
  • stream closed
  • endofstream
  • connection closed
  • connection reset
  • server disconnected

3. Unified reconnect behavior for both reconnectable categories

Added:

  • _reconnect_and_retry_once(...)
  • _handle_reconnectable_error_and_retry(...)

Both reconnectable categories now share the same recovery flow:

  1. signal _reconnect_event
  2. wait for readiness
  3. retry once
  4. reset server error count on successful recovery

This avoids duplicating reconnect logic while still keeping the classification explicit.

4. Kept auth handling separate

Auth failures still go through:

  • _handle_auth_error_and_retry(...)

That keeps credential/token recovery distinct from reconnectable transport/session failures.

5. Routed all MCP handlers through the same recoverable dispatcher

Updated all five MCP handler entry points to use:

  • _handle_recoverable_error_and_retry(...)

Applied to:

  • tool calls
  • list resources
  • read resource
  • list prompts
  • get prompt

This ensures consistent behavior across MCP tools, resources, and prompts.

Why this approach

This PR intentionally does not use a broad heuristic like:

if "session" in msg:
    reconnect

That would be too loose and could reconnect on unrelated business/tool errors.

Instead, this change keeps the logic:

  • explicit
  • whitelist-based
  • category-aware

That makes the behavior more predictable and easier to maintain.

Tests

Added tests/tools/test_mcp_tool_session_expired.py covering:

Classification

  • session identity markers
  • transport-dead markers
  • unrelated errors
  • interrupted errors

Dispatcher / recovery behavior

  • classification into:
    • session_identity_error
    • transport_dead
    • None
  • reconnectable recovery retries successfully
  • graceful fallthrough when:
    • MCP loop is unavailable
    • server record is missing
    • retry fails

Integration

  • tool handler reconnects on:
    • Invalid or expired session
    • Session terminated
  • non-tool handlers also reconnect on both reconnectable categories:
    • list resources
    • read resource
    • list prompts
    • get prompt
  • unrelated failures still surface normal MCP errors

Validation

Ran:

pytest -q tests/tools/test_mcp_tool_session_expired.py tests/tools/test_mcp_tool.py

Result:

193 passed

Reviewer notes

Main things worth checking:

  • whether the session-identity vs transport-dead split matches the intended MCP recovery model
  • whether the marker lists are narrow enough to avoid false positives
  • whether both reconnectable categories should indeed share the same _reconnect_event flow
  • whether the auth vs reconnectable distinction is now clearer in code structure

Changed files

  • tests/tools/test_mcp_tool_session_expired.py (added, +359/-0)
  • tools/mcp_tool.py (modified, +151/-47)

Code Example

Invalid params: Invalid or expired session

---

ERROR tools.mcp_tool: MCP tool wpcom-mcp/wpcom-mcp-content-authoring call failed: Invalid params: Invalid or expired session
WARNING tools.mcp_tool: Failed to connect to MCP server 'wpcom-mcp': Client error '401 Unauthorized'
WARNING tools.mcp_oauth: MCP OAuth for 'wpcom-mcp': non-interactive environment and no cached tokens found.
RAW_BUFFERClick to expand / collapse

Bug Description

When the Hermes gateway runs for an extended period, MCP servers using the Streamable HTTP transport lose their server-side session. Subsequent tool calls fail with:

Invalid params: Invalid or expired session

The MCP client does not detect this condition and re-establish the session automatically. The only recovery is a full gateway restart, which interrupts all connected messaging platforms.

Note: This is NOT an OAuth token expiry issue — the access token remains valid (direct API calls return HTTP 200). The failure is at the MCP transport session layer.

Steps to Reproduce

  1. Configure a Streamable HTTP MCP server (e.g. WordPress.com MCP)
  2. Run hermes gateway run and leave it running for several days
  3. Invoke any tool on that MCP server
  4. Observe the error

Expected Behavior

When a tool call returns "Invalid or expired session", the MCP client should automatically re-establish the session using the still-valid credentials and retry the call transparently.

Actual Behavior

Every subsequent tool call on the affected MCP server fails. The server remains broken until the gateway is manually restarted.

Relevant log entries:

ERROR tools.mcp_tool: MCP tool wpcom-mcp/wpcom-mcp-content-authoring call failed: Invalid params: Invalid or expired session
WARNING tools.mcp_tool: Failed to connect to MCP server 'wpcom-mcp': Client error '401 Unauthorized'
WARNING tools.mcp_oauth: MCP OAuth for 'wpcom-mcp': non-interactive environment and no cached tokens found.

Affected Component

  • Tools (MCP client)
  • Agent Core (gateway long-running stability)

Environment

  • OS: Linux x86_64
  • Hermes Version: 0.10.0 (2026.4.16)
  • Python: 3.11.13
  • MCP transport: Streamable HTTP

Root Cause Analysis

The MCP client in tools/mcp_tool.py does not treat "Invalid or expired session" as a reconnect trigger. The 3-attempt retry logic runs only at gateway startup, not on mid-session failures. Session expiry during normal operation falls through as a plain tool error with no recovery path.

Proposed Fix

In the MCP tool call handler, catch "Invalid or expired session" errors, tear down and re-initialize the MCP client for that server, then retry the original call once. The OAuth token remains valid — only the transport-layer session needs to be re-established.

Willing to submit a PR?

Not at this time, but the fix scope is well-defined above.

extent analysis

TL;DR

Catch "Invalid or expired session" errors in the MCP tool call handler and re-establish the transport-layer session by tearing down and re-initializing the MCP client for that server.

Guidance

  • Modify the tools/mcp_tool.py to catch "Invalid or expired session" errors and trigger a reconnect.
  • Implement a retry mechanism to re-establish the session and retry the original call once.
  • Ensure the OAuth token validity is checked before attempting to re-establish the session, as it remains valid in this scenario.
  • Consider adding logging to track session re-establishment attempts and successes for monitoring and debugging purposes.

Example

try:
    # Original MCP tool call
except Exception as e:
    if "Invalid or expired session" in str(e):
        # Tear down and re-initialize the MCP client for that server
        # Re-establish the transport-layer session
        # Retry the original call once
        pass

Notes

The proposed fix assumes that the OAuth token remains valid and only the transport-layer session needs to be re-established. This solution may not apply if the OAuth token expiry is also a concern.

Recommendation

Apply workaround: Implement the proposed fix in the MCP tool call handler to catch "Invalid or expired session" errors and re-establish the transport-layer session. This approach directly addresses the identified root cause and provides a clear recovery path without requiring a full gateway restart.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

hermes - ✅(Solved) Fix [Bug]: MCP server session expires during long-running gateway — no auto-reconnect [3 pull requests, 1 participants]