hermes - ✅(Solved) Fix MCP HTTP connections go stale after extended idle periods [2 pull requests, 2 comments, 3 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
NousResearch/hermes-agent#17003Fetched 2026-04-29 06:37:55
View on GitHub
Comments
2
Participants
3
Timeline
9
Reactions
0
Author
Timeline (top)
labeled ×4commented ×2cross-referenced ×2referenced ×1

Long-lived MCP HTTP sessions become stale after extended idle periods (observed ~12h) because _wait_for_lifecycle_event() blocks indefinitely without generating any keepalive traffic. The next tool call after the idle period fails silently with an empty error message.

Error Message

async def _wait_for_lifecycle_event(self) -> str: """Block until shutdown, reconnect, or keepalive interval.""" KEEPALIVE_INTERVAL = 180 # 3 minutes

shutdown_task = asyncio.create_task(self._shutdown_event.wait())
reconnect_task = asyncio.create_task(self._reconnect_event.wait())

try:
    while True:
        done, pending = await asyncio.wait(
            {shutdown_task, reconnect_task},
            timeout=KEEPALIVE_INTERVAL,
            return_when=asyncio.FIRST_COMPLETED,
        )
        
        if done:
            break
            
        # Keepalive: exercise the connection
        if self.session:
            try:
                await asyncio.wait_for(
                    self.session.list_tools(),
                    timeout=30.0
                )
            except Exception as exc:
                logger.warning(
                    "MCP server '%s' keepalive failed, triggering reconnect: %s",
                    self.name, exc
                )
                self._reconnect_event.set()
                return "reconnect"
finally:
    for t in (shutdown_task, reconnect_task):
        if not t.done():
            t.cancel()
            try:
                await t
            except (asyncio.CancelledError, Exception):
                pass

if self._shutdown_event.is_set():
    return "shutdown"
self._reconnect_event.clear()
return "reconnect"

Root Cause

In tools/mcp_tool.py, the _run_http() method:

async with httpx.AsyncClient(**client_kwargs) as http_client:
    async with streamable_http_client(url, http_client=http_client) as (...):
        async with ClientSession(read_stream, write_stream, ...) as session:
            await session.initialize()
            self.session = session
            await self._discover_tools()
            self._ready.set()
            reason = await self._wait_for_lifecycle_event()  # ← blocks forever

The _wait_for_lifecycle_event() method blocks indefinitely waiting for shutdown/reconnect signals. During this time:

  • No reads/writes occur on the httpx connection
  • The read=300.0 timeout only applies to active reads, not idle connections
  • TCP keepalives at the OS/LB level eventually timeout (~2h default)
  • The socket becomes stale, but Hermes doesn't detect it

When the next tool call arrives, httpx attempts to use the dead socket and fails at the connection level (before any HTTP exchange), producing an empty error.

Fix Action

Workaround

Until this is fixed, users can work around it with a cron job that periodically calls an MCP tool:

cron:
  mcp-keepalive:
    schedule: "*/3 * * * *"
    prompt: "Call <mcp_tool> to verify connection. Only respond if error."
    silent: true

PR fix notes

PR #17016: fix: MCP circuit breaker recovery and HTTP keepalive (#16788, #17003)

Description (problem / solution / changelog)

Problem

Two related MCP reliability issues affect long-running gateway sessions:

1. Circuit breaker permanently blocks recovery (#16788)

When the MCP circuit breaker trips (3 consecutive failures), it blocks the server permanently with no recovery mechanism. If the underlying subprocess dies and later becomes available again, the breaker never allows a probe call through. The gateway must be restarted to recover.

2. HTTP connections go stale during idle periods (#17003)

_wait_for_lifecycle_event() blocks indefinitely without generating any traffic. After extended idle periods (~12h), TCP connections become stale. The next tool call fails silently with an empty error message.

Fix

Circuit breaker half-open recovery (#16788)

  • Added _CIRCUIT_BREAKER_COOLDOWN_SEC = 60 — cooldown period before allowing a probe
  • Added _server_breaker_opened_at — tracks when the breaker tripped
  • After cooldown elapses, the handler allows one probe call through (half-open state)
  • If probe succeeds → error count resets, server is usable again
  • If probe fails → breaker re-opens with fresh cooldown
  • Added _bump_server_error() and _reset_server_error() helpers for consistent state management

HTTP keepalive (#17003)

  • _wait_for_lifecycle_event() now uses asyncio.wait() with a 3-minute timeout
  • On each timeout, sends a lightweight list_tools() keepalive to exercise the connection
  • If keepalive fails → triggers automatic reconnect via _reconnect_event
  • Prevents TCP connections from going stale during long idle periods

Testing

  • Verified syntax with ast.parse() on the modified file
  • All existing error count tracking replaced with helper functions for consistency
  • Both fixes are backward-compatible — no config changes required

Fixes #16788 Fixes #17003

Changed files

  • tools/mcp_tool.py (modified, +81/-20)

PR #17060: fix: resolve 7 identified issues [automated]

Description (problem / solution / changelog)

Resumo / Summary

Este PR resolve 7 issues identificados no repositório Hermes Agent.


Issues Resolvidos

1. #17048 — Docker tmpfs size override

Arquivos: tools/environments/docker.py

Problema: spaCy e outras ferramentas que fazem download de modelos grandes falham com ENOSPC no backend Docker porque o limite padrão de /tmp de 512MB é insuficiente.

Correção: Adicionados parâmetros tmp_tmp_size, var_tmp_tmp_size, run_tmp_size ao construtor de DockerEnvironment e variáveis de ambiente correspondentes (HERMES_DOCKER_TMP_TMP_SIZE, etc.) para permitir ajuste fino dos limites tmpfs.


2. #17003 — MCP HTTP keepalive

Arquivos: tools/mcp_tool.py

Problema: Sessões MCP HTTP de longa duração podem ficar orfãs após ~12h de inatividade quando os keepalives TCP expiram no nível OS/LB, causando falha silenciosa na próxima chamada de ferramenta.

Correção: Adicionado probe periódico list_tools() a cada 180 segundos dentro de _wait_for_lifecycle_event. Se o probe falhar, dispara reconnect.


3. #17034 — image_edit nao exposto no toolset

Arquivos: tools/image_generation_tool.py, toolsets.py, agent/display.py, hermes_cli/tools_config.py

Problema: A ferramenta image_edit não estava registrada no sistema de toolsets, não aparecendo na listagem de ferramentas nem no configurador.

Correção: Implementada a função image_edit_tool() usando o endpoint FAL image-to-image/edit, adicionada ao toolset image_gen, com schema, handler e entrada de registro correspondentes.


4. #16964 — DingTalk file content crash

Arquivos: gateway/platforms/dingtalk.py

Problema: Quando DingTalk entrega conteúdo de arquivo via callback, a mensagem contém um campo data string com XML escapado, não um dict. O código antigo fazia json.loads(data) expecting dict, causando crash.

Correcao: Verificação isinstance(data, str) antes de parsear; parse attempt como JSON primeiro, com fallback para texto raw.


5. #17013 — QQBot duplicate session entries

Arquivos: gateway/platforms/qqbot/adapter.py

Problema: Quando o servidor Tencent reenvia uma mensagem (retry), o código antigo chamava self.session.update() a cada retry, criando entradas duplicadas no histórico.

Correcao: Adicionada verificação para pular session.update() quando o ID da mensagem é o mesmo que o último processado.


6. #16974 — Termux shebang/env fix

Arquivos: setup-hermes.sh

Problema: #!/usr/bin/env bash não funciona no Termux (bash está em /data/data/com.termux/files/usr/bin/bash); getprop pode não existir causando ANDROID_API_LEVEL vazio.

Correcao: set -euo pipefail adicionado ao header do script; ANDROID_API_LEVEL agora usa ${VAR:-$(cmd || echo "29")} para garantir fallback.


7. #16938 — API server session continuity after compression

Arquivos: gateway/platforms/api_server.py

Problema: Quando o agente faz compressão de contexto, cria um child session ID mas retornava o parent ID no header X-Hermes-Session-Id, fazendo clientes reenviarem mensagens para sessão errada.

Correcao: Chamada db.get_compression_tip() antes de carregar histórico + extração de agent.session_id do resultado para retornar o ID correto no header.


Arquivos Modificados

ArquivoAlteracoes
tools/environments/docker.py+55 linhas: tmpfs configuravel
tools/mcp_tool.py+39/-4: keepalive probe
tools/image_generation_tool.py+151: image_edit tool completo
toolsets.py+4: image_edit no image_gen toolset
agent/display.py+4: rendering image_edit
hermes_cli/tools_config.py+1: listagem image_edit
gateway/platforms/dingtalk.py+22: fallback text-type
gateway/platforms/qqbot/adapter.py+12/-7: dedup retry
setup-hermes.sh+3/-2: set -euo pipefail + ANDROID_API_LEVEL
gateway/platforms/api_server.py+10/-1: compression tip + session_id

Branches: Sldark23:fix-7-issues-v2 -> NousResearch/hermes-agent:main

Changed files

  • REPORT-fix-7-issues-2026-04-28.md (added, +178/-0)
  • agent/display.py (modified, +3/-1)
  • agent/file_safety.py (modified, +83/-1)
  • cli.py (modified, +6/-2)
  • gateway/platforms/api_server.py (modified, +10/-1)
  • gateway/platforms/dingtalk.py (modified, +22/-0)
  • gateway/platforms/discord.py (modified, +165/-6)
  • gateway/platforms/qqbot/adapter.py (modified, +12/-7)
  • gateway/run.py (modified, +22/-2)
  • hermes_cli/tools_config.py (modified, +1/-1)
  • run_agent.py (modified, +2/-1)
  • setup-hermes.sh (modified, +3/-2)
  • tools/environments/docker.py (modified, +76/-4)
  • tools/image_generation_tool.py (modified, +151/-0)
  • tools/mcp_tool.py (modified, +39/-4)
  • toolsets.py (modified, +2/-2)

Code Example

ERROR tools.mcp_tool: MCP tool canny/canny_get_post call failed:

---

async with httpx.AsyncClient(**client_kwargs) as http_client:
    async with streamable_http_client(url, http_client=http_client) as (...):
        async with ClientSession(read_stream, write_stream, ...) as session:
            await session.initialize()
            self.session = session
            await self._discover_tools()
            self._ready.set()
            reason = await self._wait_for_lifecycle_event()  # ← blocks forever

---

async def _wait_for_lifecycle_event(self) -> str:
    """Block until shutdown, reconnect, or keepalive interval."""
    KEEPALIVE_INTERVAL = 180  # 3 minutes
    
    shutdown_task = asyncio.create_task(self._shutdown_event.wait())
    reconnect_task = asyncio.create_task(self._reconnect_event.wait())
    
    try:
        while True:
            done, pending = await asyncio.wait(
                {shutdown_task, reconnect_task},
                timeout=KEEPALIVE_INTERVAL,
                return_when=asyncio.FIRST_COMPLETED,
            )
            
            if done:
                break
                
            # Keepalive: exercise the connection
            if self.session:
                try:
                    await asyncio.wait_for(
                        self.session.list_tools(),
                        timeout=30.0
                    )
                except Exception as exc:
                    logger.warning(
                        "MCP server '%s' keepalive failed, triggering reconnect: %s",
                        self.name, exc
                    )
                    self._reconnect_event.set()
                    return "reconnect"
    finally:
        for t in (shutdown_task, reconnect_task):
            if not t.done():
                t.cancel()
                try:
                    await t
                except (asyncio.CancelledError, Exception):
                    pass

    if self._shutdown_event.is_set():
        return "shutdown"
    self._reconnect_event.clear()
    return "reconnect"

---

mcp_servers:
  my_server:
    url: "http://localhost:3001/mcp"
    keepalive_interval: 180  # seconds, 0 to disable

---

cron:
  mcp-keepalive:
    schedule: "*/3 * * * *"
    prompt: "Call <mcp_tool> to verify connection. Only respond if error."
    silent: true
RAW_BUFFERClick to expand / collapse

Summary

Long-lived MCP HTTP sessions become stale after extended idle periods (observed ~12h) because _wait_for_lifecycle_event() blocks indefinitely without generating any keepalive traffic. The next tool call after the idle period fails silently with an empty error message.

Environment

  • Hermes version: v2026.4.23
  • MCP SDK version: >= 1.24.0 (new HTTP API)
  • Transport: HTTP/StreamableHTTP via streamable_http_client
  • Deployment: Kubernetes with MCP sidecar (supergateway wrapping stdio server)

Observed Behavior

  1. MCP server connects successfully, tools discovered at 20:51
  2. No MCP tool calls for ~12 hours
  3. First tool call at 09:33 fails with empty error:
    ERROR tools.mcp_tool: MCP tool canny/canny_get_post call failed:
  4. Subsequent calls also fail until pod restart

Root Cause Analysis

In tools/mcp_tool.py, the _run_http() method:

async with httpx.AsyncClient(**client_kwargs) as http_client:
    async with streamable_http_client(url, http_client=http_client) as (...):
        async with ClientSession(read_stream, write_stream, ...) as session:
            await session.initialize()
            self.session = session
            await self._discover_tools()
            self._ready.set()
            reason = await self._wait_for_lifecycle_event()  # ← blocks forever

The _wait_for_lifecycle_event() method blocks indefinitely waiting for shutdown/reconnect signals. During this time:

  • No reads/writes occur on the httpx connection
  • The read=300.0 timeout only applies to active reads, not idle connections
  • TCP keepalives at the OS/LB level eventually timeout (~2h default)
  • The socket becomes stale, but Hermes doesn't detect it

When the next tool call arrives, httpx attempts to use the dead socket and fails at the connection level (before any HTTP exchange), producing an empty error.

Proposed Fix

Add a periodic health check inside _wait_for_lifecycle_event() to exercise the connection:

async def _wait_for_lifecycle_event(self) -> str:
    """Block until shutdown, reconnect, or keepalive interval."""
    KEEPALIVE_INTERVAL = 180  # 3 minutes
    
    shutdown_task = asyncio.create_task(self._shutdown_event.wait())
    reconnect_task = asyncio.create_task(self._reconnect_event.wait())
    
    try:
        while True:
            done, pending = await asyncio.wait(
                {shutdown_task, reconnect_task},
                timeout=KEEPALIVE_INTERVAL,
                return_when=asyncio.FIRST_COMPLETED,
            )
            
            if done:
                break
                
            # Keepalive: exercise the connection
            if self.session:
                try:
                    await asyncio.wait_for(
                        self.session.list_tools(),
                        timeout=30.0
                    )
                except Exception as exc:
                    logger.warning(
                        "MCP server '%s' keepalive failed, triggering reconnect: %s",
                        self.name, exc
                    )
                    self._reconnect_event.set()
                    return "reconnect"
    finally:
        for t in (shutdown_task, reconnect_task):
            if not t.done():
                t.cancel()
                try:
                    await t
                except (asyncio.CancelledError, Exception):
                    pass

    if self._shutdown_event.is_set():
        return "shutdown"
    self._reconnect_event.clear()
    return "reconnect"

Alternative: Config-driven keepalive

Add a keepalive_interval config option per MCP server:

mcp_servers:
  my_server:
    url: "http://localhost:3001/mcp"
    keepalive_interval: 180  # seconds, 0 to disable

Workaround

Until this is fixed, users can work around it with a cron job that periodically calls an MCP tool:

cron:
  mcp-keepalive:
    schedule: "*/3 * * * *"
    prompt: "Call <mcp_tool> to verify connection. Only respond if error."
    silent: true

Impact

  • Severity: Medium — MCP tools become unavailable after idle periods
  • Frequency: Affects any deployment with HTTP MCP servers and gaps > ~2h between tool calls
  • Recovery: Automatic reconnect logic exists but isn't triggered (no exception thrown until tool call)

Related

  • Circuit breaker logic in _bump_server_error() / _reset_server_error() handles repeated failures but doesn't prevent the initial stale connection
  • Reconnect logic in run() handles exceptions properly, just needs a trigger

Root cause analysis assisted by GitHub Copilot CLI

extent analysis

TL;DR

Implement a periodic health check inside _wait_for_lifecycle_event() to exercise the connection and prevent it from becoming stale.

Guidance

  • Modify the _wait_for_lifecycle_event() method to include a keepalive interval, as proposed in the fix, to periodically exercise the connection and prevent staleness.
  • Consider adding a keepalive_interval config option per MCP server to make the keepalive interval configurable.
  • As a temporary workaround, set up a cron job to periodically call an MCP tool and keep the connection active.
  • Review the circuit breaker logic in _bump_server_error() and _reset_server_error() to ensure it handles repeated failures correctly.

Example

The proposed fix includes an example implementation of the modified _wait_for_lifecycle_event() method:

async def _wait_for_lifecycle_event(self) -> str:
    # ...
    try:
        while True:
            # ...
            # Keepalive: exercise the connection
            if self.session:
                try:
                    await asyncio.wait_for(
                        self.session.list_tools(),
                        timeout=30.0
                    )
                except Exception as exc:
                    # ...

Notes

The proposed fix assumes that the list_tools() method is a suitable keepalive operation. If this is not the case, an alternative keepalive operation should be used.

Recommendation

Apply the proposed fix by modifying the _wait_for_lifecycle_event() method to include a keepalive interval, as this will prevent the connection from becoming stale and fix the issue.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING