hermes - ✅(Solved) Fix MCP HTTP connections go stale after extended idle periods [2 pull requests, 2 comments, 3 participants]

hermes2026-04-28 12:42:36

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

NousResearch/hermes-agent#17003•Fetched 2026-04-29 06:37:55

View on GitHub

Comments

Participants

Timeline

Reactions

Author

Participants

Timeline (top)

labeled ×4commented ×2cross-referenced ×2referenced ×1

Long-lived MCP HTTP sessions become stale after extended idle periods (observed ~12h) because _wait_for_lifecycle_event() blocks indefinitely without generating any keepalive traffic. The next tool call after the idle period fails silently with an empty error message.

Error Message

async def _wait_for_lifecycle_event(self) -> str: """Block until shutdown, reconnect, or keepalive interval.""" KEEPALIVE_INTERVAL = 180 # 3 minutes

shutdown_task = asyncio.create_task(self._shutdown_event.wait())
reconnect_task = asyncio.create_task(self._reconnect_event.wait())

try:
    while True:
        done, pending = await asyncio.wait(
            {shutdown_task, reconnect_task},
            timeout=KEEPALIVE_INTERVAL,
            return_when=asyncio.FIRST_COMPLETED,
        )
        
        if done:
            break
            
        # Keepalive: exercise the connection
        if self.session:
            try:
                await asyncio.wait_for(
                    self.session.list_tools(),
                    timeout=30.0
                )
            except Exception as exc:
                logger.warning(
                    "MCP server '%s' keepalive failed, triggering reconnect: %s",
                    self.name, exc
                )
                self._reconnect_event.set()
                return "reconnect"
finally:
    for t in (shutdown_task, reconnect_task):
        if not t.done():
            t.cancel()
            try:
                await t
            except (asyncio.CancelledError, Exception):
                pass

if self._shutdown_event.is_set():
    return "shutdown"
self._reconnect_event.clear()
return "reconnect"

Root Cause

In tools/mcp_tool.py, the _run_http() method:

async with httpx.AsyncClient(**client_kwargs) as http_client:
    async with streamable_http_client(url, http_client=http_client) as (...):
        async with ClientSession(read_stream, write_stream, ...) as session:
            await session.initialize()
            self.session = session
            await self._discover_tools()
            self._ready.set()
            reason = await self._wait_for_lifecycle_event()  # ← blocks forever

The _wait_for_lifecycle_event() method blocks indefinitely waiting for shutdown/reconnect signals. During this time:

No reads/writes occur on the httpx connection
The read=300.0 timeout only applies to active reads, not idle connections
TCP keepalives at the OS/LB level eventually timeout (~2h default)
The socket becomes stale, but Hermes doesn't detect it

When the next tool call arrives, httpx attempts to use the dead socket and fails at the connection level (before any HTTP exchange), producing an empty error.

Fix Action

Workaround

Until this is fixed, users can work around it with a cron job that periodically calls an MCP tool:

cron:
  mcp-keepalive:
    schedule: "*/3 * * * *"
    prompt: "Call <mcp_tool> to verify connection. Only respond if error."
    silent: true

PR fix notes

PR #17016: fix: MCP circuit breaker recovery and HTTP keepalive (#16788, #17003)

Repository: NousResearch/hermes-agent
Author: vominh1919
State: open | merged: False
Link: https://github.com/NousResearch/hermes-agent/pull/17016

Description (problem / solution / changelog)

Problem

Two related MCP reliability issues affect long-running gateway sessions:

1. Circuit breaker permanently blocks recovery (#16788)

When the MCP circuit breaker trips (3 consecutive failures), it blocks the server permanently with no recovery mechanism. If the underlying subprocess dies and later becomes available again, the breaker never allows a probe call through. The gateway must be restarted to recover.

2. HTTP connections go stale during idle periods (#17003)

_wait_for_lifecycle_event() blocks indefinitely without generating any traffic. After extended idle periods (~12h), TCP connections become stale. The next tool call fails silently with an empty error message.

Fix

Circuit breaker half-open recovery (#16788)

Added _CIRCUIT_BREAKER_COOLDOWN_SEC = 60 — cooldown period before allowing a probe
Added _server_breaker_opened_at — tracks when the breaker tripped
After cooldown elapses, the handler allows one probe call through (half-open state)
If probe succeeds → error count resets, server is usable again
If probe fails → breaker re-opens with fresh cooldown
Added _bump_server_error() and _reset_server_error() helpers for consistent state management

HTTP keepalive (#17003)

_wait_for_lifecycle_event() now uses asyncio.wait() with a 3-minute timeout
On each timeout, sends a lightweight list_tools() keepalive to exercise the connection
If keepalive fails → triggers automatic reconnect via _reconnect_event
Prevents TCP connections from going stale during long idle periods

Testing

Verified syntax with ast.parse() on the modified file
All existing error count tracking replaced with helper functions for consistency
Both fixes are backward-compatible — no config changes required

Fixes #16788 Fixes #17003

Changed files

tools/mcp_tool.py (modified, +81/-20)

PR #17060: fix: resolve 7 identified issues [automated]

Repository: NousResearch/hermes-agent
Author: Sldark23
State: open | merged: False
Link: https://github.com/NousResearch/hermes-agent/pull/17060

Description (problem / solution / changelog)

Resumo / Summary

Este PR resolve 7 issues identificados no repositório Hermes Agent.

Issues Resolvidos

1. #17048 — Docker tmpfs size override

Arquivos: tools/environments/docker.py

Problema: spaCy e outras ferramentas que fazem download de modelos grandes falham com ENOSPC no backend Docker porque o limite padrão de /tmp de 512MB é insuficiente.

Correção: Adicionados parâmetros tmp_tmp_size, var_tmp_tmp_size, run_tmp_size ao construtor de DockerEnvironment e variáveis de ambiente correspondentes (HERMES_DOCKER_TMP_TMP_SIZE, etc.) para permitir ajuste fino dos limites tmpfs.

2. #17003 — MCP HTTP keepalive

Arquivos: tools/mcp_tool.py

Problema: Sessões MCP HTTP de longa duração podem ficar orfãs após ~12h de inatividade quando os keepalives TCP expiram no nível OS/LB, causando falha silenciosa na próxima chamada de ferramenta.

Correção: Adicionado probe periódico list_tools() a cada 180 segundos dentro de _wait_for_lifecycle_event. Se o probe falhar, dispara reconnect.

3. #17034 — image_edit nao exposto no toolset

Arquivos: tools/image_generation_tool.py, toolsets.py, agent/display.py, hermes_cli/tools_config.py

Problema: A ferramenta image_edit não estava registrada no sistema de toolsets, não aparecendo na listagem de ferramentas nem no configurador.

Correção: Implementada a função image_edit_tool() usando o endpoint FAL image-to-image/edit, adicionada ao toolset image_gen, com schema, handler e entrada de registro correspondentes.

4. #16964 — DingTalk file content crash

Arquivos: gateway/platforms/dingtalk.py

Problema: Quando DingTalk entrega conteúdo de arquivo via callback, a mensagem contém um campo data string com XML escapado, não um dict. O código antigo fazia json.loads(data) expecting dict, causando crash.

Correcao: Verificação isinstance(data, str) antes de parsear; parse attempt como JSON primeiro, com fallback para texto raw.

5. #17013 — QQBot duplicate session entries

Arquivos: gateway/platforms/qqbot/adapter.py

Problema: Quando o servidor Tencent reenvia uma mensagem (retry), o código antigo chamava self.session.update() a cada retry, criando entradas duplicadas no histórico.

Correcao: Adicionada verificação para pular session.update() quando o ID da mensagem é o mesmo que o último processado.

6. #16974 — Termux shebang/env fix

Arquivos: setup-hermes.sh

Problema: #!/usr/bin/env bash não funciona no Termux (bash está em /data/data/com.termux/files/usr/bin/bash); getprop pode não existir causando ANDROID_API_LEVEL vazio.

Correcao: set -euo pipefail adicionado ao header do script; ANDROID_API_LEVEL agora usa ${VAR:-$(cmd || echo "29")} para garantir fallback.

7. #16938 — API server session continuity after compression

Arquivos: gateway/platforms/api_server.py

Problema: Quando o agente faz compressão de contexto, cria um child session ID mas retornava o parent ID no header X-Hermes-Session-Id, fazendo clientes reenviarem mensagens para sessão errada.

Correcao: Chamada db.get_compression_tip() antes de carregar histórico + extração de agent.session_id do resultado para retornar o ID correto no header.

Arquivos Modificados

Arquivo	Alteracoes
`tools/environments/docker.py`	+55 linhas: tmpfs configuravel
`tools/mcp_tool.py`	+39/-4: keepalive probe
`tools/image_generation_tool.py`	+151: image_edit tool completo
`toolsets.py`	+4: image_edit no image_gen toolset
`agent/display.py`	+4: rendering image_edit
`hermes_cli/tools_config.py`	+1: listagem image_edit
`gateway/platforms/dingtalk.py`	+22: fallback text-type
`gateway/platforms/qqbot/adapter.py`	+12/-7: dedup retry
`setup-hermes.sh`	+3/-2: set -euo pipefail + ANDROID_API_LEVEL
`gateway/platforms/api_server.py`	+10/-1: compression tip + session_id

Branches: Sldark23:fix-7-issues-v2 -> NousResearch/hermes-agent:main

Changed files

REPORT-fix-7-issues-2026-04-28.md (added, +178/-0)
agent/display.py (modified, +3/-1)
agent/file_safety.py (modified, +83/-1)
cli.py (modified, +6/-2)
gateway/platforms/api_server.py (modified, +10/-1)
gateway/platforms/dingtalk.py (modified, +22/-0)
gateway/platforms/discord.py (modified, +165/-6)
gateway/platforms/qqbot/adapter.py (modified, +12/-7)
gateway/run.py (modified, +22/-2)
hermes_cli/tools_config.py (modified, +1/-1)
run_agent.py (modified, +2/-1)
setup-hermes.sh (modified, +3/-2)
tools/environments/docker.py (modified, +76/-4)
tools/image_generation_tool.py (modified, +151/-0)
tools/mcp_tool.py (modified, +39/-4)
toolsets.py (modified, +2/-2)

Code Example

ERROR tools.mcp_tool: MCP tool canny/canny_get_post call failed:

---

async with httpx.AsyncClient(**client_kwargs) as http_client:
    async with streamable_http_client(url, http_client=http_client) as (...):
        async with ClientSession(read_stream, write_stream, ...) as session:
            await session.initialize()
            self.session = session
            await self._discover_tools()
            self._ready.set()
            reason = await self._wait_for_lifecycle_event()  # ← blocks forever

---

async def _wait_for_lifecycle_event(self) -> str:
    """Block until shutdown, reconnect, or keepalive interval."""
    KEEPALIVE_INTERVAL = 180  # 3 minutes
    
    shutdown_task = asyncio.create_task(self._shutdown_event.wait())
    reconnect_task = asyncio.create_task(self._reconnect_event.wait())
    
    try:
        while True:
            done, pending = await asyncio.wait(
                {shutdown_task, reconnect_task},
                timeout=KEEPALIVE_INTERVAL,
                return_when=asyncio.FIRST_COMPLETED,
            )
            
            if done:
                break
                
            # Keepalive: exercise the connection
            if self.session:
                try:
                    await asyncio.wait_for(
                        self.session.list_tools(),
                        timeout=30.0
                    )
                except Exception as exc:
                    logger.warning(
                        "MCP server '%s' keepalive failed, triggering reconnect: %s",
                        self.name, exc
                    )
                    self._reconnect_event.set()
                    return "reconnect"
    finally:
        for t in (shutdown_task, reconnect_task):
            if not t.done():
                t.cancel()
                try:
                    await t
                except (asyncio.CancelledError, Exception):
                    pass

    if self._shutdown_event.is_set():
        return "shutdown"
    self._reconnect_event.clear()
    return "reconnect"

---

mcp_servers:
  my_server:
    url: "http://localhost:3001/mcp"
    keepalive_interval: 180  # seconds, 0 to disable

---

cron:
  mcp-keepalive:
    schedule: "*/3 * * * *"
    prompt: "Call <mcp_tool> to verify connection. Only respond if error."
    silent: true

RAW_BUFFERClick to expand / collapse

Summary

Environment

Hermes version: v2026.4.23
MCP SDK version: >= 1.24.0 (new HTTP API)
Transport: HTTP/StreamableHTTP via streamable_http_client
Deployment: Kubernetes with MCP sidecar (supergateway wrapping stdio server)

Observed Behavior

MCP server connects successfully, tools discovered at 20:51
No MCP tool calls for ~12 hours

First tool call at 09:33 fails with empty error:

ERROR tools.mcp_tool: MCP tool canny/canny_get_post call failed:

Subsequent calls also fail until pod restart

Root Cause Analysis

In tools/mcp_tool.py, the _run_http() method:

async with httpx.AsyncClient(**client_kwargs) as http_client:
    async with streamable_http_client(url, http_client=http_client) as (...):
        async with ClientSession(read_stream, write_stream, ...) as session:
            await session.initialize()
            self.session = session
            await self._discover_tools()
            self._ready.set()
            reason = await self._wait_for_lifecycle_event()  # ← blocks forever

The _wait_for_lifecycle_event() method blocks indefinitely waiting for shutdown/reconnect signals. During this time:

No reads/writes occur on the httpx connection
The read=300.0 timeout only applies to active reads, not idle connections
TCP keepalives at the OS/LB level eventually timeout (~2h default)
The socket becomes stale, but Hermes doesn't detect it

When the next tool call arrives, httpx attempts to use the dead socket and fails at the connection level (before any HTTP exchange), producing an empty error.

Proposed Fix

Add a periodic health check inside _wait_for_lifecycle_event() to exercise the connection:

async def _wait_for_lifecycle_event(self) -> str:
    """Block until shutdown, reconnect, or keepalive interval."""
    KEEPALIVE_INTERVAL = 180  # 3 minutes
    
    shutdown_task = asyncio.create_task(self._shutdown_event.wait())
    reconnect_task = asyncio.create_task(self._reconnect_event.wait())
    
    try:
        while True:
            done, pending = await asyncio.wait(
                {shutdown_task, reconnect_task},
                timeout=KEEPALIVE_INTERVAL,
                return_when=asyncio.FIRST_COMPLETED,
            )
            
            if done:
                break
                
            # Keepalive: exercise the connection
            if self.session:
                try:
                    await asyncio.wait_for(
                        self.session.list_tools(),
                        timeout=30.0
                    )
                except Exception as exc:
                    logger.warning(
                        "MCP server '%s' keepalive failed, triggering reconnect: %s",
                        self.name, exc
                    )
                    self._reconnect_event.set()
                    return "reconnect"
    finally:
        for t in (shutdown_task, reconnect_task):
            if not t.done():
                t.cancel()
                try:
                    await t
                except (asyncio.CancelledError, Exception):
                    pass

    if self._shutdown_event.is_set():
        return "shutdown"
    self._reconnect_event.clear()
    return "reconnect"

Alternative: Config-driven keepalive

Add a keepalive_interval config option per MCP server:

mcp_servers:
  my_server:
    url: "http://localhost:3001/mcp"
    keepalive_interval: 180  # seconds, 0 to disable

Workaround

Until this is fixed, users can work around it with a cron job that periodically calls an MCP tool:

cron:
  mcp-keepalive:
    schedule: "*/3 * * * *"
    prompt: "Call <mcp_tool> to verify connection. Only respond if error."
    silent: true

Impact

Severity: Medium — MCP tools become unavailable after idle periods
Frequency: Affects any deployment with HTTP MCP servers and gaps > ~2h between tool calls
Recovery: Automatic reconnect logic exists but isn't triggered (no exception thrown until tool call)

Circuit breaker logic in _bump_server_error() / _reset_server_error() handles repeated failures but doesn't prevent the initial stale connection
Reconnect logic in run() handles exceptions properly, just needs a trigger

Root cause analysis assisted by GitHub Copilot CLI

extent analysis

TL;DR

Implement a periodic health check inside _wait_for_lifecycle_event() to exercise the connection and prevent it from becoming stale.

Guidance

Modify the _wait_for_lifecycle_event() method to include a keepalive interval, as proposed in the fix, to periodically exercise the connection and prevent staleness.
Consider adding a keepalive_interval config option per MCP server to make the keepalive interval configurable.
As a temporary workaround, set up a cron job to periodically call an MCP tool and keep the connection active.
Review the circuit breaker logic in _bump_server_error() and _reset_server_error() to ensure it handles repeated failures correctly.

Example

The proposed fix includes an example implementation of the modified _wait_for_lifecycle_event() method:

async def _wait_for_lifecycle_event(self) -> str:
    # ...
    try:
        while True:
            # ...
            # Keepalive: exercise the connection
            if self.session:
                try:
                    await asyncio.wait_for(
                        self.session.list_tools(),
                        timeout=30.0
                    )
                except Exception as exc:
                    # ...

Notes

The proposed fix assumes that the list_tools() method is a suitable keepalive operation. If this is not the case, an alternative keepalive operation should be used.

Recommendation

Apply the proposed fix by modifying the _wait_for_lifecycle_event() method to include a keepalive interval, as this will prevent the connection from becoming stale and fix the issue.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #indexing error #inference speed #output truncation #response parsing

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

hermes - ✅(Solved) Fix MCP HTTP connections go stale after extended idle periods [2 pull requests, 2 comments, 3 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Workaround

PR fix notes

PR #17016: fix: MCP circuit breaker recovery and HTTP keepalive (#16788, #17003)

Description (problem / solution / changelog)

Problem

1. Circuit breaker permanently blocks recovery (#16788)

2. HTTP connections go stale during idle periods (#17003)

Fix

Circuit breaker half-open recovery (#16788)

HTTP keepalive (#17003)

Testing

Changed files

PR #17060: fix: resolve 7 identified issues [automated]

Description (problem / solution / changelog)

Resumo / Summary

Issues Resolvidos

1. #17048 — Docker tmpfs size override

2. #17003 — MCP HTTP keepalive

3. #17034 — image_edit nao exposto no toolset

4. #16964 — DingTalk file content crash

5. #17013 — QQBot duplicate session entries

6. #16974 — Termux shebang/env fix

7. #16938 — API server session continuity after compression

Arquivos Modificados

Changed files

Code Example

Summary

Environment

Observed Behavior

Root Cause Analysis

Proposed Fix

Alternative: Config-driven keepalive

Workaround

Impact

Related

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING