hermes - ✅(Solved) Fix Gateway run.py: PID file race condition and httpx connection leak on cache eviction [2 pull requests, 1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
NousResearch/hermes-agent#14598Fetched 2026-04-24 06:16:06
View on GitHub
Comments
0
Participants
1
Timeline
5
Reactions
0
Participants
Timeline (top)
labeled ×3cross-referenced ×2

Two related reliability issues in gateway/run.py can cause the gateway to fail to start or leak httpx connections during cache churn.


Error Message

except FileExistsError: release_gateway_runtime_lock() logger.error("PID file race lost to another gateway instance. Exiting.") return False

Root Cause

Problem: When two gateway processes try to start simultaneously (e.g., after a crash or during auto-restart), both hit FileExistsError on the PID file and immediately exit, even though no gateway is actually running. This happens because the previous gateway may have been killed ungracefully (SIGTERM/SIGKILL) before its atexit handler could remove the PID file.

Fix Action

Fixed

PR fix notes

PR #14609: fix(gateway): recover stale pid files and release evicted clients

Description (problem / solution / changelog)

Summary

  • recover startup when a stale gateway.pid is still present after the runtime lock is acquired
  • release provider clients when _evict_cached_agent() drops a cached agent
  • add regression coverage for both recovery paths

Testing

  • python3 -m pytest -o addopts='' -q tests/gateway/test_agent_cache.py -k 'evict' tests/gateway/test_runner_startup_failures.py -k 'stale_pid or start_gateway'
  • python3 -m pytest -o addopts='' -q tests/tools/test_zombie_process_cleanup.py -k 'evict_does_not_call_close' tests/gateway/test_session_model_reset.py -k 'evict'

Closes #14598

Changed files

  • gateway/run.py (modified, +34/-6)
  • gateway/status.py (modified, +13/-0)
  • tests/gateway/test_agent_cache.py (modified, +26/-0)
  • tests/gateway/test_runner_startup_failures.py (modified, +38/-0)

PR #14710: fix(bedrock): evict cached boto3 client on stale-connection errors

Description (problem / solution / changelog)

What does this PR do?

Fixes a recurring class of failures on the Bedrock Converse code path where a pooled HTTPS connection to bedrock-runtime.<region>.amazonaws.com goes stale (NAT/VPN idle timeout, server-side TCP RST, proxy idle cull), the next Converse call fails with a library-internal exception, the agent's retry loop reuses the same cached boto3 client, and every retry hits the same dead connection pool — so the only recovery is a full process restart.

This is observable in production as repeating banners like:

⚠️  API call failed (attempt N/3): AssertionError
   🔌 Provider: bedrock  Model: <inference-profile>
   🌐 Endpoint: https://bedrock-runtime.us-east-1.amazonaws.com
   📝 Error:

(The empty Error: line is because str(AssertionError()) is an empty string — a separate minor issue I'll address in a follow-up.)

Root cause

agent/bedrock_adapter.py keeps a per-region module-level cache (_bedrock_runtime_client_cache) and there is no code path that invalidates a single entry on transient connection failure. The existing reset_client_cache() nukes the whole dict, which is global-blast-radius and never wired into the retry loop anyway.

When the underlying socket is dead, botocore / urllib3 surface it as one of:

  • botocore.exceptions.ConnectionClosedError / ReadTimeoutError / EndpointConnectionError / ConnectTimeoutError (all inherit HTTPClientErrorBotoCoreError)
  • urllib3.exceptions.ProtocolError / NewConnectionError
  • A bare AssertionError raised from inside urllib3 or botocore (internal connection-pool invariant check — no message attached)

In all three cases the cached boto3 client's connection pool is poisoned, and subsequent retries against the same cached client fail identically.

Fix

Add two helpers to agent/bedrock_adapter.py:

  • is_stale_connection_error(exc) — classifies exceptions that indicate dead-client/dead-socket state. Matches the botocore ConnectionError + HTTPClientError subtrees, urllib3 ProtocolError / NewConnectionError, and AssertionError raised from a frame whose module name starts with urllib3., botocore., or boto3.. Application-level AssertionError is intentionally excluded (the predicate walks the traceback to the innermost frame and checks the module).

  • invalidate_runtime_client(region) — per-region counterpart to the existing reset_client_cache(). Evicts a single cached client so the next call rebuilds it (and its connection pool). Returns True if a client was evicted, False otherwise.

Wire both into the Converse call sites:

  • call_converse() / call_converse_stream() in agent/bedrock_adapter.py (defense-in-depth for any future caller).
  • The two direct client.converse(**kwargs) / client.converse_stream(**kwargs) call sites in run_agent.py (the paths the agent loop actually uses today).

On a stale-connection exception the client is evicted and the exception is re-raised unchanged — the agent's existing retry loop then builds a fresh client on the next attempt and recovers without requiring a process restart. Non-stale exceptions (validation errors, throttling, etc.) are unaffected.

Related Issue

No existing tracking issue for this specific failure mode. Loosely related:

  • #14598 — gateway-side httpx connection leak on cache eviction. Same family of "cached client outlives its pool" problem, different code path (gateway vs Bedrock adapter). This PR does not fix #14598.
  • #14680 — rebuilds the Anthropic client after Ctrl-C interrupt. Same "rebuild the client to recover" pattern, different trigger (user interrupt vs stale socket). No file overlap.

Fixes #

Type of Change

  • 🐛 Bug fix (non-breaking change that fixes an issue)
  • ✨ New feature (non-breaking change that adds functionality)
  • 🔒 Security fix
  • 📝 Documentation update
  • ✅ Tests (adding or improving test coverage)
  • ♻️ Refactor (no behavior change)
  • 🎯 New skill (bundled or hub)

Changes Made

  • agent/bedrock_adapter.py
    • Add is_stale_connection_error(exc) -> bool — classifies botocore/urllib3 transient-connection exceptions and library-internal bare AssertionError.
    • Add invalidate_runtime_client(region) -> bool — per-region client-cache eviction.
    • Wrap call_converse() and call_converse_stream() in a try/except that invalidates the cached client on stale-connection exceptions before re-raising.
  • run_agent.py
    • Wrap the two direct client.converse(**kwargs) / client.converse_stream(**kwargs) invocations with the same invalidation-on-stale pattern. The region is looked up from the active agent's configuration so the correct cache entry is evicted.
  • tests/agent/test_bedrock_adapter.py — three new test classes, 14 tests total (see below).

How to Test

Automated

pytest tests/agent/test_bedrock_adapter.py -v

All 116 tests pass. The new coverage:

  • TestInvalidateRuntimeClient — per-region eviction correctness, non-cached region returns False, untouched regions are preserved.
  • TestIsStaleConnectionError — positive cases (botocore ConnectionClosedError, EndpointConnectionError, ReadTimeoutError; urllib3 ProtocolError; library-internal AssertionError from urllib3.* and botocore.* frames) and negative cases (application-level AssertionError, unrelated ValueError/KeyError).
  • TestCallConverseInvalidatesOnStaleError — end-to-end: stale error evicts the cached client, non-stale ValidationException leaves it alone, successful call leaves it cached.

Manual (on Bedrock)

  1. Configure hermes with provider: bedrock and any Bedrock model.
  2. Let the session idle for longer than the upstream idle timeout (on most networks 5–15 min is enough to trigger NAT TCP cull).
  3. Send any message. The first request after the idle will hit the stale socket.
  4. Expected: the retry completes successfully on attempt 2/3 with a freshly-built client, no restart required.
  5. Prior behavior: all 3 retries fail with identical ConnectionClosedError / AssertionError, and the session is unusable until hermes is restarted.

Checklist

Code

  • I've read the Contributing Guide
  • My commit messages follow Conventional Commits (fix(bedrock):)
  • I searched for existing PRs to make sure this isn't a duplicate (checked all open bedrock PRs; none touch the client-cache-invalidation code path)
  • My PR contains only changes related to this fix
  • I've run pytest tests/agent/test_bedrock_adapter.py -q and all tests pass
  • I've added tests for my changes
  • I've tested on my platform: Linux (Amazon Linux, Python 3.11)

Documentation & Housekeeping

  • I've updated relevant documentation — N/A (no user-visible API change)
  • I've updated cli-config.yaml.example — N/A
  • I've updated CONTRIBUTING.md or AGENTS.md — N/A
  • I've considered cross-platform impact — the fix is pure Python touching only botocore/urllib3 exception types; no OS-specific code
  • I've updated tool descriptions/schemas — N/A

Screenshots / Logs

Repro (pre-fix) from a real session where the NAT timeout dropped the pooled connection:

⚠️  API call failed (attempt 1/3): AssertionError
   🔌 Provider: bedrock
   🌐 Endpoint: https://bedrock-runtime.us-east-1.amazonaws.com
   📝 Error:

⚠️  API call failed (attempt 2/3): AssertionError
   (... same cached client, same dead pool ...)

⚠️  API call failed (attempt 3/3): AssertionError
   (... same cached client, same dead pool ...)

# Session dead — restart required.

Post-fix: attempt 1 still fails (the stale socket is unavoidable), the cache is evicted, and attempt 2 builds a fresh client and completes normally.

Changed files

  • agent/bedrock_adapter.py (modified, +130/-2)
  • run_agent.py (modified, +20/-2)
  • tests/agent/test_bedrock_adapter.py (modified, +207/-0)

Code Example

except FileExistsError:
    release_gateway_runtime_lock()
    logger.error("PID file race lost to another gateway instance. Exiting.")
    return False

---

def _evict_cached_agent(self, session_key: str) -> None:
    _lock = getattr(self, "_agent_cache_lock", None)
    if _lock:
        with _lock:
            self._agent_cache.pop(session_key, None)
RAW_BUFFERClick to expand / collapse

Summary

Two related reliability issues in gateway/run.py can cause the gateway to fail to start or leak httpx connections during cache churn.


Issue 1: PID file race condition causes spurious gateway exit

Location: start_gateway() around line 11144

Problem: When two gateway processes try to start simultaneously (e.g., after a crash or during auto-restart), both hit FileExistsError on the PID file and immediately exit, even though no gateway is actually running. This happens because the previous gateway may have been killed ungracefully (SIGTERM/SIGKILL) before its atexit handler could remove the PID file.

Current behavior:

except FileExistsError:
    release_gateway_runtime_lock()
    logger.error("PID file race lost to another gateway instance. Exiting.")
    return False

Fix: On FileExistsError, check whether a real gateway process is still alive. If the PID file is stale (previous gateway crashed), remove it and retry once. Only exit if a live gateway is confirmed.


Issue 2: httpx connection leak during agent cache eviction

Location: _evict_cached_agent() around line 8810

Problem: When _evict_cached_agent() removes an AIAgent from the cache (on /new, /model switch, etc.), it simply pops the entry and lets the object be garbage collected. However, the AIAgent holds httpx AsyncClient connections for LLM providers. These are not closed before GC, causing connection pool leaks during cache churn.

Current behavior:

def _evict_cached_agent(self, session_key: str) -> None:
    _lock = getattr(self, "_agent_cache_lock", None)
    if _lock:
        with _lock:
            self._agent_cache.pop(session_key, None)

Fix: After popping the agent from the cache, call _release_evicted_agent_soft(agent) on a daemon thread to cleanly release httpx client resources before GC. _release_evicted_agent_soft already exists and calls agent.release_clients().


Related: Discord token failure causes reconnect storms

Location: gateway/platforms/discord.py

The Discord adapter repeatedly attempts to reconnect with an invalid/expired token (LoginFailure: Improper token has been passed), creating "Unclosed client session" aiohttp errors and filling logs. This is a separate but contributing issue — the reconnect loop can interfere with gateway stability during restarts. Fixing the Discord token will stop the reconnect storms.


Status

Fixes have been applied to the running instance. The PID file race fix is confirmed working — gateway successfully restarted after SIGKILL of the previous instance.

Files affected: gateway/run.py, gateway/platforms/discord.py

extent analysis

TL;DR

To resolve the reliability issues in the gateway, implement a check for a live gateway process on FileExistsError and ensure httpx connections are closed when evicting cached agents.

Guidance

  • Modify the start_gateway() function to check if a gateway process is running when a FileExistsError occurs, and only exit if a live process is confirmed.
  • Update the _evict_cached_agent() function to call _release_evicted_agent_soft(agent) on a daemon thread after popping the agent from the cache to release httpx client resources.
  • Verify the fixes by testing the gateway restart after a SIGKILL and monitoring for connection leaks during cache churn.
  • Address the Discord token issue to prevent reconnect storms that may interfere with gateway stability.

Example

def start_gateway():
    try:
        # ...
    except FileExistsError:
        if is_gateway_process_running():
            release_gateway_runtime_lock()
            logger.error("PID file race lost to another gateway instance. Exiting.")
            return False
        else:
            # Remove stale PID file and retry
            remove_pid_file()
            return start_gateway()

def _evict_cached_agent(self, session_key: str) -> None:
    _lock = getattr(self, "_agent_cache_lock", None)
    if _lock:
        with _lock:
            agent = self._agent_cache.pop(session_key, None)
            if agent:
                threading.Thread(target=_release_evicted_agent_soft, args=(agent,), daemon=True).start()

Notes

The provided fixes assume that the _release_evicted_agent_soft function correctly releases httpx client resources. Additionally, the Discord token issue should be addressed to prevent reconnect storms.

Recommendation

Apply the workaround by implementing the suggested changes to the start_gateway() and _evict_cached_agent() functions to resolve the reliability issues in the gateway.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING