hermes - ✅(Solved) Fix fix(gateway): Gateway shutdown hangs causing 'PID file race lost' on restart [1 pull requests, 1 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
NousResearch/hermes-agent#14128Fetched 2026-04-23 07:46:39
View on GitHub
Comments
1
Participants
2
Timeline
7
Reactions
0
Author
Timeline (top)
labeled ×4cross-referenced ×2commented ×1

Error Message

_adapter_disconnect_timeout = 15.0 # seconds per adapter for platform, adapter in list(self.adapters.items()): try: await asyncio.wait_for(adapter.disconnect(), timeout=_adapter_disconnect_timeout) logger.info("✓ %s disconnected", platform.value) except asyncio.TimeoutError: logger.warning( "✗ %s disconnect timed out after %.1fs - forcing continue", platform.value, _adapter_disconnect_timeout )

Root Cause

The shutdown sequence in gateway/run.py calls await adapter.disconnect() for each platform adapter without a timeout. If any adapter's disconnect() method blocks (e.g., Feishu adapter's WebSocket thread waiting for network response), the entire shutdown process hangs.

When systemd sends SIGKILL after timeout, Python's atexit handlers don't run, so the PID file (~/.hermes/gateway.pid) is never cleaned up. The new instance sees the stale PID file and exits with "PID file race lost".

Fix Action

Fixed

PR fix notes

PR #14130: fix(gateway): add timeout to adapter.disconnect() during shutdown

Description (problem / solution / changelog)

Problem

When restarting the Hermes gateway via , the gateway process sometimes hangs during shutdown and gets SIGKILL'd by systemd after TimeoutStopSec (default 60s). This leaves a stale PID file, causing the new gateway instance to fail with "PID file race lost to another gateway instance. Exiting."

Root Cause

The shutdown sequence in gateway/run.py calls await adapter.disconnect() for each platform adapter without a timeout. If any adapter's disconnect() method blocks (e.g., Feishu adapter's WebSocket thread waiting for network response), the entire shutdown process hangs.

When systemd sends SIGKILL after timeout, Python's atexit handlers don't run, so the PID file is never cleaned up. The new instance sees the stale PID file and exits.

Solution

Add a timeout wrapper (asyncio.wait_for) around adapter.disconnect() with a 15-second timeout per adapter. On timeout, log a warning and continue with the shutdown sequence instead of hanging indefinitely.

Changes

  • Wrap adapter.disconnect() in asyncio.wait_for() with 15s timeout
  • Add asyncio.TimeoutError handler to log warning and continue
  • Ensures PID file cleanup always runs even if adapter cleanup fails

Testing

Manually tested by triggering gateway restart while Feishu WebSocket was active. The gateway now shuts down cleanly within the timeout and restarts successfully without "PID file race lost" errors.

Fixes #14128

Changed files

  • gateway/run.py (modified, +12/-1)

Code Example

Apr 23 03:21:57 python[1782979]: WARNING gateway.run: Shutdown diagnostic — other hermes processes running:
Apr 23 03:22:57 systemd[965]: hermes-gateway.service: State 'stop-sigterm' timed out. Killing.
Apr 23 03:22:57 systemd[965]: hermes-gateway.service: Killing process 1782979 (python) with signal SIGKILL.
Apr 23 03:22:57 systemd[965]: hermes-gateway.service: Failed with result 'timeout'.
Apr 23 03:22:58 python[1783144]: ERROR gateway.run: PID file race lost to another gateway instance. Exiting.

---

_adapter_disconnect_timeout = 15.0  # seconds per adapter
for platform, adapter in list(self.adapters.items()):
    try:
        await asyncio.wait_for(adapter.disconnect(), timeout=_adapter_disconnect_timeout)
        logger.info("✓ %s disconnected", platform.value)
    except asyncio.TimeoutError:
        logger.warning(
            "✗ %s disconnect timed out after %.1fs - forcing continue",
            platform.value, _adapter_disconnect_timeout
        )
RAW_BUFFERClick to expand / collapse

Bug Description

When restarting the Hermes gateway via systemctl restart hermes-gateway, the gateway process sometimes hangs during shutdown and gets SIGKILL'd by systemd after TimeoutStopSec (default 60s). This leaves a stale PID file, causing the new gateway instance to fail with "PID file race lost to another gateway instance. Exiting."

Steps to Reproduce

  1. Configure Hermes gateway with Feishu/Lark platform adapter
  2. Run systemctl --user restart hermes-gateway
  3. If the Feishu WebSocket thread happens to be blocked (e.g., waiting for network I/O), the gateway hangs during shutdown
  4. After 60 seconds, systemd sends SIGKILL
  5. New instance starts but fails with "PID file race lost" error
  6. Gateway enters restart loop until manually fixed

Root Cause

The shutdown sequence in gateway/run.py calls await adapter.disconnect() for each platform adapter without a timeout. If any adapter's disconnect() method blocks (e.g., Feishu adapter's WebSocket thread waiting for network response), the entire shutdown process hangs.

When systemd sends SIGKILL after timeout, Python's atexit handlers don't run, so the PID file (~/.hermes/gateway.pid) is never cleaned up. The new instance sees the stale PID file and exits with "PID file race lost".

Relevant Logs

Apr 23 03:21:57 python[1782979]: WARNING gateway.run: Shutdown diagnostic — other hermes processes running:
Apr 23 03:22:57 systemd[965]: hermes-gateway.service: State 'stop-sigterm' timed out. Killing.
Apr 23 03:22:57 systemd[965]: hermes-gateway.service: Killing process 1782979 (python) with signal SIGKILL.
Apr 23 03:22:57 systemd[965]: hermes-gateway.service: Failed with result 'timeout'.
Apr 23 03:22:58 python[1783144]: ERROR gateway.run: PID file race lost to another gateway instance. Exiting.

Proposed Fix

Add a timeout wrapper around adapter.disconnect() in the shutdown sequence:

_adapter_disconnect_timeout = 15.0  # seconds per adapter
for platform, adapter in list(self.adapters.items()):
    try:
        await asyncio.wait_for(adapter.disconnect(), timeout=_adapter_disconnect_timeout)
        logger.info("✓ %s disconnected", platform.value)
    except asyncio.TimeoutError:
        logger.warning(
            "✗ %s disconnect timed out after %.1fs - forcing continue",
            platform.value, _adapter_disconnect_timeout
        )

This ensures the shutdown sequence always completes within a reasonable time, allowing PID file cleanup to run properly.

Environment

  • Hermes Agent version: latest main branch
  • Platform: Feishu/Lark
  • OS: Linux (systemd user service)

extent analysis

TL;DR

Implement a timeout wrapper around adapter.disconnect() in the shutdown sequence to prevent the gateway process from hanging during shutdown.

Guidance

  • Review the proposed fix and consider adding a timeout wrapper around adapter.disconnect() to ensure the shutdown sequence completes within a reasonable time.
  • Verify that the TimeoutStopSec value in the systemd configuration is set to a suitable value to allow for a clean shutdown.
  • Test the updated shutdown sequence with the Feishu/Lark platform adapter to ensure it can handle blocked WebSocket threads without hanging.
  • Consider adding logging to track the shutdown process and identify any potential issues.

Example

The proposed fix provides an example of how to implement a timeout wrapper around adapter.disconnect():

_adapter_disconnect_timeout = 15.0  # seconds per adapter
for platform, adapter in list(self.adapters.items()):
    try:
        await asyncio.wait_for(adapter.disconnect(), timeout=_adapter_disconnect_timeout)
        logger.info("✓ %s disconnected", platform.value)
    except asyncio.TimeoutError:
        logger.warning(
            "✗ %s disconnect timed out after %.1fs - forcing continue",
            platform.value, _adapter_disconnect_timeout
        )

Notes

The proposed fix assumes that the adapter.disconnect() method is asynchronous and can be wrapped with a timeout using asyncio.wait_for(). If this is not the case, an alternative approach may be needed.

Recommendation

Apply the proposed workaround by adding a timeout wrapper around adapter.disconnect() to prevent the gateway process from hanging during shutdown. This should allow for a clean shutdown and prevent the "PID file race lost" error.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING