hermes - ✅(Solved) Fix Bug: Stale gateway.pid causes gateway restart loop after crash/SIGKILL [1 pull requests, 1 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
NousResearch/hermes-agent#13655Fetched 2026-04-22 08:05:01
View on GitHub
Comments
1
Participants
2
Timeline
3
Reactions
0
Timeline (top)
commented ×1cross-referenced ×1subscribed ×1

Error Message

ERROR gateway.run: PID file race lost to another gateway instance. Exiting. logger.error("Another gateway instance (PID %d) is running. Exiting.", stale_pid)

Root Cause

In gateway/run.py, the startup sequence:

  1. Reads gateway.pid to check for an existing gateway
  2. Writes gateway.pid (which fails with FileExistsError if the file already exists and the old PID is gone but the file wasn't cleaned up)
  3. Registers atexit.register(remove_pid_file) to delete the PID file on clean exit

The problem: if the process is killed with SIGKILL or crashes, the atexit handler never fires and gateway.pid is left behind. The next startup sees the stale file, treats it as a conflict, and exits without attempting to validate whether the PID is actually alive.

Fix Action

Fix / Workaround

This makes the gateway resilient to crashes and eliminates the need for the workaround below.

Workaround (Service Definition)

PR fix notes

PR #13709: fix(gateway): clear stale PID file from crashed gateway on startup (closes #13655)

Description (problem / solution / changelog)

Summary

Closes #13655

After a gateway process is killed with SIGKILL (or dies from OOM, systemd stop-timeout, etc.) the handler never fires and is left on disk containing the dead process's PID. On the next startup this stale file causes an unrecoverable restart loop.

Root Cause

The call chain on startup was:

get_running_pid()
  → os.kill(dead_pid, 0)          # ProcessLookupError — process is gone
  → _cleanup_invalid_pid_path()
      → remove_pid_file()          # BUG: checks file_pid != os.getpid()
                                   #      dead_pid ≠ new_pid → does nothing
  → returns None

write_pid_file()                   # O_CREAT | O_EXCL
  → FileExistsError                # stale file still present
  → 'PID file race lost. Exiting.'

remove_pid_file() intentionally guards against removing a file belonging to another live process (the --replace handoff race), but after a crash the guarded process is dead — the guard is wrong here.

Fix

In _cleanup_invalid_pid_path(), replace the call to remove_pid_file() with a direct force-unlink (pid_path.unlink(missing_ok=True)). By the time this helper is called we have already confirmed the recorded PID is dead or invalid, so skipping the ownership guard is correct.

The concurrent-write race (--replace with two competing starters) is unaffected: both processes see the stale file, both unlink it (idempotent), then their write_pid_file() O_CREAT|O_EXCL calls race as intended — exactly one wins.

Changes

FileChange
gateway/status.py_cleanup_invalid_pid_path(): direct unlink() instead of remove_pid_file()
tests/gateway/test_status.pyTwo new regression tests covering the crash→restart scenario

Regression Tests Added

All 29 existing tests in tests/gateway/test_status.py continue to pass.

Before / After

Checklist

  • Bug confirmed reproduced in the field by two independent users (#13655)
  • Root cause identified with full call-chain analysis
  • Minimal, targeted fix (9 lines changed in production code)
  • Regression tests covering exact failure sequence
  • All existing tests pass
  • No behaviour change for the normal --replace race path

Changed files

  • gateway/status.py (modified, +9/-4)
  • tests/gateway/test_status.py (modified, +74/-0)

Code Example

ERROR gateway.run: PID file race lost to another gateway instance. Exiting.

---

# In gateway/run.py — before writing PID file:
stale_pid = get_running_pid()
if stale_pid is not None:
    try:
        os.kill(stale_pid, 0)  # PID exists and we can signal it
        # Real gateway running — exit
        logger.error("Another gateway instance (PID %d) is running. Exiting.", stale_pid)
        return False
    except (ProcessLookupError, PermissionError):
        # Stale PID — file exists but process is dead, safe to overwrite
        remove_pid_file()

---

[Service]
Type=simple
ExecStartPre=/bin/rm -f {hermes_home}/gateway.pid
ExecStart=...gateway run --replace

---

rm -f ~/.hermes/gateway.pid && hermes gateway start
RAW_BUFFERClick to expand / collapse

Severity

Medium — causes complete gateway service outage requiring manual intervention

Affected Versions

Current stable as of 2026-04-21

Problem Description

The Hermes Gateway service enters a restart loop when the Python gateway process is killed unexpectedly (SIGKILL, OOM, crash). On restart, the gateway.pid file still exists with the dead process's PID. The gateway startup logic treats this as a live process conflict and exits immediately with:

ERROR gateway.run: PID file race lost to another gateway instance. Exiting.

Because the service is configured with Restart=on-failure and RestartSec=30, systemd re-attempts every 30 seconds and fails repeatedly, eventually hitting StartLimitBurst=5 and rate-limiting itself. The service becomes unreachable until an operator manually deletes ~/.hermes/gateway.pid.

Root Cause

In gateway/run.py, the startup sequence:

  1. Reads gateway.pid to check for an existing gateway
  2. Writes gateway.pid (which fails with FileExistsError if the file already exists and the old PID is gone but the file wasn't cleaned up)
  3. Registers atexit.register(remove_pid_file) to delete the PID file on clean exit

The problem: if the process is killed with SIGKILL or crashes, the atexit handler never fires and gateway.pid is left behind. The next startup sees the stale file, treats it as a conflict, and exits without attempting to validate whether the PID is actually alive.

Recommended Fix (In Gateway Code)

The gateway startup should validate whether a PID in gateway.pid is actually alive before treating it as a conflict:

# In gateway/run.py — before writing PID file:
stale_pid = get_running_pid()
if stale_pid is not None:
    try:
        os.kill(stale_pid, 0)  # PID exists and we can signal it
        # Real gateway running — exit
        logger.error("Another gateway instance (PID %d) is running. Exiting.", stale_pid)
        return False
    except (ProcessLookupError, PermissionError):
        # Stale PID — file exists but process is dead, safe to overwrite
        remove_pid_file()

This makes the gateway resilient to crashes and eliminates the need for the workaround below.

Workaround (Service Definition)

Added ExecStartPre to the systemd service template in hermes_cli/gateway.py:

[Service]
Type=simple
ExecStartPre=/bin/rm -f {hermes_home}/gateway.pid
ExecStart=...gateway run --replace

This clears any stale PID file before every service start, breaking the restart loop.

Quick Fix (One-Liner)

When this happens, clear the lock and restart:

rm -f ~/.hermes/gateway.pid && hermes gateway start

extent analysis

TL;DR

To fix the Hermes Gateway service restart loop, validate whether a PID in gateway.pid is alive before treating it as a conflict, or use the provided workaround to clear the stale PID file before service start.

Guidance

  • Implement the recommended fix in gateway/run.py to validate the PID in gateway.pid before treating it as a conflict, ensuring the gateway can recover from crashes and SIGKILL.
  • As a temporary workaround, add ExecStartPre=/bin/rm -f {hermes_home}/gateway.pid to the systemd service template to clear any stale PID file before service start.
  • Verify the fix by simulating a crash or SIGKILL and checking if the service can recover without manual intervention.
  • Consider applying the quick fix rm -f ~/.hermes/gateway.pid && hermes gateway start when the issue occurs, but this should be replaced with a permanent solution.

Example

The provided code snippet in gateway/run.py demonstrates how to validate the PID:

stale_pid = get_running_pid()
if stale_pid is not None:
    try:
        os.kill(stale_pid, 0)  
        # Real gateway running — exit
        logger.error("Another gateway instance (PID %d) is running. Exiting.", stale_pid)
        return False
    except (ProcessLookupError, PermissionError):
        # Stale PID — file exists but process is dead, safe to overwrite
        remove_pid_file()

Notes

The recommended fix requires changes to the gateway code, while the workaround modifies the systemd service template. Both approaches should resolve the restart loop issue.

Recommendation

Apply the recommended fix in gateway/run.py to validate the PID in gateway.pid, as it provides a more robust solution and eliminates the need for workarounds.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING