hermes - ✅(Solved) Fix Bug: Stale gateway.pid causes gateway restart loop after crash/SIGKILL [1 pull requests, 1 comments, 2 participants]

hermes2026-04-21 19:18:19

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

NousResearch/hermes-agent#13655•Fetched 2026-04-22 08:05:01

View on GitHub

Comments

Participants

Timeline

Reactions

Author

ObiJuanDeanobi

Participants

ObiJuanDeanobi

qaqcvc

Timeline (top)

commented ×1cross-referenced ×1subscribed ×1

Error Message

ERROR gateway.run: PID file race lost to another gateway instance. Exiting. logger.error("Another gateway instance (PID %d) is running. Exiting.", stale_pid)

Root Cause

In gateway/run.py, the startup sequence:

Reads gateway.pid to check for an existing gateway
Writes gateway.pid (which fails with FileExistsError if the file already exists and the old PID is gone but the file wasn't cleaned up)
Registers atexit.register(remove_pid_file) to delete the PID file on clean exit

The problem: if the process is killed with SIGKILL or crashes, the atexit handler never fires and gateway.pid is left behind. The next startup sees the stale file, treats it as a conflict, and exits without attempting to validate whether the PID is actually alive.

Fix Action

Fix / Workaround

This makes the gateway resilient to crashes and eliminates the need for the workaround below.

Workaround (Service Definition)

PR fix notes

PR #13709: fix(gateway): clear stale PID file from crashed gateway on startup (closes #13655)

Repository: NousResearch/hermes-agent
Author: RhythrosaLabs
State: open | merged: False
Link: https://github.com/NousResearch/hermes-agent/pull/13709

Description (problem / solution / changelog)

Summary

Closes #13655

After a gateway process is killed with SIGKILL (or dies from OOM, systemd stop-timeout, etc.) the handler never fires and is left on disk containing the dead process's PID. On the next startup this stale file causes an unrecoverable restart loop.

Root Cause

The call chain on startup was:

get_running_pid()
  → os.kill(dead_pid, 0)          # ProcessLookupError — process is gone
  → _cleanup_invalid_pid_path()
      → remove_pid_file()          # BUG: checks file_pid != os.getpid()
                                   #      dead_pid ≠ new_pid → does nothing
  → returns None

write_pid_file()                   # O_CREAT | O_EXCL
  → FileExistsError                # stale file still present
  → 'PID file race lost. Exiting.'

remove_pid_file() intentionally guards against removing a file belonging to another live process (the --replace handoff race), but after a crash the guarded process is dead — the guard is wrong here.

Fix

In _cleanup_invalid_pid_path(), replace the call to remove_pid_file() with a direct force-unlink (pid_path.unlink(missing_ok=True)). By the time this helper is called we have already confirmed the recorded PID is dead or invalid, so skipping the ownership guard is correct.

The concurrent-write race (--replace with two competing starters) is unaffected: both processes see the stale file, both unlink it (idempotent), then their write_pid_file() O_CREAT|O_EXCL calls race as intended — exactly one wins.

Changes

File	Change
`gateway/status.py`	`_cleanup_invalid_pid_path()`: direct `unlink()` instead of `remove_pid_file()`
`tests/gateway/test_status.py`	Two new regression tests covering the crash→restart scenario

Regression Tests Added

All 29 existing tests in tests/gateway/test_status.py continue to pass.

Before / After

Checklist

Bug confirmed reproduced in the field by two independent users (#13655)
Root cause identified with full call-chain analysis
Minimal, targeted fix (9 lines changed in production code)
Regression tests covering exact failure sequence
All existing tests pass
No behaviour change for the normal --replace race path

Changed files

gateway/status.py (modified, +9/-4)
tests/gateway/test_status.py (modified, +74/-0)

Code Example

ERROR gateway.run: PID file race lost to another gateway instance. Exiting.

---

# In gateway/run.py — before writing PID file:
stale_pid = get_running_pid()
if stale_pid is not None:
    try:
        os.kill(stale_pid, 0)  # PID exists and we can signal it
        # Real gateway running — exit
        logger.error("Another gateway instance (PID %d) is running. Exiting.", stale_pid)
        return False
    except (ProcessLookupError, PermissionError):
        # Stale PID — file exists but process is dead, safe to overwrite
        remove_pid_file()

---

[Service]
Type=simple
ExecStartPre=/bin/rm -f {hermes_home}/gateway.pid
ExecStart=...gateway run --replace

---

rm -f ~/.hermes/gateway.pid && hermes gateway start

RAW_BUFFERClick to expand / collapse

Severity

Medium — causes complete gateway service outage requiring manual intervention

Affected Versions

Current stable as of 2026-04-21

Problem Description

The Hermes Gateway service enters a restart loop when the Python gateway process is killed unexpectedly (SIGKILL, OOM, crash). On restart, the gateway.pid file still exists with the dead process's PID. The gateway startup logic treats this as a live process conflict and exits immediately with:

ERROR gateway.run: PID file race lost to another gateway instance. Exiting.

Because the service is configured with Restart=on-failure and RestartSec=30, systemd re-attempts every 30 seconds and fails repeatedly, eventually hitting StartLimitBurst=5 and rate-limiting itself. The service becomes unreachable until an operator manually deletes ~/.hermes/gateway.pid.

Root Cause

In gateway/run.py, the startup sequence:

Reads gateway.pid to check for an existing gateway
Writes gateway.pid (which fails with FileExistsError if the file already exists and the old PID is gone but the file wasn't cleaned up)
Registers atexit.register(remove_pid_file) to delete the PID file on clean exit

Recommended Fix (In Gateway Code)

The gateway startup should validate whether a PID in gateway.pid is actually alive before treating it as a conflict:

# In gateway/run.py — before writing PID file:
stale_pid = get_running_pid()
if stale_pid is not None:
    try:
        os.kill(stale_pid, 0)  # PID exists and we can signal it
        # Real gateway running — exit
        logger.error("Another gateway instance (PID %d) is running. Exiting.", stale_pid)
        return False
    except (ProcessLookupError, PermissionError):
        # Stale PID — file exists but process is dead, safe to overwrite
        remove_pid_file()

This makes the gateway resilient to crashes and eliminates the need for the workaround below.

Workaround (Service Definition)

Added ExecStartPre to the systemd service template in hermes_cli/gateway.py:

[Service]
Type=simple
ExecStartPre=/bin/rm -f {hermes_home}/gateway.pid
ExecStart=...gateway run --replace

This clears any stale PID file before every service start, breaking the restart loop.

Quick Fix (One-Liner)

When this happens, clear the lock and restart:

rm -f ~/.hermes/gateway.pid && hermes gateway start

extent analysis

TL;DR

To fix the Hermes Gateway service restart loop, validate whether a PID in gateway.pid is alive before treating it as a conflict, or use the provided workaround to clear the stale PID file before service start.

Guidance

Implement the recommended fix in gateway/run.py to validate the PID in gateway.pid before treating it as a conflict, ensuring the gateway can recover from crashes and SIGKILL.
As a temporary workaround, add ExecStartPre=/bin/rm -f {hermes_home}/gateway.pid to the systemd service template to clear any stale PID file before service start.
Verify the fix by simulating a crash or SIGKILL and checking if the service can recover without manual intervention.
Consider applying the quick fix rm -f ~/.hermes/gateway.pid && hermes gateway start when the issue occurs, but this should be replaced with a permanent solution.

Example

The provided code snippet in gateway/run.py demonstrates how to validate the PID:

stale_pid = get_running_pid()
if stale_pid is not None:
    try:
        os.kill(stale_pid, 0)  
        # Real gateway running — exit
        logger.error("Another gateway instance (PID %d) is running. Exiting.", stale_pid)
        return False
    except (ProcessLookupError, PermissionError):
        # Stale PID — file exists but process is dead, safe to overwrite
        remove_pid_file()

Notes

The recommended fix requires changes to the gateway code, while the workaround modifies the systemd service template. Both approaches should resolve the restart loop issue.

Recommendation

Apply the recommended fix in gateway/run.py to validate the PID in gateway.pid, as it provides a more robust solution and eliminates the need for workarounds.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#ISR setup #authentication setup #request error #file not found #serialization error

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

hermes - ✅(Solved) Fix Bug: Stale gateway.pid causes gateway restart loop after crash/SIGKILL [1 pull requests, 1 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Fix / Workaround

Workaround (Service Definition)

PR fix notes

PR #13709: fix(gateway): clear stale PID file from crashed gateway on startup (closes #13655)

Description (problem / solution / changelog)

Summary

Root Cause

Fix

Changes

Regression Tests Added

Before / After

Checklist

Changed files

Code Example

Severity

Affected Versions

Problem Description

Root Cause

Recommended Fix (In Gateway Code)

Workaround (Service Definition)

Quick Fix (One-Liner)

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING