hermes - ✅(Solved) Fix Gateway fails to start with --replace when previous instance PID is already dead (stale gateway.pid) [1 pull requests, 3 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
NousResearch/hermes-agent#14203Fetched 2026-04-23 07:46:07
View on GitHub
Comments
3
Participants
2
Timeline
8
Reactions
0
Author
Participants
Timeline (top)
commented ×3labeled ×3closed ×1cross-referenced ×1

Error Message

(after the existing_pid block, around line 10870)

    if replace:
        # Ensure stale PID file is removed even when the old process
        # is already dead (get_running_pid returned None).  Without this,
        # write_pid_file()'s O_CREAT|O_EXCL races with a leftover file.
        try:
            (get_hermes_home() / "gateway.pid").unlink(missing_ok=True)
        except Exception:
            pass

Root Cause

Root Cause

Fix Action

Fix / Workaround

Workaround

PR fix notes

PR #14388: fix: recover from stale PID file on gateway startup

Description (problem / solution / changelog)

When killed by SIGKILL (systemd timeout, OOM), atexit handlers don't run and gateway.pid persists. On next startup, write_pid_file() hits FileExistsError and logs 'PID file race lost', causing systemd to exhaust restart attempts.

Fix: on FileExistsError, check if the recorded PID is alive via get_running_pid(cleanup_stale=True). If dead, clean up the stale file and retry write_pid_file() once. Genuine races (live competing instance) still fail immediately.

Changed files

  • gateway/run.py (modified, +25/-4)

Code Example

# (after the existing_pid block, around line 10870)
        if replace:
            # Ensure stale PID file is removed even when the old process
            # is already dead (get_running_pid returned None).  Without this,
            # write_pid_file()'s O_CREAT|O_EXCL races with a leftover file.
            try:
                (get_hermes_home() / "gateway.pid").unlink(missing_ok=True)
            except Exception:
                pass

---

sudo systemctl stop hermes-gateway
rm -f ~/.hermes/gateway.pid
sudo systemctl reset-failed hermes-gateway
sudo systemctl start hermes-gateway
RAW_BUFFERClick to expand / collapse

When the gateway crashes or is killed and its PID is no longer alive, a subsequent gateway run --replace fails immediately with:

ERROR gateway.run: PID file race lost to another gateway instance. Exiting.

The gateway then exits with code 1 and systemd's Restart=on-failure creates a flap loop that cannot self-recover without manually deleting gateway.pid.

Root Cause

In gateway/run.py, the --replace startup flow is:

  1. existing_pid = get_running_pid() (line ~10799)
  2. If existing_pid is not None: terminate old process, then unlink gateway.pid (lines ~10801-10857)
  3. write_pid_file() with O_CREAT|O_EXCL (line ~11012)

When the old gateway is already dead, get_running_pid() correctly returns None (PID not found → file cleaned up by get_running_pid). However, step 2 is skipped entirely because the condition is existing_pid is not None. The stale gateway.pid file is unlinked by get_running_pid's internal _cleanup_invalid_pid_path call — but there is a race: systemd may restart the gateway process before the file is deleted, or the file may persist if cleanup_stale doesn't run in time.

More critically: if --replace is specified, the startup should always ensure the PID file is gone before attempting write_pid_file(), regardless of whether the old process is alive or dead. The current code only cleans up inside the if existing_pid is not None block.

Reproduction

  1. Start the gateway via systemd (hermes-gateway.service with Restart=on-failure)
  2. Kill the gateway process (SIGKILL, OOM, etc.)
  3. Note that ~/.hermes/gateway.pid still exists with the dead PID
  4. Systemd restarts the gateway → get_running_pid() returns None → stale PID file not cleaned by the replace block → write_pid_file() raises FileExistsError → gateway exits 1 → systemd restarts again → flap loop

Fix

After the if existing_pid is not None block and before write_pid_file(), add cleanup when --replace is set:

        # (after the existing_pid block, around line 10870)
        if replace:
            # Ensure stale PID file is removed even when the old process
            # is already dead (get_running_pid returned None).  Without this,
            # write_pid_file()'s O_CREAT|O_EXCL races with a leftover file.
            try:
                (get_hermes_home() / "gateway.pid").unlink(missing_ok=True)
            except Exception:
                pass

This matches the existing force-unlink on line ~10855 but covers the case where the replace logic was skipped because the old process was already gone.

Workaround

sudo systemctl stop hermes-gateway
rm -f ~/.hermes/gateway.pid
sudo systemctl reset-failed hermes-gateway
sudo systemctl start hermes-gateway

Environment

  • Hermes Agent: latest (post-v2.47)
  • OS: Ubuntu inside Proxmox LXC
  • Systemd: system-level service (not user systemd)

extent analysis

TL;DR

The most likely fix is to add a cleanup step after the if existing_pid is not None block to ensure the stale PID file is removed when --replace is set.

Guidance

  • The issue is caused by a race condition where the PID file is not cleaned up properly when the old process is already dead, leading to a FileExistsError when trying to write the new PID file.
  • To fix this, add a try-except block to remove the stale PID file when --replace is set, as shown in the provided code snippet.
  • Verify that the fix works by reproducing the issue and checking that the gateway no longer exits with a FileExistsError.
  • As a temporary workaround, manually stop the service, remove the PID file, reset the failed service, and start it again, as shown in the provided bash commands.

Example

The provided code snippet shows the necessary cleanup step:

if replace:
    try:
        (get_hermes_home() / "gateway.pid").unlink(missing_ok=True)
    except Exception:
        pass

This ensures that the stale PID file is removed even when the old process is already dead.

Notes

The fix assumes that the get_hermes_home() function returns the correct path to the Hermes home directory, and that the gateway.pid file is located in that directory.

Recommendation

Apply the workaround by adding the cleanup step to the code, as it directly addresses the root cause of the issue and prevents the FileExistsError from occurring.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING