hermes - ✅(Solved) Fix [Bug]: Gateway hang on clean exit / restart race with stale PID [1 pull requests, 2 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
NousResearch/hermes-agent#14176Fetched 2026-04-23 07:46:28
View on GitHub
Comments
2
Participants
2
Timeline
7
Reactions
0
Timeline (top)
labeled ×3commented ×2cross-referenced ×1referenced ×1

Error Message

#!/usr/bin/env python3 import json, os, sys

PID_FILE = "/home/ramit/.hermes/gateway.pid"

def main(): if not os.path.exists(PID_FILE): sys.exit(0) try: with open(PID_FILE, "r") as f: data = json.load(f) pid = data.get("pid") except (json.JSONDecodeError, OSError): os.remove(PID_FILE) sys.exit(0) exists = False if pid is not None: try: os.kill(pid, 0) exists = True except ProcessLookupError: exists = False if not exists: os.remove(PID_FILE)

if name == "main": main()

Root Cause

  1. Restart policy too narrow: Restart=on-failure misses clean exits
  2. No PID cleanup on stop: Stale PID file causes race condition on restart

Fix Action

Fix / Workaround

2. Patched systemd unit (~/.config/systemd/user/hermes-gateway.service)

[Unit]
Description=Hermes Agent Gateway - Messaging Platform Integration
After=network.target
StartLimitIntervalSec=600
StartLimitBurst=5

PR fix notes

PR #14332: fix(gateway): treat recycled PID with unreadable start_time as stale (#14176)

Description (problem / solution / changelog)

What does this PR do?

gateway/status.py::find_gateway_pids() iterates over the PIDs recorded in ~/.hermes/gateway.lock to decide whether the gateway is "still running". For each candidate it:

  1. Checks the PID is alive (os.kill(pid, 0)).
  2. Compares the recorded start_time against the live process's start_time to detect PID recycling.
  3. Falls back to _looks_like_gateway_process(pid) / _record_looks_like_gateway(record) heuristics.

When the recycled PID is owned by a different UID (typical on Linux when /proc/<pid>/stat is owned by another user, or under rootless container setups), _get_process_start_time returns None. The recorded-vs-live mismatch check then can't fire (current_start is None), and _looks_like_gateway_process can give a false positive on any long-lived python or hermes-related process the user happens to own. Result: the gateway thinks it's still running, refuses to start, and the user has to manually rm ~/.hermes/gateway.pid to recover.

Reporter (#14176) sees this in production with a systemd user service that restarts the gateway nightly — every few weeks the next PID up the queue lands on a recycled foreign PID, the lock file goes stale, and hermes gateway start fails with "Gateway already running".

Fix: be conservative. When the PID record carries a recorded_start but we can't read the candidate's current_start, skip the candidate (treat as stale) instead of falling through to the heuristic. Outside /proc-readable territory we don't have enough information to confirm this is the same gateway process, so prefer "no" over "maybe".

Related Issue

Fixes #14176

Type of Change

  • 🐛 Bug fix (non-breaking change that fixes an issue)

Changes Made

  • gateway/status.py (+10 / −0): in the find_gateway_pids() candidate loop, skip any PID whose recorded start_time exists but whose live start_time is unreadable. Same code path as the existing recorded-vs-live mismatch case, just covering the unreadable variant.
  • tests/gateway/test_status.py (+40 / −0): one new regression case under TestGatewayPidState, test_get_running_pid_treats_recycled_pid_with_unreadable_start_time_as_stale. Monkeypatches _get_process_start_time to return None and _looks_like_gateway_process to return True (the strongest stress for the false-positive path) and asserts the PID file is cleaned and get_running_pid() returns None.

Core diff:

         recorded_start = record.get("start_time")
         current_start = _get_process_start_time(pid)
         if recorded_start is not None and current_start is not None and current_start != recorded_start:
             continue
+        # If the PID record carries a recorded start_time but we can't read
+        # the current process's start_time, the PID may have been recycled by
+        # the OS to a process the current user can't introspect (typical on
+        # Linux when /proc/<pid>/stat is owned by another UID). The downstream
+        # _looks_like_gateway_process heuristic can give a false positive in
+        # that situation — e.g. another long-lived python process — leaving
+        # a stale PID file that blocks future starts. Be conservative and
+        # skip this candidate. See #14176.
+        if recorded_start is not None and current_start is None:
+            continue

         if _looks_like_gateway_process(pid) or _record_looks_like_gateway(record):
             return pid

How to Test

Reporter-style repro on a Linux host:

  1. Run the gateway, kill -9 the parent process to leave ~/.hermes/gateway.pid and ~/.hermes/gateway.lock populated with the dead PID.
  2. Start a long-lived python process under a different UID (e.g. another hermes daemon under another account) that the test user can see via ps but NOT via /proc/<pid>/stat. Note its PID.
  3. Edit the lock file to point at that recycled PID, keeping the original start_time field intact.
  4. Run hermes gateway start.

Before: refuses to start with "Gateway already running". After: detects the start_time mismatch is unverifiable, treats the entry as stale, cleans the lock file, and starts a fresh gateway.

Automated regression suite:

pytest tests/gateway/test_status.py::TestGatewayPidState -q

Checklist

Code

  • I've read the Contributing Guide
  • My commit messages follow Conventional Commits (fix(gateway):)
  • I searched for existing PRs to make sure this isn't a duplicate
  • My PR contains only changes related to this fix
  • I've run pytest tests/gateway/test_status.py::TestGatewayPidState -q and all tests pass (11/11)
  • I've added tests for my changes
  • I've tested on my platform: macOS 26.5 (arm64), Python 3.11.14 via uv

Documentation & Housekeeping

  • Documentation updates — N/A (internal helper, no user-visible API change beyond bug fix)
  • cli-config.yaml.example — N/A (no new config)
  • CONTRIBUTING.md / AGENTS.md — N/A
  • Cross-platform impact considered — change is conservative on every platform; the false positive fix matters most on Linux but doesn't regress macOS/Windows behavior
  • Tool descriptions/schemas — N/A

Not in scope

  • The reporter's bash script idea (a dedicated hermes gateway clean-pid command) — that's nice-to-have but a separate UX surface; the in-band fix here is the higher-impact change since it stops the bad state from forming.
  • Auditing gateway.target / Restart= semantics in the example systemd unit (the issue's secondary note) — that's a docs change for docs/deploy/ that deserves its own PR.
  • Hardening atexit-vs-SIGKILL paths so a kill -9 of the gateway doesn't leave a PID file behind — a real concern but out of scope for the reported bug, which is about PID-file interpretation, not creation.

Screenshots / Logs

Verification

$ python3 -m py_compile gateway/status.py tests/gateway/test_status.py
OK

$ uv run --no-project --with pytest --with pytest-xdist --with pyyaml \
       --with python-dotenv --with prompt_toolkit --with rich --with httpx \
       --with fastapi --with pydantic python -m pytest \
       tests/gateway/test_status.py::TestGatewayPidState -q
...........                                                              [100%]
11 passed in 0.51s

(11 = 10 existing + 1 new regression case, all green.)

Changed files

  • gateway/status.py (modified, +10/-0)
  • tests/gateway/test_status.py (modified, +40/-0)

Code Example

#!/usr/bin/env python3
import json, os, sys

PID_FILE = "/home/ramit/.hermes/gateway.pid"

def main():
    if not os.path.exists(PID_FILE):
        sys.exit(0)
    try:
        with open(PID_FILE, "r") as f:
            data = json.load(f)
        pid = data.get("pid")
    except (json.JSONDecodeError, OSError):
        os.remove(PID_FILE)
        sys.exit(0)
    exists = False
    if pid is not None:
        try:
            os.kill(pid, 0)
            exists = True
        except ProcessLookupError:
            exists = False
    if not exists:
        os.remove(PID_FILE)

if __name__ == "__main__":
    main()

---

[Unit]
Description=Hermes Agent Gateway - Messaging Platform Integration
After=network.target
StartLimitIntervalSec=600
StartLimitBurst=5

[Service]
Type=simple
ExecStart=/home/ramit/.hermes/hermes-agent/venv/bin/python -m hermes_cli.main gateway run --replace
ExecStartPre=/home/ramit/.hermes/hermes-agent/venv/bin/python /home/ramit/scripts/hermes-gateway-pid-cleanup.sh
WorkingDirectory=/home/ramit/.hermes/hermes-agent
Environment="PATH=/home/ramit/.hermes/hermes-agent/venv/bin:/home/ramit/.hermes/hermes-agent/node_modules/.bin:/home/ramit/.nvm/versions/node/v24.14.0/bin:/home/ramit/.local/bin:/home/ramit/.cargo/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin"
Environment="VIRTUAL_ENV=/home/ramit/.hermes/hermes-agent/venv"
Environment="HERMES_HOME=/home/ramit/.hermes"
Restart=always
RestartSec=30
RestartForceExitStatus=75
ExecStopPost=/home/ramit/.hermes/hermes-agent/venv/bin/python /home/ramit/scripts/hermes-gateway-pid-cleanup.sh
KillMode=mixed
KillSignal=SIGTERM
ExecReload=/bin/kill -USR1 $MAINPID
TimeoutStopSec=60
StandardOutput=journal
StandardError=journal

[Install]
WantedBy=default.target
RAW_BUFFERClick to expand / collapse

Bug Report: Gateway Hang on Clean Exit / Restart Race with Stale PID

Observed Behavior

  • Gateway Telegram bot stops responding to messages
  • systemctl --user restart hermes-gateway times out (60s)
  • Process exits cleanly after SIGTERM drain timeout ("Gateway stopped" with exit code 0)
  • Systemd (Restart=on-failure) does not restart because exit 0 = success
  • Stale ~/.hermes/gateway.pid blocks any future start ("Gateway already..."
  • Gateway stays dead until manual kill -9 + service restart

Root Cause

  1. Restart policy too narrow: Restart=on-failure misses clean exits
  2. No PID cleanup on stop: Stale PID file causes race condition on restart

Environment

  • hermes-agent commit: (current main)
  • OS: Debian 13 (trixie) aarch64
  • Runtime: systemd user service

Fix Applied

1. PID cleanup script (~/scripts/hermes-gateway-pid-cleanup.sh)

#!/usr/bin/env python3
import json, os, sys

PID_FILE = "/home/ramit/.hermes/gateway.pid"

def main():
    if not os.path.exists(PID_FILE):
        sys.exit(0)
    try:
        with open(PID_FILE, "r") as f:
            data = json.load(f)
        pid = data.get("pid")
    except (json.JSONDecodeError, OSError):
        os.remove(PID_FILE)
        sys.exit(0)
    exists = False
    if pid is not None:
        try:
            os.kill(pid, 0)
            exists = True
        except ProcessLookupError:
            exists = False
    if not exists:
        os.remove(PID_FILE)

if __name__ == "__main__":
    main()

2. Patched systemd unit (~/.config/systemd/user/hermes-gateway.service)

[Unit]
Description=Hermes Agent Gateway - Messaging Platform Integration
After=network.target
StartLimitIntervalSec=600
StartLimitBurst=5

[Service]
Type=simple
ExecStart=/home/ramit/.hermes/hermes-agent/venv/bin/python -m hermes_cli.main gateway run --replace
ExecStartPre=/home/ramit/.hermes/hermes-agent/venv/bin/python /home/ramit/scripts/hermes-gateway-pid-cleanup.sh
WorkingDirectory=/home/ramit/.hermes/hermes-agent
Environment="PATH=/home/ramit/.hermes/hermes-agent/venv/bin:/home/ramit/.hermes/hermes-agent/node_modules/.bin:/home/ramit/.nvm/versions/node/v24.14.0/bin:/home/ramit/.local/bin:/home/ramit/.cargo/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin"
Environment="VIRTUAL_ENV=/home/ramit/.hermes/hermes-agent/venv"
Environment="HERMES_HOME=/home/ramit/.hermes"
Restart=always
RestartSec=30
RestartForceExitStatus=75
ExecStopPost=/home/ramit/.hermes/hermes-agent/venv/bin/python /home/ramit/scripts/hermes-gateway-pid-cleanup.sh
KillMode=mixed
KillSignal=SIGTERM
ExecReload=/bin/kill -USR1 $MAINPID
TimeoutStopSec=60
StandardOutput=journal
StandardError=journal

[Install]
WantedBy=default.target

Key Changes

DirectiveBeforeAfterPurpose
Restarton-failurealwaysRestart even after clean exit (exit 0)
ExecStartPrecleanup scriptRemove stale PID before start
ExecStopPostcleanup scriptRemove stale PID after any stop

Verification

  • daemon-reload + restart: service active, Telegram reconnected
  • ExecStartPre exits 0/SUCCESS
  • No stale PID race observed

Suggested Upstream Action

  1. Ship scripts/hermes-gateway-pid-cleanup.py in repo
  2. Update sample systemd unit in docs/install.md with Restart=always + ExecStartPre/ExecStopPost

extent analysis

TL;DR

To fix the gateway hang issue, update the systemd unit to use Restart=always and add a PID cleanup script as ExecStartPre and ExecStopPost to remove stale PIDs.

Guidance

  • The root cause of the issue is the narrow restart policy and lack of PID cleanup, which can be addressed by updating the systemd unit and adding a cleanup script.
  • To verify the fix, reload the systemd daemon and restart the service, then check that the service is active and Telegram is reconnected.
  • The provided PID cleanup script can be used as ExecStartPre and ExecStopPost to remove stale PIDs before and after service stops.
  • Consider shipping the PID cleanup script in the repository and updating the sample systemd unit in the documentation.

Example

The provided PID cleanup script (hermes-gateway-pid-cleanup.sh) can be used as a template:

#!/usr/bin/env python3
import json, os, sys

PID_FILE = "/home/ramit/.hermes/gateway.pid"

def main():
    # ... (rest of the script remains the same)

Note that this script should be adapted to the specific environment and PID file location.

Notes

The fix assumes that the systemd unit is configured correctly and that the PID cleanup script is working as expected. Additional testing and verification may be necessary to ensure the fix is working in all scenarios.

Recommendation

Apply the workaround by updating the systemd unit to use Restart=always and adding the PID cleanup script as ExecStartPre and ExecStopPost. This will ensure that the service restarts even after clean exits and removes stale PIDs to prevent race conditions.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING