hermes - ✅(Solved) Fix # Issue: `start_gateway` should verify lock-holder PID is alive before treating stale lock as "another instance" [1 pull requests, 1 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
NousResearch/hermes-agent#28561Fetched 2026-05-20 04:03:21
View on GitHub
Comments
1
Participants
2
Timeline
5
Reactions
0
Timeline (top)
labeled ×3commented ×1cross-referenced ×1

When the gateway process crashes or is force-killed, the runtime lock file may remain "locked" in a way that acquire_gateway_runtime_lock() returns False on the next startup. The startup logic in gateway/run.py then exits with:

ERROR: Gateway runtime lock is already held by another instance. Exiting.

However, the existing get_running_pid() function already contains robust stale-PID detection (checks os.kill(pid, 0), process start time, cmdline matching, and auto-cleans PID files). The startup flow does not call get_running_pid() before acquire_gateway_runtime_lock(), so this stale-lock cleanup logic is completely bypassed.


Error Message

ERROR: Gateway runtime lock is already held by another instance. Exiting.

Root Cause

Root Cause Analysis

Fix Action

Fix / Workaround

  1. Start gateway normally: hermes gateway run
  2. Force-kill the gateway process (e.g., taskkill /F /PID <pid> on Windows, or kill -9 on POSIX)
  3. Attempt to restart gateway: hermes gateway run
  4. Observed: Gateway immediately exits with "runtime lock is already held by another instance"
  5. Workaround: Manually delete ~/.hermes/gateway.lock or ~/.hermes/gateway.pid, then restart succeeds

Workaround (for users hitting this now)

PR fix notes

PR #28603: fix(gateway): explain stale runtime lock failures (#28561)

Description (problem / solution / changelog)

What does this PR do?

This PR improves gateway startup diagnostics around runtime-lock acquisition.

When Hermes cannot acquire gateway.lock and also cannot identify a live gateway PID, it now reports the failure as a likely stale-lock case and points the user to the supported recovery path (hermes gateway run --replace) instead of implying that another live gateway instance definitely exists.

Solution Sketch

  • fix the root cause in the touched subsystem instead of layering a broad workaround around the symptom
  • keep surrounding behavior stable and avoid unrelated refactors while the area is under review
  • prove the change with focused checks on the exact path that regressed

Related Issue

Fixes #28561

Type of Change

  • 🐛 Bug fix (non-breaking change that fixes an issue)
  • ✨ New feature (non-breaking change that adds functionality)
  • 🔒 Security fix
  • 📝 Documentation update
  • ✅ Tests (adding or improving test coverage)
  • ♻️ Refactor (no behavior change)
  • 🎯 New skill (bundled or hub)

Changes Made

  • re-checked for a live gateway PID after runtime-lock acquisition failed
  • kept the existing message when a live competing PID is present
  • emitted a more specific stale-lock message when no live PID can be found
  • included the concrete gateway.lock and gateway.pid paths plus the supported recovery path
  • added regression coverage for the stale-lock startup branch

How to Test

  1. Run python -m pytest tests/gateway/test_runner_startup_failures.py::test_start_gateway_reports_stale_runtime_lock_guidance tests/gateway/test_runner_startup_failures.py::test_start_gateway_replace_clears_marker_on_permission_denied tests/gateway/test_runner_startup_failures.py::test_start_gateway_verbosity_imports_redacting_formatter -q -n 4.
  2. Confirm the stale-lock path now gives targeted recovery guidance.
  3. Confirm the live-PID path still keeps the existing competing-instance message.

Checklist

Code

  • I've read the Contributing Guide
  • My commit messages follow Conventional Commits (fix(scope):, feat(scope):, etc.)
  • I searched for existing PRs to make sure this isn't a duplicate
  • My PR contains only changes related to this fix/feature (no unrelated commits)
  • I've run pytest tests/ -q and all tests pass
  • I've added tests for my changes (required for bug fixes, strongly encouraged for features)
  • I've tested on my platform: <!-- e.g. Ubuntu 24.04, macOS 15.2, Windows 11 -->

Documentation & Housekeeping

  • I've updated relevant documentation (README, docs/, docstrings) — or N/A
  • I've updated cli-config.yaml.example if I added/changed config keys — or N/A
  • I've updated CONTRIBUTING.md or AGENTS.md if I changed architecture or workflows — or N/A
  • I've considered cross-platform impact (Windows, macOS) per the compatibility guide — or N/A
  • I've updated tool descriptions/schemas if I changed tool behavior — or N/A

Screenshots / Logs

  • N/A (CLI/gateway diagnostics change). If needed, attach the startup log for the stale-lock path.

Changed files

  • gateway/run.py (modified, +12/-3)
  • tests/gateway/test_runner_startup_failures.py (modified, +29/-0)

Code Example

ERROR: Gateway runtime lock is already held by another instance. Exiting.

---

# gateway/run.py ~L15330
current_pid = get_running_pid()          # checks PID file + lock validity
if current_pid is not None and current_pid != os.getpid():
    logger.error("Another gateway instance started during our startup. Exiting.")
    return False

if not acquire_gateway_runtime_lock():   # ← ONLY checks file lock; NO PID validation
    logger.error("Gateway runtime lock is already held by another instance. Exiting.")
    return False

---

# gateway/status.py ~L802
for record in (primary_record, fallback_record):
    pid = _pid_from_record(record)
    if pid is None:
        continue
    try:
        os.kill(pid, 0)  # existence check
    except ProcessLookupError:
        continue  # process is dead → stale
    # ... also checks start_time and cmdline

---

# test_status.py references this behavior:
# test_acquire_scoped_lock_replaces_stale_record
# test_acquire_scoped_lock_recovers_empty_lock_file
# test_acquire_scoped_lock_recovers_corrupt_lock_file

---

Gateway lock file appears stale (no running process holds it).
Run `hermes gateway run --replace` to force-start, or manually remove:
  <lock_path>

---

# Remove stale lock and PID files
rm ~/.hermes/gateway.lock ~/.hermes/gateway.pid

# Or on Windows:
del %USERPROFILE%\.hermes\gateway.lock %USERPROFILE%\.hermes\gateway.pid

# Then restart
hermes gateway run
RAW_BUFFERClick to expand / collapse

Issue: start_gateway should verify lock-holder PID is alive before treating stale lock as "another instance"

Bug Report

Component: gateway/status.py, gateway/run.py Severity: High — causes complete gateway startup failure after crash, requiring manual file cleanup Platform: Primarily Windows (reproducible on any platform where acquire_gateway_runtime_lock returns False for a stale lock)


Summary

When the gateway process crashes or is force-killed, the runtime lock file may remain "locked" in a way that acquire_gateway_runtime_lock() returns False on the next startup. The startup logic in gateway/run.py then exits with:

ERROR: Gateway runtime lock is already held by another instance. Exiting.

However, the existing get_running_pid() function already contains robust stale-PID detection (checks os.kill(pid, 0), process start time, cmdline matching, and auto-cleans PID files). The startup flow does not call get_running_pid() before acquire_gateway_runtime_lock(), so this stale-lock cleanup logic is completely bypassed.


Reproduction Steps

  1. Start gateway normally: hermes gateway run
  2. Force-kill the gateway process (e.g., taskkill /F /PID <pid> on Windows, or kill -9 on POSIX)
  3. Attempt to restart gateway: hermes gateway run
  4. Observed: Gateway immediately exits with "runtime lock is already held by another instance"
  5. Workaround: Manually delete ~/.hermes/gateway.lock or ~/.hermes/gateway.pid, then restart succeeds

Root Cause Analysis

Current startup flow (simplified)

# gateway/run.py ~L15330
current_pid = get_running_pid()          # checks PID file + lock validity
if current_pid is not None and current_pid != os.getpid():
    logger.error("Another gateway instance started during our startup. Exiting.")
    return False

if not acquire_gateway_runtime_lock():   # ← ONLY checks file lock; NO PID validation
    logger.error("Gateway runtime lock is already held by another instance. Exiting.")
    return False

The gap

acquire_gateway_runtime_lock() calls _try_acquire_file_lock(), which attempts to grab the OS-level file lock. If the lock is still held (e.g., Windows msvcrt.locking may not auto-release after kill /F), it returns False immediately. It never asks: "who holds this lock, and are they still alive?"

Meanwhile, get_running_pid() already does exactly this validation:

# gateway/status.py ~L802
for record in (primary_record, fallback_record):
    pid = _pid_from_record(record)
    if pid is None:
        continue
    try:
        os.kill(pid, 0)  # existence check
    except ProcessLookupError:
        continue  # process is dead → stale
    # ... also checks start_time and cmdline

But run.py calls get_running_pid() before acquire_gateway_runtime_lock(), and only for the "another instance started during our startup" branch. If get_running_pid() returns None (because the PID file was already cleaned), but the lock file itself is still locked by a dead process, the code proceeds to acquire_gateway_runtime_lock()False → exit.


Proposed Fix

Option A (Recommended): Reuse get_running_pid() as a lock-validity gate

In gateway/run.py, before calling acquire_gateway_runtime_lock(), attempt to read the PID from the lock file and validate it with get_running_pid()'s logic. If the recorded PID is dead, forcibly break the stale lock by closing/reopening the lock file (or documenting that the user should run with --replace).

Option B: Make acquire_gateway_runtime_lock() smarter

Add a cleanup_stale: bool = True parameter to acquire_gateway_runtime_lock(). When the initial lock attempt fails:

  1. Read the PID record from the lock file (_read_gateway_lock_record())
  2. If the recorded PID is dead (os.kill(pid, 0) raises ProcessLookupError or OSError)
  3. Close the current handle, truncate/reopen the lock file, and retry the lock acquisition
  4. Log a warning: Recovered stale runtime lock from dead process PID {pid}

This mirrors the pattern already used in acquire_scoped_lock(), which does replace stale records:

# test_status.py references this behavior:
# test_acquire_scoped_lock_replaces_stale_record
# test_acquire_scoped_lock_recovers_empty_lock_file
# test_acquire_scoped_lock_recovers_corrupt_lock_file

Option C: Startup script auto-detect

In hermes gateway run CLI, add a pre-flight check: if acquire_gateway_runtime_lock() fails, call get_running_pid(). If get_running_pid() returns None, print a helpful error:

Gateway lock file appears stale (no running process holds it).
Run `hermes gateway run --replace` to force-start, or manually remove:
  <lock_path>

Related Code

FileLinesDescription
gateway/run.py15330-15350Startup lock acquisition + PID file race logic
gateway/status.py313-331acquire_gateway_runtime_lock() — only checks file lock
gateway/status.py348-368is_gateway_runtime_lock_active() — lock existence check
gateway/status.py802-852get_running_pid()already has stale-PID cleanup
tests/gateway/test_status.py55-76Test: test_get_running_pid_cleans_stale_record_from_dead_process
tests/gateway/test_status.py421-466Tests for acquire_scoped_lock stale-lock recovery

Environment

  • OS: Windows 10/11 (also reproducible on Linux if lock mechanism doesn't auto-release)
  • Hermes version: v0.5.25+
  • Python: 3.11+
  • Lock mechanism: msvcrt.locking on Windows, fcntl.flock on POSIX

Impact

This issue causes complete service unavailability after any ungraceful gateway shutdown (crash, kill -9, Windows force-kill, power loss). Users without knowledge of the internal lock file location cannot recover without manual intervention. It also breaks automated restart loops (systemd Restart=always, scheduled health-check restarts, etc.).


Workaround (for users hitting this now)

# Remove stale lock and PID files
rm ~/.hermes/gateway.lock ~/.hermes/gateway.pid

# Or on Windows:
del %USERPROFILE%\.hermes\gateway.lock %USERPROFILE%\.hermes\gateway.pid

# Then restart
hermes gateway run

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

hermes - ✅(Solved) Fix # Issue: `start_gateway` should verify lock-holder PID is alive before treating stale lock as "another instance" [1 pull requests, 1 comments, 2 participants]