hermes - ✅(Solved) Fix # Issue: `start_gateway` should verify lock-holder PID is alive before treating stale lock as "another instance" [1 pull requests, 1 comments, 2 participants]

goddog2024 · 2026-05-19T06:42:14Z

[hermes] When the gateway process crashes or is force-killed, the runtime lock file may remain "locked" in a way that acquire gateway runtime lock returns Fals… When the gateway process crashes or is force-killed, the runtime lock file may remain "locked" in a way that `acquire_gateway_runtime_lock()` returns `False` on the next startup. The startup logic in `gateway/run.py` then exits with: ``` ERROR: Gateway runtime lock is already held by another instance. Exiting. ``` **However**, the existing `get_running_pid()` function already contains robust stale-PID detection (checks `os.kill(pid, 0)`, process start time, cmdline matching, and auto-cleans PID files). The startup flow does **not** call `get_running_pid()` before `acquire_gateway_runtime_lock()`, so this stale-lock cleanup logic is completely bypassed. --- # PR #28603: fix(gateway): explain stale runtime lock failures (#28561) - Repository: NousResearch/hermes-agent - Author: wesleysimplicio - State: open | merged: False - Link: https://github.com/NousResearch/hermes-agent/pull/28603 ## Description (problem / solution / changelog) ## What does this PR do? This PR improves gateway startup diagnostics around runtime-lock acquisition. When Hermes cannot acquire `gateway.lock` and also cannot identify a live gateway PID, it now reports the failure as a likely stale-lock case and points the user to the supported recovery path (`hermes gateway run --replace`) instead of implying that another live gateway instance definitely exists. ## Solution Sketch - fix the root cause in the touched subsystem instead of layering a broad workaround around the symptom - keep surrounding behavior stable and avoid unrelated refactors while the area is under review - prove the change with focused checks on the exact path that regressed ## Related Issue Fixes #28561 ## Type of Change - [x] 🐛 Bug fix (non-breaking change that fixes an issue) - [ ] ✨ New feature (non-breaking change that adds functionality) - [ ] 🔒 Security fix - [ ] 📝 Documentation update - [x] ✅ Tests (adding or improving test coverage) - [ ] ♻️ Refactor (no behavior change) - [ ] 🎯 New skill (bundled or hub) ## Changes Made - re-checked for a live gateway PID after runtime-lock acquisition failed - kept the existing message when a live competing PID is present - emitted a more specific stale-lock message when no live PID can be found - included the concrete `gateway.lock` and `gateway.pid` paths plus the supported recovery path - added regression coverage for the stale-lock startup branch ## How to Test 1. Run `python -m pytest tests/gateway/test_runner_startup_failures.py::test_start_gateway_reports_stale_runtime_lock_guidance tests/gateway/test_runner_startup_failures.py::test_start_gateway_replace_clears_marker_on_permission_denied tests/gateway/test_runner_startup_failures.py::test_start_gateway_verbosity_imports_redacting_formatter -q -n 4`. 2. Confirm the stale-lock path now gives targeted recovery guidance. 3. Confirm the live-PID path still keeps the existing competing-instance message. ## Checklist ### Code - [ ] I've read the [Contributing Guide](https://github.com/NousResearch/hermes-agent/blob/main/CONTRIBUTING.md) - [ ] My commit messages follow [Conventional Commits](https://www.conventionalcommits.org/) (`fix(scope):`, `feat(scope):`, etc.) - [ ] I searched for [existing PRs](https://github.com/NousResearch/hermes-agent/pulls) to make sure this isn't a duplicate - [ ] My PR contains **only** changes related to this fix/feature (no unrelated commits) - [ ] I've run `pytest tests/ -q` and all tests pass - [ ] I've added tests for my changes (required for bug fixes, strongly encouraged for features) - [ ] I've tested on my platform: ### Documentation & Housekeeping - [ ] I've updated relevant documentation (README, `docs/`, docstrings) — or N/A - [ ] I've updated `cli-config.yaml.example` if I added/changed config keys — or N/A - [ ] I've updated `CONTRIBUTING.md` or `AGENTS.md` if I changed architecture or workflows — or N/A - [ ] I've considered cross-platform impact (Windows, macOS) per the [compatibility guide](https://github.com/NousResearch/hermes-agent/blob/main/CONTRIBUTING.md#cross-platform-compatibility) — or N/A - [ ] I've updated tool descriptions/schemas if I changed tool behavior — or N/A ## Screenshots / Logs - N/A (CLI/gateway diagnostics change). If needed, attach the startup log for the stale-lock path. ## Changed files - `gateway/run.py` (modified, +12/-3) - `tests/gateway/test_runner_startup_failures.py` (modified, +29/-0) ## Fix / Workaround 1. Start gateway normally: `hermes gateway run` 2. Force-kill the gateway process (e.g., `taskkill /F /PID ` on Windows, or `kill -9` on POSIX) 3. Attempt to restart gateway: `hermes gateway run` 4. **Observed:** Gateway immediately exits with "runtime lock is already held by another instance" 5. **Workaround:** Manually delete `~/.hermes/gateway.lock` or `~/.hermes/gateway.pid`

hermes2026-05-19 06:42:14

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

NousResearch/hermes-agent#28561•Fetched 2026-05-20 04:03:21

View on GitHub

Comments

Participants

Timeline

Reactions

Author

goddog2024

Participants

goddog2024

wesleysimplicio

Timeline (top)

labeled ×3commented ×1cross-referenced ×1

When the gateway process crashes or is force-killed, the runtime lock file may remain "locked" in a way that acquire_gateway_runtime_lock() returns False on the next startup. The startup logic in gateway/run.py then exits with:

ERROR: Gateway runtime lock is already held by another instance. Exiting.

However, the existing get_running_pid() function already contains robust stale-PID detection (checks os.kill(pid, 0), process start time, cmdline matching, and auto-cleans PID files). The startup flow does not call get_running_pid() before acquire_gateway_runtime_lock(), so this stale-lock cleanup logic is completely bypassed.

Error Message

ERROR: Gateway runtime lock is already held by another instance. Exiting.

Root Cause

Root Cause Analysis

Code Example

ERROR: Gateway runtime lock is already held by another instance. Exiting.

---

# gateway/run.py ~L15330
current_pid = get_running_pid()          # checks PID file + lock validity
if current_pid is not None and current_pid != os.getpid():
    logger.error("Another gateway instance started during our startup. Exiting.")
    return False

if not acquire_gateway_runtime_lock():   # ← ONLY checks file lock; NO PID validation
    logger.error("Gateway runtime lock is already held by another instance. Exiting.")
    return False

---

# gateway/status.py ~L802
for record in (primary_record, fallback_record):
    pid = _pid_from_record(record)
    if pid is None:
        continue
    try:
        os.kill(pid, 0)  # existence check
    except ProcessLookupError:
        continue  # process is dead → stale
    # ... also checks start_time and cmdline

---

# test_status.py references this behavior:
# test_acquire_scoped_lock_replaces_stale_record
# test_acquire_scoped_lock_recovers_empty_lock_file
# test_acquire_scoped_lock_recovers_corrupt_lock_file

---

Gateway lock file appears stale (no running process holds it).
Run `hermes gateway run --replace` to force-start, or manually remove:
  <lock_path>

---

# Remove stale lock and PID files
rm ~/.hermes/gateway.lock ~/.hermes/gateway.pid

# Or on Windows:
del %USERPROFILE%\.hermes\gateway.lock %USERPROFILE%\.hermes\gateway.pid

# Then restart
hermes gateway run

RAW_BUFFERClick to expand / collapse

Issue: `start_gateway` should verify lock-holder PID is alive before treating stale lock as "another instance"

Bug Report

Component: gateway/status.py, gateway/run.py Severity: High — causes complete gateway startup failure after crash, requiring manual file cleanup Platform: Primarily Windows (reproducible on any platform where acquire_gateway_runtime_lock returns False for a stale lock)

Summary

ERROR: Gateway runtime lock is already held by another instance. Exiting.

Reproduction Steps

Start gateway normally: hermes gateway run
Force-kill the gateway process (e.g., taskkill /F /PID <pid> on Windows, or kill -9 on POSIX)
Attempt to restart gateway: hermes gateway run
Observed: Gateway immediately exits with "runtime lock is already held by another instance"
Workaround: Manually delete ~/.hermes/gateway.lock or ~/.hermes/gateway.pid, then restart succeeds

Root Cause Analysis

Current startup flow (simplified)

# gateway/run.py ~L15330
current_pid = get_running_pid()          # checks PID file + lock validity
if current_pid is not None and current_pid != os.getpid():
    logger.error("Another gateway instance started during our startup. Exiting.")
    return False

if not acquire_gateway_runtime_lock():   # ← ONLY checks file lock; NO PID validation
    logger.error("Gateway runtime lock is already held by another instance. Exiting.")
    return False

The gap

acquire_gateway_runtime_lock() calls _try_acquire_file_lock(), which attempts to grab the OS-level file lock. If the lock is still held (e.g., Windows msvcrt.locking may not auto-release after kill /F), it returns False immediately. It never asks: "who holds this lock, and are they still alive?"

Meanwhile, get_running_pid() already does exactly this validation:

# gateway/status.py ~L802
for record in (primary_record, fallback_record):
    pid = _pid_from_record(record)
    if pid is None:
        continue
    try:
        os.kill(pid, 0)  # existence check
    except ProcessLookupError:
        continue  # process is dead → stale
    # ... also checks start_time and cmdline

But run.py calls get_running_pid() before acquire_gateway_runtime_lock(), and only for the "another instance started during our startup" branch. If get_running_pid() returns None (because the PID file was already cleaned), but the lock file itself is still locked by a dead process, the code proceeds to acquire_gateway_runtime_lock() → False → exit.

Proposed Fix

Option A (Recommended): Reuse `get_running_pid()` as a lock-validity gate

In gateway/run.py, before calling acquire_gateway_runtime_lock(), attempt to read the PID from the lock file and validate it with get_running_pid()'s logic. If the recorded PID is dead, forcibly break the stale lock by closing/reopening the lock file (or documenting that the user should run with --replace).

Option B: Make `acquire_gateway_runtime_lock()` smarter

Add a cleanup_stale: bool = True parameter to acquire_gateway_runtime_lock(). When the initial lock attempt fails:

Read the PID record from the lock file (_read_gateway_lock_record())
If the recorded PID is dead (os.kill(pid, 0) raises ProcessLookupError or OSError)
Close the current handle, truncate/reopen the lock file, and retry the lock acquisition
Log a warning: Recovered stale runtime lock from dead process PID {pid}

This mirrors the pattern already used in acquire_scoped_lock(), which does replace stale records:

# test_status.py references this behavior:
# test_acquire_scoped_lock_replaces_stale_record
# test_acquire_scoped_lock_recovers_empty_lock_file
# test_acquire_scoped_lock_recovers_corrupt_lock_file

Option C: Startup script auto-detect

In hermes gateway run CLI, add a pre-flight check: if acquire_gateway_runtime_lock() fails, call get_running_pid(). If get_running_pid() returns None, print a helpful error:

Gateway lock file appears stale (no running process holds it).
Run `hermes gateway run --replace` to force-start, or manually remove:
  <lock_path>

Related Code

File	Lines	Description
`gateway/run.py`	15330-15350	Startup lock acquisition + PID file race logic
`gateway/status.py`	313-331	`acquire_gateway_runtime_lock()` — only checks file lock
`gateway/status.py`	348-368	`is_gateway_runtime_lock_active()` — lock existence check
`gateway/status.py`	802-852	`get_running_pid()` — already has stale-PID cleanup
`tests/gateway/test_status.py`	55-76	Test: `test_get_running_pid_cleans_stale_record_from_dead_process`
`tests/gateway/test_status.py`	421-466	Tests for `acquire_scoped_lock` stale-lock recovery

Environment

OS: Windows 10/11 (also reproducible on Linux if lock mechanism doesn't auto-release)
Hermes version: v0.5.25+
Python: 3.11+
Lock mechanism: msvcrt.locking on Windows, fcntl.flock on POSIX

Impact

This issue causes complete service unavailability after any ungraceful gateway shutdown (crash, kill -9, Windows force-kill, power loss). Users without knowledge of the internal lock file location cannot recover without manual intervention. It also breaks automated restart loops (systemd Restart=always, scheduled health-check restarts, etc.).

Workaround (for users hitting this now)

# Remove stale lock and PID files
rm ~/.hermes/gateway.lock ~/.hermes/gateway.pid

# Or on Windows:
del %USERPROFILE%\.hermes\gateway.lock %USERPROFILE%\.hermes\gateway.pid

# Then restart
hermes gateway run

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#runtime error #dependency conflict #environment setup #docker error #permission error

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

hermes - ✅(Solved) Fix # Issue: `start_gateway` should verify lock-holder PID is alive before treating stale lock as "another instance" [1 pull requests, 1 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Root Cause Analysis

Fix Action

Fix / Workaround

Workaround (for users hitting this now)

PR fix notes

PR #28603: fix(gateway): explain stale runtime lock failures (#28561)

Description (problem / solution / changelog)

What does this PR do?

Solution Sketch

Related Issue

Type of Change

Changes Made

How to Test

Checklist

Code

Documentation & Housekeeping

Screenshots / Logs

Changed files

Code Example

Issue: start_gateway should verify lock-holder PID is alive before treating stale lock as "another instance"

Bug Report

Summary

Reproduction Steps

Root Cause Analysis

Current startup flow (simplified)

The gap

Proposed Fix

Option A (Recommended): Reuse get_running_pid() as a lock-validity gate

Option B: Make acquire_gateway_runtime_lock() smarter

Option C: Startup script auto-detect

Related Code

Environment

Impact

Workaround (for users hitting this now)

Still need to ship something?

RELATED_DISCOVERY

TRENDING

Issue: `start_gateway` should verify lock-holder PID is alive before treating stale lock as "another instance"

Option A (Recommended): Reuse `get_running_pid()` as a lock-validity gate

Option B: Make `acquire_gateway_runtime_lock()` smarter