hermes - ✅(Solved) Fix Stale gateway lock files never cleared on macOS — process-start-time check is Linux-only [1 pull requests, 1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
NousResearch/hermes-agent#16586Fetched 2026-04-28 06:52:21
View on GitHub
Comments
0
Participants
1
Timeline
4
Reactions
0
Participants
Timeline (top)
labeled ×3cross-referenced ×1

On macOS, acquire_scoped_lock() in gateway/status.py cannot detect stale lock files when a previous gateway crashed and the PID was recycled to an unrelated OS process. The lock then persists indefinitely, blocking the platform forever (Slack, Feishu, Discord, etc.) until the user manually rms the lock file.

Error Message

Without this fix, every macOS user who experiences a single ungraceful shutdown of hermes is one PID-recycle away from a permanently broken Slack/Feishu/Discord platform until they figure out the lock file path and rm it. The error message blames a PID that has nothing to do with the actual issue ("Slack app token already in use (PID 560). Stop the other gateway first.") which sends users hunting for a phantom process.

Root Cause

On macOS, acquire_scoped_lock() in gateway/status.py cannot detect stale lock files when a previous gateway crashed and the PID was recycled to an unrelated OS process. The lock then persists indefinitely, blocking the platform forever (Slack, Feishu, Discord, etc.) until the user manually rms the lock file.

Fix Action

Fixed

PR fix notes

PR #16655: fix(gateway): detect stale lock files on macOS via ps -o lstart=

Description (problem / solution / changelog)

Summary

  • _get_process_start_time() was Linux-only (/proc/<pid>/stat), so on macOS PID-recycle staleness detection was silently disabled across acquire_scoped_lock, get_running_pid, the takeover marker, and release_all_scoped_locks.
  • Add a ps -o lstart= fallback for sys.platform == "darwin", parsed under forced C locale into a Unix-epoch int.
  • Regression coverage in tests/gateway/test_status.py::TestGetProcessStartTime (6 tests; 3 fail without the production change).

The bug

When a gateway crashes hard on macOS (kill -9, OOM, panic), the scoped lock file at ~/.local/state/hermes/gateway-locks/<scope>-<hash>.lock keeps the dead gateway's PID. macOS later recycles that PID to an unrelated OS process (FamilyControlsAgent in the reporter's case for #16586). The next gateway start runs acquire_scoped_lock:

  1. os.kill(existing_pid, 0) succeeds — the recycled PID is alive, just not Hermes.
  2. The strong stale check needs a start_time mismatch. _get_process_start_time(existing_pid) reads /proc/<pid>/stat, which doesn't exist on macOS, so it returns None. The current_start is not None guard short-circuits and stale stays False.
  3. The secondary "stopped process" check at gateway/status.py:520-528 also reads /proc/<pid>/status and silently no-ops on macOS.
  4. acquire_scoped_lock returns (False, existing) → the platform reports slack-app-token_lock / feishu_app_lock / etc. and never recovers without a manual rm of the lock file.

The same Linux-only assumption affects get_running_pid (line 784), write_takeover_marker (line 656), consume_takeover_marker_for_self (line 716), and release_all_scoped_locks — every code path that wants to defend against PID reuse on macOS.

The fix

Extend _get_process_start_time() so that when the /proc read fails and we're on darwin, it falls back to ps -o lstart= -p <pid>. lstart is the absolute wall-clock start timestamp; recycling a PID to a new process changes it. The helper parses the timestamp under a forced LC_ALL=C / LANG=C environment so %a / %b are deterministic, then converts to a Unix-epoch int.

Why it's safe to mix units across platforms:

  • Lock and PID-file metadata are written and read on the same machine, so the value never crosses an OS boundary.
  • Every consumer does equality comparison only — the contract is "the same live process produces the same int; a recycled PID produces a different int". Linux clock-ticks-since-boot and macOS lstart-as-epoch both satisfy that.

Diff: 1 production file (~50 lines, including docstrings), 1 test file. Linux fast path is untouched; Windows still returns None exactly as before.

Test plan

  • Focused: tests/gateway/test_status.py::TestGetProcessStartTime — 6 tests covering happy-path parsing, ps non-zero return, garbled output, FileNotFoundError (ps not on PATH), distinct-int contract, and end-to-end acquire_scoped_lock PID-recycle reclaim.
  • Adjacent: full tests/gateway/test_status.py (40 tests) and tests/gateway/test_runner_startup_failures.py — 48 passed locally.
  • Regression guard: reverted gateway/status.py, reran the new test class — 3 of 6 fail (the ones exercising the Darwin path); restored fix and all 6 pass.

Contract Protected

_get_process_start_time(pid) returns a value v such that, while a process holds a given PID, v stays equal across calls; once the kernel hands that PID to an unrelated process, v changes. The helper is allowed to return None when the value can't be determined (existing contract; every caller guards is not None). The fix extends this contract from "Linux only" to "Linux + macOS".

Known-bad inputs covered by tests:

  • ps returns non-zero (PID gone) → None.
  • ps returns empty / non-lstart-shaped text → None.
  • ps executable missing → None (caught as OSError).
  • Two different lstart values → two different ints (PID-recycle detection).

Related

  • Closes #16586.
  • Adjacent merged history in this area: #8995 (stale lock recovery on Linux), #11909 (legacy hermes.service detection), #14504 (Windows OSError handling). All assumed Linux/Windows; this PR fills the macOS gap.

Changed files

  • gateway/status.py (modified, +48/-2)
  • tests/gateway/test_status.py (modified, +122/-0)
RAW_BUFFERClick to expand / collapse

Summary

On macOS, acquire_scoped_lock() in gateway/status.py cannot detect stale lock files when a previous gateway crashed and the PID was recycled to an unrelated OS process. The lock then persists indefinitely, blocking the platform forever (Slack, Feishu, Discord, etc.) until the user manually rms the lock file.

Repro

  1. macOS host running ai.hermes.gateway via launchd
  2. Gateway crashes ungracefully (kill -9, OOM, panic)
  3. The orphaned lock file at ~/.local/state/hermes/gateway-locks/<scope>-<hash>.lock retains the dead gateway's PID
  4. macOS recycles that PID to a different process (e.g. FamilyControlsAgent)
  5. New gateway starts, calls acquire_scoped_lock() → existing record found
  6. Stale check at gateway/status.py:500-529:
    • os.kill(existing_pid, 0) succeeds (the recycled PID is alive, just unrelated)
    • start_time mismatch is checked next, but _get_process_start_time(pid) reads /proc/<pid>/stat (Linux-only) → returns None on macOS → comparison short-circuits
    • Stopped-process check at line 519-528 also reads /proc/<pid>/status → no-op on macOS
  7. stale = Falsereturn False, existing → platform marks as slack-app-token_lock / feishu_app_lock / etc., never recovers

In my case: the Slack lock file's pid: 560 was written 2026-04-23. Days later, PID 560 had been recycled to FamilyControlsAgent. Hermes saw the alive PID and concluded another gateway held the token, even after dozens of restarts. Manually rming the lock file was the only fix.

Affected code

gateway/status.py:

  • _get_process_start_time() (lines 106-113) — only works on Linux
  • acquire_scoped_lock() stale-detection block (lines 500-529) — relies entirely on /proc/* files for the strong check; PID-existence-only is insufficient on macOS

Suggested fixes (any one, or layered)

  1. Use psutil (already a transitive dep via discord.py/aiohttp ecosystems? if not, add it). psutil.Process(pid).create_time() works cross-platform and gives epoch seconds. Compare against the lock file's stored start_time.

  2. Use proc_pidinfo via ctypes if avoiding psutil. macOS exposes proc_pidinfo(pid, PROC_PIDTBSDINFO, ...) returning pbi_start_tvsec — equivalent of Linux's start_time field.

  3. Add a max-age fallback: if the lock file's updated_at is older than e.g. 24h, treat as stale regardless of PID liveness. Cheap, no syscalls, defensive.

  4. Compare cmdline: read /proc/<pid>/cmdline on Linux or ps -p <pid> -o command= on macOS — if it doesn't contain hermes / gateway / etc., treat as stale.

(3) alone would unblock most users. (1) is the right architectural fix.

Operational impact today

Without this fix, every macOS user who experiences a single ungraceful shutdown of hermes is one PID-recycle away from a permanently broken Slack/Feishu/Discord platform until they figure out the lock file path and rm it. The error message blames a PID that has nothing to do with the actual issue ("Slack app token already in use (PID 560). Stop the other gateway first.") which sends users hunting for a phantom process.

Environment

  • macOS 26.4.1 (Darwin 25.4.0, arm64)
  • hermes-agent commit 283c8fd6 (Apr 2026)
  • Python 3.11
  • launchd-managed via ~/Library/LaunchAgents/ai.hermes.gateway.plist

Related

The same root-cause produces:

  • slack-app-token_lock errors blaming dead/recycled PIDs
  • feishu_app_lock errors with the same pattern
  • General "platform fatal" states that don't recover across restarts

extent analysis

TL;DR

The most likely fix is to use psutil to implement cross-platform process start time checking in acquire_scoped_lock().

Guidance

  • Implement psutil.Process(pid).create_time() to get the process creation time in a cross-platform way, replacing the Linux-specific /proc/* files approach.
  • Consider adding a max-age fallback to treat locks older than a certain threshold (e.g., 24 hours) as stale, regardless of PID liveness.
  • Review the suggested fixes and choose the most suitable one, considering factors like dependencies and maintainability.
  • Verify the fix by testing the acquire_scoped_lock() function with different scenarios, including PID recycling and lock file persistence.

Example

import psutil

def _get_process_start_time(pid):
    try:
        return psutil.Process(pid).create_time()
    except psutil.NoSuchProcess:
        return None

Notes

The provided suggestions assume that psutil is available or can be added as a dependency. If psutil is not an option, alternative approaches like using ctypes or comparing command lines may be explored.

Recommendation

Apply the psutil-based fix, as it provides a cross-platform solution and is already a transitive dependency in the project. This approach is the most straightforward and maintainable way to address the issue.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

hermes - ✅(Solved) Fix Stale gateway lock files never cleared on macOS — process-start-time check is Linux-only [1 pull requests, 1 participants]