hermes - 💡(How to fix) Fix Stale platform-lock detection broken on macOS: PID reuse causes permanent startup failure

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

Stale platform-lock detection in gateway/status.py is broken on macOS (and any non-Linux platform). When the PID recorded in a lock file gets reused by an unrelated process after a crash/forced exit, the gateway permanently refuses to start, even with --replace.

Error Message

def _get_process_start_time(pid: int) -> Optional[int]: try: import psutil return int(psutil.Process(pid).create_time() * 1_000_000) # microseconds except Exception: return None

Root Cause

_get_process_start_time(pid) in gateway/status.py (line 106) only reads /proc/<pid>/stat:

def _get_process_start_time(pid: int) -> Optional[int]:
    """Return the kernel start time for a process when available."""
    stat_path = Path(f"/proc/{pid}/stat")
    try:
        # Field 22 in /proc/<pid>/stat is process start time (clock ticks).
        return int(stat_path.read_text().split()[21])
    except (FileNotFoundError, IndexError, PermissionError, ValueError, OSError):
        return None

/proc is Linux-only. On macOS the file doesn't exist, so the function always returns None. Lock files are therefore written with "start_time": null, e.g. observed in ~/.local/state/hermes/gateway-locks/slack-app-token-<hash>.lock:

{"pid": 889, "kind": "hermes-gateway", ..., "start_time": null, ...}

The stale-lock detector around line 506-513 then bails out:

current_start = _get_process_start_time(existing_pid)
if (
    existing.get("start_time") is not None     # ← always false on macOS
    and current_start is not None              # ← always false on macOS
    and current_start != existing.get("start_time")
):
    stale = True

Both existing.get("start_time") and current_start are None on macOS, so the start-time discriminator never fires. As long as any process holds that PID — even a totally unrelated one — the lock is treated as live, and --replace doesn't help.

Fix Action

Workaround

Manually delete the stale lock:

rm ~/.local/state/hermes/gateway-locks/<scope>-<hash>.lock

Code Example

ERROR gateway.platforms.base: [Slack] Slack app token already in use (PID 889). Stop the other gateway first.
ERROR gateway.run: Gateway hit a non-retryable startup conflict: slack: Slack app token already in use (PID 889). Stop the other gateway first.
ERROR gateway.run: Gateway exiting cleanly: slack: Slack app token already in use (PID 889). Stop the other gateway first.

---

def _get_process_start_time(pid: int) -> Optional[int]:
    """Return the kernel start time for a process when available."""
    stat_path = Path(f"/proc/{pid}/stat")
    try:
        # Field 22 in /proc/<pid>/stat is process start time (clock ticks).
        return int(stat_path.read_text().split()[21])
    except (FileNotFoundError, IndexError, PermissionError, ValueError, OSError):
        return None

---

{"pid": 889, "kind": "hermes-gateway", ..., "start_time": null, ...}

---

current_start = _get_process_start_time(existing_pid)
if (
    existing.get("start_time") is not None     # ← always false on macOS
    and current_start is not None              # ← always false on macOS
    and current_start != existing.get("start_time")
):
    stale = True

---

def _get_process_start_time(pid: int) -> Optional[int]:
    try:
        import psutil
        return int(psutil.Process(pid).create_time() * 1_000_000)  # microseconds
    except Exception:
        return None

---

rm ~/.local/state/hermes/gateway-locks/<scope>-<hash>.lock
RAW_BUFFERClick to expand / collapse

Summary

Stale platform-lock detection in gateway/status.py is broken on macOS (and any non-Linux platform). When the PID recorded in a lock file gets reused by an unrelated process after a crash/forced exit, the gateway permanently refuses to start, even with --replace.

Reproduction

  1. Run hermes gateway run --replace on macOS.
  2. Crash or hard-kill the process (e.g. SIGKILL, machine reboot mid-run) so it doesn't release the platform locks under ~/.local/state/hermes/gateway-locks/.
  3. Wait until the OS reuses that PID for an unrelated process (on macOS this happens quickly — in my case PID 889 was eventually held by TipsWidgetExtension).
  4. Try to start the gateway again. It exits cleanly with:
ERROR gateway.platforms.base: [Slack] Slack app token already in use (PID 889). Stop the other gateway first.
ERROR gateway.run: Gateway hit a non-retryable startup conflict: slack: Slack app token already in use (PID 889). Stop the other gateway first.
ERROR gateway.run: Gateway exiting cleanly: slack: Slack app token already in use (PID 889). Stop the other gateway first.

Even with --replace, the gateway never recovers. Under launchd / systemd KeepAlive=true, this becomes an infinite restart-and-fail loop.

Root cause

_get_process_start_time(pid) in gateway/status.py (line 106) only reads /proc/<pid>/stat:

def _get_process_start_time(pid: int) -> Optional[int]:
    """Return the kernel start time for a process when available."""
    stat_path = Path(f"/proc/{pid}/stat")
    try:
        # Field 22 in /proc/<pid>/stat is process start time (clock ticks).
        return int(stat_path.read_text().split()[21])
    except (FileNotFoundError, IndexError, PermissionError, ValueError, OSError):
        return None

/proc is Linux-only. On macOS the file doesn't exist, so the function always returns None. Lock files are therefore written with "start_time": null, e.g. observed in ~/.local/state/hermes/gateway-locks/slack-app-token-<hash>.lock:

{"pid": 889, "kind": "hermes-gateway", ..., "start_time": null, ...}

The stale-lock detector around line 506-513 then bails out:

current_start = _get_process_start_time(existing_pid)
if (
    existing.get("start_time") is not None     # ← always false on macOS
    and current_start is not None              # ← always false on macOS
    and current_start != existing.get("start_time")
):
    stale = True

Both existing.get("start_time") and current_start are None on macOS, so the start-time discriminator never fires. As long as any process holds that PID — even a totally unrelated one — the lock is treated as live, and --replace doesn't help.

Suggested fix

Make _get_process_start_time cross-platform. psutil is the cleanest:

def _get_process_start_time(pid: int) -> Optional[int]:
    try:
        import psutil
        return int(psutil.Process(pid).create_time() * 1_000_000)  # microseconds
    except Exception:
        return None

If avoiding a psutil dependency, on macOS one can shell out to ps -p <pid> -o lstart= or use libproc via ctypes. The Linux /proc/<pid>/stat path can be kept as a fast path. Whatever the implementation, the goal is: on every supported platform, the function must return a non-None value when the PID exists, so that PID reuse can be detected.

A safety-net second fix: when start_time cannot be obtained, also compare process command line / argv against the lock's recorded argv (already stored in the lock file) before declaring a lock live.

Workaround

Manually delete the stale lock:

rm ~/.local/state/hermes/gateway-locks/<scope>-<hash>.lock

Environment

  • macOS Darwin 25.4.0 (arm64)
  • Python in ~/.hermes/hermes-agent/venv
  • launchd LaunchAgent, KeepAlive=true
  • Affected scopes (any platform that calls _acquire_platform_lock): slack, discord, telegram, signal, whatsapp, qqbot

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

hermes - 💡(How to fix) Fix Stale platform-lock detection broken on macOS: PID reuse causes permanent startup failure