hermes - ✅(Solved) Fix [Bug]: macOS stale Telegram token lock can block gateway after PID reuse [2 pull requests, 2 comments, 3 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
NousResearch/hermes-agent#16376Fetched 2026-04-28 06:53:47
View on GitHub
Comments
2
Participants
3
Timeline
9
Reactions
0
Author
Timeline (top)
labeled ×4commented ×2cross-referenced ×2unsubscribed ×1

Error Message

The Telegram adapter should connect normally instead of failing with a token-in-use error. ERROR gateway.platforms.base: [Telegram] Telegram bot token already in use (PID 574). Stop the other gateway first.

Additional Logs / Traceback (optional)

Root Cause

Hermes treated the stale lock as active because os.kill(pid, 0) succeeded for the reused PID.

Fix Action

Fix / Workaround

Current workaround/fix verification after local patch:

I tested a local patch that:

  • adds a ps fallback for process command line detection on macOS
  • factors gateway command-line matching into a helper
  • treats a Hermes scoped lock as stale when the live PID belongs to a non-Hermes process and start time is unavailable
  • adds a regression test for PID reuse by an unrelated process

PR fix notes

PR #16423: fix(gateway): detect stale scoped locks via cmdline when start_time is absent

Description (problem / solution / changelog)

Problem

On macOS, _get_process_start_time() always returns None because /proc doesn't exist. When a scoped lock record also has start_time=None (legacy locks or locks written on macOS), the staleness check in acquire_scoped_lock() is inconclusive — it cannot determine whether the PID was reused by an unrelated process.

From #16376: The recorded PID in the Telegram scoped lock had been reused by an unrelated macOS application (UURemote.app). Since the PID was alive and the start_time check was inconclusive, Hermes refused to start the Telegram adapter.

Fix

After the start_time comparison, add a fallback that checks whether the live PID still looks like a Hermes gateway via _looks_like_gateway_process(). Two cases:

  1. Lock record has gateway metadata (kind == "gateway") but the live process is NOT a gateway → stale (PID reused)
  2. Lock record lacks metadata (legacy) AND the live process is NOT a gateway → stale (safe assumption — if it were a gateway, cmdline would match)

If the live process IS a gateway (cmdline matches), the lock is valid regardless of metadata presence — this handles legacy locks from older versions.

Fixes #16376

Changed files

  • gateway/status.py (modified, +14/-0)

PR #16432: fix(gateway): detect stale scoped lock when PID is reused on macOS (#16376)

Description (problem / solution / changelog)

What does this PR do?

Fixes #16376 — stale Telegram bot token lock blocks gateway startup on macOS when the recorded PID is reused by an unrelated process.

Problem

On macOS, the scoped lock system could not verify process identity because:

  1. _get_process_start_time() only reads /proc/<pid>/stat — unavailable on macOS → returns None
  2. _read_process_cmdline() only reads /proc/<pid>/cmdline — unavailable on macOS → returns None
  3. The stopped-process check (/proc/<pid>/status for State: T) also fails on macOS

This means:

  • Lock files written on macOS have start_time: null
  • When checking a lock, _get_process_start_time(existing_pid) returns None
  • The start_time comparison existing["start_time"] != current_start is skipped when both are None
  • If macOS reuses the PID for an unrelated app (e.g. UURemote.app), os.kill(pid, 0) succeeds
  • Gateway treats the lock as active and refuses to start

Fix (three parts)

  1. _get_process_start_time() — falls back to ps -o lstart= -p <pid> on macOS/BSD, parsing the human-readable timestamp to epoch seconds. Makes start_time available on all platforms.

  2. _read_process_cmdline() — falls back to ps -o command= -p <pid> on macOS/BSD. Makes command-line inspection available on all platforms.

  3. Stale lock detection for null start_time — when both the stored and live start_time are None (edge case where ps also fails), the code now checks the live process command line. If it does not contain "hermes", the lock is treated as stale. This prevents unrelated macOS processes from blocking the Telegram adapter.

Tests

All 35 existing tests pass without modification.

2 new tests added in TestScopedLocks:

  • test_acquire_scoped_lock_stale_when_pid_reused_null_start_time — simulates the exact macOS scenario from the issue: PID 574 reused by UURemote.app, both start times None, cmdline is not hermes → lock stale, acquisition succeeds
  • test_acquire_scoped_lock_keeps_lock_when_pid_reused_is_hermes — when the reused PID IS a hermes process (rare), the lock is correctly preserved

Labels: type/bug, comp/gateway, platform/telegram, P2

Changed files

  • gateway/status.py (modified, +62/-8)
  • tests/gateway/test_status.py (modified, +62/-0)

Code Example

Telegram bot token already in use (PID 574). Stop the other gateway first.

---

/Applications/UURemote.app/Contents/MacOS/UURemote -startup

---

~/.local/state/hermes/gateway-locks/telegram-bot-token-<hash>.lock

---

ps -p 574 -o pid,ppid,user,stat,lstart,command

---

PID  PPID USER     STAT STARTED                      COMMAND
574     1 <user>   S    Mon Apr 27 12:09:00 2026     /Applications/UURemote.app/Contents/MacOS/UURemote -startup

---

ERROR gateway.platforms.base: [Telegram] Telegram bot token already in use (PID 574). Stop the other gateway first.
WARNING gateway.run: ✗ telegram failed to connect

---

Not uploaded via `hermes debug share` because the relevant minimal diagnostics are included below and full local debug output may contain unrelated config/log data.

Local environment:


Hermes Agent v0.11.0 (2026.4.23)
Python: 3.11.15
OpenAI SDK: 2.32.0
OS: macOS 26.4.1 (Build 25E253)
Gateway managed by launchd


Current workaround/fix verification after local patch:


tests/gateway/test_status.py::TestScopedLocks
tests/gateway/test_telegram_conflict.py
14 passed

---

Stale lock record:


{
  "pid": 574,
  "kind": "hermes-gateway",
  "argv": [
    "/Users/<user>/.hermes/hermes-agent/hermes_cli/main.py",
    "gateway",
    "run",
    "--replace"
  ],
  "start_time": null,
  "scope": "telegram-bot-token",
  "identity_hash": "<telegram-bot-token-hash>",
  "metadata": {
    "platform": "telegram"
  }
}


Live process occupying the reused PID:


PID  PPID USER     STAT STARTED                      COMMAND
574     1 <user>   S    Mon Apr 27 12:09:00 2026     /Applications/UURemote.app/Contents/MacOS/UURemote -startup

---

ps -p <pid> -o command=

---

14 passed
RAW_BUFFERClick to expand / collapse

Bug Description

Hermes Gateway can incorrectly refuse to start the Telegram adapter with:

Telegram bot token already in use (PID 574). Stop the other gateway first.

In this case, the Telegram bot token was configured correctly and there was no second Hermes Gateway using it. The recorded PID in the Telegram scoped lock had been reused by an unrelated macOS application:

/Applications/UURemote.app/Contents/MacOS/UURemote -startup

The stale lock file was:

~/.local/state/hermes/gateway-locks/telegram-bot-token-<hash>.lock

It contained an old Hermes Gateway record with "pid": 574 and "start_time": null. Since PID 574 was alive, Hermes treated the lock as active and failed to connect Telegram.

Steps to Reproduce

  1. Run Hermes Gateway on macOS with Telegram enabled.
  2. Let Hermes create a scoped Telegram bot token lock under ~/.local/state/hermes/gateway-locks/.
  3. Leave behind a stale lock record whose pid points to a previous Hermes Gateway and whose start_time is null.
  4. Later, allow macOS to reuse that PID for an unrelated process.
  5. Restart Hermes Gateway.

Observed example:

ps -p 574 -o pid,ppid,user,stat,lstart,command
PID  PPID USER     STAT STARTED                      COMMAND
574     1 <user>   S    Mon Apr 27 12:09:00 2026     /Applications/UURemote.app/Contents/MacOS/UURemote -startup

Expected Behavior

Hermes should treat the scoped lock as stale when the recorded PID is alive but the live process is not actually a Hermes Gateway process.

The Telegram adapter should connect normally instead of failing with a token-in-use error.

Actual Behavior

Hermes treated the stale lock as active because os.kill(pid, 0) succeeded for the reused PID.

Gateway logs repeatedly showed:

ERROR gateway.platforms.base: [Telegram] Telegram bot token already in use (PID 574). Stop the other gateway first.
WARNING gateway.run: ✗ telegram failed to connect

After manually fixing the stale-lock detection locally and restarting Gateway, the lock file was replaced with the current Gateway PID and gateway_state.json showed Telegram as connected.

Affected Component

Gateway (Telegram/Discord/Slack/WhatsApp)

Messaging Platform (if gateway-related)

Telegram

Debug Report

Not uploaded via `hermes debug share` because the relevant minimal diagnostics are included below and full local debug output may contain unrelated config/log data.

Local environment:


Hermes Agent v0.11.0 (2026.4.23)
Python: 3.11.15
OpenAI SDK: 2.32.0
OS: macOS 26.4.1 (Build 25E253)
Gateway managed by launchd


Current workaround/fix verification after local patch:


tests/gateway/test_status.py::TestScopedLocks
tests/gateway/test_telegram_conflict.py
14 passed

Operating System

macOS 26.4.1 (Build 25E253)

Python Version

3.11.15

Hermes Version

Hermes Agent v0.11.0 (2026.4.23)

Additional Logs / Traceback (optional)

Stale lock record:


{
  "pid": 574,
  "kind": "hermes-gateway",
  "argv": [
    "/Users/<user>/.hermes/hermes-agent/hermes_cli/main.py",
    "gateway",
    "run",
    "--replace"
  ],
  "start_time": null,
  "scope": "telegram-bot-token",
  "identity_hash": "<telegram-bot-token-hash>",
  "metadata": {
    "platform": "telegram"
  }
}


Live process occupying the reused PID:


PID  PPID USER     STAT STARTED                      COMMAND
574     1 <user>   S    Mon Apr 27 12:09:00 2026     /Applications/UURemote.app/Contents/MacOS/UURemote -startup

Root Cause Analysis (optional)

The issue appears to be in gateway.status.acquire_scoped_lock().

The scoped lock code checks whether the recorded PID is alive with os.kill(existing_pid, 0). It can detect stale records by comparing process start times when both the stored start_time and current process start time are available.

On macOS, _get_process_start_time() can return None, so scoped lock records can contain "start_time": null. In that case Hermes cannot distinguish the original Gateway process from an unrelated process that later reused the same PID.

When a stale lock record points to a PID now owned by another process, os.kill(pid, 0) still succeeds, so Hermes incorrectly returns the existing lock as active.

Proposed Fix (optional)

When a scoped lock record has kind == "hermes-gateway" but start time comparison is unavailable, fall back to checking the live process command line.

On platforms without /proc/<pid>/cmdline, use:

ps -p <pid> -o command=

If the command line does not look like Hermes Gateway, treat the scoped lock as stale and replace it.

I tested a local patch that:

  • adds a ps fallback for process command line detection on macOS
  • factors gateway command-line matching into a helper
  • treats a Hermes scoped lock as stale when the live PID belongs to a non-Hermes process and start time is unavailable
  • adds a regression test for PID reuse by an unrelated process

Relevant local test result:

14 passed

Are you willing to submit a PR for this?

  • I'd like to fix this myself and submit a PR

extent analysis

TL;DR

The issue can be fixed by modifying the scoped lock code to check the live process command line when the start time comparison is unavailable, treating the lock as stale if the process is not a Hermes Gateway.

Guidance

  • Modify the gateway.status.acquire_scoped_lock() function to fall back to checking the live process command line when the start time comparison is unavailable.
  • Use ps -p <pid> -o command= to get the command line of the process on platforms without /proc/<pid>/cmdline.
  • Treat a Hermes scoped lock as stale when the live PID belongs to a non-Hermes process and start time is unavailable.
  • Add a regression test for PID reuse by an unrelated process to ensure the fix works correctly.

Example

def acquire_scoped_lock(pid):
    # ... existing code ...
    if start_time is None:
        # Fall back to checking the live process command line
        command = subprocess.check_output(['ps', '-p', str(pid), '-o', 'command=']).decode().strip()
        if not command.startswith('hermes_cli/main.py'):
            # Treat the lock as stale if the process is not a Hermes Gateway
            return False
    # ... existing code ...

Notes

The proposed fix assumes that the hermes_cli/main.py command line is unique to the Hermes Gateway process. If this is not the case, additional checks may be needed to ensure correct identification of the Hermes Gateway process.

Recommendation

Apply the workaround by modifying the gateway.status.acquire_scoped_lock() function to check the live process command line when the start time comparison is unavailable. This fix should resolve the issue without requiring an upgrade to a fixed version.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING