hermes - ✅(Solved) Fix gateway: scoped lock PID-reuse guard is a no-op on macOS/Windows — stale lockfiles permanently block startup [1 pull requests, 1 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
NousResearch/hermes-agent#18778Fetched 2026-05-03 04:54:22
View on GitHub
Comments
1
Participants
2
Timeline
6
Reactions
0
Timeline (top)
labeled ×3commented ×1cross-referenced ×1referenced ×1

On macOS (and Windows), gateway/status.py::acquire_scoped_lock() can refuse to start the gateway forever after an unclean shutdown, because its PID-reuse guard silently degrades to a bare os.kill(pid, 0) check. Once that PID gets recycled by any other process, the lock is treated as "still held" until a human deletes the file.

Symptom in the wild (macOS 26, launchd-managed gateway):

ERROR gateway.platforms.base: [Telegram] Telegram bot token already in use (PID 450). Stop the other gateway first.
ERROR gateway.run: Gateway hit a non-retryable startup conflict: telegram: Telegram bot token already in use (PID 450). Stop the other gateway first.

…where PID 450 was actually /usr/libexec/intelligentroutingd, recycled long after the original gateway had died. The lock file at ~/.local/state/hermes/gateway-locks/telegram-bot-token-<hash>.lock looked like:

{"pid": 450, "kind": "hermes-gateway", "start_time": null, "scope": "telegram-bot-token", ...}

KeepAlive then loops forever: launchd restarts the gateway, gateway sees the "live" lock, exits with the non-retryable error, repeat. Telegram bot stays unreachable until manual rm of the lockfile.

Error Message

ERROR gateway.platforms.base: [Telegram] Telegram bot token already in use (PID 450). Stop the other gateway first. ERROR gateway.run: Gateway hit a non-retryable startup conflict: telegram: Telegram bot token already in use (PID 450). Stop the other gateway first. KeepAlive then loops forever: launchd restarts the gateway, gateway sees the "live" lock, exits with the non-retryable error, repeat. Telegram bot stays unreachable until manual rm of the lockfile.

Root Cause

In gateway/status.py:

  1. _get_process_start_time(pid) only reads /proc/<pid>/stat (Linux-only). On macOS/Windows it always returns None.
  2. Because of (1), every lockfile written on macOS has "start_time": null.
  3. The PID-reuse guard inside acquire_scoped_lock() requires both the stored start_time AND the live one to be non-null:
    if (
        existing.get("start_time") is not None
        and current_start is not None
        and current_start != existing.get("start_time")
    ):
        stale = True
    On macOS both are None, so the guard is silently skipped.
  4. The fallback "is the process stopped (Ctrl+Z)?" check also reads /proc/<pid>/status — Linux-only, so it's a no-op on macOS too.
  5. Net effect: as soon as the recorded PID is reused by anything alive, os.kill(pid, 0) succeeds and the lock is treated as held — permanently.

For comparison, the runtime lock path uses _looks_like_gateway_process(pid) (which reads cmdline patterns) as a defense — but _read_process_cmdline() is /proc/<pid>/cmdline-only, also Linux-only. The scoped lock path doesn't even call that helper.

What made the lock stale in the first place was an unclean shutdown — the gateway logged Gateway drain timed out after 180.0s with 1 active agent(s); interrupting remaining work on the way down, so release_scoped_lock() never ran. (Likely launchd SIGKILL after its grace window.) That's the trigger; the bug above is what makes it stick forever.

Fix Action

Workaround

Until the fix lands, wrap the gateway launch with a prestart hook that scans $HERMES_GATEWAY_LOCK_DIR (or ~/.local/state/hermes/gateway-locks/) and removes any *.lock whose recorded PID is either dead or alive-but-not-a-gateway. On a launchd-managed setup, point the LaunchAgent's ProgramArguments at a small wrapper script that runs the cleanup, then execs the real gateway.

(Reporting this from a macOS deployment where the bot was silently down for hours before the stale-lock root cause was identified.)

PR fix notes

PR #17246: fix: resolve 7 identified issues [automated]

Description (problem / solution / changelog)

Summary

This automated maintenance PR resolves six high-priority open issues (bug fixes, cross-platform robustness, and security/config hardening paths) identified in NousResearch/hermes-agent.

Note: The job target was 7 issues. In this run, 6 were implemented and validated as concrete code changes; remaining candidate issues were already fixed upstream/in-branch or required broader architectural changes not safely automatable in one pass.

Issues resolved

  1. #18757 - resolve_api_key_provider_credentials() misses ~/.hermes/.env for base_url_env_var

    • Replaced os.getenv(...) with get_env_value(...) in API-key provider credential resolution.
    • Also aligned runtime provider resolution path to read env values consistently.
  2. #18705 - load_hermes_dotenv() overrides runtime env vars (override=True)

    • Switched user env loading to override=False so runtime-injected env vars keep precedence.
    • Updated function docstring behavior notes accordingly.
  3. #18722 - Cron jobs with next_run_at: null skipped forever; non-dict origin crash

    • Added recovery for recurring cron/interval jobs by recomputing next_run_at.
    • Hardened _resolve_origin() to tolerate non-dict origin payloads.
  4. #18742 - Kimi/Moonshot via aggregators misses reasoning-mode detection

    • _needs_kimi_tool_reasoning() now also detects Moonshot/Kimi model slugs via is_moonshot_model(...).
  5. #18744 - constraints_path dead config (not loaded)

    • Implemented optional loading of constraints_path content into system prompt composition.
  6. #18778 - Gateway scoped lock stale detection no-op on macOS/Windows

    • Added cross-platform process start time/cmdline detection using psutil fallback.
    • Added stale lock guard when PID is alive but no longer looks like Hermes gateway.

Files modified

  • hermes_cli/auth.py
  • hermes_cli/runtime_provider.py
  • hermes_cli/env_loader.py
  • cron/jobs.py
  • cron/scheduler.py
  • run_agent.py
  • gateway/status.py

Commit list

  • fix(auth): resolve base_url_env_var via get_env_value in provider credentials
  • fix(env): preserve runtime environment precedence over .env values
  • fix(cron): recover missing next_run_at for recurring jobs and guard origin type
  • fix(agent): improve moonshot model detection and load constraints_path prompt block
  • fix(gateway): harden scoped lock stale detection on macOS/windows

Changed files

  • Dockerfile (modified, +3/-2)
  • acp_adapter/session.py (modified, +12/-0)
  • agent/auxiliary_client.py (modified, +280/-28)
  • agent/context_compressor.py (modified, +496/-52)
  • agent/title_generator.py (modified, +2/-2)
  • agent/transports/chat_completions.py (modified, +14/-0)
  • agent/usage_pricing.py (modified, +4/-0)
  • cli-config.yaml.example (modified, +5/-0)
  • cli.py (modified, +27/-3)
  • cron/jobs.py (modified, +10/-2)
  • cron/scheduler.py (modified, +14/-4)
  • docker/entrypoint.sh (modified, +9/-1)
  • gateway/channel_directory.py (modified, +14/-4)
  • gateway/platforms/discord.py (modified, +33/-7)
  • gateway/platforms/email.py (modified, +12/-2)
  • gateway/platforms/feishu.py (modified, +34/-1)
  • gateway/platforms/qqbot/adapter.py (modified, +8/-2)
  • gateway/platforms/telegram_network.py (modified, +7/-2)
  • gateway/platforms/weixin.py (modified, +10/-1)
  • gateway/run.py (modified, +129/-32)
  • gateway/status.py (modified, +37/-2)
  • hermes_cli/auth.py (modified, +4/-4)
  • hermes_cli/commands.py (modified, +1/-1)
  • hermes_cli/config.py (modified, +271/-40)
  • hermes_cli/copilot_auth.py (modified, +1/-1)
  • hermes_cli/doctor.py (modified, +6/-1)
  • hermes_cli/env_loader.py (modified, +5/-4)
  • hermes_cli/gateway.py (modified, +16/-13)
  • hermes_cli/main.py (modified, +69/-3)
  • hermes_cli/memory_setup.py (modified, +1/-1)
  • hermes_cli/model_switch.py (modified, +6/-1)
  • hermes_cli/models.py (modified, +60/-2)
  • hermes_cli/profiles.py (modified, +16/-3)
  • hermes_cli/runtime_provider.py (modified, +17/-14)
  • hermes_cli/setup.py (modified, +8/-2)
  • hermes_cli/slack_cli.py (modified, +1/-2)
  • hermes_cli/status.py (modified, +17/-2)
  • hermes_cli/web_server.py (modified, +1/-1)
  • hermes_constants.py (modified, +16/-3)
  • model_tools.py (modified, +44/-13)
  • run_agent.py (modified, +413/-82)
  • setup-hermes.sh (modified, +23/-12)
  • skills/red-teaming/godmode/scripts/load_godmode.py (modified, +9/-8)
  • tests/agent/test_context_compressor.py (modified, +389/-0)
  • tests/agent/transports/test_chat_completions.py (modified, +11/-0)
  • tests/gateway/test_compress_command.py (modified, +49/-0)
  • tests/hermes_cli/test_api_key_providers.py (modified, +5/-5)
  • tests/hermes_cli/test_config.py (modified, +17/-0)
  • tests/run_agent/test_413_compression.py (modified, +81/-1)
  • tests/run_agent/test_compression_boundary_hook.py (modified, +42/-0)
  • tests/run_agent/test_run_agent.py (modified, +100/-13)
  • tests/tools/test_skill_manager_tool.py (modified, +270/-0)
  • tools/approval.py (modified, +1/-1)
  • tools/delegate_tool.py (modified, +4/-1)
  • tools/environments/docker.py (modified, +36/-5)
  • tools/environments/local.py (modified, +8/-1)
  • tools/file_operations.py (modified, +70/-67)
  • tools/file_tools.py (modified, +13/-2)
  • tools/send_message_tool.py (modified, +72/-2)
  • tools/session_search_tool.py (modified, +2/-2)
  • tools/skill_manager_tool.py (modified, +82/-21)
  • tools/skills_tool.py (modified, +13/-1)
  • tools/terminal_tool.py (modified, +6/-0)
  • tools/tool_backend_helpers.py (modified, +15/-5)
  • tools/tts_tool.py (modified, +27/-16)
  • tools/voice_mode.py (modified, +23/-10)
  • toolsets.py (modified, +14/-1)
  • tui_gateway/server.py (modified, +5/-3)
  • ui-tui/src/app/turnController.ts (modified, +1/-1)
  • ui-tui/src/app/useInputHandlers.ts (modified, +8/-3)
  • ui-tui/src/app/useSessionLifecycle.ts (modified, +1/-1)
  • ui-tui/src/gatewayTypes.ts (modified, +1/-0)
  • utils.py (modified, +9/-0)
  • uv.lock (modified, +161/-2)
  • website/docs/reference/environment-variables.md (modified, +1/-1)

Code Example

ERROR gateway.platforms.base: [Telegram] Telegram bot token already in use (PID 450). Stop the other gateway first.
ERROR gateway.run: Gateway hit a non-retryable startup conflict: telegram: Telegram bot token already in use (PID 450). Stop the other gateway first.

---

{"pid": 450, "kind": "hermes-gateway", "start_time": null, "scope": "telegram-bot-token", ...}

---

if (
       existing.get("start_time") is not None
       and current_start is not None
       and current_start != existing.get("start_time")
   ):
       stale = True

---

# 1. Start the gateway, get its PID
launchctl print gui/$(id -u)/ai.hermes.gateway | grep pid

# 2. SIGKILL it so locks aren't released
kill -9 <pid>

# 3. Inspect the leftover lock — note start_time: null
cat ~/.local/state/hermes/gateway-locks/telegram-bot-token-*.lock

# 4. Wait for the macOS PID space to wrap (or just spawn enough processes to reach <pid>)
#    On a busy laptop this happens within minutes.

# 5. Try to start the gateway again — fails with "already in use".
hermes gateway run --replace

---

if not stale and existing.get("kind") == _GATEWAY_KIND \
           and not _looks_like_gateway_process(existing_pid):
       stale = True
RAW_BUFFERClick to expand / collapse

Summary

On macOS (and Windows), gateway/status.py::acquire_scoped_lock() can refuse to start the gateway forever after an unclean shutdown, because its PID-reuse guard silently degrades to a bare os.kill(pid, 0) check. Once that PID gets recycled by any other process, the lock is treated as "still held" until a human deletes the file.

Symptom in the wild (macOS 26, launchd-managed gateway):

ERROR gateway.platforms.base: [Telegram] Telegram bot token already in use (PID 450). Stop the other gateway first.
ERROR gateway.run: Gateway hit a non-retryable startup conflict: telegram: Telegram bot token already in use (PID 450). Stop the other gateway first.

…where PID 450 was actually /usr/libexec/intelligentroutingd, recycled long after the original gateway had died. The lock file at ~/.local/state/hermes/gateway-locks/telegram-bot-token-<hash>.lock looked like:

{"pid": 450, "kind": "hermes-gateway", "start_time": null, "scope": "telegram-bot-token", ...}

KeepAlive then loops forever: launchd restarts the gateway, gateway sees the "live" lock, exits with the non-retryable error, repeat. Telegram bot stays unreachable until manual rm of the lockfile.

Root cause

In gateway/status.py:

  1. _get_process_start_time(pid) only reads /proc/<pid>/stat (Linux-only). On macOS/Windows it always returns None.
  2. Because of (1), every lockfile written on macOS has "start_time": null.
  3. The PID-reuse guard inside acquire_scoped_lock() requires both the stored start_time AND the live one to be non-null:
    if (
        existing.get("start_time") is not None
        and current_start is not None
        and current_start != existing.get("start_time")
    ):
        stale = True
    On macOS both are None, so the guard is silently skipped.
  4. The fallback "is the process stopped (Ctrl+Z)?" check also reads /proc/<pid>/status — Linux-only, so it's a no-op on macOS too.
  5. Net effect: as soon as the recorded PID is reused by anything alive, os.kill(pid, 0) succeeds and the lock is treated as held — permanently.

For comparison, the runtime lock path uses _looks_like_gateway_process(pid) (which reads cmdline patterns) as a defense — but _read_process_cmdline() is /proc/<pid>/cmdline-only, also Linux-only. The scoped lock path doesn't even call that helper.

What made the lock stale in the first place was an unclean shutdown — the gateway logged Gateway drain timed out after 180.0s with 1 active agent(s); interrupting remaining work on the way down, so release_scoped_lock() never ran. (Likely launchd SIGKILL after its grace window.) That's the trigger; the bug above is what makes it stick forever.

Reproduction

On macOS:

# 1. Start the gateway, get its PID
launchctl print gui/$(id -u)/ai.hermes.gateway | grep pid

# 2. SIGKILL it so locks aren't released
kill -9 <pid>

# 3. Inspect the leftover lock — note start_time: null
cat ~/.local/state/hermes/gateway-locks/telegram-bot-token-*.lock

# 4. Wait for the macOS PID space to wrap (or just spawn enough processes to reach <pid>)
#    On a busy laptop this happens within minutes.

# 5. Try to start the gateway again — fails with "already in use".
hermes gateway run --replace

Proposed fix

Three layers, in order of impact:

1. Cross-platform _get_process_start_time. Replace the /proc-only reader with psutil.Process(pid).create_time(). psutil is already widely available; if adding it as a hard dep is undesirable, fall back to sysctl KERN_PROC_PID on Darwin (subprocess to /usr/sbin/sysctl -n kern.proc.pid.<pid> works without new deps) and GetProcessTimes on Windows. Once start_time is populated on every OS, the existing guard at lines 513–518 of gateway/status.py does its job.

2. Identity check inside the scoped-lock staleness path. Even without start_time, _looks_like_gateway_process(pid) (line 139) plus a cross-platform cmdline reader (psutil.Process(pid).cmdline() or ps -o command= -p <pid>) would catch this case. The runtime lock path already uses the equivalent idea via _record_looks_like_gateway. Add the same to acquire_scoped_lock():

if not stale and existing.get("kind") == _GATEWAY_KIND \
        and not _looks_like_gateway_process(existing_pid):
    stale = True

3. Cleaner shutdown. Make sure release_scoped_lock() runs even when the agent drain times out — either bump the drain timeout's hard kill so the release path always fires, or register an atexit/signal-handler that releases scoped locks unconditionally. Reduces how often stale lockfiles appear in the first place.

Optional UX improvement: a hermes gateway unlock (or hermes doctor --fix-locks) command so end users don't need to know where ~/.local/state/hermes/gateway-locks/ lives.

Workaround

Until the fix lands, wrap the gateway launch with a prestart hook that scans $HERMES_GATEWAY_LOCK_DIR (or ~/.local/state/hermes/gateway-locks/) and removes any *.lock whose recorded PID is either dead or alive-but-not-a-gateway. On a launchd-managed setup, point the LaunchAgent's ProgramArguments at a small wrapper script that runs the cleanup, then execs the real gateway.

(Reporting this from a macOS deployment where the bot was silently down for hours before the stale-lock root cause was identified.)

Environment

  • macOS 26 (arm64)
  • Gateway managed by launchd via ~/Library/LaunchAgents/ai.hermes.gateway.plist
  • Telegram platform (likely affects every scoped-lock-using platform on macOS/Windows)

extent analysis

TL;DR

The most likely fix for the issue is to implement a cross-platform _get_process_start_time function and add an identity check inside the scoped-lock staleness path to prevent the gateway from treating a recycled PID as a live lock.

Guidance

  • Implement a cross-platform _get_process_start_time function using psutil.Process(pid).create_time() to populate the start_time field in the lock file.
  • Add an identity check inside the scoped-lock staleness path using _looks_like_gateway_process(pid) and a cross-platform cmdline reader to catch cases where the PID has been recycled.
  • Consider implementing a cleaner shutdown mechanism to ensure release_scoped_lock() runs even when the agent drain times out.
  • As a temporary workaround, create a prestart hook to scan and remove stale lock files before launching the gateway.

Example

import psutil

def _get_process_start_time(pid):
    try:
        return psutil.Process(pid).create_time()
    except psutil.Error:
        # Handle error cases, e.g., return None or raise an exception
        pass

Notes

The proposed fix requires modifications to the gateway/status.py file and may need additional error handling and testing to ensure cross-platform compatibility.

Recommendation

Apply the proposed fix, starting with the implementation of a cross-platform _get_process_start_time function, to prevent the gateway from treating a recycled PID as a live lock. This will ensure that the gateway can start correctly even after an unclean shutdown.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

hermes - ✅(Solved) Fix gateway: scoped lock PID-reuse guard is a no-op on macOS/Windows — stale lockfiles permanently block startup [1 pull requests, 1 comments, 2 participants]