hermes - 💡(How to fix) Fix --replace singleton pidfile resolver ignores HERMES_PROFILE, causing cross-gateway SIGKILL on multi-persona hosts

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

On hosts running two or more Hermes telegram gateways with different HERMES_PROFILE values but a shared \$HOME, --replace SIGKILLs sibling gateways. Each new gateway with --replace resolves to the SAME pidfile path regardless of HERMES_PROFILE, so it kills whichever gateway last claimed the slot.

Error Message

  • journalctl (or equivalent) shows the killed gateway exiting with Telegram API 409 Conflict BEFORE the surviving gateway has acquired the long-polling slot. This initially LOOKS like a bot-token rotation issue but is actually a pidfile-collision symptom: the kill happens, the killed process drops the long-poll, the new process picks it up, and 409 is the API noticing two consumers briefly.
  • Only the most-recently-started gateway with --replace survives.

Root Cause

The pidfile path resolver inside hermes/core/gateway.py (the function that computes pidfile_path before --replace reads it) honors the active_profile config value but does NOT honor the HERMES_PROFILE environment variable when both are set or when only HERMES_PROFILE is set. So every gateway sharing a $HOME resolves to the same ~/.hermes/profiles/<active_profile>/gateway.pid regardless of the env variable they were launched with.

Fix Action

Fix / Workaround

Production workaround (currently deployed)

This works but is brittle — anyone running multiple gateways from the same $HOME hits the original bug before discovering the workaround.

Code Example

HERMES_PROFILE=alice hermes telegram --replace &

---

HERMES_PROFILE=bob hermes telegram --replace &

---

# pseudocode
profile = os.environ.get("HERMES_PROFILE") or config.active_profile
pidfile_path = Path(hermes_home) / "profiles" / profile / "gateway.pid"
RAW_BUFFERClick to expand / collapse

Summary

On hosts running two or more Hermes telegram gateways with different HERMES_PROFILE values but a shared \$HOME, --replace SIGKILLs sibling gateways. Each new gateway with --replace resolves to the SAME pidfile path regardless of HERMES_PROFILE, so it kills whichever gateway last claimed the slot.

Reproduction

Minimal scenario on a single host (Linux, single user, Hermes v0.14.0):

  1. Start a Hermes telegram gateway as profile alice:
    HERMES_PROFILE=alice hermes telegram --replace &
  2. Start a second telegram gateway as profile bob, same shell user:
    HERMES_PROFILE=bob hermes telegram --replace &
  3. Observe: alice's process gets SIGKILLed. The pidfile at ~/.hermes/profiles/<resolved>/gateway.pid was overwritten by bob's --replace, then read back and used as the kill target.

The two gateways have distinct HERMES_PROFILE values but resolve to the same pidfile, so --replace cannot distinguish them.

Observed behavior

  • journalctl (or equivalent) shows the killed gateway exiting with Telegram API 409 Conflict BEFORE the surviving gateway has acquired the long-polling slot. This initially LOOKS like a bot-token rotation issue but is actually a pidfile-collision symptom: the kill happens, the killed process drops the long-poll, the new process picks it up, and 409 is the API noticing two consumers briefly.
  • Only the most-recently-started gateway with --replace survives.

Root cause analysis

The pidfile path resolver inside hermes/core/gateway.py (the function that computes pidfile_path before --replace reads it) honors the active_profile config value but does NOT honor the HERMES_PROFILE environment variable when both are set or when only HERMES_PROFILE is set. So every gateway sharing a $HOME resolves to the same ~/.hermes/profiles/<active_profile>/gateway.pid regardless of the env variable they were launched with.

Production workaround (currently deployed)

Per-process isolation of HERMES_HOME:

  • Set HERMES_HOME=~/.hermes-<slug>/ per gateway (one slug per profile).
  • Drop a profile-specific active_profile file at \$HERMES_HOME/active_profile.
  • Symlink \$HERMES_HOME/profiles/<slug> to a single shared plugin source so plugin code stays single-sourced.
  • Set the env in the service unit (Environment=HERMES_HOME=~/.hermes-<slug>).

Result: each gateway resolves to a distinct pidfile (~/.hermes-<slug>/profiles/<slug>/gateway.pid), six gateways co-exist with stable PIDs across a 30-minute observation window, no cross-kill.

This works but is brittle — anyone running multiple gateways from the same $HOME hits the original bug before discovering the workaround.

Proposed upstream fix

In the pidfile path resolver, honor HERMES_PROFILE if set, fall back to active_profile only when the env variable is unset:

# pseudocode
profile = os.environ.get("HERMES_PROFILE") or config.active_profile
pidfile_path = Path(hermes_home) / "profiles" / profile / "gateway.pid"

Plus a unit test that asserts two gateway invocations with different HERMES_PROFILE values (same HERMES_HOME) produce different pidfile_path outputs.

Happy to send a PR if maintainers prefer that over a self-fix.

Affected versions

  • Observed on v0.14.0.
  • The pidfile resolver pattern is unchanged in v0.13.x based on a quick read of release notes, so the bug likely applies there too.

Severity

Medium-to-high in production. The kill is silent from Hermes's perspective (the surviving gateway looks healthy in isolation) — only external verification (checking that ALL N gateways are still up after starting a new one) catches it. A multi-persona host with --replace in its restart loop can mask a complete outage of N-1 personas until a customer reports it.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING