hermes - 💡(How to fix) Fix systemd TimeoutStopSec mismatch causes silent crash loop; suggested fix command is wrong [1 pull requests]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

Error Message

Note: sys_exc is (None, None, None) — no Python traceback, just a silent exit. The user sees the gateway go down but gets no actionable error message.

Fix Action

Fixed

Code Example

{"tag": "gateway.start", "pid": 32252, "replace": false}
{"tag": "asyncio.run.returned", "pid": 32252, "success": false}
{"tag": "gateway.exit_nonzero", "pid": 32252}
{"tag": "atexit.hook", "pid": 32252, "sys_exc": "(None, None, None)"}

---

hermes gateway stop
hermes gateway install --force   # regenerates unit with TimeoutStopSec=210
hermes gateway start
RAW_BUFFERClick to expand / collapse

Problem

The default systemd unit generated by hermes gateway install has TimeoutStopSec=90s, but the gateway's default restart_drain_timeout is 180s. When systemd stops the gateway, it sends SIGTERM, waits 90s, then sends SIGKILL — but the gateway needs up to 180s to drain active agents. The SIGKILL kills the gateway mid-drain, leaving stale lock files. On next startup, the gateway detects the old PID/lock and exits immediately with "Another gateway instance is already running". Combined with systemd Restart=on-failure, this creates an infinite crash loop — startup → detect stale lock → exit → systemd restarts → repeat.

Evidence

The gateway-exit-diag.log grew to 381,469 lines (63 MB) over 13 days of crash looping. Last entries:

{"tag": "gateway.start", "pid": 32252, "replace": false}
{"tag": "asyncio.run.returned", "pid": 32252, "success": false}
{"tag": "gateway.exit_nonzero", "pid": 32252}
{"tag": "atexit.hook", "pid": 32252, "sys_exc": "(None, None, None)"}

Note: sys_exc is (None, None, None) — no Python traceback, just a silent exit. The user sees the gateway go down but gets no actionable error message.

Gateway's Own Warning

The gateway detects this mismatch and logs:

Stale systemd unit detected: hermes.service has TimeoutStopSec=90s but drain_timeout=180s (expected >=210s). systemd may SIGKILL the gateway mid-drain. Run hermes gateway service install --replace to regenerate the unit.

Two Bugs in the Warning

  1. The suggested command is wrong. hermes gateway service install --replace does not work. The correct command is hermes gateway install --force. (--replace is only valid for hermes gateway run --replace)

  2. The warning is buried in logs. Users in a crash loop can't see it — the gateway dies before they can read logs. The warning should be surfaced more aggressively (e.g., on startup, refuse to start until fixed).

Steps to Reproduce

  1. Fresh install of Hermes via systemd user service
  2. Have an active agent session (so drain is needed)
  3. systemctl --user restart hermes-gateway
  4. Gateway enters crash loop: every ~10 seconds, starts → detects stale state → exits → systemd restarts

Fix Applied

hermes gateway stop
hermes gateway install --force   # regenerates unit with TimeoutStopSec=210
hermes gateway start

After this, TimeoutStopSec=210drain_timeout=180 and the crash loop stops.

Proposed Fixes

  1. hermes gateway install should auto-align TimeoutStopSec with restart_drain_timeout + 30s by default
  2. Fix the suggested command in the warning message (--replace--force)
  3. On gateway startup, if the mismatch is detected, refuse to start (or auto-fix) rather than just logging a warning

Related

  • #11258: Gateway self-restart can exit cleanly into draining state and stay dead under systemd Restart=on-failure (related but different failure mode)

Environment

  • OS: Ubuntu 24.04 (WSL2)
  • Hermes version: latest
  • systemd user service (systemctl --user)

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING