hermes - 💡(How to fix) Fix Avoid systemd gateway restart/takeover loops from Restart=always + --replace

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

Error Message

  • No direct OOM, port bind failure, or unhandled traceback was visible in the checked window.

Root Cause

That combination is unsafe when another gateway runner for the same profile appears, because a planned --replace SIGTERM can still be treated by systemd as a process to restart. The old supervisor can then race the replacement and create a restart/takeover loop.

RAW_BUFFERClick to expand / collapse

Problem

Systemd-managed Hermes gateway units can combine:

  • ExecStart=... gateway run --replace
  • Restart=always

That combination is unsafe when another gateway runner for the same profile appears, because a planned --replace SIGTERM can still be treated by systemd as a process to restart. The old supervisor can then race the replacement and create a restart/takeover loop.

Incident evidence

Observed during the Clavain/Volyova gateway incident on 2026-05-26:

  • Logs showed Shutdown context: signal=SIGTERM under_systemd=yes parent_pid=1 parent_name=systemd.
  • Some exits were clean; some escalated to status=9/KILL after the old process/children did not leave promptly.
  • No direct OOM, port bind failure, or unhandled traceback was visible in the checked window.
  • Current code comments around takeover assume clean takeover avoids Restart=on-failure loops, but the installed units used Restart=always.

The immediate local trigger was stale duplicate user services fighting the canonical system services, but this template/policy combination amplified the failure mode.

Desired outcome

Make the systemd service behavior safe by default for supervised gateways.

Candidate approaches

  • Use Restart=on-failure rather than Restart=always for gateway services that run with --replace.
  • Or remove --replace from ExecStart for systemd-managed units and make replacement an explicit control action.
  • Or make --replace takeover systemd-aware so planned clean exits do not cause the old supervisor lane to resurrect itself.

Acceptance criteria

  • Installing/replacing a gateway service no longer generates a unit that can repeatedly resurrect an intentionally replaced process.
  • Tests or docs cover the interaction between gateway run --replace and systemd Restart= policy.
  • Existing gateway restart/install flows still handle stale PID files and already-running gateways cleanly.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING