hermes - 💡(How to fix) Fix [systemd] Incomplete process cleanup during restart causes port conflict and infinite restart loop

Official PRs (…)
ON THIS PAGE

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

Error Message

  • Issue is silent — no error notification, systemd just keeps restarting

Root Cause

Two compounding factors:

  1. Race condition in systemd restart: SIGTERM → new process starts before old process fully exits → port/PID file still held
  2. Exit code 1 on duplicate detection: The new process treats "another instance running" as a failure (exit 1) rather than a clean exit (exit 0). This is the same root cause as #21549 / PR #21555.

Fix Action

Fix / Workaround

Deployment-level workaround: Add an ExecStartPre guard to the systemd unit:

Code Example

May 08 21:25:19 systemd[1]: hermes-gateway.service: Scheduled restart job, restart counter is at 75.
May 08 21:25:20 python[34383]:Gateway already running (PID 30623).
May 08 21:25:20 systemd[1]: hermes-gateway.service: Main process exited, code=exited, status=1/FAILURE
May 08 21:25:20 systemd[1]: hermes-gateway.service: Failed with result 'exit-code'.

---

[Service]
ExecStartPre=/bin/bash -c 'fuser -k 8649/tcp 2>/dev/null; sleep 1'
RAW_BUFFERClick to expand / collapse

Bug Description

On systemd-managed deployments, systemctl restart hermes-gateway.service can trigger an infinite restart loop when the old gateway process is not fully terminated before the new process starts. The new process detects the stale instance via PID file and exits with code 1. With Restart=always, systemd interprets this as a failure and immediately restarts — creating a death loop.

This is distinct from platform-connection failures (#21831) and launchd double-spawn (#21549). The trigger is incomplete process cleanup during systemd restart.

Environment

  • OS: WSL2 Ubuntu 24.04
  • Hermes Agent: installed via standard installer
  • Service: /etc/systemd/system/hermes-gateway.service with Restart=always
  • Command: gateway run --replace
  • Profiles: 6 (default + 5 agents), all using --replace

Steps to Reproduce

  1. Start hermes-gateway.service normally
  2. Run sudo systemctl restart hermes-gateway.service
  3. If the old process hasn't fully exited when the new one starts, the new process detects the stale PID and exits with code 1
  4. systemd restarts → goto step 2

The race condition is more likely when:

  • The gateway is under load (active agent sessions during restart)
  • WSL2 has been resumed from Windows hibernation
  • Multiple restart commands are issued in quick succession

Expected Behavior

systemctl restart should reliably produce a single running gateway instance. If a stale instance is detected, the new process should exit cleanly (exit 0) rather than triggering systemd's restart policy.

Actual Behavior

May 08 21:25:19 systemd[1]: hermes-gateway.service: Scheduled restart job, restart counter is at 75.
May 08 21:25:20 python[34383]: ❌ Gateway already running (PID 30623).
May 08 21:25:20 systemd[1]: hermes-gateway.service: Main process exited, code=exited, status=1/FAILURE
May 08 21:25:20 systemd[1]: hermes-gateway.service: Failed with result 'exit-code'.

NRestarts reached 76 and was still increasing.

Root Cause

Two compounding factors:

  1. Race condition in systemd restart: SIGTERM → new process starts before old process fully exits → port/PID file still held
  2. Exit code 1 on duplicate detection: The new process treats "another instance running" as a failure (exit 1) rather than a clean exit (exit 0). This is the same root cause as #21549 / PR #21555.

Proposed Solutions

Upstream fix (preferred): Merge PR #21555 — change duplicate-instance detection from return False (exit 1) to return True (exit 0). This makes the "already running" case a clean exit, which systemd does not treat as a failure.

Deployment-level workaround: Add an ExecStartPre guard to the systemd unit:

[Service]
ExecStartPre=/bin/bash -c 'fuser -k 8649/tcp 2>/dev/null; sleep 1'

This ensures the port is freed before the new process starts. Verified: after applying this, 4 consecutive restarts produced NRestarts=0.

Documentation: Add a systemd deployment guide to the official docs covering:

  • Recommended unit file configuration (ExecStartPre, RestartSec)
  • Multi-profile gateway setup considerations
  • WSL2-specific notes (hibernation, process cleanup)

Impact

  • Gateway becomes unavailable during the restart loop
  • Cron ticker is repeatedly killed, preventing all cron jobs from firing
  • High CPU usage from rapid restart cycles
  • Issue is silent — no error notification, systemd just keeps restarting

Related Issues

  • #21549 — launchd double-spawn (macOS equivalent)
  • #21555 — PR fixing exit code for duplicate detection
  • #21831 — platform auth failure causing similar restart loop

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING