hermes - 💡(How to fix) Fix [systemd] Incomplete process cleanup during restart causes port conflict and infinite restart loop

hermes2026-05-08 14:49:22

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

Error Message

Issue is silent — no error notification, systemd just keeps restarting

Root Cause

Two compounding factors:

Race condition in systemd restart: SIGTERM → new process starts before old process fully exits → port/PID file still held
Exit code 1 on duplicate detection: The new process treats "another instance running" as a failure (exit 1) rather than a clean exit (exit 0). This is the same root cause as #21549 / PR #21555.

Fix Action

Fix / Workaround

Deployment-level workaround: Add an ExecStartPre guard to the systemd unit:

Code Example

May 08 21:25:19 systemd[1]: hermes-gateway.service: Scheduled restart job, restart counter is at 75.
May 08 21:25:20 python[34383]: ❌ Gateway already running (PID 30623).
May 08 21:25:20 systemd[1]: hermes-gateway.service: Main process exited, code=exited, status=1/FAILURE
May 08 21:25:20 systemd[1]: hermes-gateway.service: Failed with result 'exit-code'.

---

[Service]
ExecStartPre=/bin/bash -c 'fuser -k 8649/tcp 2>/dev/null; sleep 1'

RAW_BUFFERClick to expand / collapse

Bug Description

On systemd-managed deployments, systemctl restart hermes-gateway.service can trigger an infinite restart loop when the old gateway process is not fully terminated before the new process starts. The new process detects the stale instance via PID file and exits with code 1. With Restart=always, systemd interprets this as a failure and immediately restarts — creating a death loop.

This is distinct from platform-connection failures (#21831) and launchd double-spawn (#21549). The trigger is incomplete process cleanup during systemd restart.

Environment

OS: WSL2 Ubuntu 24.04
Hermes Agent: installed via standard installer
Service: /etc/systemd/system/hermes-gateway.service with Restart=always
Command: gateway run --replace
Profiles: 6 (default + 5 agents), all using --replace

Steps to Reproduce

Start hermes-gateway.service normally
Run sudo systemctl restart hermes-gateway.service
If the old process hasn't fully exited when the new one starts, the new process detects the stale PID and exits with code 1
systemd restarts → goto step 2

The race condition is more likely when:

The gateway is under load (active agent sessions during restart)
WSL2 has been resumed from Windows hibernation
Multiple restart commands are issued in quick succession

Expected Behavior

systemctl restart should reliably produce a single running gateway instance. If a stale instance is detected, the new process should exit cleanly (exit 0) rather than triggering systemd's restart policy.

Actual Behavior

May 08 21:25:19 systemd[1]: hermes-gateway.service: Scheduled restart job, restart counter is at 75.
May 08 21:25:20 python[34383]: ❌ Gateway already running (PID 30623).
May 08 21:25:20 systemd[1]: hermes-gateway.service: Main process exited, code=exited, status=1/FAILURE
May 08 21:25:20 systemd[1]: hermes-gateway.service: Failed with result 'exit-code'.

NRestarts reached 76 and was still increasing.

Root Cause

Two compounding factors:

Race condition in systemd restart: SIGTERM → new process starts before old process fully exits → port/PID file still held
Exit code 1 on duplicate detection: The new process treats "another instance running" as a failure (exit 1) rather than a clean exit (exit 0). This is the same root cause as #21549 / PR #21555.

Proposed Solutions

Upstream fix (preferred): Merge PR #21555 — change duplicate-instance detection from return False (exit 1) to return True (exit 0). This makes the "already running" case a clean exit, which systemd does not treat as a failure.

Deployment-level workaround: Add an ExecStartPre guard to the systemd unit:

[Service]
ExecStartPre=/bin/bash -c 'fuser -k 8649/tcp 2>/dev/null; sleep 1'

This ensures the port is freed before the new process starts. Verified: after applying this, 4 consecutive restarts produced NRestarts=0.

Documentation: Add a systemd deployment guide to the official docs covering:

Recommended unit file configuration (ExecStartPre, RestartSec)
Multi-profile gateway setup considerations
WSL2-specific notes (hibernation, process cleanup)

Impact

Gateway becomes unavailable during the restart loop
Cron ticker is repeatedly killed, preventing all cron jobs from firing
High CPU usage from rapid restart cycles
Issue is silent — no error notification, systemd just keeps restarting

Related Issues

#21549 — launchd double-spawn (macOS equivalent)
#21555 — PR fixing exit code for duplicate detection
#21831 — platform auth failure causing similar restart loop

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #pipeline error #runtime error #dependency conflict #environment setup

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

hermes - 💡(How to fix) Fix [systemd] Incomplete process cleanup during restart causes port conflict and infinite restart loop

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Fix / Workaround

Code Example

Bug Description

Environment

Steps to Reproduce

Expected Behavior

Actual Behavior

Root Cause

Proposed Solutions

Impact

Related Issues

Still need to ship something?

TRENDING

hermes - 💡(How to fix) Fix [systemd] Incomplete process cleanup during restart causes port conflict and infinite restart loop

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Fix / Workaround

Code Example

Bug Description

Environment

Steps to Reproduce

Expected Behavior

Actual Behavior

Root Cause

Proposed Solutions

Impact

Related Issues

Still need to ship something?

RELATED_DISCOVERY

TRENDING