hermes - ✅(Solved) Fix Gateway restart loop: hardcoded --replace + systemd ExecStopPost causes infinite restart cycle [3 pull requests, 1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
NousResearch/hermes-agent#23272Fetched 2026-05-11 03:30:14
View on GitHub
Comments
0
Participants
1
Timeline
8
Reactions
0
Author
Participants
Timeline (top)
labeled ×5cross-referenced ×3

Root Cause

Root Causes (two required in combination)

Fix Action

Fix / Workaround

Combined with the _disable_replace suppression on line 3219 being a local patch (not upstream), any CLI invocation of hermes gateway run (cron health checks, manual commands) can SIGTERM an existing instance. The upstream fix needs to be in the CLI layer — not just the internal call.

A local patch to gateway.py forces replace=False internally, but this lives in the local installation and is overwritten on every Hermes Agent update.

PR fix notes

PR #23281: fix(gateway): stop default replace in service runs

Description (problem / solution / changelog)

What does this PR do?

Stops service-generated gateway runs from hardcoding --replace into their startup command. Service units, launchd plists, and detached profile restart helpers should start a fresh gateway process, not implicitly request a destructive in-place replacement.

This keeps manual hermes gateway run --replace available for operators, while making the default managed-service path safer and less likely to self-trigger restart loops.

Related Issue

Fixes #23272

Type of Change

  • 🐛 Bug fix (non-breaking change that fixes an issue)
  • ✨ New feature (non-breaking change that adds functionality)
  • 🔒 Security fix
  • 📝 Documentation update
  • ✅ Tests (adding or improving test coverage)
  • ♻️ Refactor (no behavior change)
  • 🎯 New skill (bundled or hub)

Changes Made

  • Removed hardcoded --replace from _gateway_run_args_for_profile()
  • Removed hardcoded --replace from generated systemd unit ExecStart lines
  • Removed hardcoded --replace from generated launchd ProgramArguments
  • Added regression coverage to ensure managed gateway launch paths omit --replace

How to Test

  1. Run uv run --frozen pytest -q -o addopts='' tests/hermes_cli/test_gateway_service.py -k 'launchd_install_repairs_outdated_plist_without_force or systemd_unit_includes_profile or launchd_plist_includes_profile or gateway_run_args_for_profile_omit_replace'
  2. Generate a systemd unit or launchd plist and confirm it uses gateway run without --replace
  3. Confirm manual CLI usage can still explicitly pass --replace when desired

Checklist

Code

  • I've read the Contributing Guide
  • My commit messages follow Conventional Commits (fix(scope):, feat(scope):, etc.)
  • I searched for existing PRs to make sure this isn't a duplicate
  • My PR contains only changes related to this fix/feature (no unrelated commits)
  • I've run pytest tests/ -q and all tests pass
  • I've added tests for my changes (required for bug fixes, strongly encouraged for features)
  • I've tested on my platform: macOS 15 / local CLI service test slice

Documentation & Housekeeping

  • I've updated relevant documentation (README, docs/, docstrings) — or N/A
  • I've updated cli-config.yaml.example if I added/changed config keys — or N/A
  • I've updated CONTRIBUTING.md or AGENTS.md if I changed architecture or workflows — or N/A
  • I've considered cross-platform impact (Windows, macOS) per the compatibility guide — or N/A
  • I've updated tool descriptions/schemas if I changed tool behavior — or N/A

Screenshots / Logs

  • uv run --frozen pytest -q -o addopts='' tests/hermes_cli/test_gateway_service.py -k 'launchd_install_repairs_outdated_plist_without_force or systemd_unit_includes_profile or launchd_plist_includes_profile or gateway_run_args_for_profile_omit_replace'
  • uv run --frozen ruff check hermes_cli/gateway.py tests/hermes_cli/test_gateway_service.py

Changed files

  • hermes_cli/gateway.py (modified, +3/-4)
  • tests/hermes_cli/test_gateway_service.py (modified, +24/-2)

PR #23330: fix(gateway): remove --replace from systemd/launchd service definitions (fixes #23272)

Description (problem / solution / changelog)

Summary

Removes the --replace flag from all service-installed gateway launch paths (systemd, launchd, Windows Scheduled Task, Startup folder). The --replace flag remains available for explicit manual use via hermes gateway run --replace, but services no longer auto-kill sibling instances, which prevents the infinite restart cycle described in issue #23272.

What changed

FileChange
hermes_cli/gateway.pygenerate_systemd_unit(): drop --replace from ExecStart (both system & user). generate_launchd_plist(): drop <string>--replace</string>. _gateway_run_args_for_profile(): drop --replace from detached-restart argv.
hermes_cli/gateway_windows.py_build_gateway_cmd_script() and _build_gateway_argv(): drop --replace from Windows service/Scheduled Task argv. Updated docstring references.
tests/Updated assertions that verify plist/unit content to expect --replace absence. Legacy-unit detection tests now use the new ExecStart signature.

Why

Issue #23272 describes a gateway restart-loop when:

  1. ExecStopPost=/bin/kill -9 $MAINPID (a systemd drop-in) fires on every service stop.
  2. The service is configured with gateway run --replace, so a clean restart SIGTERMs the existing instance.
  3. The ExecStopPost SIGKILLs the newly spawned instance before it stabilizes.
  4. systemd Restart=always respawns → loop.

By removing --replace from service definitions, the gateway starts without killing an already-running sibling. When a real restart is needed, hermes gateway restart (or systemd native restart) handles orderly handoff via takeover markers, avoiding the ExecStopPost trap.

Backwards compatibility

  • Manual hermes gateway run --replace still works; the flag is only removed from auto-generated service files.
  • Existing installed services will pick up the new unit definition on next hermes gateway install or auto-refresh at boot time.
  • No CLI option was removed; --replace is still present in argparse.

Fixes #23272

Changed files

  • hermes_cli/gateway.py (modified, +3/-4)
  • hermes_cli/gateway_windows.py (modified, +5/-5)
  • tests/hermes_cli/test_gateway_service.py (modified, +7/-6)
  • tests/hermes_cli/test_update_gateway_restart.py (modified, +6/-10)

PR #23350: fix(gateway): skip --replace when running under systemd

Description (problem / solution / changelog)

Ignores the --replace flag when the gateway detects it is running under systemd (via INVOCATION_ID or SYSTEMD_EXEC_PID environment variables). This prevents infinite restart loops caused by systemd's Restart=always policy colliding with the gateway's own process replacement logic, especially when combined with custom ExecStopPost hooks.

Closes #23272

Changed files

  • hermes_cli/gateway.py (modified, +7/-0)
  • scripts/release.py (modified, +1/-0)

Code Example

replace = getattr(args, 'replace', False)
run_gateway(verbose, quiet=quiet, replace=replace)
RAW_BUFFERClick to expand / collapse

Bug Description

Hermes Gateway enters an infinite restart loop when certain conditions are met, causing:

  • Flood of Telegram messages (100+) during restart cycle
  • Gateway unresponsive between restarts
  • Memory pressure from repeated process spawning

Root Causes (two required in combination)

1. Hardcoded --replace in gateway.py (primary)

In hermes_cli/gateway.py, the run_gateway() function and its callers hardcode replace=False but the CLI entry point at line 4997-4998 still reads --replace from args and passes it through:

replace = getattr(args, 'replace', False)
run_gateway(verbose, quiet=quiet, replace=replace)

Combined with the _disable_replace suppression on line 3219 being a local patch (not upstream), any CLI invocation of hermes gateway run (cron health checks, manual commands) can SIGTERM an existing instance. The upstream fix needs to be in the CLI layer — not just the internal call.

2. Systemd ExecStopPost override (trigger)

A systemd drop-in file with ExecStopPost=/bin/kill -9 $MAINPID fires on every service stop — including clean restarts. When the gateway SIGTERMs itself via --replace, the ExecStopPost immediately SIGKILLs the newly spawned instance before it stabilizes.

Result: ~87 second restart cycle → ExecStopPost kills new instance → systemd respawns → repeat.

Current Fix (non-durable)

A local patch to gateway.py forces replace=False internally, but this lives in the local installation and is overwritten on every Hermes Agent update.

Proposed Fix (upstream)

  1. CLI layer: Stop passing --replace through from args by default. Make it truly opt-in.
  2. Documentation: Document that --replace is destructive and should only be used for manual maintenance.
  3. Alternative: Add a confirm flag when --replace is detected, or require HERMES_FORCE_REPLACE=1 env var to activate it.

Environment

  • Hermes Agent (latest as of May 2026)
  • Linux (systemd-managed)
  • Telegram platform

Additional Context

RCA from production incident (May 10 2026): gateway flooded Telegram with 100+ restart messages over ~90 minutes before root cause identified. Two independent bugs working in concert.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

hermes - ✅(Solved) Fix Gateway restart loop: hardcoded --replace + systemd ExecStopPost causes infinite restart cycle [3 pull requests, 1 participants]