hermes - 💡(How to fix) Fix Generated Linux gateway service can flap forever because ExecStart uses gateway run --replace under Type=simple

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

On WSL Ubuntu, the generated user unit for the Friday profile repeatedly restarted roughly every five minutes even though the gateway itself was healthy and connected to Slack. Each restart was treated as a planned --replace takeover of the already-running gateway, so the gateway kept disconnecting and reconnecting cleanly instead of remaining as one stable systemd-managed process.

This looks like a service contract bug rather than a Slack, token, or profile-config problem.

Root Cause

Generated Linux systemd gateway unit for profile-specific Hermes services can flap forever because steady-state ExecStart uses gateway run --replace under Type=simple

Fix Action

Fix / Workaround

Local Workaround That Stabilized Friday

This workaround is useful operationally, but it should not be the final upstream answer because the standalone script currently appears not to propagate start_gateway() failure as a nonzero exit code.

Code Example

ExecStart=/home/brandon/.hermes/hermes-agent/venv/bin/python -m hermes_cli.main --profile friday gateway run --replace
Type=simple
Restart=always
RestartSec=60
RestartMaxDelaySec=300
RestartSteps=5

---

Received SIGTERM as a planned --replace takeover - exiting cleanly

---

[Service]
ExecStart=
ExecStart=/home/brandon/.hermes/hermes-agent/venv/bin/python /home/brandon/.hermes/hermes-agent/scripts/hermes-gateway run
Restart=on-failure
RestartSec=30
RAW_BUFFERClick to expand / collapse

Hermes Gateway Service Loop Bug Report Draft

Title

Generated Linux systemd gateway unit for profile-specific Hermes services can flap forever because steady-state ExecStart uses gateway run --replace under Type=simple

Summary

On WSL Ubuntu, the generated user unit for the Friday profile repeatedly restarted roughly every five minutes even though the gateway itself was healthy and connected to Slack. Each restart was treated as a planned --replace takeover of the already-running gateway, so the gateway kept disconnecting and reconnecting cleanly instead of remaining as one stable systemd-managed process.

This looks like a service contract bug rather than a Slack, token, or profile-config problem.

Environment

  • Hermes profile: friday
  • Linux host: WSL Ubuntu
  • Service installed via: friday gateway install --force then friday gateway start
  • Generated unit path: ~/.config/systemd/user/hermes-gateway-friday.service
  • Generated steady-state ExecStart:
ExecStart=/home/brandon/.hermes/hermes-agent/venv/bin/python -m hermes_cli.main --profile friday gateway run --replace
Type=simple
Restart=always
RestartSec=60
RestartMaxDelaySec=300
RestartSteps=5

Observed Behavior

  • systemctl --user status hermes-gateway-friday.service showed the service in activating (auto-restart) rather than active (running).
  • The systemd Main PID exited with status=0/SUCCESS.
  • friday gateway status reported that the user gateway service was stopped or restart-pending while also showing a live gateway PID for the profile.
  • gateway.log showed the gateway connecting to Slack successfully, running normally, then later receiving:
Received SIGTERM as a planned --replace takeover - exiting cleanly
  • Immediately after that clean exit, a new PID started, connected to Slack again, and the same cycle repeated.

Expected Behavior

The generated systemd unit should produce one stable, foreground, service-managed main process.

  • systemctl --user status should remain active (running).
  • The gateway should stay connected until a real stop, restart, or failure occurs.
  • A healthy service should not repeatedly perform planned takeovers of itself.

Actual Behavior

The generated systemd unit repeatedly launched a fresh gateway run --replace process.

Each new process then treated the existing live gateway as something to replace, sent a planned takeover signal, and restarted the session even though the prior gateway was healthy.

Likely Root Cause

The generated Linux service definition appears to be using the wrong runtime mode for steady-state service ownership.

Facts from the incident:

  • the generated unit used gateway run --replace
  • the generated unit used Type=simple
  • the systemd-tracked main process did not remain the long-lived gateway process
  • the long-lived gateway process continued running long enough to be seen by the next launch and then got cleanly replaced

That means one of two closely related things is happening:

  1. gateway run --replace is not safe as the normal steady-state ExecStart for a service because it is takeover-oriented by design, or
  2. the current command path is spawning or handing off to a child process in a way that violates Type=simple expectations for the service manager.

Either way, the steady-state Linux service command should not be the replace-oriented startup path.

Local Workaround That Stabilized Friday

The following user-unit override stopped the restart loop locally:

[Service]
ExecStart=
ExecStart=/home/brandon/.hermes/hermes-agent/venv/bin/python /home/brandon/.hermes/hermes-agent/scripts/hermes-gateway run
Restart=on-failure
RestartSec=30

After reloading systemd and restarting the service:

  • hermes-gateway-friday.service became active (running)
  • systemd tracked one stable main PID
  • the gateway log showed normal Slack connection and steady runtime

This workaround is useful operationally, but it should not be the final upstream answer because the standalone script currently appears not to propagate start_gateway() failure as a nonzero exit code.

Proposed Upstream Direction

  • Add a dedicated service-safe gateway run mode for Linux systemd use.
  • Generate Linux units against that service-safe mode, not gateway run --replace.
  • Reserve replace/takeover behavior for explicit restart operations and manual-process replacement flows.

Notes

This bug is easy to misdiagnose because the gateway looks healthy from the platform side. Slack auth, Socket Mode, and channel directory setup all succeeded repeatedly. The failure was in service ownership and lifecycle semantics, not platform connectivity.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

hermes - 💡(How to fix) Fix Generated Linux gateway service can flap forever because ExecStart uses gateway run --replace under Type=simple