hermes - 💡(How to fix) Fix Bug: Gateway zombie process after /restart under systemd

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

When the gateway runs under systemd and receives a /restart command from a messaging channel, the process becomes a zombie — the async event loop stops but the Python process never exits. All messaging platform connections are lost, and the gateway silently stops processing messages.

Error Message

Long-term: Investigate why SystemExit(75) does not exit the process. The asyncio event loop should propagate the exception cleanly. This may require:

Root Cause

Two issues combine to cause this:

Fix Action

Fix / Workaround

  • Severity: High — gateway becomes completely non-functional
  • Reproducibility: 100% (every /restart under systemd)
  • Workaround: Remove --replace from systemd ExecStart, use systemctl restart instead of /restart

Workaround (Applied)

This workaround resolves the issue for our deployment. The fix only affects the auto-generated systemd unit file; hermes gateway run --replace still works for manual CLI usage.

Code Example

ExecStart=... gateway run --replace

---

# Before:
ExecStart={python_path} -m hermes_cli.main gateway run --replace

# After:
ExecStart={python_path} -m hermes_cli.main gateway run
RAW_BUFFERClick to expand / collapse

Bug Report: Gateway zombie process after /restart under systemd

Summary

When the gateway runs under systemd and receives a /restart command from a messaging channel, the process becomes a zombie — the async event loop stops but the Python process never exits. All messaging platform connections are lost, and the gateway silently stops processing messages.

Environment

  • OS: CentOS 7 (glibc 2.17)
  • Python: 3.11.15
  • Hermes version: 71772ac7d (latest from upstream)
  • Service manager: systemd 219
  • Installation: hermes gateway service install (auto-generated unit file)

Steps to Reproduce

  1. Install gateway as systemd service (hermes gateway service install)
  2. Start the service (systemctl start hermes-gateway)
  3. Wait for all platforms to connect
  4. Send /restart from any messaging channel (e.g., Feishu, QQBot)

Expected Behavior

  • Gateway process exits with code 75
  • systemd restarts the gateway automatically
  • New process connects all messaging platforms
  • User receives "Gateway online" notification

Actual Behavior

  1. Gateway logs "Gateway stopped (total teardown Xs)"
  2. Gateway logs "Cron ticker stopped"
  3. Gateway logs "Periodic memory monitoring stopped"
  4. Process does NOT exit — stays alive with 4 threads, all in sleeping state
  5. Main thread stuck in ep_poll (asyncio event loop still running)
  6. systemd sees PID is alive → does NOT restart
  7. All messaging platform connections are lost
  8. Gateway becomes completely non-functional (no messages sent/received)

Root Cause Analysis

Two issues combine to cause this:

Issue 1: --replace flag in systemd ExecStart

The auto-generated systemd unit file uses:

ExecStart=... gateway run --replace

The --replace flag (added in PR #576 to fix restart loops) is redundant under systemd because systemd's Restart=always + RestartForceExitStatus=75 already manages the process lifecycle. The flag's kill-and-replace logic conflicts with systemd's own restart mechanism.

Issue 2: SystemExit(75) does not exit the process

When /restart is triggered:

  1. request_restart(via_service=True) is called (correct — detects systemd via /run/systemd/system)
  2. _exit_code is set to GATEWAY_SERVICE_RESTART_EXIT_CODE (75) (correct)
  3. raise SystemExit(75) is called (correct)
  4. Process does not exit — asyncio event loop prevents clean shutdown

The event loop has pending callbacks/tasks that catch or block SystemExit propagation. The main thread remains in ep_poll, indicating the loop is still running despite the shutdown sequence completing.

Impact

  • Severity: High — gateway becomes completely non-functional
  • Reproducibility: 100% (every /restart under systemd)
  • Workaround: Remove --replace from systemd ExecStart, use systemctl restart instead of /restart

Suggested Fix

Short-term: Remove --replace from the systemd unit template in hermes_cli/gateway.py (lines 2212 and 2250):

# Before:
ExecStart={python_path} -m hermes_cli.main gateway run --replace

# After:
ExecStart={python_path} -m hermes_cli.main gateway run

This prevents the zombie issue because systemd's Restart=always handles process lifecycle. The --replace flag remains available for manual CLI usage (hermes gateway run --replace).

Long-term: Investigate why SystemExit(75) does not exit the process. The asyncio event loop should propagate the exception cleanly. This may require:

  • Checking for non-daemon threads blocking process exit
  • Reviewing task cancellation in _stop_impl()
  • Ensuring asyncio.run() cleanup phase completes

Related

  • PR #576 (ee5daba06) — introduced --replace to fix systemd restart loops
  • This fix resolved restart loops but introduced the zombie issue under /restart

Workaround (Applied)

We have applied a local fix on our deployment:

  1. Removed --replace from the systemd unit file (/etc/systemd/system/hermes-gateway.service)
  2. Changed ExecStart from gateway run --replace to gateway run
  3. Force-killed the zombie process (PID 12551)
  4. Restarted the service — all 4 platforms connected successfully

This workaround resolves the issue for our deployment. The fix only affects the auto-generated systemd unit file; hermes gateway run --replace still works for manual CLI usage.

Additional Context

  • The gateway logs to ~/.hermes/profiles/fengfeng/logs/gateway.log (profile-specific path)
  • journald Storage=none on this system means journalctl shows no gateway output
  • After removing --replace, all 4 platforms (Home Assistant, Feishu, Weixin, QQBot) connected successfully

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

hermes - 💡(How to fix) Fix Bug: Gateway zombie process after /restart under systemd