hermes - ✅(Solved) Fix [Bug]: /restart can leave zombie process — _stop_impl cancels the _run_restart background task mid-execution [2 pull requests, 1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
NousResearch/hermes-agent#12875Fetched 2026-04-20 12:16:28
View on GitHub
Comments
0
Participants
1
Timeline
3
Reactions
0
Author
Participants
Timeline (top)
cross-referenced ×1referenced ×1renamed ×1

Root Cause

The bug is a race condition in the restart → stop → shutdown code path:

  1. request_restart() (line 1880-1881) creates _run_restart as a background task and adds it to self._background_tasks:

    task = asyncio.create_task(_run_restart())
    self._background_tasks.add(task)
    task.add_done_callback(self._background_tasks.discard)
  2. _run_restart (line 1876-1878) calls stop(), which creates _stop_impl as _stop_task and awaits it:

    self._stop_task = asyncio.create_task(_stop_impl())
    await self._stop_task
  3. _stop_impl (lines 2570-2574) cancels ALL tasks in _background_tasks except _stop_task:

    for _task in list(self._background_tasks):
        if _task is self._stop_task:
            continue
        _task.cancel()
    self._background_tasks.clear()
  4. _run_restart IS in _background_tasks and IS cancelled. Since _run_restart is currently await self._stop_task, the CancelledError propagates into _stop_task, interrupting _stop_impl mid-cleanup.

  5. If _stop_impl is cancelled before reaching self._shutdown_event.set() (line 2583), wait_for_shutdown() blocks forever — the process hangs as a zombie.

  6. Even if _stop_impl finishes before the cancellation takes effect, _run_restart is cancelled and the _exit_code = 75 assignment (line 2649) may not execute if the cancellation propagates first. The gateway exits with code 0 — which matches the symptom in #11258.

Fix Action

Fixed

PR fix notes

PR #12946: fix(gateway): prevent /restart zombie by excluding _restart_task from stop cancellation

Description (problem / solution / changelog)

Summary

Fixes #12875.

When /restart fires under systemd, the gateway can hang as a zombie process or exit with code 0 — either way preventing systemd from restarting the service.

Root Cause

request_restart() creates _run_restart as a background task and adds it to _background_tasks. _run_restart then calls stop(), which creates _stop_impl. _stop_impl cancels everything in _background_tasks except _stop_task itself — but _restart_task is also in that set, so it gets cancelled mid-execution:

  • Zombie: _stop_impl is cancelled before reaching _shutdown_event.set()wait_for_shutdown() blocks forever
  • Silent exit 0: cancellation lands after _stop_impl finishes but before _exit_code = 75 is assigned → systemd Restart=on-failure doesn't trigger

Fix

Two minimal changes:

1. request_restart() — store the task reference in self._restart_task (cleared via done-callback after completion):

task = asyncio.create_task(_run_restart())
self._restart_task = task                                          # ← new
self._background_tasks.add(task)
task.add_done_callback(self._background_tasks.discard)
task.add_done_callback(lambda _: setattr(self, "_restart_task", None))  # ← new

2. _stop_impl() cancellation loop — exclude _restart_task, symmetric with the existing _stop_task exclusion:

# before
if _task is self._stop_task:
    continue
# after
if _task is self._stop_task or _task is self._restart_task:
    continue

Changes

FileChange
gateway/run.pyAdd _restart_task class field; store task ref in request_restart(); exclude from _stop_impl sweep
tests/gateway/restart_test_helpers.pyInitialise _restart_task = None in test helper
tests/gateway/test_restart_zombie.py6 new tests covering the regression

Test plan

  • pytest tests/gateway/test_restart_zombie.py — 6 passed
  • pytest tests/gateway/test_restart_drain.py tests/gateway/test_restart_notification.py — 22 passed, no regressions
  • _restart_task is None before restart, populated after request_restart(), cleared after completion
  • _restart_task is NOT cancelled by the _stop_impl sweep
  • Second request_restart() call remains a no-op (idempotent guard unchanged)

Changed files

  • gateway/run.py (modified, +4/-1)
  • tests/gateway/restart_test_helpers.py (modified, +1/-0)
  • tests/gateway/test_restart_zombie.py (added, +156/-0)

Code Example

task = asyncio.create_task(_run_restart())
   self._background_tasks.add(task)
   task.add_done_callback(self._background_tasks.discard)

---

self._stop_task = asyncio.create_task(_stop_impl())
   await self._stop_task

---

for _task in list(self._background_tasks):
       if _task is self._stop_task:
           continue
       _task.cancel()
   self._background_tasks.clear()

---

# request_restart() — remove lines 1881-1882
task = asyncio.create_task(_run_restart())
# self._background_tasks.add(task)      # REMOVE
# task.add_done_callback(self._background_tasks.discard)  # REMOVE

---

for _task in list(self._background_tasks):
    if _task is self._stop_task:
        continue
    if _task is _run_restart:
        continue
    _task.cancel()
RAW_BUFFERClick to expand / collapse

Related: #11258 (same symptom), #12438, PR #8150 (different root causes)

Bug Description

When /restart is triggered under systemd (via_service=True), the gateway can hang as a zombie process — alive but disconnected from all platforms, never reaching SystemExit(75). This prevents systemd from restarting the service. In other timing scenarios, the process exits cleanly with code 0, which also prevents restart under Restart=on-failure (matching the symptom in #11258).

Steps to Reproduce

  1. Run the gateway under systemd (hermes gateway install with Restart=on-failure)
  2. Trigger a /restart from a live messaging session (Discord or Telegram)
  3. Observe the gateway logs: all platforms disconnect, "Gateway stopped" is logged
  4. The process either hangs as a zombie (alive but disconnected) or exits with code 0

Expected Behavior

/restart under systemd should reliably exit with code 75 (GATEWAY_SERVICE_RESTART_EXIT_CODE) so systemd restarts the service.

Actual Behavior

Two failure modes depending on timing:

  • Zombie: Process hangs alive but disconnected. wait_for_shutdown() blocks forever because _shutdown_event.set() is never called (cancelled before reaching it).
  • Clean exit: Process exits with code 0 instead of 75. systemd does not restart (Restart=on-failure only restarts on non-zero exit). This matches #11258.

Root Cause Analysis

The bug is a race condition in the restart → stop → shutdown code path:

  1. request_restart() (line 1880-1881) creates _run_restart as a background task and adds it to self._background_tasks:

    task = asyncio.create_task(_run_restart())
    self._background_tasks.add(task)
    task.add_done_callback(self._background_tasks.discard)
  2. _run_restart (line 1876-1878) calls stop(), which creates _stop_impl as _stop_task and awaits it:

    self._stop_task = asyncio.create_task(_stop_impl())
    await self._stop_task
  3. _stop_impl (lines 2570-2574) cancels ALL tasks in _background_tasks except _stop_task:

    for _task in list(self._background_tasks):
        if _task is self._stop_task:
            continue
        _task.cancel()
    self._background_tasks.clear()
  4. _run_restart IS in _background_tasks and IS cancelled. Since _run_restart is currently await self._stop_task, the CancelledError propagates into _stop_task, interrupting _stop_impl mid-cleanup.

  5. If _stop_impl is cancelled before reaching self._shutdown_event.set() (line 2583), wait_for_shutdown() blocks forever — the process hangs as a zombie.

  6. Even if _stop_impl finishes before the cancellation takes effect, _run_restart is cancelled and the _exit_code = 75 assignment (line 2649) may not execute if the cancellation propagates first. The gateway exits with code 0 — which matches the symptom in #11258.

Proposed Fix

Exclude _run_restart from the cancellation loop in _stop_impl, the same way _stop_task is already excluded. Two options:

Option A — Don't add _run_restart to _background_tasks at all (it self-terminates):

# request_restart() — remove lines 1881-1882
task = asyncio.create_task(_run_restart())
# self._background_tasks.add(task)      # REMOVE
# task.add_done_callback(self._background_tasks.discard)  # REMOVE

Option B — Skip it in the cancel loop:

for _task in list(self._background_tasks):
    if _task is self._stop_task:
        continue
    if _task is _run_restart:
        continue
    _task.cancel()

Option A is cleaner — _run_restart is a self-terminating orchestration task that doesn't need lifecycle management by the shutdown machinery.

Code References

  • gateway/run.py:1868-1883request_restart() and _run_restart
  • gateway/run.py:2473-2657stop() and _stop_impl
  • gateway/run.py:2570-2574 — background task cancellation loop
  • gateway/run.py:2583_shutdown_event.set() (may be skipped)
  • gateway/run.py:2649_exit_code = GATEWAY_SERVICE_RESTART_EXIT_CODE (may be skipped)
  • gateway/run.py:10896-10917 — main function: wait_for_shutdown()SystemExit(75)

Affected Component

  • Gateway (Telegram/Discord/Slack/WhatsApp)

Messaging Platform

  • Discord
  • Telegram

Debug Report

Report: https://paste.rs/fR2iW agent.log: https://paste.rs/8ewKw gateway.log: https://paste.rs/mXHp4

Operating System

Linux 5.15.185-tegra aarch64 (Jetson Orin Nano, Ubuntu 22.04)

Python Version

3.11.15

Hermes Version

v0.10.0 (2026.4.16)

extent analysis

TL;DR

The most likely fix is to exclude _run_restart from the cancellation loop in _stop_impl to prevent it from being cancelled and causing the gateway to hang or exit with code 0.

Guidance

  • Identify the lines of code responsible for cancelling tasks in _stop_impl and modify them to exclude _run_restart, as proposed in Option A or Option B.
  • Verify that _run_restart is not added to _background_tasks or is skipped in the cancellation loop to prevent its cancellation.
  • Test the fix by triggering a /restart from a live messaging session and observing the gateway logs to ensure it exits with code 75 and systemd restarts the service.
  • Review the provided code references to understand the affected components and functions.

Example

# request_restart() — remove lines 1881-1882
task = asyncio.create_task(_run_restart())
# self._background_tasks.add(task)      # REMOVE
# task.add_done_callback(self._background_tasks.discard)  # REMOVE

or

for _task in list(self._background_tasks):
    if _task is self._stop_task:
        continue
    if _task is _run_restart:
        continue
    _task.cancel()

Notes

The proposed fix assumes that excluding _run_restart from the cancellation loop will resolve the issue. However, additional testing and verification may be necessary to ensure the fix works as expected in all scenarios.

Recommendation

Apply Option A as the proposed fix, as it is considered cleaner and more straightforward. This option removes the need to manage the lifecycle of _run_restart by the shutdown machinery.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING