hermes - 💡(How to fix) Fix bug: hermes gateway restart kills gateway ~50% of the time in non-systemd environments (new process spawned as child of old) [1 pull requests]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

In non-systemd environments (e.g. Docker container without systemd), hermes gateway restart kills the gateway approximately every other invocation. The failure is systematic and reproducible — roughly 50% of restarts result in the gateway dying and not coming back.

Error Message

2026-05-29 23:17:21 ERROR [Telegram] Telegram bot token already in use (PID 10). Stop the other gateway first. 2026-05-29 23:17:21 ERROR Gateway hit a non-retryable startup conflict — exiting cleanly. The "token already in use" error fires because the new gateway starts too quickly, sees the old PID still in the lock file, and aborts rather than waiting.

Root Cause

In hermes_cli/gateway.py, the manual (non-systemd) restart path calls run_gateway() in-process / blocking:

# gateway.py ~line 5602
if stop_profile_gateway():
    print("✓ Stopped gateway for this profile")

_wait_for_gateway_exit(timeout=10.0, force_after=5.0)

print("Starting gateway...")
run_gateway(verbose=0)  # <-- BLOCKING, runs as child of the invoking process

When hermes gateway restart is invoked from inside the running gateway (e.g. via an agent terminal tool call), the restart process is a child of the running gateway. When stop_profile_gateway() kills the old gateway with SIGTERM, it takes the child restart process — and the new run_gateway() call with it — resulting in the gateway dying with no replacement.

The hermes update path already solves this correctly using launch_detached_profile_gateway_restart(), which spawns a detached watcher that waits for the old PID to exit and respawns the gateway in a fully independent process tree. The manual restart path does not use this mechanism.

Fix Action

Fixed

Code Example

# gateway.py ~line 5602
if stop_profile_gateway():
    print("✓ Stopped gateway for this profile")

_wait_for_gateway_exit(timeout=10.0, force_after=5.0)

print("Starting gateway...")
run_gateway(verbose=0)  # <-- BLOCKING, runs as child of the invoking process

---

hermes   10    ...  /opt/hermes/.venv/bin/hermes gateway run          # old gateway (SIGTERM'd)
hermes   39032 ...    \_ /opt/hermes/.venv/bin/hermes gateway restart  # restart call from agent
hermes   39215 ...        \_ /opt/hermes/.venv/bin/hermes gateway run  # new gateway — dies with parent

---

2026-05-29 23:17:20 INFO  Starting Hermes Gateway...
2026-05-29 23:17:21 ERROR [Telegram] Telegram bot token already in use (PID 10). Stop the other gateway first.
2026-05-29 23:17:21 ERROR Gateway hit a non-retryable startup conflict — exiting cleanly.

---

# In gateway.py manual (non-systemd) restart path
pid = get_running_pid()
if pid and launch_detached_profile_gateway_restart(current_profile, pid):
    # Detached watcher will respawn gateway once old PID exits
    drained = _graceful_restart_via_sigusr1(pid, drain_timeout=drain_budget)
    if not drained:
        os.kill(pid, signal.SIGTERM)
    print("↻ Gateway restart initiated (detached watcher will respawn)")
else:
    # Fallback: safe only if not called from within the gateway tree
    stop_profile_gateway()
    _wait_for_gateway_exit(timeout=10.0, force_after=5.0)
    run_gateway(verbose=0)
RAW_BUFFERClick to expand / collapse

Summary

In non-systemd environments (e.g. Docker container without systemd), hermes gateway restart kills the gateway approximately every other invocation. The failure is systematic and reproducible — roughly 50% of restarts result in the gateway dying and not coming back.

Root Cause

In hermes_cli/gateway.py, the manual (non-systemd) restart path calls run_gateway() in-process / blocking:

# gateway.py ~line 5602
if stop_profile_gateway():
    print("✓ Stopped gateway for this profile")

_wait_for_gateway_exit(timeout=10.0, force_after=5.0)

print("Starting gateway...")
run_gateway(verbose=0)  # <-- BLOCKING, runs as child of the invoking process

When hermes gateway restart is invoked from inside the running gateway (e.g. via an agent terminal tool call), the restart process is a child of the running gateway. When stop_profile_gateway() kills the old gateway with SIGTERM, it takes the child restart process — and the new run_gateway() call with it — resulting in the gateway dying with no replacement.

The hermes update path already solves this correctly using launch_detached_profile_gateway_restart(), which spawns a detached watcher that waits for the old PID to exit and respawns the gateway in a fully independent process tree. The manual restart path does not use this mechanism.

Evidence from Logs

gateway-shutdown-diag.log captures the process tree at SIGTERM time — the new gateway is a grandchild of the old one:

hermes   10    ...  /opt/hermes/.venv/bin/hermes gateway run          # old gateway (SIGTERM'd)
hermes   39032 ...    \_ /opt/hermes/.venv/bin/hermes gateway restart  # restart call from agent
hermes   39215 ...        \_ /opt/hermes/.venv/bin/hermes gateway run  # new gateway — dies with parent

gateway.log confirms the failure sequence after the old process exits:

2026-05-29 23:17:20 INFO  Starting Hermes Gateway...
2026-05-29 23:17:21 ERROR [Telegram] Telegram bot token already in use (PID 10). Stop the other gateway first.
2026-05-29 23:17:21 ERROR Gateway hit a non-retryable startup conflict — exiting cleanly.

The "token already in use" error fires because the new gateway starts too quickly, sees the old PID still in the lock file, and aborts rather than waiting.

Environment

  • Hermes Agent: v0.15.0 (nousresearch/hermes-agent Docker image)
  • Platform: Linux container, no systemd (ttyd is PID 1 via tini)
  • Gateway mode: manual (non-systemd), started from /hermes.sh via nohup hermes gateway run
  • Trigger: hermes gateway restart called from within an agent terminal tool call

Expected Behavior

hermes gateway restart should reliably replace the running gateway regardless of whether it is invoked from within the gateway's own process tree.

Proposed Fix

The manual (non-systemd) restart path should use launch_detached_profile_gateway_restart() — the same detached watcher pattern the update path already uses — so the new gateway process is spawned outside the old process tree before the old gateway is killed:

# In gateway.py manual (non-systemd) restart path
pid = get_running_pid()
if pid and launch_detached_profile_gateway_restart(current_profile, pid):
    # Detached watcher will respawn gateway once old PID exits
    drained = _graceful_restart_via_sigusr1(pid, drain_timeout=drain_budget)
    if not drained:
        os.kill(pid, signal.SIGTERM)
    print("↻ Gateway restart initiated (detached watcher will respawn)")
else:
    # Fallback: safe only if not called from within the gateway tree
    stop_profile_gateway()
    _wait_for_gateway_exit(timeout=10.0, force_after=5.0)
    run_gateway(verbose=0)

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

hermes - 💡(How to fix) Fix bug: hermes gateway restart kills gateway ~50% of the time in non-systemd environments (new process spawned as child of old) [1 pull requests]