hermes - 💡(How to fix) Fix bug: hermes gateway restart kills gateway ~50% of the time in non-systemd environments (new process spawned as child of old) [1 pull requests]

Error Message

2026-05-29 23:17:21 ERROR [Telegram] Telegram bot token already in use (PID 10). Stop the other gateway first. 2026-05-29 23:17:21 ERROR Gateway hit a non-retryable startup conflict — exiting cleanly. The "token already in use" error fires because the new gateway starts too quickly, sees the old PID still in the lock file, and aborts rather than waiting.

Root Cause

In hermes_cli/gateway.py, the manual (non-systemd) restart path calls run_gateway() in-process / blocking:

# gateway.py ~line 5602
if stop_profile_gateway():
    print("✓ Stopped gateway for this profile")

_wait_for_gateway_exit(timeout=10.0, force_after=5.0)

print("Starting gateway...")
run_gateway(verbose=0)  # <-- BLOCKING, runs as child of the invoking process

When hermes gateway restart is invoked from inside the running gateway (e.g. via an agent terminal tool call), the restart process is a child of the running gateway. When stop_profile_gateway() kills the old gateway with SIGTERM, it takes the child restart process — and the new run_gateway() call with it — resulting in the gateway dying with no replacement.

The hermes update path already solves this correctly using launch_detached_profile_gateway_restart(), which spawns a detached watcher that waits for the old PID to exit and respawns the gateway in a fully independent process tree. The manual restart path does not use this mechanism.

Code Example

# gateway.py ~line 5602
if stop_profile_gateway():
    print("✓ Stopped gateway for this profile")

_wait_for_gateway_exit(timeout=10.0, force_after=5.0)

print("Starting gateway...")
run_gateway(verbose=0)  # <-- BLOCKING, runs as child of the invoking process

---

hermes   10    ...  /opt/hermes/.venv/bin/hermes gateway run          # old gateway (SIGTERM'd)
hermes   39032 ...    \_ /opt/hermes/.venv/bin/hermes gateway restart  # restart call from agent
hermes   39215 ...        \_ /opt/hermes/.venv/bin/hermes gateway run  # new gateway — dies with parent

---

2026-05-29 23:17:20 INFO  Starting Hermes Gateway...
2026-05-29 23:17:21 ERROR [Telegram] Telegram bot token already in use (PID 10). Stop the other gateway first.
2026-05-29 23:17:21 ERROR Gateway hit a non-retryable startup conflict — exiting cleanly.

---

# In gateway.py manual (non-systemd) restart path
pid = get_running_pid()
if pid and launch_detached_profile_gateway_restart(current_profile, pid):
    # Detached watcher will respawn gateway once old PID exits
    drained = _graceful_restart_via_sigusr1(pid, drain_timeout=drain_budget)
    if not drained:
        os.kill(pid, signal.SIGTERM)
    print("↻ Gateway restart initiated (detached watcher will respawn)")
else:
    # Fallback: safe only if not called from within the gateway tree
    stop_profile_gateway()
    _wait_for_gateway_exit(timeout=10.0, force_after=5.0)
    run_gateway(verbose=0)

Summary

In non-systemd environments (e.g. Docker container without systemd), hermes gateway restart kills the gateway approximately every other invocation. The failure is systematic and reproducible — roughly 50% of restarts result in the gateway dying and not coming back.

Root Cause

In hermes_cli/gateway.py, the manual (non-systemd) restart path calls run_gateway() in-process / blocking:

# gateway.py ~line 5602
if stop_profile_gateway():
    print("✓ Stopped gateway for this profile")

_wait_for_gateway_exit(timeout=10.0, force_after=5.0)

print("Starting gateway...")
run_gateway(verbose=0)  # <-- BLOCKING, runs as child of the invoking process

Evidence from Logs

gateway-shutdown-diag.log captures the process tree at SIGTERM time — the new gateway is a grandchild of the old one:

hermes   10    ...  /opt/hermes/.venv/bin/hermes gateway run          # old gateway (SIGTERM'd)
hermes   39032 ...    \_ /opt/hermes/.venv/bin/hermes gateway restart  # restart call from agent
hermes   39215 ...        \_ /opt/hermes/.venv/bin/hermes gateway run  # new gateway — dies with parent

gateway.log confirms the failure sequence after the old process exits:

2026-05-29 23:17:20 INFO  Starting Hermes Gateway...
2026-05-29 23:17:21 ERROR [Telegram] Telegram bot token already in use (PID 10). Stop the other gateway first.
2026-05-29 23:17:21 ERROR Gateway hit a non-retryable startup conflict — exiting cleanly.

The "token already in use" error fires because the new gateway starts too quickly, sees the old PID still in the lock file, and aborts rather than waiting.

Environment

Hermes Agent: v0.15.0 (nousresearch/hermes-agent Docker image)
Platform: Linux container, no systemd (ttyd is PID 1 via tini)
Gateway mode: manual (non-systemd), started from /hermes.sh via nohup hermes gateway run
Trigger: hermes gateway restart called from within an agent terminal tool call

Expected Behavior

hermes gateway restart should reliably replace the running gateway regardless of whether it is invoked from within the gateway's own process tree.

Proposed Fix

The manual (non-systemd) restart path should use launch_detached_profile_gateway_restart() — the same detached watcher pattern the update path already uses — so the new gateway process is spawned outside the old process tree before the old gateway is killed:

# In gateway.py manual (non-systemd) restart path
pid = get_running_pid()
if pid and launch_detached_profile_gateway_restart(current_profile, pid):
    # Detached watcher will respawn gateway once old PID exits
    drained = _graceful_restart_via_sigusr1(pid, drain_timeout=drain_budget)
    if not drained:
        os.kill(pid, signal.SIGTERM)
    print("↻ Gateway restart initiated (detached watcher will respawn)")
else:
    # Fallback: safe only if not called from within the gateway tree
    stop_profile_gateway()
    _wait_for_gateway_exit(timeout=10.0, force_after=5.0)
    run_gateway(verbose=0)

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

hermes - 💡(How to fix) Fix bug: hermes gateway restart kills gateway ~50% of the time in non-systemd environments (new process spawned as child of old) [1 pull requests]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Fixed

Code Example

Summary

Root Cause

Evidence from Logs

Environment

Expected Behavior

Proposed Fix

Still need to ship something?

TRENDING