hermes - ✅(Solved) Fix Gateway service: exit code mismatch causes StartLimitBurst=5 exhaustion on rapid restart [3 pull requests, 2 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
NousResearch/hermes-agent#14051Fetched 2026-04-23 07:47:09
View on GitHub
Comments
2
Participants
2
Timeline
9
Reactions
0
Author
Participants
Timeline (top)
labeled ×4cross-referenced ×3commented ×2

Error Message

Apr 22 23:10:48 XIAOXIN systemd[195]: hermes-gateway.service: Consumed 2.409s CPU time, 56.5M memory peak.
Apr 22 23:11:18 XIAOXIN systemd[195]: hermes-gateway.service: Scheduled restart job, restart counter is at 5.
Apr 22 23:11:18 XIAOXIN systemd[195]: hermes-gateway.service: Start request repeated too quickly.
Apr 22 23:11:18 XIAOXIN systemd[195]: hermes-gateway.service: Failed with result 'exit-code'.

Filed via Hermes Agent

Root Cause

Two issues combine to cause this:

Fix Action

Fixed

PR fix notes

PR #14069: fix(gateway): exit with RestartForceExitStatus code and detect zombie PIDs in --replace

Description (problem / solution / changelog)

Fix two bugs that combine to exhaust systemd's StartLimitBurst=5 on rapid restarts, permanently failing the gateway service.

What changed and why

Fix 1 — exit code mismatch (hermes_cli/gateway.py)

run_gateway() was exiting with code 1 on startup failure. The systemd service file declares RestartForceExitStatus=75, meaning only exit code 75 bypasses StartLimitBurst. Exit code 1 was counted against the burst limit, so 5 fast failures (e.g. PID file races) permanently failed the service. Now exits with GATEWAY_SERVICE_RESTART_EXIT_CODE (75), which was already imported but unused in this path.

Fix 2 — zombie process detection (gateway/run.py)

The --replace wait loop used os.kill(pid, 0) to check if the old gateway was alive. This call succeeds for zombie processes (crashed but not yet reaped by their parent), so a zombie gateway would stall the loop for 10 s, then receive SIGKILL — which has no effect on zombies. Both the old zombie and the new instance would then race to write the PID file, causing FileExistsError → exit with old code 1 → burst counter hit.

Added _is_zombie(pid) that reads /proc/{pid}/status on Linux and checks for State: Z. The wait loop now breaks immediately when the old PID is a zombie, treating it as already gone before proceeding to force-unlink the PID file.

How to test

  1. Start gateway: hermes gateway start
  2. Kill it with SIGKILL: kill -9 <pid>
  3. Verify systemctl --user status hermes-gateway.service shows restart attempts, not a permanent failed state
  4. Run new unit tests: pytest tests/gateway/test_zombie_detection.py tests/hermes_cli/test_gateway_service.py::TestRunGatewayExitCode -v

What platforms tested on

  • macOS on darwin-arm64 (local, zombie detection path returns False and is skipped — systemd path tested via unit tests)
  • Linux/systemd path covered by unit tests with mocked /proc/{pid}/status

Fixes #14051

<!-- autocontrib:worker-id=issue-new-1e8aae21 kind=pr-open -->

Changed files

  • gateway/run.py (modified, +29/-1)
  • hermes_cli/gateway.py (modified, +3/-3)
  • tests/gateway/test_zombie_detection.py (added, +45/-0)
  • tests/hermes_cli/test_gateway_service.py (modified, +25/-0)

PR #14080: fix(gateway): use managed restart exit code on startup failure

Description (problem / solution / changelog)

Summary

  • make hermes gateway run exit with the managed gateway restart code on startup failure
  • align foreground gateway failures with the generated systemd/launchd service units
  • add a regression test that proves run_gateway() now exits 75 instead of 1

Problem

On current main, run_gateway() exits with 1 when startup fails, even though the generated service units set RestartForceExitStatus=75. A minimal repro with start_gateway() returning False currently raises SystemExit(1), which means rapid failures count against StartLimitBurst instead of being treated as managed restarts.

Closes #14051.

Testing

  • pytest -o addopts= tests/hermes_cli/test_gateway_service.py
  • manual repro: monkeypatched start_gateway(...)=False now exits 75 instead of 1

Changed files

  • hermes_cli/gateway.py (modified, +4/-3)
  • tests/hermes_cli/test_gateway_service.py (modified, +15/-0)

PR #14260: fix(gateway): use GATEWAY_SERVICE_RESTART_EXIT_CODE instead of hardcoded 1

Description (problem / solution / changelog)

Closes #14051

Changed files

  • gateway/run.py (modified, +1/-1)

Code Example

success = asyncio.run(start_gateway(replace=replace, verbosity=verbosity))
if not success:
    sys.exit(1)  # ← Always exits with 1

---

RestartForceExitStatus=75

---

for _ in range(20):
    try:
        os.kill(existing_pid, 0)  # Signal 0 = existence check
        time.sleep(0.5)
    except (ProcessLookupError, PermissionError):
        break
else:
    # Still alive after 10s — force kill
    terminate_pid(existing_pid, force=True)

---

from gateway.restart import GATEWAY_SERVICE_RESTART_EXIT_CODE

success = asyncio.run(start_gateway(replace=replace, verbosity=verbosity))
if not success:
    sys.exit(GATEWAY_SERVICE_RESTART_EXIT_CODE)  # 75

---

import os
import stat

def is_zombie(pid: int) -> bool:
    try:
        st = os.stat(f"/proc/{pid}")
        return stat.S_ISDIR(st.st_mode)  # /proc/PID is a dir for alive processes
    except (FileNotFoundError, ProcessLookupError):
        return False
    # Alternative: check STATUS in /proc/{pid}/status for 'Z (zombie)'

# In the replace loop:
for _ in range(20):
    try:
        os.kill(existing_pid, 0)
        if is_zombie(existing_pid):
            break  # Zombie won't exit, stop waiting
        time.sleep(0.5)
    except (ProcessLookupError, PermissionError):
        break

---

Apr 22 23:10:48 XIAOXIN systemd[195]: hermes-gateway.service: Consumed 2.409s CPU time, 56.5M memory peak.
Apr 22 23:11:18 XIAOXIN systemd[195]: hermes-gateway.service: Scheduled restart job, restart counter is at 5.
Apr 22 23:11:18 XIAOXIN systemd[195]: hermes-gateway.service: Start request repeated too quickly.
Apr 22 23:11:18 XIAOXIN systemd[195]: hermes-gateway.service: Failed with result 'exit-code'.
RAW_BUFFERClick to expand / collapse

Bug Description

When the gateway crashes or is killed unexpectedly, the systemd service enters a restart loop and exhausts StartLimitBurst=5 within seconds, permanently failing even though the actual root cause may be transient.

Root Cause Analysis

Two issues combine to cause this:

1. Exit code mismatch in run_gateway()

In hermes_cli/gateway.py, the run_gateway() function exits with code 1 on failure:

success = asyncio.run(start_gateway(replace=replace, verbosity=verbosity))
if not success:
    sys.exit(1)  # ← Always exits with 1

But the systemd service file sets:

RestartForceExitStatus=75

This means exit code 1 is counted toward StartLimitBurst=5, while only exit code 75 would bypass the burst limit. When the gateway fails fast (e.g., "PID file race lost" which exits immediately without waiting), systemd cycles through all 5 restarts within seconds.

2. Zombie process edge case in --replace logic

In gateway/run.py, the --replace wait loop uses os.kill(pid, 0) to check if the old process is still alive:

for _ in range(20):
    try:
        os.kill(existing_pid, 0)  # Signal 0 = existence check
        time.sleep(0.5)
    except (ProcessLookupError, PermissionError):
        break
else:
    # Still alive after 10s — force kill
    terminate_pid(existing_pid, force=True)

os.kill(pid, 0) succeeds for zombie processes (processes that have exited but not yet been wait()ed by their parent). A zombie process:

  • Is still in the process table → os.kill(zombie, 0) succeeds
  • Cannot respond to any signal → SIGTERM/SIGKILL have no effect
  • Can only be reaped by its parent calling wait()

If the old gateway crashes and becomes a zombie, --replace will misjudge it as still alive, wait 10 seconds, then SIGKILL (which has no effect on zombie), and proceed — but the zombie PID is still in the kernel process table. Both the old and new instance then race to write the PID file, causing FileExistsError → "PID file race lost".

Expected Behavior

  1. Gateway should exit with code 75 (or whatever GATEWAY_SERVICE_RESTART_EXIT_CODE is) on startup failure, not 1, so systemd's RestartForceExitStatus=75 correctly bypasses the burst limit for transient failures.
  2. The --replace zombie process check should use os.wait4() or check /proc/{pid}/stat to properly detect zombie state, not rely solely on os.kill(pid, 0).

Environment

  • Hermes Agent: NousResearch/hermes-agent
  • OS: Debian (WSL) / systemd user service
  • Python: 3.x via venv

Reproduction Steps

  1. Start gateway normally: hermes gateway start
  2. Kill gateway with SIGKILL (simulating crash): kill -9 <gateway_pid>
  3. Observe: systemctl --user status hermes-gateway.service shows rapid restart attempts until burst limit is hit
  4. Gateway service enters failed state permanently

Suggested Fix

Fix 1 — In hermes_cli/gateway.py, change run_gateway() to use the correct exit code:

from gateway.restart import GATEWAY_SERVICE_RESTART_EXIT_CODE

success = asyncio.run(start_gateway(replace=replace, verbosity=verbosity))
if not success:
    sys.exit(GATEWAY_SERVICE_RESTART_EXIT_CODE)  # 75

Fix 2 — In gateway/run.py, add zombie detection in the --replace wait loop:

import os
import stat

def is_zombie(pid: int) -> bool:
    try:
        st = os.stat(f"/proc/{pid}")
        return stat.S_ISDIR(st.st_mode)  # /proc/PID is a dir for alive processes
    except (FileNotFoundError, ProcessLookupError):
        return False
    # Alternative: check STATUS in /proc/{pid}/status for 'Z (zombie)'

# In the replace loop:
for _ in range(20):
    try:
        os.kill(existing_pid, 0)
        if is_zombie(existing_pid):
            break  # Zombie won't exit, stop waiting
        time.sleep(0.5)
    except (ProcessLookupError, PermissionError):
        break

Logs

Apr 22 23:10:48 XIAOXIN systemd[195]: hermes-gateway.service: Consumed 2.409s CPU time, 56.5M memory peak.
Apr 22 23:11:18 XIAOXIN systemd[195]: hermes-gateway.service: Scheduled restart job, restart counter is at 5.
Apr 22 23:11:18 XIAOXIN systemd[195]: hermes-gateway.service: Start request repeated too quickly.
Apr 22 23:11:18 XIAOXIN systemd[195]: hermes-gateway.service: Failed with result 'exit-code'.

Filed via Hermes Agent

extent analysis

TL;DR

To fix the issue, update the run_gateway() function to exit with the correct code and modify the --replace wait loop to properly detect zombie processes.

Guidance

  • Update the hermes_cli/gateway.py file to use the correct exit code (GATEWAY_SERVICE_RESTART_EXIT_CODE) in the run_gateway() function.
  • Modify the gateway/run.py file to add zombie detection in the --replace wait loop using the is_zombie() function.
  • Verify the changes by testing the gateway startup and restart behavior.
  • Check the systemd service logs to ensure the restart loop is no longer triggered unnecessarily.

Example

The suggested fix provides example code snippets for the necessary changes:

# In hermes_cli/gateway.py
sys.exit(GATEWAY_SERVICE_RESTART_EXIT_CODE)  # 75

# In gateway/run.py
def is_zombie(pid: int) -> bool:
    try:
        st = os.stat(f"/proc/{pid}")
        return stat.S_ISDIR(st.st_mode)  # /proc/PID is a dir for alive processes
    except (FileNotFoundError, ProcessLookupError):
        return False

Notes

The provided fixes assume that the GATEWAY_SERVICE_RESTART_EXIT_CODE is defined and set to the correct value (75). Additionally, the is_zombie() function uses a simple check for zombie processes, which may not cover all edge cases.

Recommendation

Apply the suggested fixes to update the run_gateway() function and modify the --replace wait loop to properly detect zombie processes. This should resolve the issue with the systemd service entering a restart loop and exhausting the StartLimitBurst limit.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING