hermes - ✅(Solved) Fix Gateway service: exit code mismatch causes StartLimitBurst=5 exhaustion on rapid restart [3 pull requests, 2 comments, 2 participants]

hermes2026-04-22 15:54:53

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

NousResearch/hermes-agent#14051•Fetched 2026-04-23 07:47:09

View on GitHub

Comments

Participants

Timeline

Reactions

Author

BerryGB9

Participants

alt-glitch

BerryGB9

Timeline (top)

labeled ×4cross-referenced ×3commented ×2

Error Message

Apr 22 23:10:48 XIAOXIN systemd[195]: hermes-gateway.service: Consumed 2.409s CPU time, 56.5M memory peak.
Apr 22 23:11:18 XIAOXIN systemd[195]: hermes-gateway.service: Scheduled restart job, restart counter is at 5.
Apr 22 23:11:18 XIAOXIN systemd[195]: hermes-gateway.service: Start request repeated too quickly.
Apr 22 23:11:18 XIAOXIN systemd[195]: hermes-gateway.service: Failed with result 'exit-code'.

Filed via Hermes Agent

Root Cause

Two issues combine to cause this:

Fix Action

Fixed

Fixed by PR: fix(gateway): exit with RestartForceExitStatus code and detect zombie PIDs in --replace (https://github.com/NousResearch/hermes-agent/pull/14069)
Fixed by PR: fix(gateway): use managed restart exit code on startup failure (https://github.com/NousResearch/hermes-agent/pull/14080)
Fixed by PR: fix(gateway): use GATEWAY_SERVICE_RESTART_EXIT_CODE instead of hardcoded 1 (https://github.com/NousResearch/hermes-agent/pull/14260)

PR fix notes

PR #14069: fix(gateway): exit with RestartForceExitStatus code and detect zombie PIDs in --replace

Repository: NousResearch/hermes-agent
Author: konsisumer
State: closed | merged: False
Link: https://github.com/NousResearch/hermes-agent/pull/14069

Description (problem / solution / changelog)

Fix two bugs that combine to exhaust systemd's StartLimitBurst=5 on rapid restarts, permanently failing the gateway service.

What changed and why

Fix 1 — exit code mismatch (hermes_cli/gateway.py)

run_gateway() was exiting with code 1 on startup failure. The systemd service file declares RestartForceExitStatus=75, meaning only exit code 75 bypasses StartLimitBurst. Exit code 1 was counted against the burst limit, so 5 fast failures (e.g. PID file races) permanently failed the service. Now exits with GATEWAY_SERVICE_RESTART_EXIT_CODE (75), which was already imported but unused in this path.

Fix 2 — zombie process detection (gateway/run.py)

The --replace wait loop used os.kill(pid, 0) to check if the old gateway was alive. This call succeeds for zombie processes (crashed but not yet reaped by their parent), so a zombie gateway would stall the loop for 10 s, then receive SIGKILL — which has no effect on zombies. Both the old zombie and the new instance would then race to write the PID file, causing FileExistsError → exit with old code 1 → burst counter hit.

Added _is_zombie(pid) that reads /proc/{pid}/status on Linux and checks for State: Z. The wait loop now breaks immediately when the old PID is a zombie, treating it as already gone before proceeding to force-unlink the PID file.

How to test

Start gateway: hermes gateway start
Kill it with SIGKILL: kill -9 <pid>
Verify systemctl --user status hermes-gateway.service shows restart attempts, not a permanent failed state
Run new unit tests: pytest tests/gateway/test_zombie_detection.py tests/hermes_cli/test_gateway_service.py::TestRunGatewayExitCode -v

What platforms tested on

macOS on darwin-arm64 (local, zombie detection path returns False and is skipped — systemd path tested via unit tests)
Linux/systemd path covered by unit tests with mocked /proc/{pid}/status

Fixes #14051

Changed files

gateway/run.py (modified, +29/-1)
hermes_cli/gateway.py (modified, +3/-3)
tests/gateway/test_zombie_detection.py (added, +45/-0)
tests/hermes_cli/test_gateway_service.py (modified, +25/-0)

PR #14080: fix(gateway): use managed restart exit code on startup failure

Repository: NousResearch/hermes-agent
Author: LeonSGP43
State: open | merged: False
Link: https://github.com/NousResearch/hermes-agent/pull/14080

Description (problem / solution / changelog)

Summary

make hermes gateway run exit with the managed gateway restart code on startup failure
align foreground gateway failures with the generated systemd/launchd service units
add a regression test that proves run_gateway() now exits 75 instead of 1

Problem

On current main, run_gateway() exits with 1 when startup fails, even though the generated service units set RestartForceExitStatus=75. A minimal repro with start_gateway() returning False currently raises SystemExit(1), which means rapid failures count against StartLimitBurst instead of being treated as managed restarts.

Closes #14051.

Testing

pytest -o addopts= tests/hermes_cli/test_gateway_service.py
manual repro: monkeypatched start_gateway(...)=False now exits 75 instead of 1

Changed files

hermes_cli/gateway.py (modified, +4/-3)
tests/hermes_cli/test_gateway_service.py (modified, +15/-0)

PR #14260: fix(gateway): use GATEWAY_SERVICE_RESTART_EXIT_CODE instead of hardcoded 1

Repository: NousResearch/hermes-agent
Author: ms-alan
State: open | merged: False
Link: https://github.com/NousResearch/hermes-agent/pull/14260

Description (problem / solution / changelog)

Closes #14051

Changed files

gateway/run.py (modified, +1/-1)

Code Example

success = asyncio.run(start_gateway(replace=replace, verbosity=verbosity))
if not success:
    sys.exit(1)  # ← Always exits with 1

---

RestartForceExitStatus=75

---

for _ in range(20):
    try:
        os.kill(existing_pid, 0)  # Signal 0 = existence check
        time.sleep(0.5)
    except (ProcessLookupError, PermissionError):
        break
else:
    # Still alive after 10s — force kill
    terminate_pid(existing_pid, force=True)

---

from gateway.restart import GATEWAY_SERVICE_RESTART_EXIT_CODE

success = asyncio.run(start_gateway(replace=replace, verbosity=verbosity))
if not success:
    sys.exit(GATEWAY_SERVICE_RESTART_EXIT_CODE)  # 75

---

import os
import stat

def is_zombie(pid: int) -> bool:
    try:
        st = os.stat(f"/proc/{pid}")
        return stat.S_ISDIR(st.st_mode)  # /proc/PID is a dir for alive processes
    except (FileNotFoundError, ProcessLookupError):
        return False
    # Alternative: check STATUS in /proc/{pid}/status for 'Z (zombie)'

# In the replace loop:
for _ in range(20):
    try:
        os.kill(existing_pid, 0)
        if is_zombie(existing_pid):
            break  # Zombie won't exit, stop waiting
        time.sleep(0.5)
    except (ProcessLookupError, PermissionError):
        break

---

Apr 22 23:10:48 XIAOXIN systemd[195]: hermes-gateway.service: Consumed 2.409s CPU time, 56.5M memory peak.
Apr 22 23:11:18 XIAOXIN systemd[195]: hermes-gateway.service: Scheduled restart job, restart counter is at 5.
Apr 22 23:11:18 XIAOXIN systemd[195]: hermes-gateway.service: Start request repeated too quickly.
Apr 22 23:11:18 XIAOXIN systemd[195]: hermes-gateway.service: Failed with result 'exit-code'.

RAW_BUFFERClick to expand / collapse

Bug Description

When the gateway crashes or is killed unexpectedly, the systemd service enters a restart loop and exhausts StartLimitBurst=5 within seconds, permanently failing even though the actual root cause may be transient.

Root Cause Analysis

Two issues combine to cause this:

1. Exit code mismatch in `run_gateway()`

In hermes_cli/gateway.py, the run_gateway() function exits with code 1 on failure:

success = asyncio.run(start_gateway(replace=replace, verbosity=verbosity))
if not success:
    sys.exit(1)  # ← Always exits with 1

But the systemd service file sets:

RestartForceExitStatus=75

This means exit code 1 is counted toward StartLimitBurst=5, while only exit code 75 would bypass the burst limit. When the gateway fails fast (e.g., "PID file race lost" which exits immediately without waiting), systemd cycles through all 5 restarts within seconds.

2. Zombie process edge case in `--replace` logic

In gateway/run.py, the --replace wait loop uses os.kill(pid, 0) to check if the old process is still alive:

for _ in range(20):
    try:
        os.kill(existing_pid, 0)  # Signal 0 = existence check
        time.sleep(0.5)
    except (ProcessLookupError, PermissionError):
        break
else:
    # Still alive after 10s — force kill
    terminate_pid(existing_pid, force=True)

os.kill(pid, 0) succeeds for zombie processes (processes that have exited but not yet been wait()ed by their parent). A zombie process:

Is still in the process table → os.kill(zombie, 0) succeeds
Cannot respond to any signal → SIGTERM/SIGKILL have no effect
Can only be reaped by its parent calling wait()

If the old gateway crashes and becomes a zombie, --replace will misjudge it as still alive, wait 10 seconds, then SIGKILL (which has no effect on zombie), and proceed — but the zombie PID is still in the kernel process table. Both the old and new instance then race to write the PID file, causing FileExistsError → "PID file race lost".

Expected Behavior

Gateway should exit with code 75 (or whatever GATEWAY_SERVICE_RESTART_EXIT_CODE is) on startup failure, not 1, so systemd's RestartForceExitStatus=75 correctly bypasses the burst limit for transient failures.
The --replace zombie process check should use os.wait4() or check /proc/{pid}/stat to properly detect zombie state, not rely solely on os.kill(pid, 0).

Environment

Hermes Agent: NousResearch/hermes-agent
OS: Debian (WSL) / systemd user service
Python: 3.x via venv

Reproduction Steps

Start gateway normally: hermes gateway start
Kill gateway with SIGKILL (simulating crash): kill -9 <gateway_pid>
Observe: systemctl --user status hermes-gateway.service shows rapid restart attempts until burst limit is hit
Gateway service enters failed state permanently

Suggested Fix

Fix 1 — In hermes_cli/gateway.py, change run_gateway() to use the correct exit code:

from gateway.restart import GATEWAY_SERVICE_RESTART_EXIT_CODE

success = asyncio.run(start_gateway(replace=replace, verbosity=verbosity))
if not success:
    sys.exit(GATEWAY_SERVICE_RESTART_EXIT_CODE)  # 75

Fix 2 — In gateway/run.py, add zombie detection in the --replace wait loop:

import os
import stat

def is_zombie(pid: int) -> bool:
    try:
        st = os.stat(f"/proc/{pid}")
        return stat.S_ISDIR(st.st_mode)  # /proc/PID is a dir for alive processes
    except (FileNotFoundError, ProcessLookupError):
        return False
    # Alternative: check STATUS in /proc/{pid}/status for 'Z (zombie)'

# In the replace loop:
for _ in range(20):
    try:
        os.kill(existing_pid, 0)
        if is_zombie(existing_pid):
            break  # Zombie won't exit, stop waiting
        time.sleep(0.5)
    except (ProcessLookupError, PermissionError):
        break

Logs

Apr 22 23:10:48 XIAOXIN systemd[195]: hermes-gateway.service: Consumed 2.409s CPU time, 56.5M memory peak.
Apr 22 23:11:18 XIAOXIN systemd[195]: hermes-gateway.service: Scheduled restart job, restart counter is at 5.
Apr 22 23:11:18 XIAOXIN systemd[195]: hermes-gateway.service: Start request repeated too quickly.
Apr 22 23:11:18 XIAOXIN systemd[195]: hermes-gateway.service: Failed with result 'exit-code'.

Filed via Hermes Agent

extent analysis

TL;DR

To fix the issue, update the run_gateway() function to exit with the correct code and modify the --replace wait loop to properly detect zombie processes.

Guidance

Update the hermes_cli/gateway.py file to use the correct exit code (GATEWAY_SERVICE_RESTART_EXIT_CODE) in the run_gateway() function.
Modify the gateway/run.py file to add zombie detection in the --replace wait loop using the is_zombie() function.
Verify the changes by testing the gateway startup and restart behavior.
Check the systemd service logs to ensure the restart loop is no longer triggered unnecessarily.

Example

The suggested fix provides example code snippets for the necessary changes:

# In hermes_cli/gateway.py
sys.exit(GATEWAY_SERVICE_RESTART_EXIT_CODE)  # 75

# In gateway/run.py
def is_zombie(pid: int) -> bool:
    try:
        st = os.stat(f"/proc/{pid}")
        return stat.S_ISDIR(st.st_mode)  # /proc/PID is a dir for alive processes
    except (FileNotFoundError, ProcessLookupError):
        return False

Notes

The provided fixes assume that the GATEWAY_SERVICE_RESTART_EXIT_CODE is defined and set to the correct value (75). Additionally, the is_zombie() function uses a simple check for zombie processes, which may not cover all edge cases.

Recommendation

Apply the suggested fixes to update the run_gateway() function and modify the --replace wait loop to properly detect zombie processes. This should resolve the issue with the systemd service entering a restart loop and exhausting the StartLimitBurst limit.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #configuration error #environment variable #network issue #logging issue

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

hermes - ✅(Solved) Fix Gateway service: exit code mismatch causes StartLimitBurst=5 exhaustion on rapid restart [3 pull requests, 2 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Fixed

PR fix notes

PR #14069: fix(gateway): exit with RestartForceExitStatus code and detect zombie PIDs in --replace

Description (problem / solution / changelog)

What changed and why

How to test

What platforms tested on

Changed files

PR #14080: fix(gateway): use managed restart exit code on startup failure

Description (problem / solution / changelog)

Summary

Problem

Testing

Changed files

PR #14260: fix(gateway): use GATEWAY_SERVICE_RESTART_EXIT_CODE instead of hardcoded 1

Description (problem / solution / changelog)

Changed files

Code Example

Bug Description

Root Cause Analysis

1. Exit code mismatch in run_gateway()

2. Zombie process edge case in --replace logic

Expected Behavior

Environment

Reproduction Steps

Suggested Fix

Logs

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING

1. Exit code mismatch in `run_gateway()`

2. Zombie process edge case in `--replace` logic