hermes - 💡(How to fix) Fix fix(gateway): launchd double-spawn triggers infinite restart death spiral (exit code 1 on instance detection) [1 pull requests]

StepCodex · 2026-05-07T22:40:34Z

[hermes] Bug Summary When launchd macOS spawns two gateway instances simultaneously intermittent race on display wake, KeepAlive restart, or update , the secon… ## Fixed - Fixed by PR: fix(gateway): exit cleanly when another instance detected to avoid launchd restart loop (https://github.com/NousResearch/hermes-agent/pull/21555) ## Bug Summary When `launchd` (macOS) spawns two gateway instances simultaneously (intermittent race on display wake, `KeepAlive` restart, or update), the second instance detects the first via the PID file at `gateway/run.py:14855` and exits with **code 1** (`return False` → `sys.exit(1)`). Because the plist has `KeepAlive.SuccessfulExit=false`, launchd interprets exit code 1 as a failure and **immediately restarts** — creating an infinite loop that spams logs and drains CPU until manually stopped. ## Root Cause Commit `cbe29db77` added a second PID check during startup (`gateway/run.py:14850-14859`) that exits with `return False` when another instance is detected. This was intended to prevent double-running, but the exit code (1) combined with launchd's `SuccessfulExit=false` creates a death spiral. **The cascade:** 1. macOS display wake → launchd enters "on-demand-only mode" → kills existing gateway 2. `KeepAlive` triggers restart → launchd spawns **two** instances simultaneously (intermittent race) 3. Instance B detects Instance A via PID file at line 14855 → `return False` → `sys.exit(1)` 4. launchd `SuccessfulExit=false` → auto-restart → detect other instance → exit 1 → **loop** ## Evidence - **3 confirmed incidents** after May 6 update: May 6 01:37, May 6 02:08, May 7 23:19 - **196 occurrences** of "Another gateway instance started during our startup" in `gateway.log` - **299 occurrences** in `gateway.error.log` - Duplicate log lines in `agent.log` at 23:19:50 confirm two instances started at the same millisecond - macOS `log show` confirms display wake triggered the cascade at 23:19:33 ## Reproduction Hard to reproduce reliably (depends on launchd timing), but on macOS with multiple gateway profile plists: 1. Have 2+ launchd plists with `KeepAlive.SuccessfulExit=false` 2. Trigger a display wake while gateway is running 3. If launchd double-spawns, the death spiral starts ## Current Code (v0.13.0 / v2026.5.7) ```python # gateway/run.py:14850-14859 _current_pid = get_running_pid() if _current_pid is not None and _current_pid != os.getpid(): logger.error( "Another gateway instance (PID %d) started during our startup. " "Exiting to avoid double-running.", _current_pid ) return False # → sys.exit(1) in main() — triggers KeepAlive restart loop # gateway/run.py:14967-14969 success = asyncio.run(start_gateway(config)) if not success: sys.exit(1) ``` ## Proposed Fix **Option A (minimal):** Change `return False` to `return True` (exit 0) when another instance is detected. This tells launchd "clean exit, no restart needed" — the surviving instance continues normally. This is the correct semantics: detecting another healthy instance is **not a failure**. **Option B (robust):** Add a startup lock (`flock` on the PID file or lock file) with jitter backoff. Only one instance acquires the lock; others wait briefly then exit 0 if another instance won. **Option A is sufficient** and matches the intent of `SuccessfulExit=false` — only exit 1 for genuine failures (no platforms connected, unhandled errors), not for "another instance is already handling this." ## Environment - macOS (launchd service manager) - Hermes Agent v0.13.0 (v2026.5.7) - Multiple gateway profile plists with `KeepAlive.SuccessfulExit=false` - `--replace` flag is set in plists but only runs at the first PID check (line 14626), not the second (line 14855)

hermes2026-05-07 22:40:34

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

Error Message

299 occurrences in gateway.error.log logger.error(

Root Cause

Commit cbe29db77 added a second PID check during startup (gateway/run.py:14850-14859) that exits with return False when another instance is detected. This was intended to prevent double-running, but the exit code (1) combined with launchd's SuccessfulExit=false creates a death spiral.

The cascade:

macOS display wake → launchd enters "on-demand-only mode" → kills existing gateway
KeepAlive triggers restart → launchd spawns two instances simultaneously (intermittent race)
Instance B detects Instance A via PID file at line 14855 → return False → sys.exit(1)
launchd SuccessfulExit=false → auto-restart → detect other instance → exit 1 → loop

Fix Action

Fixed

Fixed by PR: fix(gateway): exit cleanly when another instance detected to avoid launchd restart loop (https://github.com/NousResearch/hermes-agent/pull/21555)

Code Example

# gateway/run.py:14850-14859
_current_pid = get_running_pid()
if _current_pid is not None and _current_pid != os.getpid():
    logger.error(
        "Another gateway instance (PID %d) started during our startup. "
        "Exiting to avoid double-running.", _current_pid
    )
    return False  # → sys.exit(1) in main() — triggers KeepAlive restart loop

# gateway/run.py:14967-14969
success = asyncio.run(start_gateway(config))
if not success:
    sys.exit(1)

RAW_BUFFERClick to expand / collapse

Bug Summary

When launchd (macOS) spawns two gateway instances simultaneously (intermittent race on display wake, KeepAlive restart, or update), the second instance detects the first via the PID file at gateway/run.py:14855 and exits with code 1 (return False → sys.exit(1)). Because the plist has KeepAlive.SuccessfulExit=false, launchd interprets exit code 1 as a failure and immediately restarts — creating an infinite loop that spams logs and drains CPU until manually stopped.

Root Cause

The cascade:

macOS display wake → launchd enters "on-demand-only mode" → kills existing gateway
KeepAlive triggers restart → launchd spawns two instances simultaneously (intermittent race)
Instance B detects Instance A via PID file at line 14855 → return False → sys.exit(1)
launchd SuccessfulExit=false → auto-restart → detect other instance → exit 1 → loop

Evidence

3 confirmed incidents after May 6 update: May 6 01:37, May 6 02:08, May 7 23:19
196 occurrences of "Another gateway instance started during our startup" in gateway.log
299 occurrences in gateway.error.log
Duplicate log lines in agent.log at 23:19:50 confirm two instances started at the same millisecond
macOS log show confirms display wake triggered the cascade at 23:19:33

Reproduction

Hard to reproduce reliably (depends on launchd timing), but on macOS with multiple gateway profile plists:

Have 2+ launchd plists with KeepAlive.SuccessfulExit=false
Trigger a display wake while gateway is running
If launchd double-spawns, the death spiral starts

Current Code (v0.13.0 / v2026.5.7)

# gateway/run.py:14850-14859
_current_pid = get_running_pid()
if _current_pid is not None and _current_pid != os.getpid():
    logger.error(
        "Another gateway instance (PID %d) started during our startup. "
        "Exiting to avoid double-running.", _current_pid
    )
    return False  # → sys.exit(1) in main() — triggers KeepAlive restart loop

# gateway/run.py:14967-14969
success = asyncio.run(start_gateway(config))
if not success:
    sys.exit(1)

Proposed Fix

Option A (minimal): Change return False to return True (exit 0) when another instance is detected. This tells launchd "clean exit, no restart needed" — the surviving instance continues normally. This is the correct semantics: detecting another healthy instance is not a failure.

Option B (robust): Add a startup lock (flock on the PID file or lock file) with jitter backoff. Only one instance acquires the lock; others wait briefly then exit 0 if another instance won.

Option A is sufficient and matches the intent of SuccessfulExit=false — only exit 1 for genuine failures (no platforms connected, unhandled errors), not for "another instance is already handling this."

Environment

macOS (launchd service manager)
Hermes Agent v0.13.0 (v2026.5.7)
Multiple gateway profile plists with KeepAlive.SuccessfulExit=false
--replace flag is set in plists but only runs at the first PID check (line 14626), not the second (line 14855)

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#ISR setup #authentication setup #request error #file not found #serialization error

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

hermes - 💡(How to fix) Fix fix(gateway): launchd double-spawn triggers infinite restart death spiral (exit code 1 on instance detection) [1 pull requests]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Fixed

Code Example

Bug Summary

Root Cause

Evidence

Reproduction

Current Code (v0.13.0 / v2026.5.7)

Proposed Fix

Environment

Still need to ship something?

TRENDING

hermes - 💡(How to fix) Fix fix(gateway): launchd double-spawn triggers infinite restart death spiral (exit code 1 on instance detection) [1 pull requests]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Fixed

Code Example

Bug Summary

Root Cause

Evidence

Reproduction

Current Code (v0.13.0 / v2026.5.7)

Proposed Fix

Environment

Still need to ship something?

RELATED_DISCOVERY

TRENDING