hermes - 💡(How to fix) Fix fix(gateway): launchd double-spawn triggers infinite restart death spiral (exit code 1 on instance detection) [1 pull requests]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

Error Message

  • 299 occurrences in gateway.error.log logger.error(

Root Cause

Commit cbe29db77 added a second PID check during startup (gateway/run.py:14850-14859) that exits with return False when another instance is detected. This was intended to prevent double-running, but the exit code (1) combined with launchd's SuccessfulExit=false creates a death spiral.

The cascade:

  1. macOS display wake → launchd enters "on-demand-only mode" → kills existing gateway
  2. KeepAlive triggers restart → launchd spawns two instances simultaneously (intermittent race)
  3. Instance B detects Instance A via PID file at line 14855 → return Falsesys.exit(1)
  4. launchd SuccessfulExit=false → auto-restart → detect other instance → exit 1 → loop

Fix Action

Fixed

Code Example

# gateway/run.py:14850-14859
_current_pid = get_running_pid()
if _current_pid is not None and _current_pid != os.getpid():
    logger.error(
        "Another gateway instance (PID %d) started during our startup. "
        "Exiting to avoid double-running.", _current_pid
    )
    return False  # → sys.exit(1) in main() — triggers KeepAlive restart loop

# gateway/run.py:14967-14969
success = asyncio.run(start_gateway(config))
if not success:
    sys.exit(1)
RAW_BUFFERClick to expand / collapse

Bug Summary

When launchd (macOS) spawns two gateway instances simultaneously (intermittent race on display wake, KeepAlive restart, or update), the second instance detects the first via the PID file at gateway/run.py:14855 and exits with code 1 (return Falsesys.exit(1)). Because the plist has KeepAlive.SuccessfulExit=false, launchd interprets exit code 1 as a failure and immediately restarts — creating an infinite loop that spams logs and drains CPU until manually stopped.

Root Cause

Commit cbe29db77 added a second PID check during startup (gateway/run.py:14850-14859) that exits with return False when another instance is detected. This was intended to prevent double-running, but the exit code (1) combined with launchd's SuccessfulExit=false creates a death spiral.

The cascade:

  1. macOS display wake → launchd enters "on-demand-only mode" → kills existing gateway
  2. KeepAlive triggers restart → launchd spawns two instances simultaneously (intermittent race)
  3. Instance B detects Instance A via PID file at line 14855 → return Falsesys.exit(1)
  4. launchd SuccessfulExit=false → auto-restart → detect other instance → exit 1 → loop

Evidence

  • 3 confirmed incidents after May 6 update: May 6 01:37, May 6 02:08, May 7 23:19
  • 196 occurrences of "Another gateway instance started during our startup" in gateway.log
  • 299 occurrences in gateway.error.log
  • Duplicate log lines in agent.log at 23:19:50 confirm two instances started at the same millisecond
  • macOS log show confirms display wake triggered the cascade at 23:19:33

Reproduction

Hard to reproduce reliably (depends on launchd timing), but on macOS with multiple gateway profile plists:

  1. Have 2+ launchd plists with KeepAlive.SuccessfulExit=false
  2. Trigger a display wake while gateway is running
  3. If launchd double-spawns, the death spiral starts

Current Code (v0.13.0 / v2026.5.7)

# gateway/run.py:14850-14859
_current_pid = get_running_pid()
if _current_pid is not None and _current_pid != os.getpid():
    logger.error(
        "Another gateway instance (PID %d) started during our startup. "
        "Exiting to avoid double-running.", _current_pid
    )
    return False  # → sys.exit(1) in main() — triggers KeepAlive restart loop

# gateway/run.py:14967-14969
success = asyncio.run(start_gateway(config))
if not success:
    sys.exit(1)

Proposed Fix

Option A (minimal): Change return False to return True (exit 0) when another instance is detected. This tells launchd "clean exit, no restart needed" — the surviving instance continues normally. This is the correct semantics: detecting another healthy instance is not a failure.

Option B (robust): Add a startup lock (flock on the PID file or lock file) with jitter backoff. Only one instance acquires the lock; others wait briefly then exit 0 if another instance won.

Option A is sufficient and matches the intent of SuccessfulExit=false — only exit 1 for genuine failures (no platforms connected, unhandled errors), not for "another instance is already handling this."

Environment

  • macOS (launchd service manager)
  • Hermes Agent v0.13.0 (v2026.5.7)
  • Multiple gateway profile plists with KeepAlive.SuccessfulExit=false
  • --replace flag is set in plists but only runs at the first PID check (line 14626), not the second (line 14855)

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING