hermes - 💡(How to fix) Fix Windows: Gateway crashes ~400ms after boot — stale planned-stop marker triggers false UNKNOWN exit

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

Root Cause

In gateway/run.py:_run_planned_stop_watcher() (~line 18374), the watcher fires the handler whenever marker_path.exists() is True, without reading the marker to check if target_pid == os.getpid(). The handler then calls consume_planned_stop_marker_for_self(), which correctly identifies the PID mismatch and returns False — but by then the process is already stopping, and the handler treats it as an unexpected exit.

Key code path:

  • gateway/run.py:_run_planned_stop_watcher() — fires on ANY marker (the bug)
  • gateway/shutdown_forensics.py:_signal_name(None)"UNKNOWN"
  • gateway/status.py:_consume_pid_marker_for_self() — deletes marker even on mismatch, returns False
  • gateway/status.py:_PLANNED_STOP_MARKER_FILENAME = ".gateway-planned-stop.json"

Fix Action

Workaround

Delete the stale marker before starting Gateway:

rm ~/.hermes/.gateway-planned-stop.json

Code Example

14:33:19,423  Gateway running with 2 platform(s)
14:33:19,815  Received UNKNOWN — initiating shutdown     ← 392ms later

14:44:54,999  ✓ telegram connected
14:44:55,295  Received UNKNOWN — initiating shutdown     ← 296ms later

---

# In _run_planned_stop_watcher, before loop.call_soon_threadsafe:
record = _read_json_file(marker_path)
target_matches = False
if record:
    try:
        target_pid = int(record.get("target_pid", -1))
        target_matches = (target_pid == os.getpid())
    except (TypeError, ValueError):
        pass
if not target_matches:
    # Stale marker — delete and continue running
    marker_path.unlink(missing_ok=True)
    logger.debug("Planned-stop watcher: ignoring stale marker")
else:
    loop.call_soon_threadsafe(exit_handler, None)
    break

---

rm ~/.hermes/.gateway-planned-stop.json
RAW_BUFFERClick to expand / collapse

Bug Description

When a new Gateway instance starts, the _run_planned_stop_watcher thread (introduced in PR #33798 for #33778) fires the exit handler as soon as .gateway-planned-stop.json exists — without checking whether the marker actually targets the current process. If the marker was written for a PREVIOUS Gateway instance (different PID), the new Gateway terminates within ~400ms, logging "Received UNKNOWN — initiating shutdown."

This is a regression from PR #33798 which added the watcher thread but omitted PID validation.

Steps to Reproduce

  1. On Windows, start Gateway: hermes gateway run (let it boot fully, note PID)
  2. Stop it via hermes gateway stop (writes marker with target PID)
  3. Immediately restart: hermes gateway run
  4. Observe: Gateway boots, connects platforms, then dies with "Received UNKNOWN" within ~500ms

Expected Behavior

The watcher should only fire exit when the marker targets the CURRENT process (matching target_pid). Stale markers should be silently cleaned up.

Actual Behavior

Gateway crashes immediately after boot:

14:33:19,423  Gateway running with 2 platform(s)
14:33:19,815  Received UNKNOWN — initiating shutdown     ← 392ms later

14:44:54,999  ✓ telegram connected
14:44:55,295  Received UNKNOWN — initiating shutdown     ← 296ms later

Exit context shows signal=UNKNOWN (= signal=None from _signal_name()) with different parent_pid values across crashes.

Root Cause

In gateway/run.py:_run_planned_stop_watcher() (~line 18374), the watcher fires the handler whenever marker_path.exists() is True, without reading the marker to check if target_pid == os.getpid(). The handler then calls consume_planned_stop_marker_for_self(), which correctly identifies the PID mismatch and returns False — but by then the process is already stopping, and the handler treats it as an unexpected exit.

Key code path:

  • gateway/run.py:_run_planned_stop_watcher() — fires on ANY marker (the bug)
  • gateway/shutdown_forensics.py:_signal_name(None)"UNKNOWN"
  • gateway/status.py:_consume_pid_marker_for_self() — deletes marker even on mismatch, returns False
  • gateway/status.py:_PLANNED_STOP_MARKER_FILENAME = ".gateway-planned-stop.json"

Impact

  • Gateway cannot stay up on Windows if a stale marker exists
  • Creates an unrecoverable crash loop (watchdog restarts → same crash)
  • Telegram polling sessions are not properly closed → Conflict: terminated by other getUpdates storms
  • 21 Gateway starts logged in a single day on the affected host

Proposed Fix

Add PID validation in the watcher before calling the exit handler:

# In _run_planned_stop_watcher, before loop.call_soon_threadsafe:
record = _read_json_file(marker_path)
target_matches = False
if record:
    try:
        target_pid = int(record.get("target_pid", -1))
        target_matches = (target_pid == os.getpid())
    except (TypeError, ValueError):
        pass
if not target_matches:
    # Stale marker — delete and continue running
    marker_path.unlink(missing_ok=True)
    logger.debug("Planned-stop watcher: ignoring stale marker")
else:
    loop.call_soon_threadsafe(exit_handler, None)
    break

(Also needs from gateway.status import _read_json_file added to imports.)

Environment

  • OS: Windows 10 (build 26200)
  • Hermes Agent version: latest (commit includes PR #33798 fix)
  • Python: 3.11.15

Related

  • #33778 — original issue about planned_stop_marker on Windows (this watcher was the fix)
  • PR #33798 — introduced the watcher thread without PID validation

Workaround

Delete the stale marker before starting Gateway:

rm ~/.hermes/.gateway-planned-stop.json

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

hermes - 💡(How to fix) Fix Windows: Gateway crashes ~400ms after boot — stale planned-stop marker triggers false UNKNOWN exit