hermes - 💡(How to fix) Fix gateway: `gateway_state.json` heartbeat tick missing — WebUI cross-container liveness check fails for idle gateways

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

gateway_state.json is documented and consumed by downstream tools as a periodic heartbeat, but the gateway only writes it on state transitions. After a healthy idle period of >2 minutes, cross-container WebUI deployments mark the gateway as down despite the process being alive and serving traffic.

Error Message

gateway/status_heartbeat.py (new)

import threading, logging from gateway.status import write_runtime_status

logger = logging.getLogger(name)

_thread = None _stop = None DEFAULT_INTERVAL_SECONDS = 60.0 # well below WebUI's 120s freshness window

def _loop(stop_event, interval): while not stop_event.wait(interval): try: write_runtime_status() except Exception as exc: logger.debug("status heartbeat tick failed: %s", exc)

def start_status_heartbeat(interval_seconds=DEFAULT_INTERVAL_SECONDS): global _thread, _stop if _thread is not None and _thread.is_alive(): return False _stop = threading.Event() _thread = threading.Thread( target=_loop, args=(_stop, float(interval_seconds)), daemon=True, name="gateway-status-heartbeat", ) _thread.start() return True

Root Cause

gateway/status.py:write_runtime_status() is the only writer of gateway_state.json. Every caller in gateway/run.py is event-driven (state transitions, platform connect/disconnect, startup, degraded, shutdown). There is no periodic tick, so updated_at only advances when one of those events fires.

$ grep -nB1 -A3 "write_runtime_status\b" gateway/run.py | grep -E "write_runtime_status\(" -A1
2429: write_runtime_status(gateway_state=..., exit_reason=..., ...)
2449: write_runtime_status(platform=..., platform_state=..., ...)
3650: write_runtime_status(gateway_state="starting", exit_reason=None)
3951: write_runtime_status(gateway_state="startup_failed", exit_reason=reason)
3976: write_runtime_status(gateway_state="degraded", exit_reason=None)

gateway/memory_monitor.py runs every 300s but only logs [MEMORY] ... — it does not touch the state file.

Fix Action

Fix / Workaround

Workaround applied locally

Code Example

$ grep -nB1 -A3 "write_runtime_status\b" gateway/run.py | grep -E "write_runtime_status\(" -A1
2429: write_runtime_status(gateway_state=..., exit_reason=..., ...)
2449: write_runtime_status(platform=..., platform_state=..., ...)
3650: write_runtime_status(gateway_state="starting", exit_reason=None)
3951: write_runtime_status(gateway_state="startup_failed", exit_reason=reason)
3976: write_runtime_status(gateway_state="degraded", exit_reason=None)

---

# gateway/status_heartbeat.py (new)
import threading, logging
from gateway.status import write_runtime_status

logger = logging.getLogger(__name__)

_thread = None
_stop = None
DEFAULT_INTERVAL_SECONDS = 60.0  # well below WebUI's 120s freshness window

def _loop(stop_event, interval):
    while not stop_event.wait(interval):
        try:
            write_runtime_status()
        except Exception as exc:
            logger.debug("status heartbeat tick failed: %s", exc)

def start_status_heartbeat(interval_seconds=DEFAULT_INTERVAL_SECONDS):
    global _thread, _stop
    if _thread is not None and _thread.is_alive():
        return False
    _stop = threading.Event()
    _thread = threading.Thread(
        target=_loop, args=(_stop, float(interval_seconds)),
        daemon=True, name="gateway-status-heartbeat",
    )
    _thread.start()
    return True
RAW_BUFFERClick to expand / collapse

Summary

gateway_state.json is documented and consumed by downstream tools as a periodic heartbeat, but the gateway only writes it on state transitions. After a healthy idle period of >2 minutes, cross-container WebUI deployments mark the gateway as down despite the process being alive and serving traffic.

Reproduction

  1. Run hermes-agent gateway with no platforms enabled (or with platforms enabled but stable, no state changes).
  2. Run nesquena/hermes-webui in a separate Docker container, sharing ~/.hermes as a volume.
  3. Wait ~3 minutes after the gateway last logged a state transition.
  4. WebUI dashboard shows the "Hermes agent is not responding. Gateway heartbeat failed." banner.
  5. The gateway process is fully responsive — agents spawn, tools execute, platform messages flow. Only the WebUI's liveness signal is stale.

Verified on hermes-agent v0.14.0 with hermes-webui v0.51.103.

Root cause

gateway/status.py:write_runtime_status() is the only writer of gateway_state.json. Every caller in gateway/run.py is event-driven (state transitions, platform connect/disconnect, startup, degraded, shutdown). There is no periodic tick, so updated_at only advances when one of those events fires.

$ grep -nB1 -A3 "write_runtime_status\b" gateway/run.py | grep -E "write_runtime_status\(" -A1
2429: write_runtime_status(gateway_state=..., exit_reason=..., ...)
2449: write_runtime_status(platform=..., platform_state=..., ...)
3650: write_runtime_status(gateway_state="starting", exit_reason=None)
3951: write_runtime_status(gateway_state="startup_failed", exit_reason=reason)
3976: write_runtime_status(gateway_state="degraded", exit_reason=None)

gateway/memory_monitor.py runs every 300s but only logs [MEMORY] ... — it does not touch the state file.

Why this surfaces now

nesquena/hermes-webui#1879 (closed) implemented a cross-container liveness check that explicitly relies on heartbeat semantics. From the issue:

"The gateway already writes an updated_at timestamp to gateway_state.json on every tick. A recent timestamp with gateway_state: 'running' is a reliable cross-container liveness signal..."

The companion GATEWAY_FRESHNESS_THRESHOLD_S = 120.0 in api/agent_health.py is set assuming a tick exists. So the design contract is: WebUI expects a heartbeat; gateway never had one.

Once get_running_pid() returns None (the normal cross-container case — fcntl locks and os.kill don't cross PID namespaces), the freshness check is the only liveness signal left. With no heartbeat, that check fails 2 minutes after the last state change, even if the gateway is healthy.

Proposed fix

Add a periodic heartbeat task that calls write_runtime_status() with no kwargs. The function already refreshes updated_at, pid, argv, kind, and start_time while leaving all _UNSET fields untouched (gateway/status.py:504-547) — so a no-arg call is exactly the noop heartbeat needed.

Mirroring the memory_monitor pattern would keep this lightweight:

# gateway/status_heartbeat.py (new)
import threading, logging
from gateway.status import write_runtime_status

logger = logging.getLogger(__name__)

_thread = None
_stop = None
DEFAULT_INTERVAL_SECONDS = 60.0  # well below WebUI's 120s freshness window

def _loop(stop_event, interval):
    while not stop_event.wait(interval):
        try:
            write_runtime_status()
        except Exception as exc:
            logger.debug("status heartbeat tick failed: %s", exc)

def start_status_heartbeat(interval_seconds=DEFAULT_INTERVAL_SECONDS):
    global _thread, _stop
    if _thread is not None and _thread.is_alive():
        return False
    _stop = threading.Event()
    _thread = threading.Thread(
        target=_loop, args=(_stop, float(interval_seconds)),
        daemon=True, name="gateway-status-heartbeat",
    )
    _thread.start()
    return True

Wire it up alongside memory_monitor.start_memory_monitoring() in gateway/run.py start_gateway() (~L17910).

Config knob logging.status_heartbeat.interval_seconds (or similar), default 60s. Disable via enabled: false for users who don't run the WebUI cross-container.

Workaround applied locally

Until the heartbeat lands upstream, I'm running a derived hermes-webui image that bumps GATEWAY_FRESHNESS_THRESHOLD_S = 120.0 → 86400.0 in api/agent_health.py. This trades the false-positive banner for a 24h detection lag on real crashes. Documented for anyone hitting the same symptom while this is open.

Environment

  • hermes-agent v0.14.0 (commit pinned in NousResearch/hermes-agent main as of 2026-05-26)
  • hermes-webui v0.51.103 (ghcr.io/nesquena/hermes-webui@sha256:d568b23...)
  • Linux host, two-container split (gateway on host via systemd-user, webui in Docker)
  • Profiles affected: both default and a named profile (rogelio)

Happy to send a PR if the maintainers like this direction — wanted to confirm the design choice (gateway-side heartbeat vs. WebUI-side check change) before writing tests and wiring config.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING