hermes - 💡(How to fix) Fix WebUI health check shows gateway not running in multi-container Docker setup — lock file not visible across containers

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

Root Cause

The WebUI health check (/api/gateway/status endpoint) relies on gateway.status.get_running_pid() to determine if the gateway is alive.

get_running_pid() checks two things:

  1. gateway.lock — OS-level flock on ~/.hermes/gateway.lock — this file is created via fcntl.flock() by the gateway process. When the gateway container dies, the lock is released automatically by the OS.
  2. gateway_state.json — runtime metadata written by the gateway.

Problem: In a 3-container setup where get_hermes_home() resolves differently across containers (e.g., gateway container uses ~/.hermes/ under user hermes while WebUI uses ~/.hermes/ under user hermeswebui), the gateway.lock file may be written to a path the WebUI container cannot see, even though gateway_state.json (written via atomic_json_write) IS visible on the shared volume.

Fix Action

Workaround

No functional impact — the gateway works correctly. Only the Dashboard display is wrong.

Code Example

{"pid":7,"kind":"hermes-gateway","gateway_state":"running","platforms":{"weixin":{"state":"connected"}}}

---

$ ls -la ~/.hermes/gateway.lock
ls: cannot access '~/.hermes/gateway.lock': No such file or directory
$ ls -la ~/.hermes/gateway.pid
ls: cannot access '~/.hermes/gateway.pid': No such file or directory

---

runtime_status = read_runtime_status()  # reads gateway_state.json -> exists, shows "running"
running_pid = get_running_pid()         # checks flock on gateway.lock -> NOT FOUND

if running_pid:                                         # None
    return {"alive": True}
elif isinstance(runtime_status, dict):                   # True
    return {"alive": False}  # <-- This is the false negative
else:
    return {"alive": None}
RAW_BUFFERClick to expand / collapse

Bug Description

In a 3-container Docker Compose deployment (hermes-webui, hermes-gateway, hermes-agent), the WebUI Dashboard shows "gateway not running" even though the gateway is alive and actively processing messages (confirmed via WeChat/Telegram replies).

Architecture

Three containers share a Docker volume containing ~/.hermes/:

  • hermes-webui — runs python server.py (WebUI/Dashboard)
  • hermes-gateway — runs hermes gateway run
  • hermes-agent — runs the agent loop

Root Cause

The WebUI health check (/api/gateway/status endpoint) relies on gateway.status.get_running_pid() to determine if the gateway is alive.

get_running_pid() checks two things:

  1. gateway.lock — OS-level flock on ~/.hermes/gateway.lock — this file is created via fcntl.flock() by the gateway process. When the gateway container dies, the lock is released automatically by the OS.
  2. gateway_state.json — runtime metadata written by the gateway.

Problem: In a 3-container setup where get_hermes_home() resolves differently across containers (e.g., gateway container uses ~/.hermes/ under user hermes while WebUI uses ~/.hermes/ under user hermeswebui), the gateway.lock file may be written to a path the WebUI container cannot see, even though gateway_state.json (written via atomic_json_write) IS visible on the shared volume.

Evidence

Symptom: Users can message the gateway and receive replies, but Dashboard shows "stopped"

gateway_state.json (visible to WebUI, shows "running"):

{"pid":7,"kind":"hermes-gateway","gateway_state":"running","platforms":{"weixin":{"state":"connected"}}}

gateway.lock / gateway.pid (NOT visible to WebUI, both missing from the WebUI container's HERMES_HOME path):

$ ls -la ~/.hermes/gateway.lock
ls: cannot access '~/.hermes/gateway.lock': No such file or directory
$ ls -la ~/.hermes/gateway.pid
ls: cannot access '~/.hermes/gateway.pid': No such file or directory

The gateway log confirms active message handling during the same period, with no crash or restart events.

Health Check Code Flow

In app/api/agent_health.pybuild_agent_health_payload():

runtime_status = read_runtime_status()  # reads gateway_state.json -> exists, shows "running"
running_pid = get_running_pid()         # checks flock on gateway.lock -> NOT FOUND

if running_pid:                                         # None
    return {"alive": True}
elif isinstance(runtime_status, dict):                   # True
    return {"alive": False}  # <-- This is the false negative
else:
    return {"alive": None}

Expected Behavior

The health check should detect that the gateway is alive even when gateway.lock is not accessible from the WebUI container. Possible approaches:

  1. Use gateway_state.json's updated_at timestamp as a fallback — if it's recent and the file says "running", consider the gateway alive
  2. Provide a HERMES_GATEWAY_LOCK_DIR override or path config so multi-container setups can point to a shared lock path
  3. Add a lightweight HTTP health endpoint to the gateway process so the WebUI can ping it directly

Environment

  • Deployment: 3-container Docker Compose (from the official WebUI docker-compose files)
  • Hermes Agent version: v2.1.0
  • Platform: WeChat (weixin gateway) confirmed affected; likely affects all gateway platforms

Workaround

No functional impact — the gateway works correctly. Only the Dashboard display is wrong.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING