hermes - 💡(How to fix) Fix Dashboard hangs: tui_gateway.slash_worker subprocesses leak on PTY chat disconnect (524 via reverse proxy)

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

When running hermes dashboard --tui behind a reverse proxy (Cloudflare Tunnel in my case), repeated open/close of the in-browser Chat tab leaks tui_gateway.slash_worker Python subprocesses. They stack inside the dashboard's systemd cgroup until the process saturates memory and stops responding — the dashboard hangs with dozens of pending connections, and the proxy returns 524 (Cloudflare) / 502 (nginx). SIGTERM does not reap; only SIGKILL on the whole cgroup recovers.

Filed at the maintainers' request after a live incident on vm-hermes (Hermes Agent v0.14.0, commit cea87d913).

Error Message

def _set_pdeathsig(): try: import ctypes, signal as _sig libc = ctypes.CDLL("libc.so.6", use_errno=True) PR_SET_PDEATHSIG = 1 libc.prctl(PR_SET_PDEATHSIG, _sig.SIGTERM, 0, 0, 0) except Exception: pass

self.proc = subprocess.Popen( argv, ..., preexec_fn=_set_pdeathsig, # Linux-only; wrap in platform check )

Root Cause

Root cause (suspected)

Fix Action

Fix / Workaround

  • I separately patched a Host: header issue affecting the same proxy setup in #32362 (DNS-rebinding allowlist via env var). That patch is what surfaced this leak — without it the dashboard wasn't reachable over the tunnel at all, so the leak never had a chance to accumulate.

Code Example

ExecStart=/.../hermes dashboard --port 9119 --host 127.0.0.1 --no-open --tui --skip-build

---

$ pgrep -af tui_gateway.slash_worker | wc -l
5

$ systemctl --user status hermes-dashboard.service
   Memory: 781.4M (peak: 1.1G)
   Tasks: 38

$ journalctl --user -u hermes-dashboard | grep -i "Incoming request ended abruptly: context canceled" | wc -l
51

---

systemctl --user kill -s SIGKILL hermes-dashboard.service
systemctl --user reset-failed hermes-dashboard.service
systemctl --user start hermes-dashboard.service

---

def _set_pdeathsig():
       try:
           import ctypes, signal as _sig
           libc = ctypes.CDLL("libc.so.6", use_errno=True)
           PR_SET_PDEATHSIG = 1
           libc.prctl(PR_SET_PDEATHSIG, _sig.SIGTERM, 0, 0, 0)
       except Exception:
           pass

   self.proc = subprocess.Popen(
       argv,
       ...,
       preexec_fn=_set_pdeathsig,   # Linux-only; wrap in platform check
   )
RAW_BUFFERClick to expand / collapse

Dashboard hangs under PTY chat after stacked tui_gateway.slash_worker subprocess leak

Summary

When running hermes dashboard --tui behind a reverse proxy (Cloudflare Tunnel in my case), repeated open/close of the in-browser Chat tab leaks tui_gateway.slash_worker Python subprocesses. They stack inside the dashboard's systemd cgroup until the process saturates memory and stops responding — the dashboard hangs with dozens of pending connections, and the proxy returns 524 (Cloudflare) / 502 (nginx). SIGTERM does not reap; only SIGKILL on the whole cgroup recovers.

Filed at the maintainers' request after a live incident on vm-hermes (Hermes Agent v0.14.0, commit cea87d913).

Environment

  • Hermes Agent v0.14.0, main @ cea87d913
  • Linux (Azure Ubuntu 24.04, kernel 6.17.0-1015-azure)
  • Python from the bundled venv
  • Launched via systemd --user unit:
    ExecStart=/.../hermes dashboard --port 9119 --host 127.0.0.1 --no-open --tui --skip-build
  • Reverse proxy: Cloudflare Tunnel → 127.0.0.1:9119
  • Model: claude-opus-4.7 via Copilot OAuth (reasoning medium)

Reproduction

  1. Start hermes dashboard --tui bound to loopback.
  2. Open the dashboard in a browser, click Chat (spawns a PTY + tui_gateway.slash_worker).
  3. Close the tab / browser without typing /quit, OR let the websocket drop due to upstream proxy timeout / page refresh.
  4. Re-open Chat. Repeat 3–5 times.
  5. pgrep -af tui_gateway.slash_worker — workers accumulate, one per session, never reaped.
  6. After ~5 stacked workers the dashboard process climbs past ~800MB RSS, event loop starves, all HTTP requests stall (51 pending in my incident), proxy returns 524.

Observed behaviour

$ pgrep -af tui_gateway.slash_worker | wc -l
5

$ systemctl --user status hermes-dashboard.service
   Memory: 781.4M (peak: 1.1G)
   Tasks: 38

$ journalctl --user -u hermes-dashboard | grep -i "Incoming request ended abruptly: context canceled" | wc -l
51

SIGTERM to the unit was ignored. Recovery:

systemctl --user kill -s SIGKILL hermes-dashboard.service
systemctl --user reset-failed hermes-dashboard.service
systemctl --user start hermes-dashboard.service

Root cause (suspected)

Two related leaks in the PTY-bridge / slash-worker lifecycle:

  1. /api/pty WebSocketDisconnect path closes the PtyBridge correctly, but the spawned hermes --tui child holds an open _SlashWorker (subprocess.Popen of tui_gateway.slash_worker). When the parent dies via the SIGHUP→SIGTERM→SIGKILL escalation in pty_bridge.PtyBridge.close(), the slash worker — which is a grandchild spawned by the in-PTY agent process, not the dashboard — does not always see a TTY-hangup propagation if the agent process exits non-cleanly. Result: an orphan tui_gateway.slash_worker is reparented to PID 1 (or remains under the dashboard cgroup since it was launched via Popen from inside an agent that started under the dashboard's user-unit cgroup).
  2. _SlashWorker registers no atexit / signal handler and no PR_SET_PDEATHSIG in tui_gateway/server.py (lines ~183–264). close() is only called when the in-agent code reaches _restart_slash_worker or session shutdown — neither runs on abrupt websocket disconnect.

So:

  • Every browser refresh / Cloudflare upstream timeout that drops the /api/pty WS leaves an orphan worker.
  • Because the workers live in the same systemd user-cgroup, they count against the dashboard service's memory and Tasks=, and systemctl --user stop only signals the main pid; the workers ignore SIGTERM (no handler) and only die on KillMode=control-group + SIGKILL.

Suggested fix

Two independent guards, each cheap and useful on its own:

  1. In _SlashWorker.__init__, set PR_SET_PDEATHSIG to SIGTERM on the child via a preexec_fn so the worker dies the moment its parent agent exits — even if the parent crashes or is SIGKILLed.

    def _set_pdeathsig():
        try:
            import ctypes, signal as _sig
            libc = ctypes.CDLL("libc.so.6", use_errno=True)
            PR_SET_PDEATHSIG = 1
            libc.prctl(PR_SET_PDEATHSIG, _sig.SIGTERM, 0, 0, 0)
        except Exception:
            pass
    
    self.proc = subprocess.Popen(
        argv,
        ...,
        preexec_fn=_set_pdeathsig,   # Linux-only; wrap in platform check
    )
  2. In /api/pty's finally block (hermes_cli/web_server.py ~3582–3588), after bridge.close(), also walk and SIGTERM any tui_gateway.slash_worker whose parent pid is the just-closed bridge's pid. This is defensive — guard (1) makes it redundant on Linux — but it's necessary on macOS where PR_SET_PDEATHSIG doesn't exist (use proc_track / kqueue NOTE_EXIT if you want symmetry, or just accept the explicit sweep).

Optional third: have systemd units document KillMode=control-group (it's the default, but worth a note in docs/deployment.md) so operators don't override it to process and lose the cgroup-wide SIGKILL recovery path.

Reproducibility

Reliably reproduces on my box: 3 open/close cycles on a flaky upstream (I forced this by toggling Cloudflare cache rules), 5 stacked workers, hang within 2 minutes. Happy to provide systemd journal excerpts or strace if useful.

Related

  • I separately patched a Host: header issue affecting the same proxy setup in #32362 (DNS-rebinding allowlist via env var). That patch is what surfaced this leak — without it the dashboard wasn't reachable over the tunnel at all, so the leak never had a chance to accumulate.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

hermes - 💡(How to fix) Fix Dashboard hangs: tui_gateway.slash_worker subprocesses leak on PTY chat disconnect (524 via reverse proxy)