Error Message

def _set_pdeathsig(): try: import ctypes, signal as _sig libc = ctypes.CDLL("libc.so.6", use_errno=True) PR_SET_PDEATHSIG = 1 libc.prctl(PR_SET_PDEATHSIG, _sig.SIGTERM, 0, 0, 0) except Exception: pass

self.proc = subprocess.Popen( argv, ..., preexec_fn=_set_pdeathsig, # Linux-only; wrap in platform check )

Code Example

ExecStart=/.../hermes dashboard --port 9119 --host 127.0.0.1 --no-open --tui --skip-build

---

$ pgrep -af tui_gateway.slash_worker | wc -l
5

$ systemctl --user status hermes-dashboard.service
   Memory: 781.4M (peak: 1.1G)
   Tasks: 38

$ journalctl --user -u hermes-dashboard | grep -i "Incoming request ended abruptly: context canceled" | wc -l
51

---

systemctl --user kill -s SIGKILL hermes-dashboard.service
systemctl --user reset-failed hermes-dashboard.service
systemctl --user start hermes-dashboard.service

---

def _set_pdeathsig():
       try:
           import ctypes, signal as _sig
           libc = ctypes.CDLL("libc.so.6", use_errno=True)
           PR_SET_PDEATHSIG = 1
           libc.prctl(PR_SET_PDEATHSIG, _sig.SIGTERM, 0, 0, 0)
       except Exception:
           pass

   self.proc = subprocess.Popen(
       argv,
       ...,
       preexec_fn=_set_pdeathsig,   # Linux-only; wrap in platform check
   )

Dashboard hangs under PTY chat after stacked `tui_gateway.slash_worker` subprocess leak

StepCodex · 2026-05-26T03:01:31Z

[hermes] When running hermes dashboard --tui behind a reverse proxy Cloudflare Tunnel in my case , repeated open/close of the in-browser Chat tab leaks tui gat… When running `hermes dashboard --tui` behind a reverse proxy (Cloudflare Tunnel in my case), repeated open/close of the in-browser **Chat** tab leaks `tui_gateway.slash_worker` Python subprocesses. They stack inside the dashboard's systemd cgroup until the process saturates memory and stops responding — the dashboard hangs with dozens of pending connections, and the proxy returns 524 (Cloudflare) / 502 (nginx). SIGTERM does not reap; only SIGKILL on the whole cgroup recovers. Filed at the maintainers' request after a live incident on `vm-hermes` (Hermes Agent v0.14.0, commit `cea87d913`). ## Fix / Workaround - I separately patched a `Host:` header issue affecting the same proxy setup in #32362 (DNS-rebinding allowlist via env var). That patch is what surfaced this leak — without it the dashboard wasn't reachable over the tunnel at all, so the leak never had a chance to accumulate. # Dashboard hangs under PTY chat after stacked `tui_gateway.slash_worker` subprocess leak ## Summary When running `hermes dashboard --tui` behind a reverse proxy (Cloudflare Tunnel in my case), repeated open/close of the in-browser **Chat** tab leaks `tui_gateway.slash_worker` Python subprocesses. They stack inside the dashboard's systemd cgroup until the process saturates memory and stops responding — the dashboard hangs with dozens of pending connections, and the proxy returns 524 (Cloudflare) / 502 (nginx). SIGTERM does not reap; only SIGKILL on the whole cgroup recovers. Filed at the maintainers' request after a live incident on `vm-hermes` (Hermes Agent v0.14.0, commit `cea87d913`). ## Environment - Hermes Agent v0.14.0, `main` @ `cea87d913` - Linux (Azure Ubuntu 24.04, kernel 6.17.0-1015-azure) - Python from the bundled venv - Launched via systemd --user unit: ``` ExecStart=/.../hermes dashboard --port 9119 --host 127.0.0.1 --no-open --tui --skip-build ``` - Reverse proxy: Cloudflare Tunnel → `127.0.0.1:9119` - Model: `claude-opus-4.7` via Copilot OAuth (reasoning medium) ## Reproduction 1. Start `hermes dashboard --tui` bound to loopback. 2. Open the dashboard in a browser, click **Chat** (spawns a PTY + `tui_gateway.slash_worker`). 3. Close the tab / browser without typing `/quit`, OR let the websocket drop due to upstream proxy timeout / page refresh. 4. Re-open Chat. Repeat 3–5 times. 5. `pgrep -af tui_gateway.slash_worker` — workers accumulate, one per session, never reaped. 6. After ~5 stacked workers the dashboard process climbs past ~800MB RSS, event loop starves, all HTTP requests stall (51 pending in my incident), proxy returns 524. ## Observed behaviour ``` $ pgrep -af tui_gateway.slash_worker | wc -l 5 $ systemctl --user status hermes-dashboard.service Memory: 781.4M (peak: 1.1G) Tasks: 38 $ journalctl --user -u hermes-dashboard | grep -i "Incoming request ended abruptly: context canceled" | wc -l 51 ``` SIGTERM to the unit was ignored. Recovery: ``` systemctl --user kill -s SIGKILL hermes-dashboard.service systemctl --user reset-failed hermes-dashboard.service systemctl --user start hermes-dashboard.service ``` ## Root cause (suspected) Two related leaks in the PTY-bridge / slash-worker lifecycle: 1. **`/api/pty` WebSocketDisconnect path closes the `PtyBridge` correctly, but the spawned `hermes --tui` child holds an open `_SlashWorker` (subprocess.Popen of `tui_gateway.slash_worker`).** When the parent dies via the SIGHUP→SIGTERM→SIGKILL escalation in `pty_bridge.PtyBridge.close()`, the slash worker — which is a *grandchild* spawned by the in-PTY agent process, not the dashboard — does not always see a TTY-hangup propagation if the agent process exits non-cleanly. Result: an orphan `tui_gateway.slash_worker` is reparented to PID 1 (or remains under the dashboard cgroup since it was launched via `Popen` from inside an agent that started under the dashboard's user-unit cgroup). 2. **`_SlashWorker` registers no `atexit` / signal handler and no `PR_SET_PDEATHSIG`** in `tui_gateway/server.py` (lines ~183–264). `close()` is only called when the in-agent code reaches `_restart_slash_worker` or session shutdown — neither runs on abrupt websocket disconnect. So: - Every browser refresh / Cloudflare upstream timeout that drops the `/api/pty` WS leaves an orphan worker. - Because the workers live in the same systemd user-cgroup, they count against the dashboard service's memory and Tasks=, and `systemctl --user stop` only signals the main pid; the workers ignore SIGTERM (no handler) and only die on KillMode=control-group + SIGKILL. ## Suggested fix Two independent guards, each cheap and useful on its own: 1. **In `_SlashWorker.__init__`**, set `PR_SET_PDEATHSIG` to SIGTERM on the child via a `preexec_fn` so the worker dies the moment its parent agent exits — even if the parent crashes or is SIGKILLed. ```python def _set_pdeathsig(): try: import ctypes, signal as _s

Summary

When running hermes dashboard --tui behind a reverse proxy (Cloudflare Tunnel in my case), repeated open/close of the in-browser Chat tab leaks tui_gateway.slash_worker Python subprocesses. They stack inside the dashboard's systemd cgroup until the process saturates memory and stops responding — the dashboard hangs with dozens of pending connections, and the proxy returns 524 (Cloudflare) / 502 (nginx). SIGTERM does not reap; only SIGKILL on the whole cgroup recovers.

Filed at the maintainers' request after a live incident on vm-hermes (Hermes Agent v0.14.0, commit cea87d913).

Environment

Hermes Agent v0.14.0, main @ cea87d913
Linux (Azure Ubuntu 24.04, kernel 6.17.0-1015-azure)
Python from the bundled venv

Launched via systemd --user unit:

ExecStart=/.../hermes dashboard --port 9119 --host 127.0.0.1 --no-open --tui --skip-build

Reverse proxy: Cloudflare Tunnel → 127.0.0.1:9119
Model: claude-opus-4.7 via Copilot OAuth (reasoning medium)

Reproduction

Start hermes dashboard --tui bound to loopback.
Open the dashboard in a browser, click Chat (spawns a PTY + tui_gateway.slash_worker).
Close the tab / browser without typing /quit, OR let the websocket drop due to upstream proxy timeout / page refresh.
Re-open Chat. Repeat 3–5 times.
pgrep -af tui_gateway.slash_worker — workers accumulate, one per session, never reaped.
After ~5 stacked workers the dashboard process climbs past ~800MB RSS, event loop starves, all HTTP requests stall (51 pending in my incident), proxy returns 524.

Observed behaviour

$ pgrep -af tui_gateway.slash_worker | wc -l
5

$ systemctl --user status hermes-dashboard.service
   Memory: 781.4M (peak: 1.1G)
   Tasks: 38

$ journalctl --user -u hermes-dashboard | grep -i "Incoming request ended abruptly: context canceled" | wc -l
51

SIGTERM to the unit was ignored. Recovery:

systemctl --user kill -s SIGKILL hermes-dashboard.service
systemctl --user reset-failed hermes-dashboard.service
systemctl --user start hermes-dashboard.service

Root cause (suspected)

Two related leaks in the PTY-bridge / slash-worker lifecycle:

/api/pty WebSocketDisconnect path closes the PtyBridge correctly, but the spawned hermes --tui child holds an open _SlashWorker (subprocess.Popen of tui_gateway.slash_worker). When the parent dies via the SIGHUP→SIGTERM→SIGKILL escalation in pty_bridge.PtyBridge.close(), the slash worker — which is a grandchild spawned by the in-PTY agent process, not the dashboard — does not always see a TTY-hangup propagation if the agent process exits non-cleanly. Result: an orphan tui_gateway.slash_worker is reparented to PID 1 (or remains under the dashboard cgroup since it was launched via Popen from inside an agent that started under the dashboard's user-unit cgroup).
_SlashWorker registers no atexit / signal handler and no PR_SET_PDEATHSIG in tui_gateway/server.py (lines ~183–264). close() is only called when the in-agent code reaches _restart_slash_worker or session shutdown — neither runs on abrupt websocket disconnect.

So:

Every browser refresh / Cloudflare upstream timeout that drops the /api/pty WS leaves an orphan worker.
Because the workers live in the same systemd user-cgroup, they count against the dashboard service's memory and Tasks=, and systemctl --user stop only signals the main pid; the workers ignore SIGTERM (no handler) and only die on KillMode=control-group + SIGKILL.

Suggested fix

Two independent guards, each cheap and useful on its own:

In _SlashWorker.__init__, set PR_SET_PDEATHSIG to SIGTERM on the child via a preexec_fn so the worker dies the moment its parent agent exits — even if the parent crashes or is SIGKILLed.

def _set_pdeathsig():
    try:
        import ctypes, signal as _sig
        libc = ctypes.CDLL("libc.so.6", use_errno=True)
        PR_SET_PDEATHSIG = 1
        libc.prctl(PR_SET_PDEATHSIG, _sig.SIGTERM, 0, 0, 0)
    except Exception:
        pass

self.proc = subprocess.Popen(
    argv,
    ...,
    preexec_fn=_set_pdeathsig,   # Linux-only; wrap in platform check
)

In /api/pty's finally block (hermes_cli/web_server.py ~3582–3588), after bridge.close(), also walk and SIGTERM any tui_gateway.slash_worker whose parent pid is the just-closed bridge's pid. This is defensive — guard (1) makes it redundant on Linux — but it's necessary on macOS where PR_SET_PDEATHSIG doesn't exist (use proc_track / kqueue NOTE_EXIT if you want symmetry, or just accept the explicit sweep).

Optional third: have systemd units document KillMode=control-group (it's the default, but worth a note in docs/deployment.md) so operators don't override it to process and lose the cgroup-wide SIGKILL recovery path.

Reproducibility

Reliably reproduces on my box: 3 open/close cycles on a flaky upstream (I forced this by toggling Cloudflare cache rules), 5 stacked workers, hang within 2 minutes. Happy to provide systemd journal excerpts or strace if useful.

I separately patched a Host: header issue affecting the same proxy setup in #32362 (DNS-rebinding allowlist via env var). That patch is what surfaced this leak — without it the dashboard wasn't reachable over the tunnel at all, so the leak never had a chance to accumulate.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

hermes - 💡(How to fix) Fix Dashboard hangs: tui_gateway.slash_worker subprocesses leak on PTY chat disconnect (524 via reverse proxy)

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Root cause (suspected)

Fix Action

Fix / Workaround

Code Example

Dashboard hangs under PTY chat after stacked `tui_gateway.slash_worker` subprocess leak

Summary

Environment

Reproduction

Observed behaviour

Root cause (suspected)

Suggested fix

Reproducibility

Related

Still need to ship something?

TRENDING

hermes - 💡(How to fix) Fix Dashboard hangs: tui_gateway.slash_worker subprocesses leak on PTY chat disconnect (524 via reverse proxy)

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Root cause (suspected)

Fix Action

Fix / Workaround

Code Example

Dashboard hangs under PTY chat after stacked tui_gateway.slash_worker subprocess leak

Summary

Environment

Reproduction

Observed behaviour

Root cause (suspected)

Suggested fix

Reproducibility

Related

Still need to ship something?

TRENDING

Dashboard hangs under PTY chat after stacked `tui_gateway.slash_worker` subprocess leak