hermes - 💡(How to fix) Fix tui_gateway.slash_worker subprocesses leak under dashboard usage (swap-pinning a 7.8GB box at 128 workers)

Root Cause

Same hermes dashboard --tui is running on 4 other VPSes in our fleet (Finn, Finn2, Jason, Sam — same release). All 4 currently show zero tui_gateway.slash_worker accumulation because their dashboards are open but barely used by humans. So the bug is the same code path everywhere; only Hermie has the multi-user traffic to expose it. This is consistent with "every slash invocation spawns a fresh worker, and only when the session disconnects do orphans become visible."

Code Example

Session 20260505_104123_c8063a — 62 stale workers (oldest from 5/5 ~10:41 AM)
Session 20260506_072606_f58b62 — 30
Session 20260506_082737_2a52e2 — 29
Session 20260506_114839_5a88dd —  4
Session 20260505_163042_6b92fb —  3
                                ───
                         total 128

---

Before SIGTERM:   RAM free  187 MB,  swap used 3.9 GB / 4.0 GB,  slash_workers 128
After  SIGTERM:   RAM free  4.4 GB,  swap used  81 MB,           slash_workers  19

---

Session 20260506_082737_2a52e2 — 15 workers
Session 20260506_072606_f58b62 — 14
Session 20260505_104123_c8063a — 14
Session 20260507_142806_ea1f49 —  1   (active session)
                                ───
                          total 44 (+ 2 in active turn)

---

python3 -m tui_gateway.slash_worker --session-key <session_id> --model claude-opus-4-7

---

# Kill any tui_gateway.slash_worker process older than 60 minutes (gracefully).
ps -eo pid,etimes,cmd \
  | awk '/tui_gateway\.slash_worker/ && !/awk/ && $2+0 > 3600 {print $1}' \
  | xargs -r kill -TERM

Bug: `tui_gateway.slash_worker` subprocesses leak under dashboard usage

hermes dashboard --tui is documented to use a persistent _SlashWorker subprocess per session — singular, persistent across slash invocations (per AGENTS.md § "Slash Command Flow"). Observed behavior contradicts this: each slash.exec call appears to spawn a fresh tui_gateway.slash_worker subprocess and orphan it.

Workers from sessions that ended hours/days ago are still running. Each holds ~95 MB resident. On a busy multi-user dashboard box this accumulates fast enough to swap-pin the host.

Reproducer

hermes dashboard --insecure --no-open --host 0.0.0.0 --port 9119 --tui
Multiple long-lived browser dashboard chat sessions (different users)
Heavy use of slash commands via the embedded TUI in browser

Observed accumulation (production, 7.8 GB box)

128 stale tui_gateway.slash_worker subprocesses across 5 dashboard chat sessions over ~48 h:

Session 20260505_104123_c8063a — 62 stale workers (oldest from 5/5 ~10:41 AM)
Session 20260506_072606_f58b62 — 30
Session 20260506_082737_2a52e2 — 29
Session 20260506_114839_5a88dd —  4
Session 20260505_163042_6b92fb —  3
                                ───
                         total 128

128 × ~95 MB ≈ 12 GB of resident demand on a 7.8 GB host. Result:

Before SIGTERM:   RAM free  187 MB,  swap used 3.9 GB / 4.0 GB,  slash_workers 128
After  SIGTERM:   RAM free  4.4 GB,  swap used  81 MB,           slash_workers  19

kswapd0 was active. Every keystroke through the websocket → PTY bridge had to wait on kernel paging. Symptom for end users: dashboard text field laggy / unresponsive on a box that otherwise has no load.

Re-accumulation rate

Cleared 128 → 19 at ~10:30 AM ET. Re-checked ~4 hours later: back to 46, none of which are reused across slash invocations within the same session — they accumulate, not deduplicate. Confirms the persistent-singular-worker behavior is not happening.

Session 20260506_082737_2a52e2 — 15 workers
Session 20260506_072606_f58b62 — 14
Session 20260505_104123_c8063a — 14
Session 20260507_142806_ea1f49 —  1   (active session)
                                ───
                          total 44 (+ 2 in active turn)

Each worker is invoked as:

python3 -m tui_gateway.slash_worker --session-key <session_id> --model claude-opus-4-7

Where to look (per `AGENTS.md` TUI architecture section)

tui_gateway/server.py — slash worker lifecycle / spawn path
tui_gateway/__main__.py (or wherever _SlashWorker is constructed)
hermes_cli/pty_bridge.py + hermes_cli/web_server.py /api/pty — dashboard side
ui-tui/src/* — slash.exec dispatch path

The fix is most likely:

the _SlashWorker is being constructed per-call instead of looked up from a session-scoped registry, and/or
an existing registry isn't reaping workers when the session disconnects (no graceful close on websocket teardown)

Cross-fleet observation

Workaround in production

Stopgap kill cron deployed across all 5 of our boxes:

# Kill any tui_gateway.slash_worker process older than 60 minutes (gracefully).
ps -eo pid,etimes,cmd \
  | awk '/tui_gateway\.slash_worker/ && !/awk/ && $2+0 > 3600 {print $1}' \
  | xargs -r kill -TERM

Run every 30 min. SIGTERM exits cleanly; tested no impact to live sessions.

I'd be happy to clear out our production state more often and capture additional snapshots if useful (process tables, /proc/<pid>/status, etc.). Will keep applying the stopgap until upstream fix is available.

Environment

hermes-agent build: 2026.4.30 + 188 commits (v2026.4.30-188-g5d3be898a)
Ubuntu 24.04 LTS
Python 3.12, all default
Anthropic API mode, model claude-opus-4-7
5 production VPSes, 7.8 GB RAM each

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

hermes - 💡(How to fix) Fix tui_gateway.slash_worker subprocesses leak under dashboard usage (swap-pinning a 7.8GB box at 128 workers)

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Fix Action

Fix / Workaround

Workaround in production

Code Example

Bug: `tui_gateway.slash_worker` subprocesses leak under dashboard usage

Reproducer

Observed accumulation (production, 7.8 GB box)

Re-accumulation rate

Where to look (per `AGENTS.md` TUI architecture section)

Cross-fleet observation

Workaround in production

Environment

Still need to ship something?

TRENDING

hermes - 💡(How to fix) Fix tui_gateway.slash_worker subprocesses leak under dashboard usage (swap-pinning a 7.8GB box at 128 workers)

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Fix Action

Fix / Workaround

Workaround in production

Code Example

Bug: tui_gateway.slash_worker subprocesses leak under dashboard usage

Reproducer

Observed accumulation (production, 7.8 GB box)

Re-accumulation rate

Where to look (per AGENTS.md TUI architecture section)

Cross-fleet observation

Workaround in production

Environment

Still need to ship something?

RELATED_DISCOVERY

TRENDING

Bug: `tui_gateway.slash_worker` subprocesses leak under dashboard usage

Where to look (per `AGENTS.md` TUI architecture section)