hermes - 💡(How to fix) Fix tui_gateway.slash_worker subprocesses leak under dashboard usage (swap-pinning a 7.8GB box at 128 workers)

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

Root Cause

Same hermes dashboard --tui is running on 4 other VPSes in our fleet (Finn, Finn2, Jason, Sam — same release). All 4 currently show zero tui_gateway.slash_worker accumulation because their dashboards are open but barely used by humans. So the bug is the same code path everywhere; only Hermie has the multi-user traffic to expose it. This is consistent with "every slash invocation spawns a fresh worker, and only when the session disconnects do orphans become visible."

Fix Action

Fix / Workaround

  • tui_gateway/server.py — slash worker lifecycle / spawn path
  • tui_gateway/__main__.py (or wherever _SlashWorker is constructed)
  • hermes_cli/pty_bridge.py + hermes_cli/web_server.py /api/pty — dashboard side
  • ui-tui/src/*slash.exec dispatch path

Workaround in production

Code Example

Session 20260505_104123_c8063a — 62 stale workers (oldest from 5/5 ~10:41 AM)
Session 20260506_072606_f58b62 — 30
Session 20260506_082737_2a52e2 — 29
Session 20260506_114839_5a88dd —  4
Session 20260505_163042_6b92fb —  3
                                ───
                         total 128

---

Before SIGTERM:   RAM free  187 MB,  swap used 3.9 GB / 4.0 GB,  slash_workers 128
After  SIGTERM:   RAM free  4.4 GB,  swap used  81 MB,           slash_workers  19

---

Session 20260506_082737_2a52e2 — 15 workers
Session 20260506_072606_f58b62 — 14
Session 20260505_104123_c8063a — 14
Session 20260507_142806_ea1f49 —  1   (active session)
                                ───
                          total 44 (+ 2 in active turn)

---

python3 -m tui_gateway.slash_worker --session-key <session_id> --model claude-opus-4-7

---

# Kill any tui_gateway.slash_worker process older than 60 minutes (gracefully).
ps -eo pid,etimes,cmd \
  | awk '/tui_gateway\.slash_worker/ && !/awk/ && $2+0 > 3600 {print $1}' \
  | xargs -r kill -TERM
RAW_BUFFERClick to expand / collapse

Bug: tui_gateway.slash_worker subprocesses leak under dashboard usage

hermes dashboard --tui is documented to use a persistent _SlashWorker subprocess per session — singular, persistent across slash invocations (per AGENTS.md § "Slash Command Flow"). Observed behavior contradicts this: each slash.exec call appears to spawn a fresh tui_gateway.slash_worker subprocess and orphan it.

Workers from sessions that ended hours/days ago are still running. Each holds ~95 MB resident. On a busy multi-user dashboard box this accumulates fast enough to swap-pin the host.

Reproducer

  • hermes dashboard --insecure --no-open --host 0.0.0.0 --port 9119 --tui
  • Multiple long-lived browser dashboard chat sessions (different users)
  • Heavy use of slash commands via the embedded TUI in browser

Observed accumulation (production, 7.8 GB box)

128 stale tui_gateway.slash_worker subprocesses across 5 dashboard chat sessions over ~48 h:

Session 20260505_104123_c8063a — 62 stale workers (oldest from 5/5 ~10:41 AM)
Session 20260506_072606_f58b62 — 30
Session 20260506_082737_2a52e2 — 29
Session 20260506_114839_5a88dd —  4
Session 20260505_163042_6b92fb —  3
                                ───
                         total 128

128 × ~95 MB ≈ 12 GB of resident demand on a 7.8 GB host. Result:

Before SIGTERM:   RAM free  187 MB,  swap used 3.9 GB / 4.0 GB,  slash_workers 128
After  SIGTERM:   RAM free  4.4 GB,  swap used  81 MB,           slash_workers  19

kswapd0 was active. Every keystroke through the websocket → PTY bridge had to wait on kernel paging. Symptom for end users: dashboard text field laggy / unresponsive on a box that otherwise has no load.

Re-accumulation rate

Cleared 128 → 19 at ~10:30 AM ET. Re-checked ~4 hours later: back to 46, none of which are reused across slash invocations within the same session — they accumulate, not deduplicate. Confirms the persistent-singular-worker behavior is not happening.

Session 20260506_082737_2a52e2 — 15 workers
Session 20260506_072606_f58b62 — 14
Session 20260505_104123_c8063a — 14
Session 20260507_142806_ea1f49 —  1   (active session)
                                ───
                          total 44 (+ 2 in active turn)

Each worker is invoked as:

python3 -m tui_gateway.slash_worker --session-key <session_id> --model claude-opus-4-7

Where to look (per AGENTS.md TUI architecture section)

  • tui_gateway/server.py — slash worker lifecycle / spawn path
  • tui_gateway/__main__.py (or wherever _SlashWorker is constructed)
  • hermes_cli/pty_bridge.py + hermes_cli/web_server.py /api/pty — dashboard side
  • ui-tui/src/*slash.exec dispatch path

The fix is most likely:

  • the _SlashWorker is being constructed per-call instead of looked up from a session-scoped registry, and/or
  • an existing registry isn't reaping workers when the session disconnects (no graceful close on websocket teardown)

Cross-fleet observation

Same hermes dashboard --tui is running on 4 other VPSes in our fleet (Finn, Finn2, Jason, Sam — same release). All 4 currently show zero tui_gateway.slash_worker accumulation because their dashboards are open but barely used by humans. So the bug is the same code path everywhere; only Hermie has the multi-user traffic to expose it. This is consistent with "every slash invocation spawns a fresh worker, and only when the session disconnects do orphans become visible."

Workaround in production

Stopgap kill cron deployed across all 5 of our boxes:

# Kill any tui_gateway.slash_worker process older than 60 minutes (gracefully).
ps -eo pid,etimes,cmd \
  | awk '/tui_gateway\.slash_worker/ && !/awk/ && $2+0 > 3600 {print $1}' \
  | xargs -r kill -TERM

Run every 30 min. SIGTERM exits cleanly; tested no impact to live sessions.

I'd be happy to clear out our production state more often and capture additional snapshots if useful (process tables, /proc/<pid>/status, etc.). Will keep applying the stopgap until upstream fix is available.

Environment

  • hermes-agent build: 2026.4.30 + 188 commits (v2026.4.30-188-g5d3be898a)
  • Ubuntu 24.04 LTS
  • Python 3.12, all default
  • Anthropic API mode, model claude-opus-4-7
  • 5 production VPSes, 7.8 GB RAM each

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING