hermes - ✅(Solved) Fix tui_gateway dispatcher is single-threaded — slow RPC calls freeze approval.respond / session.interrupt [1 pull requests, 2 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
NousResearch/hermes-agent#12546Fetched 2026-04-20 12:18:21
View on GitHub
Comments
2
Participants
2
Timeline
8
Reactions
0
Author
Timeline (top)
commented ×2referenced ×2closed ×1cross-referenced ×1

Every RPC handler in tui_gateway/server.py runs synchronously on the single stdin-read loop in tui_gateway/entry.py. A slow handler (slash.exec, cli.exec, shell.exec, session.resume, session.branch) blocks the dispatcher for up to 45–600 s, during which ANY inbound RPC from Ink — including approval.respond and session.interrupt — sits unread in the pipe.

User-visible symptom: click Allow Once on an approval prompt, UI appears to hang. Hit Ctrl+C to interrupt, nothing happens. Both become responsive again once the slow handler times out.

Root Cause

Happy to PR whichever direction you prefer — or if you'd rather own it, pass. Pinging you because you know the TUI's session-state invariants better than anyone.

Fix Action

Fix / Workaround

Summary

Every RPC handler in tui_gateway/server.py runs synchronously on the single stdin-read loop in tui_gateway/entry.py. A slow handler (slash.exec, cli.exec, shell.exec, session.resume, session.branch) blocks the dispatcher for up to 45–600 s, during which ANY inbound RPC from Ink — including approval.respond and session.interrupt — sits unread in the pipe.

Outbound streaming (message.delta, tool.start, etc.) is unaffected — worker threads emit those via write_json (guarded by _stdout_lock). Only inbound dispatch is the bottleneck.

Concrete repro

T+0.0s  user types /tokens
        → slash.exec → _SlashWorker blocks on stdout_queue.get(timeout=45s)
T+0.5s  Agent worker thread (from an earlier prompt.submit) emits approval.request
        → UI renders Allow/Deny buttons
T+0.8s  User clicks Allow Once
        → Ink sends approval.respond into stdin
        → request sits in the pipe buffer, UNREAD
        → agent thread is blocked at ev.wait() inside _block()
T+45s   slash worker times out, dispatcher unblocks
        dispatcher reads approval.respond, sets the Event, agent resumes

44 seconds of UI-frozen staring. Same pattern for session.interrupt — Ctrl+C buffers until the slow handler releases.

PR fix notes

PR #12560: fix(tui-gateway): dispatch slow RPC handlers on a thread pool (#12546)

Description (problem / solution / changelog)

What does this PR do?

Fixes the single-threaded dispatcher freeze described in #12546. The for raw in sys.stdin loop in tui_gateway/entry.py calls handle_request() inline, so any handler that blocks for seconds to minutes — slash.exec (45s), cli.exec (up to 600s), shell.exec (30s), session.resume / session.branch (synchronous _make_agent()) — freezes the dispatcher. While one is running, inbound RPCs including approval.respond and session.interrupt sit unread in the stdin pipe buffer and only land after the slow handler returns.

This PR routes only those five handlers onto a small ThreadPoolExecutor; every other handler stays on the main thread. That's Option 2 from the issue — it gives us the user-visible responsiveness win without opening up the ordering / session-state race concerns that a full pool-everything refactor would.

Related Issue

Fixes #12546

Type of Change

  • 🐛 Bug fix (non-breaking change that fixes an issue)
  • ✨ New feature (non-breaking change that adds functionality)
  • 🔒 Security fix
  • 📝 Documentation update
  • ✅ Tests (adding or improving test coverage)
  • ♻️ Refactor (no behavior change)
  • 🎯 New skill (bundled or hub)

Changes Made

  • tui_gateway/server.py
    • Add _LONG_HANDLERS = frozenset({"cli.exec", "session.branch", "session.resume", "shell.exec", "slash.exec"}).
    • Add ThreadPoolExecutor(max_workers=4, thread_name_prefix="tui-rpc") with atexit-registered shutdown. Worker count is overridable via HERMES_TUI_RPC_POOL_WORKERS (min 2).
    • New dispatch(req) function: routes long methods to the pool and returns None; everything else goes through handle_request() inline and returns the response.
    • New _run_and_emit(req) helper: runs the handler on a pool worker, catches any unexpected exception so a bug in a handler still surfaces as a JSON-RPC -32000 error instead of dying silently, and writes the response via the already-thread-safe write_json().
  • tui_gateway/entry.py
    • Swap handle_requestdispatch. When dispatch returns None the pool worker has taken responsibility for writing the response.
  • tests/tui_gateway/test_protocol.py — 5 new tests
    • Non-long handlers still return synchronously from dispatch().
    • Long handlers return None from dispatch() and emit their response via write_json().
    • A blocked long handler does not delay a concurrent fast handler (key regression guard).
    • A handler that raises still produces a structured error response on stdout.
    • Methods not in _LONG_HANDLERS always take the sync path.

Why this scope is safe

  • write_json is already _stdout_lock-guarded (server.py:33, 133), so concurrent response writes from pool workers are already safe.
  • The five pool-routed handlers all return a rendered payload and don't mutate session state that fast handlers depend on — they only touch session["slash_worker"] (in slash.exec) and create new _sessions[sid] entries (in session.resume / session.branch, where sid is a fresh uuid.uuid4().hex[:8]).
  • The ordering hazard the issue raises around session.close → anything is a non-issue here because session.close stays on the main thread; a concurrent pool-dispatched call on the closing session would just see _sess_nowait return _err(rid, 4001, "session not found"), which is already the expected error response for a race like that.
  • session.create is already async via its own agent_ready: threading.Event (server.py:1060, 1118), and prompt.submit already spawns its own worker thread (server.py:1480). Neither changes.

How to Test

cd hermes-agent
uv run --python 3.12 --with pytest --with pytest-xdist pytest tests/tui_gateway/ -q
# 46 passed

Manual repro (same script from the issue comment), before and after:

python - <<'PY'
import subprocess, sys, json, time
p = subprocess.Popen([sys.executable, '-m', 'tui_gateway.entry'],
                     stdin=subprocess.PIPE, stdout=subprocess.PIPE,
                     text=True, bufsize=1)
def send(m): p.stdin.write(json.dumps(m) + '\n'); p.stdin.flush()
def recv(): return json.loads(p.stdout.readline())

recv()  # gateway.ready
send({'jsonrpc':'2.0','id':1,'method':'session.create','params':{'cols':80}})
sid = recv()['result']['session_id']

t0 = time.time()
send({'jsonrpc':'2.0','id':2,'method':'shell.exec','params':{'command':'sleep 3'}})
send({'jsonrpc':'2.0','id':3,'method':'terminal.resize',
      'params':{'session_id':sid,'cols':120}})

seen = set()
while seen < {2, 3}:
    r = recv()
    if (rid := r.get('id')) in {2, 3}:
        print(f'[{time.time()-t0:5.2f}s] response id={rid}')
        seen.add(rid)
p.terminate()
PY

Before this PR:

[ 3.0xs] response id=2
[ 3.0xs] response id=3   ← terminal.resize was queued behind shell.exec

After this PR:

[ 0.00s] response id=3   ← fast handler returns immediately
[ 3.01s] response id=2   ← shell.exec continues on the pool

Inside hermes chat, the equivalent UX win: !sleep 10 no longer freezes Ctrl+C / approval prompts.

Checklist

Code

  • I've read the Contributing Guide
  • My commit messages follow Conventional Commits
  • I searched for existing PRs — none covering this
  • PR contains only this one logical change
  • pytest tests/tui_gateway/ -q passes (46/46)
  • I've added tests for my changes
  • Tested on Ubuntu 24.04 (WSL2), Python 3.12

Documentation & Housekeeping

  • Docs — N/A (one new env var, HERMES_TUI_RPC_POOL_WORKERS, documented in the code comment alongside _RPC_POOL_WORKERS)
  • cli-config.yaml.example — N/A (env var, not config key)
  • CONTRIBUTING.md / AGENTS.md — N/A
  • Cross-platform — concurrent.futures.ThreadPoolExecutor is stdlib, works on Windows/macOS/Linux
  • Tool descriptions/schemas — N/A

Changed files

  • tests/tui_gateway/test_protocol.py (modified, +79/-0)
  • tui_gateway/entry.py (modified, +2/-2)
  • tui_gateway/server.py (modified, +41/-0)
  • ui-tui/src/__tests__/providers.test.ts (modified, +6/-3)
  • ui-tui/src/app/createGatewayEventHandler.ts (modified, +0/-11)
  • ui-tui/src/app/interfaces.ts (modified, +0/-5)
  • ui-tui/src/app/useInputHandlers.ts (modified, +0/-1)
  • ui-tui/src/app/useMainApp.ts (modified, +6/-9)
  • ui-tui/src/components/modelPicker.tsx (modified, +8/-2)
  • ui-tui/src/components/prompts.tsx (modified, +8/-20)
  • ui-tui/src/config/env.ts (modified, +1/-3)
  • ui-tui/src/domain/paths.ts (modified, +1/-2)
  • ui-tui/src/domain/providers.ts (modified, +3/-9)

Code Example

# tui_gateway/entry.py (38 lines)
for raw in sys.stdin:            # blocks on stdin read
    ...
    req = json.loads(line)
    resp = handle_request(req)   # runs to completion, BLOCKING
    if resp is not None:
        write_json(resp)

---

T+0.0s  user types /tokens
        → slash.exec → _SlashWorker blocks on stdout_queue.get(timeout=45s)
T+0.5s  Agent worker thread (from an earlier prompt.submit) emits approval.request
UI renders Allow/Deny buttons
T+0.8s  User clicks Allow Once
Ink sends approval.respond into stdin
        → request sits in the pipe buffer, UNREAD
        → agent thread is blocked at ev.wait() inside _block()
T+45s   slash worker times out, dispatcher unblocks
        dispatcher reads approval.respond, sets the Event, agent resumes
RAW_BUFFERClick to expand / collapse

@OutThisLife — flagging this for you since you own the TUI. Not urgent; no fix attached yet, this is a design-decision issue more than a bug report.

Summary

Every RPC handler in tui_gateway/server.py runs synchronously on the single stdin-read loop in tui_gateway/entry.py. A slow handler (slash.exec, cli.exec, shell.exec, session.resume, session.branch) blocks the dispatcher for up to 45–600 s, during which ANY inbound RPC from Ink — including approval.respond and session.interrupt — sits unread in the pipe.

User-visible symptom: click Allow Once on an approval prompt, UI appears to hang. Hit Ctrl+C to interrupt, nothing happens. Both become responsive again once the slow handler times out.

Architecture today

# tui_gateway/entry.py (38 lines)
for raw in sys.stdin:            # blocks on stdin read
    ...
    req = json.loads(line)
    resp = handle_request(req)   # runs to completion, BLOCKING
    if resp is not None:
        write_json(resp)

Outbound streaming (message.delta, tool.start, etc.) is unaffected — worker threads emit those via write_json (guarded by _stdout_lock). Only inbound dispatch is the bottleneck.

Handlers that can block the loop

HandlerWorst-case block
slash.exec45 s (_SLASH_WORKER_TIMEOUT_S)
cli.exec600 s (subprocess.run(timeout=600))
shell.exec30 s
session.resume / session.branchseconds (synchronous _make_agent())

prompt.submit itself is fine — it spawns a threading.Thread for agent.run_conversation and returns immediately.

Concrete repro

T+0.0s  user types /tokens
        → slash.exec → _SlashWorker blocks on stdout_queue.get(timeout=45s)
T+0.5s  Agent worker thread (from an earlier prompt.submit) emits approval.request
        → UI renders Allow/Deny buttons
T+0.8s  User clicks Allow Once
        → Ink sends approval.respond into stdin
        → request sits in the pipe buffer, UNREAD
        → agent thread is blocked at ev.wait() inside _block()
T+45s   slash worker times out, dispatcher unblocks
        dispatcher reads approval.respond, sets the Event, agent resumes

44 seconds of UI-frozen staring. Same pattern for session.interrupt — Ctrl+C buffers until the slow handler releases.

Fix options

  1. Full thread-pool dispatch. ThreadPoolExecutor(max_workers=8), submit each request. write_json is already thread-safe. _pending/_answers already use threading.Event (since _block is called from agent threads). Risks:
    • Ordering: requests can complete out of order. Most pairs are order-independent but a couple — session.createprompt.submit on the new sid, session.close → anything — would need care.
    • Session-state races: handlers that do read-then-write on session[\"...\"] outside session[\"history_lock\"] need an audit.
    • Shutdown drain for the executor (not critical, but polite).
  2. Incremental — only the known-slow handlers. _LONG_HANDLERS = {"slash.exec", "cli.exec", "shell.exec", "session.resume", "session.branch"} dispatch on a pool, everything else stays on the main thread. ~80% of the user-visible benefit, much smaller audit surface, zero ordering surprises for fast handlers.

I'd lean (2) as a first step. The slow handlers don't have inter-dependencies with other RPC state; they mostly just return a rendered payload.

Context

Came up during a gateway race-condition audit (see #12371, #12441, #12416, #12444 for recent concurrency fixes). The base adapter and Discord side are now in good shape; this is the last HIGH on my list.

Happy to PR whichever direction you prefer — or if you'd rather own it, pass. Pinging you because you know the TUI's session-state invariants better than anyone.

extent analysis

TL;DR

Implementing a thread-pool dispatch for slow handlers, such as slash.exec, cli.exec, shell.exec, session.resume, and session.branch, can help prevent the UI from freezing due to blocking RPC handlers.

Guidance

  • Identify the slow handlers that block the dispatcher and consider using a ThreadPoolExecutor to run them asynchronously.
  • Audit session-state access in handlers that read and write to session["..."] outside of session["history_lock"] to prevent races.
  • Consider implementing an incremental approach, where only known-slow handlers are dispatched on a pool, to minimize the audit surface and ordering surprises.
  • Review the ordering of requests and ensure that any order-dependent pairs, such as session.create and prompt.submit, are handled correctly.

Example

_LONG_HANDLERS = {"slash.exec", "cli.exec", "shell.exec", "session.resume", "session.branch"}
with ThreadPoolExecutor(max_workers=8) as executor:
    if req["handler"] in _LONG_HANDLERS:
        executor.submit(handle_request, req)
    else:
        handle_request(req)

Notes

The chosen solution should balance the trade-offs between complexity, ordering, and session-state races. The incremental approach may be a good starting point, as it provides most of the user-visible benefits with a smaller audit surface.

Recommendation

Apply the incremental workaround, where only known-slow handlers are dispatched on a pool, as it provides a good balance between benefits and complexity. This approach allows for a more targeted solution with fewer ordering surprises and a smaller audit surface.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

hermes - ✅(Solved) Fix tui_gateway dispatcher is single-threaded — slow RPC calls freeze approval.respond / session.interrupt [1 pull requests, 2 comments, 2 participants]