hermes - ✅(Solved) Fix [Feature]: Server-side SSE token batching to fix Open WebUI streaming lag [2 pull requests, 1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
NousResearch/hermes-agent#17537Fetched 2026-04-30 06:46:54
View on GitHub
Comments
0
Participants
1
Timeline
5
Reactions
0
Author
Participants
Timeline (top)
labeled ×3cross-referenced ×2

When Hermes is connected to Open WebUI via the API server (/v1/responses), long streaming responses cause severe UI lag. The browser freezes, scrolling becomes choppy, and the entire page becomes unresponsive.

Root Cause

api_server.py:_write_sse_responses() sends one SSE event per token (every response.output_text.delta). Open WebUI re-renders the full markdown on every single event. For a typical 20-second response, this means ~500 SSE events -> 500 full markdown re-parses, including expensive Katex regex scanning.

The Open WebUI side has acknowledged this (issues #20878, #18743, #13787) but their fixes (virtual scrolling, batched rendering, deferred Katex) are not yet implemented.

Fix Action

Fix / Workaround

# In _dispatch():
elif isinstance(it, str):
    _batch_buf.append(it)
    if _batch_timer is None:
        _batch_timer = asyncio.create_task(_batch_flush_after(0.05))

Also needs nonlocal _batch_timer fix:

The _dispatch() nested function also has a missing nonlocal _batch_timer declaration (causes UnboundLocalError). Must add nonlocal _batch_timer to _dispatch().

PR fix notes

PR #17541: fix(api_server): SSE token batching + response trimming for Open WebUI performance

Description (problem / solution / changelog)

Summary

Fixes severe UI lag when Hermes connects to Open WebUI via the API server. Long streaming responses with tool calls cause the browser to freeze due to Open WebUI re-rendering markdown on every single SSE token event (~500 events per 20s response).

Closes #17537

Changes

1. SSE Token Batching (50ms buffer) — Core Fix

Instead of emitting one SSE event per token, buffer consecutive text deltas and flush as a single event every 50ms:

  • ~500 SSE events → ~20 events per response
  • Open WebUI re-renders drop by 95%
  • 50ms is below human perception threshold — streaming still feels real-time
  • Flush triggered before tool events, EOS sentinel, and result processing

2. nonlocal _batch_timer fix

Added missing nonlocal _batch_timer declaration in _dispatch() nested function. Previously caused UnboundLocalError when batching was attempted.

3. response.completed Content Trimming

Trims large tool call arguments (>500 chars) and function call outputs (>1000 chars) in the response.completed SSE event. Prevents silent hangs when single SSE lines exceed 400-848KB, which Open WebUI's parser cannot handle.

4. Catch-All Exception Handlers (Both SSE Methods)

Added except Exception handlers to both _write_sse_responses() and _write_sse_chat_completion() to emit proper error events and [DONE] terminators. Prevents TransferEncodingError from incomplete chunked encoding when model API errors occur mid-stream (e.g., BadRequestError, AuthenticationError, rate limits).

5. Request Body Size Limits

  • Raised MAX_REQUEST_BYTES from 1MB to 10MB for long conversations
  • Passed client_max_size=MAX_REQUEST_BYTES to aiohttp.Application to prevent silent 400 errors from truncated request bodies

Related Issues

  • Open WebUI: open-webui/open-webui#20878 (UI freezes during streaming)
  • Open WebUI: open-webui/open-webui#18743 (tool call JSON rendering)
  • Hermes Agent: #17537 (this PR)

Testing

  • Tested with Open WebUI v0.9.2 (pip-installed on Windows) connected to Hermes API server via Responses mode
  • 20-second multi-tool response: previously ~500 SSE events causing UI freeze → now ~20 events, UI stays responsive
  • response.completed payload reduced from 848KB to ~8KB

Changed files

  • gateway/platforms/api_server.py (modified, +117/-8)

PR #17552: docs: Open WebUI Filter Function + quantified performance analysis

Description (problem / solution / changelog)

Summary

Adds a production-ready Open WebUI Filter Function that eliminates UI lag when Hermes connects to Open WebUI via the API Server. Includes detailed performance analysis with quantified before/after metrics.

Background

When Hermes streams long responses with tool calls through Open WebUI, the browser freezes due to three compounding issues:

  1. SSE event storm — each token = 1 SSE event → ~500 re-renders per 20s response
  2. DOM bloat — tool call arguments (24KB+ JSON) create ~300+ DOM nodes per card
  3. Giant response.completed — 400-848KB single-line SSE silently hangs parser

Server-side batching (PR #17541) solves issue 1. This Filter Function solves issues 2 and 3.

Changes

contrib/openwebui-filter/filter-function-v3.py

Complete Open WebUI Filter with:

  • Emitter beautify: 15+ tool emoji summaries (💾 path (24.5 KB) instead of raw JSON)
  • Output summaries: JSON → one-liner (🔍 5 results instead of {"data":{"web":[...]}})
  • call_id → name tracking: accurate tool name resolution across SSE event pairs
  • Multi-part output: processes all output parts, not just output[0]
  • response.completed trimming: 848KB → ~8KB (largest single performance win)
  • Inline-output hint: encourages Hermes to output content inline

contrib/openwebui-filter/README.md

Comprehensive deployment guide with both persistent (file-based) and quick (API) methods.

contrib/openwebui-filter/perf-analysis.md

Full root cause analysis with per-layer metrics and architectural diagrams.

Performance Impact (Batching + Filter combined)

MetricBeforeAfterImprovement
SSE events per 20s response~500~20-96%
DOM nodes per tool card~300+~5-98%
Frame render time600ms~80ms-87%
response.completed payload848 KB~8 KB-99%
CPU during streaming100% (frozen)<20%solved
UI freezingyesnonesolved
TransferEncodingErroroccasionaleliminatedsolved

Data Sources

  • Open WebUI issue #20878 — Safari profiling (v0.7.2 → v0.8.9)
  • Hermes PR #17541 — server-side batching measurements
  • Filter Function DOM inspection via browser console diagnostics

Related Issues

  • Closes: #17537 (Feature Request: SSE batching)
  • Related: #17541 (Server-side batching PR)
  • Related: open-webui/open-webui#20878, open-webui/open-webui#21884

Changed files

  • contrib/openwebui-filter/README.md (added, +90/-0)
  • contrib/openwebui-filter/filter-function-v3.py (added, +349/-0)
  • contrib/openwebui-filter/perf-analysis.md (added, +133/-0)
  • gateway/platforms/api_server.py (modified, +117/-8)

Code Example

# In _dispatch():
elif isinstance(it, str):
    _batch_buf.append(it)
    if _batch_timer is None:
        _batch_timer = asyncio.create_task(_batch_flush_after(0.05))
RAW_BUFFERClick to expand / collapse

Description

When Hermes is connected to Open WebUI via the API server (/v1/responses), long streaming responses cause severe UI lag. The browser freezes, scrolling becomes choppy, and the entire page becomes unresponsive.

Root Cause

api_server.py:_write_sse_responses() sends one SSE event per token (every response.output_text.delta). Open WebUI re-renders the full markdown on every single event. For a typical 20-second response, this means ~500 SSE events -> 500 full markdown re-parses, including expensive Katex regex scanning.

The Open WebUI side has acknowledged this (issues #20878, #18743, #13787) but their fixes (virtual scrolling, batched rendering, deferred Katex) are not yet implemented.

Proposed Fix: Server-Side Token Batching

Add a 50ms token buffer in _write_sse_responses(). Instead of immediately emitting every text delta, buffer consecutive text tokens and flush as a single SSE event every 50ms:

# In _dispatch():
elif isinstance(it, str):
    _batch_buf.append(it)
    if _batch_timer is None:
        _batch_timer = asyncio.create_task(_batch_flush_after(0.05))

Impact: ~500 SSE events -> ~20 events. Open WebUI re-renders drop by 95%. 50ms is below human perception threshold so streaming still feels real-time.

Flush triggers needed:

  • Before tool events (__tool_started__, __tool_completed__) to maintain ordering
  • Before EOS sentinel to flush final tokens
  • Before agent result processing

Also needs nonlocal _batch_timer fix:

The _dispatch() nested function also has a missing nonlocal _batch_timer declaration (causes UnboundLocalError). Must add nonlocal _batch_timer to _dispatch().

Additional Context

  • Open WebUI issue tracking same problem: open-webui/open-webui#20878
  • The response.completed event also needs content trimming (848KB+ single SSE lines cause silent hangs)
  • Same batching applies to Chat Completions endpoint (_write_sse_chat_completion())
  • Hermes skill already documents the fix in detail; this issue tracks implementation in main repo

Environment

  • Hermes Agent: latest
  • Open WebUI: v0.9.x (pip-installed on Windows)
  • Connection: WSL2 Hermes -> Windows Open WebUI via API server (Responses mode)
  • Browser: Chrome/Firefox both affected

extent analysis

TL;DR

Implement server-side token batching in _write_sse_responses() to reduce the number of SSE events and alleviate UI lag.

Guidance

  • Introduce a 50ms token buffer in _write_sse_responses() to batch consecutive text tokens and flush as a single SSE event.
  • Add flush triggers before tool events, EOS sentinel, and agent result processing to maintain ordering.
  • Declare _batch_timer as nonlocal in the _dispatch() nested function to fix the UnboundLocalError.
  • Consider applying the same batching to the Chat Completions endpoint (_write_sse_chat_completion()).

Example

_batch_buf = []
_batch_timer = None

def _dispatch():
    global _batch_buf, _batch_timer
    # ...
    elif isinstance(it, str):
        _batch_buf.append(it)
        if _batch_timer is None:
            _batch_timer = asyncio.create_task(_batch_flush_after(0.05))
    # ...

def _batch_flush_after(delay):
    # ...
    nonlocal _batch_timer
    # ...

Notes

The proposed fix assumes that the 50ms buffering delay is below the human perception threshold, ensuring real-time streaming. However, this value may need to be adjusted based on specific use cases.

Recommendation

Apply the workaround by implementing server-side token batching, as it directly addresses the root cause of the UI lag issue.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

hermes - ✅(Solved) Fix [Feature]: Server-side SSE token batching to fix Open WebUI streaming lag [2 pull requests, 1 participants]