hermes - ✅(Solved) Fix [Feature]: Server-side SSE token batching to fix Open WebUI streaming lag [2 pull requests, 1 participants]

bogerman1 · 2026-04-29T15:51:59Z

[hermes] When Hermes is connected to Open WebUI via the API server /v1/responses , long streaming responses cause severe UI lag. The browser freezes, scrolling… When Hermes is connected to Open WebUI via the API server (`/v1/responses`), long streaming responses cause severe UI lag. The browser freezes, scrolling becomes choppy, and the entire page becomes unresponsive. # PR #17541: fix(api_server): SSE token batching + response trimming for Open WebUI performance - Repository: NousResearch/hermes-agent - Author: bogerman1 - State: open | merged: False - Link: https://github.com/NousResearch/hermes-agent/pull/17541 ## Description (problem / solution / changelog) ## Summary Fixes severe UI lag when Hermes connects to Open WebUI via the API server. Long streaming responses with tool calls cause the browser to freeze due to Open WebUI re-rendering markdown on every single SSE token event (~500 events per 20s response). Closes #17537 ## Changes ### 1. SSE Token Batching (50ms buffer) — Core Fix Instead of emitting one SSE event per token, buffer consecutive text deltas and flush as a single event every 50ms: - ~500 SSE events → ~20 events per response - Open WebUI re-renders drop by 95% - 50ms is below human perception threshold — streaming still feels real-time - Flush triggered before tool events, EOS sentinel, and result processing ### 2. `nonlocal _batch_timer` fix Added missing `nonlocal _batch_timer` declaration in `_dispatch()` nested function. Previously caused `UnboundLocalError` when batching was attempted. ### 3. `response.completed` Content Trimming Trims large tool call arguments (>500 chars) and function call outputs (>1000 chars) in the `response.completed` SSE event. Prevents silent hangs when single SSE lines exceed 400-848KB, which Open WebUI's parser cannot handle. ### 4. Catch-All Exception Handlers (Both SSE Methods) Added `except Exception` handlers to both `_write_sse_responses()` and `_write_sse_chat_completion()` to emit proper error events and `[DONE]` terminators. Prevents `TransferEncodingError` from incomplete chunked encoding when model API errors occur mid-stream (e.g., BadRequestError, AuthenticationError, rate limits). ### 5. Request Body Size Limits - Raised `MAX_REQUEST_BYTES` from 1MB to 10MB for long conversations - Passed `client_max_size=MAX_REQUEST_BYTES` to `aiohttp.Application` to prevent silent 400 errors from truncated request bodies ## Related Issues - Open WebUI: open-webui/open-webui#20878 (UI freezes during streaming) - Open WebUI: open-webui/open-webui#18743 (tool call JSON rendering) - Hermes Agent: #17537 (this PR) ## Testing - Tested with Open WebUI v0.9.2 (pip-installed on Windows) connected to Hermes API server via Responses mode - 20-second multi-tool response: previously ~500 SSE events causing UI freeze → now ~20 events, UI stays responsive - `response.completed` payload reduced from 848KB to ~8KB ## Changed files - `gateway/platforms/api_server.py` (modified, +117/-8) --- # PR #17552: docs: Open WebUI Filter Function + quantified performance analysis - Repository: NousResearch/hermes-agent - Author: bogerman1 - State: open | merged: False - Link: https://github.com/NousResearch/hermes-agent/pull/17552 ## Description (problem / solution / changelog) ## Summary Adds a production-ready Open WebUI Filter Function that eliminates UI lag when Hermes connects to Open WebUI via the API Server. Includes detailed performance analysis with quantified before/after metrics. ## Background When Hermes streams long responses with tool calls through Open WebUI, the browser freezes due to three compounding issues: 1. **SSE event storm** — each token = 1 SSE event → ~500 re-renders per 20s response 2. **DOM bloat** — tool call arguments (24KB+ JSON) create ~300+ DOM nodes per card 3. **Giant response.completed** — 400-848KB single-line SSE silently hangs parser Server-side batching (PR #17541) solves issue 1. This Filter Function solves issues 2 and 3. ## Changes ### `contrib/openwebui-filter/filter-function-v3.py` Complete Open WebUI Filter with: - **Emitter beautify**: 15+ tool emoji summaries (`💾 path (24.5 KB)` instead of raw JSON) - **Output summaries**: JSON → one-liner (`🔍 5 results` instead of `{"data":{"web":[...]}}`) - **call_id → name tracking**: accurate tool name resolution across SSE event pairs - **Multi-part output**: processes all output parts, not just `output[0]` - **response.completed trimming**: 848KB → ~8KB (largest single performance win) - **Inline-output hint**: encourages Hermes to output content inline ### `contrib/openwebui-filter/README.md` Comprehensive deployment guide with both persistent (file-based) and quick (API) methods. ### `contrib/openwebui-filter/perf-analysis.md` Full root cause analysis with per-layer metrics and architectural diagrams. ## Performance Impact (Batching + Filter combined) | Metric | Before | After | Improvement | |--------|--------|-------|-------------| | SSE events per 20s response | ~500 | ~20 | **-96%** | | DO

hermes2026-04-29 15:51:59

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

NousResearch/hermes-agent#17537•Fetched 2026-04-30 06:46:54

View on GitHub

Comments

Participants

Timeline

Reactions

Author

bogerman1

Participants

bogerman1

Timeline (top)

labeled ×3cross-referenced ×2

When Hermes is connected to Open WebUI via the API server (/v1/responses), long streaming responses cause severe UI lag. The browser freezes, scrolling becomes choppy, and the entire page becomes unresponsive.

Root Cause

api_server.py:_write_sse_responses() sends one SSE event per token (every response.output_text.delta). Open WebUI re-renders the full markdown on every single event. For a typical 20-second response, this means ~500 SSE events -> 500 full markdown re-parses, including expensive Katex regex scanning.

The Open WebUI side has acknowledged this (issues #20878, #18743, #13787) but their fixes (virtual scrolling, batched rendering, deferred Katex) are not yet implemented.

Fix Action

Fix / Workaround

# In _dispatch():
elif isinstance(it, str):
    _batch_buf.append(it)
    if _batch_timer is None:
        _batch_timer = asyncio.create_task(_batch_flush_after(0.05))

Also needs `nonlocal _batch_timer` fix:

The _dispatch() nested function also has a missing nonlocal _batch_timer declaration (causes UnboundLocalError). Must add nonlocal _batch_timer to _dispatch().

PR fix notes

PR #17541: fix(api_server): SSE token batching + response trimming for Open WebUI performance

Repository: NousResearch/hermes-agent
Author: bogerman1
State: open | merged: False
Link: https://github.com/NousResearch/hermes-agent/pull/17541

Description (problem / solution / changelog)

Summary

Fixes severe UI lag when Hermes connects to Open WebUI via the API server. Long streaming responses with tool calls cause the browser to freeze due to Open WebUI re-rendering markdown on every single SSE token event (~500 events per 20s response).

Closes #17537

Changes

1. SSE Token Batching (50ms buffer) — Core Fix

Instead of emitting one SSE event per token, buffer consecutive text deltas and flush as a single event every 50ms:

~500 SSE events → ~20 events per response
Open WebUI re-renders drop by 95%
50ms is below human perception threshold — streaming still feels real-time
Flush triggered before tool events, EOS sentinel, and result processing

2. `nonlocal _batch_timer` fix

Added missing nonlocal _batch_timer declaration in _dispatch() nested function. Previously caused UnboundLocalError when batching was attempted.

3. `response.completed` Content Trimming

Trims large tool call arguments (>500 chars) and function call outputs (>1000 chars) in the response.completed SSE event. Prevents silent hangs when single SSE lines exceed 400-848KB, which Open WebUI's parser cannot handle.

4. Catch-All Exception Handlers (Both SSE Methods)

Added except Exception handlers to both _write_sse_responses() and _write_sse_chat_completion() to emit proper error events and [DONE] terminators. Prevents TransferEncodingError from incomplete chunked encoding when model API errors occur mid-stream (e.g., BadRequestError, AuthenticationError, rate limits).

5. Request Body Size Limits

Raised MAX_REQUEST_BYTES from 1MB to 10MB for long conversations
Passed client_max_size=MAX_REQUEST_BYTES to aiohttp.Application to prevent silent 400 errors from truncated request bodies

Related Issues

Open WebUI: open-webui/open-webui#20878 (UI freezes during streaming)
Open WebUI: open-webui/open-webui#18743 (tool call JSON rendering)
Hermes Agent: #17537 (this PR)

Testing

Tested with Open WebUI v0.9.2 (pip-installed on Windows) connected to Hermes API server via Responses mode
20-second multi-tool response: previously ~500 SSE events causing UI freeze → now ~20 events, UI stays responsive
response.completed payload reduced from 848KB to ~8KB

Changed files

gateway/platforms/api_server.py (modified, +117/-8)

PR #17552: docs: Open WebUI Filter Function + quantified performance analysis

Repository: NousResearch/hermes-agent
Author: bogerman1
State: open | merged: False
Link: https://github.com/NousResearch/hermes-agent/pull/17552

Description (problem / solution / changelog)

Summary

Adds a production-ready Open WebUI Filter Function that eliminates UI lag when Hermes connects to Open WebUI via the API Server. Includes detailed performance analysis with quantified before/after metrics.

Background

When Hermes streams long responses with tool calls through Open WebUI, the browser freezes due to three compounding issues:

SSE event storm — each token = 1 SSE event → ~500 re-renders per 20s response
DOM bloat — tool call arguments (24KB+ JSON) create ~300+ DOM nodes per card
Giant response.completed — 400-848KB single-line SSE silently hangs parser

Server-side batching (PR #17541) solves issue 1. This Filter Function solves issues 2 and 3.

Changes

`contrib/openwebui-filter/filter-function-v3.py`

Complete Open WebUI Filter with:

Emitter beautify: 15+ tool emoji summaries (💾 path (24.5 KB) instead of raw JSON)
Output summaries: JSON → one-liner (🔍 5 results instead of {"data":{"web":[...]}})
call_id → name tracking: accurate tool name resolution across SSE event pairs
Multi-part output: processes all output parts, not just output[0]
response.completed trimming: 848KB → ~8KB (largest single performance win)
Inline-output hint: encourages Hermes to output content inline

`contrib/openwebui-filter/README.md`

Comprehensive deployment guide with both persistent (file-based) and quick (API) methods.

`contrib/openwebui-filter/perf-analysis.md`

Full root cause analysis with per-layer metrics and architectural diagrams.

Performance Impact (Batching + Filter combined)

Metric	Before	After	Improvement
SSE events per 20s response	~500	~20	-96%
DOM nodes per tool card	~300+	~5	-98%
Frame render time	600ms	~80ms	-87%
response.completed payload	848 KB	~8 KB	-99%
CPU during streaming	100% (frozen)	<20%	solved
UI freezing	yes	none	solved
TransferEncodingError	occasional	eliminated	solved

Data Sources

Open WebUI issue #20878 — Safari profiling (v0.7.2 → v0.8.9)
Hermes PR #17541 — server-side batching measurements
Filter Function DOM inspection via browser console diagnostics

Related Issues

Closes: #17537 (Feature Request: SSE batching)
Related: #17541 (Server-side batching PR)
Related: open-webui/open-webui#20878, open-webui/open-webui#21884

Changed files

contrib/openwebui-filter/README.md (added, +90/-0)
contrib/openwebui-filter/filter-function-v3.py (added, +349/-0)
contrib/openwebui-filter/perf-analysis.md (added, +133/-0)
gateway/platforms/api_server.py (modified, +117/-8)

Code Example

# In _dispatch():
elif isinstance(it, str):
    _batch_buf.append(it)
    if _batch_timer is None:
        _batch_timer = asyncio.create_task(_batch_flush_after(0.05))

RAW_BUFFERClick to expand / collapse

Description

Root Cause

The Open WebUI side has acknowledged this (issues #20878, #18743, #13787) but their fixes (virtual scrolling, batched rendering, deferred Katex) are not yet implemented.

Proposed Fix: Server-Side Token Batching

Add a 50ms token buffer in _write_sse_responses(). Instead of immediately emitting every text delta, buffer consecutive text tokens and flush as a single SSE event every 50ms:

# In _dispatch():
elif isinstance(it, str):
    _batch_buf.append(it)
    if _batch_timer is None:
        _batch_timer = asyncio.create_task(_batch_flush_after(0.05))

Impact: ~500 SSE events -> ~20 events. Open WebUI re-renders drop by 95%. 50ms is below human perception threshold so streaming still feels real-time.

Flush triggers needed:

Before tool events (__tool_started__, __tool_completed__) to maintain ordering
Before EOS sentinel to flush final tokens
Before agent result processing

Also needs `nonlocal _batch_timer` fix:

The _dispatch() nested function also has a missing nonlocal _batch_timer declaration (causes UnboundLocalError). Must add nonlocal _batch_timer to _dispatch().

Additional Context

Open WebUI issue tracking same problem: open-webui/open-webui#20878
The response.completed event also needs content trimming (848KB+ single SSE lines cause silent hangs)
Same batching applies to Chat Completions endpoint (_write_sse_chat_completion())
Hermes skill already documents the fix in detail; this issue tracks implementation in main repo

Environment

Hermes Agent: latest
Open WebUI: v0.9.x (pip-installed on Windows)
Connection: WSL2 Hermes -> Windows Open WebUI via API server (Responses mode)
Browser: Chrome/Firefox both affected

extent analysis

TL;DR

Implement server-side token batching in _write_sse_responses() to reduce the number of SSE events and alleviate UI lag.

Guidance

Introduce a 50ms token buffer in _write_sse_responses() to batch consecutive text tokens and flush as a single SSE event.
Add flush triggers before tool events, EOS sentinel, and agent result processing to maintain ordering.
Declare _batch_timer as nonlocal in the _dispatch() nested function to fix the UnboundLocalError.
Consider applying the same batching to the Chat Completions endpoint (_write_sse_chat_completion()).

Example

_batch_buf = []
_batch_timer = None

def _dispatch():
    global _batch_buf, _batch_timer
    # ...
    elif isinstance(it, str):
        _batch_buf.append(it)
        if _batch_timer is None:
            _batch_timer = asyncio.create_task(_batch_flush_after(0.05))
    # ...

def _batch_flush_after(delay):
    # ...
    nonlocal _batch_timer
    # ...

Notes

The proposed fix assumes that the 50ms buffering delay is below the human perception threshold, ensuring real-time streaming. However, this value may need to be adjusted based on specific use cases.

Recommendation

Apply the workaround by implementing server-side token batching, as it directly addresses the root cause of the UI lag issue.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #indexing error #inference speed #output truncation #response parsing

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

hermes - ✅(Solved) Fix [Feature]: Server-side SSE token batching to fix Open WebUI streaming lag [2 pull requests, 1 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Fix Action

Fix / Workaround

Also needs nonlocal _batch_timer fix:

PR fix notes

PR #17541: fix(api_server): SSE token batching + response trimming for Open WebUI performance

Description (problem / solution / changelog)

Summary

Changes

1. SSE Token Batching (50ms buffer) — Core Fix

2. nonlocal _batch_timer fix

3. response.completed Content Trimming

4. Catch-All Exception Handlers (Both SSE Methods)

5. Request Body Size Limits

Related Issues

Testing

Changed files

PR #17552: docs: Open WebUI Filter Function + quantified performance analysis

Description (problem / solution / changelog)

Summary

Background

Changes

contrib/openwebui-filter/filter-function-v3.py

contrib/openwebui-filter/README.md

contrib/openwebui-filter/perf-analysis.md

Performance Impact (Batching + Filter combined)

Data Sources

Related Issues

Changed files

Code Example

Description

Root Cause

Proposed Fix: Server-Side Token Batching

Flush triggers needed:

Also needs nonlocal _batch_timer fix:

Additional Context

Environment

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING

Also needs `nonlocal _batch_timer` fix:

2. `nonlocal _batch_timer` fix

3. `response.completed` Content Trimming

`contrib/openwebui-filter/filter-function-v3.py`

`contrib/openwebui-filter/README.md`

`contrib/openwebui-filter/perf-analysis.md`

Also needs `nonlocal _batch_timer` fix: