ollama - 💡(How to fix) Fix Report on Issues with UI Interaction with Ollama [4 comments, 3 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
ollama/ollama#15329Fetched 2026-04-08 02:44:04
View on GitHub
Comments
4
Participants
3
Timeline
5
Reactions
1
Author
Timeline (top)
commented ×4labeled ×1
RAW_BUFFERClick to expand / collapse

What is the issue?

Executive Summary During testing, systematic failures were observed when using Ollama through UI clients (Chatbox, OpenWebUI). With identical model parameters and identical prompts, some UI clients do not receive a response, despite the fact that: • Ollama successfully performs inference • GPU utilization reaches 100% • CPU shows a typical compute load pattern • responses via CLI are consistently returned without delay This indicates a problem at the API interaction layer between UI clients and Ollama, rather than an issue with the models or hardware. The problem is reproducible across multiple models, which rules out issues related to specific quantizations or architectures. Additionally, it was observed that behavior becomes unstable even with a 32k context window, while previous-generation models handle significantly larger context windows reliably. This may indicate issues related to streaming response handling or context management.

Test Conditions Parameters: • identical prompt • identical model settings • context window = 32k • no system configuration changes between runs • identical hardware • execution via: o CLI (ollama run) o Ollama API o Chatbox o OpenWebUI Test prompt: Explain how a quantum computer works

Observed Anomaly In multiple cases: • GPU reaches 100% utilization • CPU initially shows high load, then decreases • inference is clearly performed by Ollama • UI does not receive token stream • UI continues waiting until GPU utilization drops to zero • no response is displayed At the same time, CLI works correctly. This is a typical symptom of: • streaming connection interruption • chunked response processing errors • keep-alive connection issues • incorrect handling of SSE (server-sent events) • client waiting indefinitely for final token • incorrect handling of eval_duration / prompt_eval_duration

Test Results GLM-4.7 q6 flash interface behavior CLI generation starts immediately Ollama API generation starts immediately Chatbox GPU 100%, no response OpenWebUI delayed start of generation

gemma4:31b-it-q4_K_M interface behavior CLI generation starts immediately Ollama API ~1 second delay Chatbox CPU 70% → 15-30%, no response OpenWebUI CPU 70% → 15-30%, no response (result consistently reproducible)

Qwen3.5-9b q8 interface behavior CLI high CPU usage, no response Ollama API generation starts immediately Chatbox ~5 second delay, high CPU usage OpenWebUI generation starts immediately

qwen3.5:35b-a3b-q4_K_M interface behavior CLI generation starts immediately Ollama API generation starts immediately Chatbox GPU 100%, no response OpenWebUI ~5 second delay, high CPU usage

qwen3.5:27b-q4_K_M interface behavior CLI generation starts immediately Ollama API generation starts immediately Chatbox no response OpenWebUI ~2 second delay

Conclusion Recurring issues observed:

  1. UI clients do not receive token streams despite successful inference in Ollama
  2. some clients remain waiting until inference is fully completed
  3. problem reproduces across different models
  4. problem reproduces across different quantizations
  5. CLI operates correctly
  6. Ollama API operates correctly
  7. failures occur only when using UI clients This indicates a likely issue related to: • Ollama streaming API • chunked transfer encoding handling • token streaming with long context windows • reasoning token handling • connection timeouts • incorrect stop sequence handling • incorrect handling of stream=true parameter • differences in handling reasoning models

Items Recommended for Investigation API layer • correctness of SSE streaming implementation • stream completion handling for long responses • consistency between CLI and HTTP API behavior • correctness of Content-Length / Transfer-Encoding handling • buffer flushing behavior • keep-alive connection stability client layer • correct handling of partial tokens • reasoning token handling • behavior when model emits reasoning tokens before final answer • handling of stream completion events • response timeout handling parameters • impact of context window = 32k • impact of eval_duration • impact of prompt_eval_duration • behavior of reasoning models (Qwen3.5 family) • KV cache size impact

Why This Matters In the current state, using Ollama through UI clients is: • unstable • unpredictable • creates the impression that models freeze • complicates integration into enterprise interfaces • slows adoption of local LLM infrastructure CLI operation remains stable, confirming that the inference pipeline itself functions correctly.

Relevant log output

OS

Windows

GPU

Nvidia

CPU

Intel

Ollama version

0.20.0

extent analysis

TL;DR

Investigate and fix the Ollama streaming API, focusing on SSE streaming implementation, chunked transfer encoding handling, and token streaming with long context windows.

Guidance

  • Verify the correctness of the SSE streaming implementation in the Ollama API layer, ensuring proper handling of long responses and stream completion.
  • Investigate the client layer's handling of partial tokens, reasoning tokens, and stream completion events to identify potential issues.
  • Test the impact of context window size, eval_duration, and prompt_eval_duration on the stability of the UI clients and Ollama API.
  • Check the consistency between CLI and HTTP API behavior, particularly in regards to buffer flushing and keep-alive connection stability.

Example

No specific code example is provided, as the issue requires a deeper investigation into the Ollama API and client implementation.

Notes

The problem seems to be related to the interaction between the UI clients and the Ollama API, rather than the models or hardware. The fact that the CLI operates correctly suggests that the inference pipeline itself is functioning properly.

Recommendation

Apply a workaround by adjusting the context window size, eval_duration, and prompt_eval_duration parameters to see if it improves the stability of the UI clients. This may help mitigate the issue until a permanent fix is implemented for the Ollama streaming API.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING