ollama - 💡(How to fix) Fix [v0.30.0-rc] Memory Degradation: 25GB KV Cache stuck on single-concurrency RAG stack unless upstream TCP connection is physically reset

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

Root Cause

Expected behavior

Ollama should have an internal timeout or explicit API endpoint to flush the KV Cache of an inactive session, rather than permanently locking up to 25GB of VRAM just because an upstream proxy/reranker keeps a TCP connection open or lingering.

RAW_BUFFERClick to expand / collapse

What is the issue?

What is the issue?

In the new llama.cpp native runner architecture, there is an issue with memory/VRAM reclamation during single-user, long-context scenarios (128K window).

The ~25GB KV Cache is not released when a session ends or a "New Conversation" is triggered in the upstream app (Dify/RAGFlow). Instead, it remains locked in VRAM (stuck at ~80GB total allocation).

Crucially, the exact moment the upstream Reranker service (managed by Xinference) is restarted—which forces a physical TCP connection closure—Ollama immediately and cleanly flushes the 25GB KV Cache, dropping the VRAM back down to the model's static weight level (~55GB).

This indicates that the new llama.cpp runner heavily relies on absolute TCP connection teardown to flush the context cache, causing unexpected memory degradation when upstream API gateways/microservices keep the HTTP connection alive.

Steps to reproduce

  1. Spin up a full RAG stack via Docker: Dify + RAGFlow + Xinference (Reranker) + Ollama (v0.30.0 Pre-release).
  2. Load Qwen3.6:27B-BF16 and set the context window to 128K.
  3. Initiate a massive RAG query that utilizes the full 128K context. Total VRAM allocation hits ~80GB (54GB static weights + 25GB KV Cache) on the Blackwell GPU.
  4. Click "New Conversation" in Dify/RAGFlow to end the session. The 25GB KV Cache is unpredictably retained in VRAM and does not release after several minutes.
  5. Manually run docker restart <xinference-reranker-container>.
  6. Observation: The moment the connection resets, Ollama's VRAM drops instantly back to ~55GB.

Expected behavior

Ollama should have an internal timeout or explicit API endpoint to flush the KV Cache of an inactive session, rather than permanently locking up to 25GB of VRAM just because an upstream proxy/reranker keeps a TCP connection open or lingering.

Environment

  • OS: Windows 11 (Docker Desktop via WSL2 backend)
  • GPU: NVIDIA RTX PRO 6000 Blackwell Workstation Edition (96GB GDDR7)
  • Ollama Version: v0.30.0-rcX (Please fill in your exact rc version tag here)
  • Model: Qwen3.6:27B-BF16
  • Upstream components: Dify, RAGFlow, Xinference (hosting the Reranker)

Relevant log output

OS

No response

GPU

No response

CPU

No response

Ollama version

No response

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

FAQ

Expected behavior

Ollama should have an internal timeout or explicit API endpoint to flush the KV Cache of an inactive session, rather than permanently locking up to 25GB of VRAM just because an upstream proxy/reranker keeps a TCP connection open or lingering.

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING