ollama - 💡(How to fix) Fix [v0.30.0-rc] Memory Degradation: 25GB KV Cache stuck on single-concurrency RAG stack unless upstream TCP connection is physically reset

ollama2026-05-28 08:16:42

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

Root Cause

Expected behavior

Ollama should have an internal timeout or explicit API endpoint to flush the KV Cache of an inactive session, rather than permanently locking up to 25GB of VRAM just because an upstream proxy/reranker keeps a TCP connection open or lingering.

RAW_BUFFERClick to expand / collapse

What is the issue?

In the new llama.cpp native runner architecture, there is an issue with memory/VRAM reclamation during single-user, long-context scenarios (128K window).

The ~25GB KV Cache is not released when a session ends or a "New Conversation" is triggered in the upstream app (Dify/RAGFlow). Instead, it remains locked in VRAM (stuck at ~80GB total allocation).

Crucially, the exact moment the upstream Reranker service (managed by Xinference) is restarted—which forces a physical TCP connection closure—Ollama immediately and cleanly flushes the 25GB KV Cache, dropping the VRAM back down to the model's static weight level (~55GB).

This indicates that the new llama.cpp runner heavily relies on absolute TCP connection teardown to flush the context cache, causing unexpected memory degradation when upstream API gateways/microservices keep the HTTP connection alive.

Steps to reproduce

Spin up a full RAG stack via Docker: Dify + RAGFlow + Xinference (Reranker) + Ollama (v0.30.0 Pre-release).
Load Qwen3.6:27B-BF16 and set the context window to 128K.
Initiate a massive RAG query that utilizes the full 128K context. Total VRAM allocation hits ~80GB (54GB static weights + 25GB KV Cache) on the Blackwell GPU.
Click "New Conversation" in Dify/RAGFlow to end the session. The 25GB KV Cache is unpredictably retained in VRAM and does not release after several minutes.
Manually run docker restart <xinference-reranker-container>.
Observation: The moment the connection resets, Ollama's VRAM drops instantly back to ~55GB.

Expected behavior

Environment

OS: Windows 11 (Docker Desktop via WSL2 backend)
GPU: NVIDIA RTX PRO 6000 Blackwell Workstation Edition (96GB GDDR7)
Ollama Version: v0.30.0-rcX (Please fill in your exact rc version tag here)
Model: Qwen3.6:27B-BF16
Upstream components: Dify, RAGFlow, Xinference (hosting the Reranker)

Relevant log output

OS

No response

GPU

No response

CPU

No response

Ollama version

No response

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

FAQ

Expected behavior

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

ollama - 💡(How to fix) Fix [v0.30.0-rc] Memory Degradation: 25GB KV Cache stuck on single-concurrency RAG stack unless upstream TCP connection is physically reset

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Expected behavior

What is the issue?

What is the issue?

Steps to reproduce

Expected behavior

Environment

Relevant log output

OS

GPU

CPU

Ollama version

FAQ

Expected behavior

Still need to ship something?

TRENDING