ollama - ✅(Solved) Fix KV cache completely non-functional on CPU backend: every /api/chat request re-evaluates all tokens from scratch [1 pull requests, 1 comments, 2 participants]

ollama2026-03-11 09:00:52

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

ollama/ollama#14780•Fetched 2026-04-08 00:31:45

View on GitHub

Comments

Participants

Timeline

Reactions

Author

zhener562

Participants

Lightspace260

zhener562

Timeline (top)

commented ×1cross-referenced ×1labeled ×1subscribed ×1

Fix Action

Fixed

Fixed by PR: fix: enable KV cache for CPU mode (https://github.com/ollama/ollama/pull/14791)

PR fix notes

PR #14791: fix: enable KV cache for CPU mode

Repository: ollama/ollama
Author: Lightspace260
State: open | merged: False
Link: https://github.com/ollama/ollama/pull/14791

Description (problem / solution / changelog)

Description

This PR fixes an issue where the KV cache was not being properly initialized when running in CPU-only mode. Previously, the cache was only enabled if a GPU was detected, leading to significant performance degradation for CPU users.

Key Changes

Forced KV cache initialization even when no GPU is present.
Added a fallback in-memory cache for CPU mode.
Ensured enabled flag is correctly set for CPU backend.

Performance

Token generation latency improved from ~2s to ~0.5s on local CPU tests.

/claim #14780

Changed files

llm/server.go (modified, +6/-0)
runner/ollamarunner/cache.go (modified, +33/-24)
runner/ollamarunner/runner.go (modified, +3/-2)

Code Example

import json, urllib.request                               
                                                                                                                                                                                        
  def chat(msgs):
      body = json.dumps({"model": "qwen3.5:9b", "messages": msgs, "stream": False,                                                                                                      
                         "options": {"num_predict": 5}}).encode()                                                                                                                       
      req = urllib.request.Request("http://localhost:11434/api/chat",                                                                                                                   
                                   data=body, headers={"Content-Type": "application/json"})                                                                                             
      with urllib.request.urlopen(req) as r:                                                                                                                                            
          return json.loads(r.read())                                                                                                                                                   
                                                                                                                                                                                        
  msgs = [{"role": "user", "content": "What is 2+2?"}]                                                                                                                                  
  for i in range(3):                                                                                                                                                                    
      r = chat(msgs)                                                                                                                                                                    
      print(f"Request {i+1}: pe_count={r['prompt_eval_count']}, "
            f"pe_ms={r['prompt_eval_duration']/1e6:.1f}ms")

---

Request 1: pe_count=17, pe_ms=583.7ms                                                                                                                                                 
  Request 2: pe_count=17, pe_ms=627.5ms  ← no speedup                                                                                                                                   
  Request 3: pe_count=17, pe_ms=583.4ms  ← no speedup

---

Turn  pe_count  pe_ms   ratio_vs_turn1                    
     1        11   423ms   1.00x                                                                                                                                                        
     2        22   703ms   1.66x                                                                                                                                                        
     3        33   987ms   2.33x                                                                                                                                                        
     4        44  1256ms   2.97x                                                                                                                                                        
     5        55  1542ms   3.64x

---

# Pass context tokens back each turn                      
  r = post("/api/generate", {"model": ..., "prompt": ..., "context": ctx})                                                                                                              
  ctx = r["context"]

---

Turn 1→2: pe_ms 539ms → 598ms  (+59ms for 19 new tokens = cache HIT ✓)
  Turn 2→3: pe_ms 598ms → 1076ms (+478ms for 19 new tokens ≈ 25ms/tok ✓)

---

Turn  pe_count  ms/token                                  
     1        17    19.2   ← baseline                                                                                                                                                   
     2        42     8.2   ← DECREASING = cache hit, only new tokens evaluated                                                                                                          
     3        67     7.2                                                                                                                                                                
    10       252     5.2   ← still decreasing

RAW_BUFFERClick to expand / collapse

Title: KV cache completely non-functional on CPU backend: every /api/chat request re-evaluates all tokens from scratch

Body:

Summary

With the current --ollama-engine (v0.17.1), /api/chat performs a full prompt re-evaluation on every turn when running on CPU. Identical requests sent consecutively take the
same time — the KV cache prefix-match is not being used at all.

This causes prompt evaluation time to grow linearly with conversation length, making multi-turn chat increasingly slow.

Environment

Ollama version: 0.17.1
OS: Ubuntu 24.04.2 LTS
CPU: AMD Ryzen 5 8600G (6 cores)
GPU: None (CPU-only mode, offloaded 0/33 layers to GPU)
Runner: --ollama-engine
Model: qwen3.5:9b

Reproduction

Send the same single-turn request 3 times:

import json, urllib.request                               
                                                                                                                                                                                      
def chat(msgs):
    body = json.dumps({"model": "qwen3.5:9b", "messages": msgs, "stream": False,                                                                                                      
                       "options": {"num_predict": 5}}).encode()                                                                                                                       
    req = urllib.request.Request("http://localhost:11434/api/chat",                                                                                                                   
                                 data=body, headers={"Content-Type": "application/json"})                                                                                             
    with urllib.request.urlopen(req) as r:                                                                                                                                            
        return json.loads(r.read())                                                                                                                                                   
                                                                                                                                                                                      
msgs = [{"role": "user", "content": "What is 2+2?"}]                                                                                                                                  
for i in range(3):                                                                                                                                                                    
    r = chat(msgs)                                                                                                                                                                    
    print(f"Request {i+1}: pe_count={r['prompt_eval_count']}, "
          f"pe_ms={r['prompt_eval_duration']/1e6:.1f}ms")

Expected: 2nd and 3rd requests are nearly instant (cache hit).

Actual:

Request 1: pe_count=17, pe_ms=583.7ms                                                                                                                                                 
Request 2: pe_count=17, pe_ms=627.5ms  ← no speedup                                                                                                                                   
Request 3: pe_count=17, pe_ms=583.4ms  ← no speedup

Multi-turn evidence: pe_count is cumulative total, not new tokens

Turn  pe_count  pe_ms   ratio_vs_turn1                    
   1        11   423ms   1.00x                                                                                                                                                        
   2        22   703ms   1.66x                                                                                                                                                        
   3        33   987ms   2.33x                                                                                                                                                        
   4        44  1256ms   2.97x                                                                                                                                                        
   5        55  1542ms   3.64x

pe_count grows as the cumulative total of all tokens — all tokens are re-evaluated every turn.
pe_ms / pe_count stays constant (~27ms/token), confirming the full prompt is processed each time.
If cache were working, pe_count would reflect only the newly added tokens (~10 per turn).

Contrast: /api/generate with context reuse works correctly

# Pass context tokens back each turn                      
r = post("/api/generate", {"model": ..., "prompt": ..., "context": ctx})                                                                                                              
ctx = r["context"]

Turn 1→2: pe_ms 539ms → 598ms  (+59ms for 19 new tokens = cache HIT ✓)
Turn 2→3: pe_ms 598ms → 1076ms (+478ms for 19 new tokens ≈ 25ms/tok ✓)

The /api/generate + context path correctly reuses the KV cache. Only /api/chat is broken on CPU.

GPU comparison (ROCm backend — cache works)

After enabling ROCm (HSA_OVERRIDE_GFX_VERSION=11.0.0):

Turn  pe_count  ms/token                                  
   1        17    19.2   ← baseline                                                                                                                                                   
   2        42     8.2   ← DECREASING = cache hit, only new tokens evaluated                                                                                                          
   3        67     7.2                                                                                                                                                                
  10       252     5.2   ← still decreasing

ms/token decreasing over turns proves the GPU backend correctly uses prefix-match KV cache.

Conclusion

Backend	KV cache working?
CPU (`--ollama-engine`)	No — full re-eval every request
GPU/ROCm (`--ollama-engine`)	Yes — prefix-match working
`/api/generate` + context (CPU)	Yes — explicit context reuse works

The CPU backend of the new engine appears to not maintain the KV cache between /api/chat requests. The prefix-match logic in cache.go (findLongestCacheSlot /
countCommonPrefix) does not appear to be triggered, or the cache is cleared between requests.

Related issues

#5303 — random full re-evaluation (different trigger, same symptom)
#12504 — new engine prompt eval much slower (closed as duplicate of #12037)

extent analysis

Fix Plan

To fix the issue of the KV cache not working on the CPU backend for /api/chat requests, we need to modify the cache logic to properly store and retrieve cache entries between requests.

Here are the steps to fix the issue:

Modify the cache.go file to ensure that the findLongestCacheSlot and countCommonPrefix functions are correctly triggered for /api/chat requests.
Verify that the cache is not being cleared between requests by checking the cache implementation in cache.go.
Update the cache key generation logic to include the conversation context, so that subsequent requests with the same conversation context can hit the cache.

Example code changes:

# In cache.py, update the cache key generation logic
def generate_cache_key(conversation_context, request_data):
    # Include conversation context in the cache key
    cache_key = f"{conversation_context}:{request_data}"
    return cache_key

# In api_chat.py, update the cache retrieval logic
def get_cache_entry(cache_key):
    # Retrieve the cache entry using the generated cache key
    cache_entry = cache.get(cache_key)
    return cache_entry

# In api_chat.py, update the cache storage logic
def store_cache_entry(cache_key, cache_entry):
    # Store the cache entry using the generated cache key
    cache.set(cache_key, cache_entry)

Verification

To verify that the fix worked, send the same single-turn request multiple times and check if the pe_count and pe_ms values reflect a cache hit. The pe_count should only reflect the newly added tokens, and the pe_ms should be significantly lower for subsequent requests.

Example verification code:

import json, urllib.request

def chat(msgs):
    body = json.dumps({"model": "qwen3.5:9b", "messages": msgs, "stream": False, "options": {"num_predict": 5}}).encode()
    req = urllib.request.Request("http://localhost:11434/api/chat", data=body, headers={"Content-Type": "application/json"})
    with urllib.request.urlopen(req) as r:
        return json.loads(r.read())

msgs = [{"role": "user", "content": "What is 2+2?"}]
for i in range(3):
    r = chat(msgs)
    print(f"Request {i+1}: pe_count={r['prompt_eval_count']}, pe_ms={r['prompt_eval_duration']/1e6:.1f}ms")

The output should show a significant decrease in pe_ms for subsequent requests, indicating a cache hit.

Extra Tips

Ensure that the cache implementation is thread-safe to avoid issues with concurrent requests.
Consider adding cache expiration logic to prevent stale cache entries from affecting performance.
Monitor cache hit rates and adjust the cache key generation logic as needed to optimize performance.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #ssr #installation #tensor shape #autograd error #API versioning #request timeout

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

ollama - ✅(Solved) Fix KV cache completely non-functional on CPU backend: every /api/chat request re-evaluates all tokens from scratch [1 pull requests, 1 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Fix Action

Fixed

PR fix notes

PR #14791: fix: enable KV cache for CPU mode

Description (problem / solution / changelog)

Description

Key Changes

Performance

Changed files

Code Example

Summary

Environment

Reproduction

Multi-turn evidence: pe_count is cumulative total, not new tokens

Contrast: /api/generate with context reuse works correctly

GPU comparison (ROCm backend — cache works)

Conclusion

Related issues

extent analysis

Fix Plan

Verification

Extra Tips

Still need to ship something?

TRENDING

ollama - ✅(Solved) Fix KV cache completely non-functional on CPU backend: every /api/chat request re-evaluates all tokens from scratch [1 pull requests, 1 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Fix Action

Fixed

PR fix notes

PR #14791: fix: enable KV cache for CPU mode

Description (problem / solution / changelog)

Description

Key Changes

Performance

Changed files

Code Example

Summary

Environment

Reproduction

Multi-turn evidence: pe_count is cumulative total, not new tokens

Contrast: /api/generate with context reuse works correctly

GPU comparison (ROCm backend — cache works)

Conclusion

Related issues

extent analysis

Fix Plan

Verification

Extra Tips

Still need to ship something?

RELATED_DISCOVERY

TRENDING