ollama - 💡(How to fix) Fix Inverted VRAM / KV Cache Estimation Logic for Qwen 3.6 and Gemma 4 Architectures on Multi-GPU Platform

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

Error Message

$ ollama ps NAME ID SIZE PROCESSOR CONTEXT UNTIL
jeffh/intfloat-multilingual-e5-large:f32 d398628108a4 2.3 GB 100% GPU 512 Forever
qwen3.6:27b-bf16 c3a702fca756 85 GB 100% GPU 262144 Forever

$ rocm-smi

============================================ ROCm System Management Interface ============================================ ====================================================== Concise Info ====================================================== Device Node IDs Temp Power Partitions SCLK MCLK Fan Perf PwrCap VRAM% GPU%
(DID, GUID) (Edge) (Socket) (Mem, Compute, ID)

0 3 0x66a1, 52718 37.0°C 28.0W N/A, N/A, 0 1700Mhz 1000Mhz 59.22% high 190.0W 84% 1%
1 4 0x66a1, 4670 40.0°C 32.0W N/A, N/A, 0 1800Mhz 1000Mhz 80.0% high 225.0W 84% 0%
2 2 0x66a1, 39396 44.0°C 253.0W N/A, N/A, 0 1711Mhz 1000Mhz 37.25% high 225.0W 84% 100%
3 1 0x66a1, 15839 36.0°C 29.0W N/A, N/A, 0 1700Mhz 1000Mhz 29.41% high 190.0W 81% 0%

================================================== End of ROCm SMI Log ===================================================

API Ollama error: 500 - {"error":"model runner has unexpectedly stopped, this may be due to resource limitations or an internal error, check ollama server logs for details"}

Root Cause

API Ollama error: 500 - {"error":"model runner has unexpectedly stopped, this may be due to resource limitations or an internal error, check ollama server logs for details"}

## 5. Root Cause Hypothesis & Conclusion

Code Example

| Model | Precision | Configured num_ctx | Configured num_predict | Total VRAM Reserved | Prompt Payload | Result
|
| Devstral-small-2:24b-instruct | FP16 | 230,000 | 32,768 | 88 GB   | 150,000+ tokens | **SUCCESS** (Stable)
| Qwen3.6:27b-it | BF16 | 262,144 | 32,000 | 86 GB | 75,000 tokens  | **SUCCESS** (Stable)
| Qwen3.6:27b-it | BF16 | 262,144 | 32,000 | 86 GB | 100,000 tokens | **CRASH** (OOM / Runner Stopped)
| Gemma4:31b-it  | BF16 | 262,144 | 32,000 | 92 GB | 75,000 tokens  | **CRASH** (OOM / Runner Stopped)

---

$ ollama ps

NAME                                        ID              SIZE      PROCESSOR    CONTEXT    UNTIL   
jeffh/intfloat-multilingual-e5-large:f32    d398628108a4    2.3 GB    100% GPU     512        Forever    
devstral-small-2:24b-instruct-2512-fp16     15c77ff5438a    88 GB     100% GPU     230000     Forever    

$ rocm-smi

============================================ ROCm System Management Interface ============================================
====================================================== Concise Info ======================================================
Device  Node  IDs              Temp    Power     Partitions          SCLK     MCLK     Fan     Perf  PwrCap  VRAM%  GPU%  
              (DID,     GUID)  (Edge)  (Socket)  (Mem, Compute, ID)                                                       
==========================================================================================================================
0       3     0x66a1,   52718  39.0°C  216.0W    N/A, N/A, 0         1654Mhz  1000Mhz  67.06%  high  190.0W  80%    100%  
1       4     0x66a1,   4670   38.0°C  30.0W     N/A, N/A, 0         1800Mhz  1000Mhz  41.57%  high  225.0W  89%    0%    
2       2     0x66a1,   39396  41.0°C  29.0W     N/A, N/A, 0         1800Mhz  1000Mhz  35.29%  high  225.0W  91%    0%    
3       1     0x66a1,   15839  38.0°C  30.0W     N/A, N/A, 0         1700Mhz  1000Mhz  70.98%  high  190.0W  86%    0%    
==========================================================================================================================
================================================== End of ROCm SMI Log ===================================================

---

$ ollama ps
NAME                                        ID              SIZE      PROCESSOR    CONTEXT    UNTIL   
jeffh/intfloat-multilingual-e5-large:f32    d398628108a4    2.3 GB    100% GPU     512        Forever    
qwen3.6:27b-bf16                            c3a702fca756    85 GB     100% GPU     262144     Forever    

$ rocm-smi

============================================ ROCm System Management Interface ============================================
====================================================== Concise Info ======================================================
Device  Node  IDs              Temp    Power     Partitions          SCLK     MCLK     Fan     Perf  PwrCap  VRAM%  GPU%  
              (DID,     GUID)  (Edge)  (Socket)  (Mem, Compute, ID)                                                       
==========================================================================================================================
0       3     0x66a1,   52718  37.0°C  28.0W     N/A, N/A, 0         1700Mhz  1000Mhz  59.22%  high  190.0W  84%    1%    
1       4     0x66a1,   4670   40.0°C  32.0W     N/A, N/A, 0         1800Mhz  1000Mhz  80.0%   high  225.0W  84%    0%    
2       2     0x66a1,   39396  44.0°C  253.0W    N/A, N/A, 0         1711Mhz  1000Mhz  37.25%  high  225.0W  84%    100%  
3       1     0x66a1,   15839  36.0°C  29.0W     N/A, N/A, 0         1700Mhz  1000Mhz  29.41%  high  190.0W  81%    0%    
==========================================================================================================================
================================================== End of ROCm SMI Log ===================================================

API Ollama error: 500 - {"error":"model runner has unexpectedly stopped, this may be due to resource limitations or an internal error, check ollama server logs for details"}

---

$ ollama ps

NAME                                        ID              SIZE      PROCESSOR    CONTEXT    UNTIL   
gemma4:31b-it-bf16                          236d76ae0874    92 GB     100% GPU     262144     Forever    
jeffh/intfloat-multilingual-e5-large:f32    d398628108a4    2.3 GB    100% GPU     512        Forever    

$ rocm-smi

============================================ ROCm System Management Interface ============================================
====================================================== Concise Info ======================================================
Device  Node  IDs              Temp    Power     Partitions          SCLK     MCLK     Fan     Perf  PwrCap  VRAM%  GPU%  
              (DID,     GUID)  (Edge)  (Socket)  (Mem, Compute, ID)                                                       
==========================================================================================================================
0       3     0x66a1,   52718  37.0°C  29.0W     N/A, N/A, 0         1700Mhz  1000Mhz  61.57%  high  190.0W  87%    0%    
1       4     0x66a1,   4670   35.0°C  29.0W     N/A, N/A, 0         1800Mhz  1000Mhz  72.16%  high  225.0W  93%    0%    
2       2     0x66a1,   39396  41.0°C  253.0W    N/A, N/A, 0         1711Mhz  1000Mhz  33.33%  high  225.0W  91%    100%  
3       1     0x66a1,   15839  36.0°C  30.0W     N/A, N/A, 0         1700Mhz  1000Mhz  63.14%  high  190.0W  88%    0%    
==========================================================================================================================
================================================== End of ROCm SMI Log ===================================================

API Ollama error: 500 - {"error":"model runner has unexpectedly stopped, this may be due to resource limitations or an internal error, check ollama server logs for details"}

---
RAW_BUFFERClick to expand / collapse

What is the issue?

1. Executive Summary

An anomaly in VRAM allocation and static footprint estimation has been observed in Ollama versions v0.24 and v0.23.2. When configuring large context windows (num_ctx 256k) across a multi-GPU AMD ROCm setup, the framework exhibits an inverted memory allocation pattern: it reserves significantly less static VRAM for larger, denser models than it does for smaller ones.

** The Paradox:**

  • Devstral 24B FP16** (~48 GB base weights) scales and reserves 94 GB VRAM
  • Qwen 3.6 27B BF16** (~54 GB base weights) allocates at 86 GB VRAM
  • Gemma 4 31B BF16** (~62 GB base weights) -allocates at 92 GB VRAM

This - theoretical - underestimation leads to a deterministic, repeatable system failure during recursive RAG operations (specifically using LlamaIndex 'tree_summarize' mode)

  • Qwen 3.6 27B reliably hits a hard wall and crashes with an API Error 500 (runner unexpectedly stopped) at a payload threshold of > 75k prompt tokens
  • Gemma 4 31B exhibits an even lower stability threshold, succumbing to immediate out-of-memory states or early execution aborts.
  • Conversely, Devstral 24B, having been granted 94GB buffer processes a moderate 150k+ prompt token payload with 100% deterministic stability.

The payload is defined as 'top_k * chunk_size' (eg. 256 fragments [cosine similarity > 0.815] x 384 tokens of chunk_size ~= 100k tokens)

2. Hardware Platform & Software Environment

  • Ollama Versions: v0.24.0 (Stable) and v0.23.2

  • Host OS: Fedora Linux 43 (fully updated)

  • Compute Topology:

    • Motherboard:** ASRock SWRX80
    • CPU:** AMD Threadripper Pro 3955WX
  • GPU Subsystem (Heterogeneous ROCm Pool - 96GB Total VRAM):

    • 2x AMD Radeon Pro VII (16GB HBM2 ECC)
    • 2x AMD Instinct MI60 (32GB HBM2 ECC)
  • Graphics/Compute Stack: ROCm Hub / Vulkan compute runtimes with hardware-level RAS monitoring.

3. Empirical Test Matrix & Verification Data

All tests were executed under the same stress conditions using a structural RAG pipeline.

| Model | Precision | Configured num_ctx | Configured num_predict | Total VRAM Reserved | Prompt Payload | Result
|
| Devstral-small-2:24b-instruct | FP16 | 230,000 | 32,768 | 88 GB   | 150,000+ tokens | **SUCCESS** (Stable)
| Qwen3.6:27b-it | BF16 | 262,144 | 32,000 | 86 GB | 75,000 tokens  | **SUCCESS** (Stable)
| Qwen3.6:27b-it | BF16 | 262,144 | 32,000 | 86 GB | 100,000 tokens | **CRASH** (OOM / Runner Stopped)
| Gemma4:31b-it  | BF16 | 262,144 | 32,000 | 92 GB | 75,000 tokens  | **CRASH** (OOM / Runner Stopped)

4. Step-by-Step Execution Logs and System Dumps

A. The Baseline: Correct Allocation & Scaling (Devstral 24B FP16)

When Devstral is loaded, Ollama correctly calculates the KV Cache scaling for large context windows, utilizing the hardware boundaries up to ~91% per device.

$ ollama ps

NAME                                        ID              SIZE      PROCESSOR    CONTEXT    UNTIL   
jeffh/intfloat-multilingual-e5-large:f32    d398628108a4    2.3 GB    100% GPU     512        Forever    
devstral-small-2:24b-instruct-2512-fp16     15c77ff5438a    88 GB     100% GPU     230000     Forever    

$ rocm-smi

============================================ ROCm System Management Interface ============================================
====================================================== Concise Info ======================================================
Device  Node  IDs              Temp    Power     Partitions          SCLK     MCLK     Fan     Perf  PwrCap  VRAM%  GPU%  
              (DID,     GUID)  (Edge)  (Socket)  (Mem, Compute, ID)                                                       
==========================================================================================================================
0       3     0x66a1,   52718  39.0°C  216.0W    N/A, N/A, 0         1654Mhz  1000Mhz  67.06%  high  190.0W  80%    100%  
1       4     0x66a1,   4670   38.0°C  30.0W     N/A, N/A, 0         1800Mhz  1000Mhz  41.57%  high  225.0W  89%    0%    
2       2     0x66a1,   39396  41.0°C  29.0W     N/A, N/A, 0         1800Mhz  1000Mhz  35.29%  high  225.0W  91%    0%    
3       1     0x66a1,   15839  38.0°C  30.0W     N/A, N/A, 0         1700Mhz  1000Mhz  70.98%  high  190.0W  86%    0%    
==========================================================================================================================
================================================== End of ROCm SMI Log ===================================================

B. The Baseline: Allocation & Scaling (Qwen3.6 27B FP16)

When Qwen3.6 is loaded, Ollama calculates the KV Cache scaling for large context windows, utilizing the hardware boundaries up to ~84% per device.

$ ollama ps
NAME                                        ID              SIZE      PROCESSOR    CONTEXT    UNTIL   
jeffh/intfloat-multilingual-e5-large:f32    d398628108a4    2.3 GB    100% GPU     512        Forever    
qwen3.6:27b-bf16                            c3a702fca756    85 GB     100% GPU     262144     Forever    

$ rocm-smi

============================================ ROCm System Management Interface ============================================
====================================================== Concise Info ======================================================
Device  Node  IDs              Temp    Power     Partitions          SCLK     MCLK     Fan     Perf  PwrCap  VRAM%  GPU%  
              (DID,     GUID)  (Edge)  (Socket)  (Mem, Compute, ID)                                                       
==========================================================================================================================
0       3     0x66a1,   52718  37.0°C  28.0W     N/A, N/A, 0         1700Mhz  1000Mhz  59.22%  high  190.0W  84%    1%    
1       4     0x66a1,   4670   40.0°C  32.0W     N/A, N/A, 0         1800Mhz  1000Mhz  80.0%   high  225.0W  84%    0%    
2       2     0x66a1,   39396  44.0°C  253.0W    N/A, N/A, 0         1711Mhz  1000Mhz  37.25%  high  225.0W  84%    100%  
3       1     0x66a1,   15839  36.0°C  29.0W     N/A, N/A, 0         1700Mhz  1000Mhz  29.41%  high  190.0W  81%    0%    
==========================================================================================================================
================================================== End of ROCm SMI Log ===================================================

API Ollama error: 500 - {"error":"model runner has unexpectedly stopped, this may be due to resource limitations or an internal error, check ollama server logs for details"}

C. The Baseline: Allocation & Scaling (Gemma4 31B FP16)

When Gemma4 is loaded, Ollama calculates the KV Cache scaling for large context windows, utilizing the hardware boundaries up to ~93% per device.

$ ollama ps

NAME                                        ID              SIZE      PROCESSOR    CONTEXT    UNTIL   
gemma4:31b-it-bf16                          236d76ae0874    92 GB     100% GPU     262144     Forever    
jeffh/intfloat-multilingual-e5-large:f32    d398628108a4    2.3 GB    100% GPU     512        Forever    

$ rocm-smi

============================================ ROCm System Management Interface ============================================
====================================================== Concise Info ======================================================
Device  Node  IDs              Temp    Power     Partitions          SCLK     MCLK     Fan     Perf  PwrCap  VRAM%  GPU%  
              (DID,     GUID)  (Edge)  (Socket)  (Mem, Compute, ID)                                                       
==========================================================================================================================
0       3     0x66a1,   52718  37.0°C  29.0W     N/A, N/A, 0         1700Mhz  1000Mhz  61.57%  high  190.0W  87%    0%    
1       4     0x66a1,   4670   35.0°C  29.0W     N/A, N/A, 0         1800Mhz  1000Mhz  72.16%  high  225.0W  93%    0%    
2       2     0x66a1,   39396  41.0°C  253.0W    N/A, N/A, 0         1711Mhz  1000Mhz  33.33%  high  225.0W  91%    100%  
3       1     0x66a1,   15839  36.0°C  30.0W     N/A, N/A, 0         1700Mhz  1000Mhz  63.14%  high  190.0W  88%    0%    
==========================================================================================================================
================================================== End of ROCm SMI Log ===================================================

API Ollama error: 500 - {"error":"model runner has unexpectedly stopped, this may be due to resource limitations or an internal error, check ollama server logs for details"}

5. Root Cause Hypothesis & Conclusion

The behavior - MAY - indicate an underlying architectural bug in Ollama's inner size/memory estimation calculations for the Qwen and Gemma matrix specifications. While the backend correctly maps the layers across the ROCm runtime, it fails to statically lock down the mandatory O(N^2) KV Cache storage structure at high token boundaries for these specific families. This leads to a silent under-allocation at initialization. When the engine is later hit with sustained, heavy recursive context synthesis (tree_summarize), dynamic activation tensors collide with unreserved memory spaces, resulting in a fatal llama.cpp backend abort. Given that the hardware stack is verified clean and fully capable of maintaining a heavy compute envelope (as proven by Devstral). So, request a review of the model memory allocation curves for Qwen 3.6 and Gemma 4.

Regards, /WS

Relevant log output

OS

Linux

GPU

AMD

CPU

AMD

Ollama version

0.24.0

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING