ollama - 💡(How to fix) Fix Resizable BAR causes automatic context size to overflow VRAM, pushing model layers to system RAM [1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
ollama/ollama#16020Fetched 2026-05-07 03:31:35
View on GitHub
Comments
0
Participants
1
Timeline
1
Reactions
0
Participants
Timeline (top)
closed ×1

When Resizable BAR (ReBAR) is enabled on multi-GPU systems with large total VRAM, Ollama's automatic num_ctx calculation sets an excessively large default context window. The resulting KV cache consumes so much VRAM that model weight layers spill into system shared memory (via CUDA sysmem fallback), dropping generation speed from ~18 tok/s to ~0.9 tok/s.

Environment OS: Windows 11 Pro GPUs: 2× NVIDIA RTX PRO 4500 Blackwell (32 GB GDDR7 each, 64 GB total) Ollama version: 0.23.1 (also reproduced on 0.20.2) Model: llama3.3:70b (Q4_K_M, ~42 GB) PCIe: Both GPUs confirmed Gen 5 x16 under load ReBAR: Enabled — BAR1 = 32,768 MiB per card What happens With ReBAR enabled, Ollama sees 63.7 GiB of total VRAM across the two GPUs. Debug logs show:

vram-based default context total_vram="63.7 GiB" default_num_ctx=262144 Ollama auto-sets num_ctx=262144 (262K tokens). For a 70B model with Q4_0 KV cache, the KV cache at 262K context is enormous — far larger than the ~22 GB of VRAM remaining after the 42 GB model weights are loaded. This causes model layers to spill into CUDA shared memory (system RAM mapped as GPU-accessible memory), which is 10–50× slower than dedicated VRAM.

Observed behavior Condition VRAM Used CPU RAM Speed ReBAR ON, default context (262K auto) 27.7 GB GPU + rest in CPU ~45 GB 0.9 tok/s 🔴 ReBAR ON, explicit num_ctx=4096 57.3 GB GPU, 0 CPU ~1.2 GB 18.0 tok/s ✅ ReBAR OFF (before enabling) ~52 GB GPU minimal 18.2 tok/s ✅ CUDA itself correctly reports only dedicated VRAM (31.86 GB per card). The issue is in Ollama's context size calculation, not in CUDA's memory reporting.

Root cause Ollama's automatic num_ctx calculation appears to fill all available VRAM with KV cache without reserving enough space for model weights. The calculation should be:

max_safe_context = (total_vram - model_weight_size - safety_margin) / kv_cache_bytes_per_token Instead, it seems to calculate context based on total VRAM alone, which works when VRAM is small (8-24 GB) but fails catastrophically on large multi-GPU setups (64+ GB) where the auto-calculated context creates a KV cache larger than the remaining VRAM after model loading.

Workaround Setting OLLAMA_CONTEXT_LENGTH=32768 as a system-wide environment variable caps the default context and completely resolves the issue. The model then loads fully to GPU with proper dual-GPU layer splitting.

Expected behavior Ollama should calculate the default context size after accounting for model weight size, ensuring that model layers are never pushed to system RAM due to an oversized KV cache. On this system, a safe default would be ~32K–49K tokens for 70B, not 262K.

Related This may be related to #6008 (shared memory / ReBAR issues). However, the root cause here is specifically the context auto-sizing, not CUDA's shared memory behavior — CUDA correctly reports only dedicated VRAM. The shared memory only becomes a problem because the inflated KV cache forces layer spillover.

Additional notes ReBAR otherwise provides a significant benefit: prompt processing speed improved from 72.7 → 270.3 tok/s (3.7×) for 70B on this setup Smaller models (32B, 8B) that fit on a single GPU are unaffected since they don't trigger the extreme auto-context calculation The OLLAMA_CONTEXT_LENGTH workaround is effective but users shouldn't need to know this — the auto-calculation should be safe by default

Root Cause

Root cause Ollama's automatic num_ctx calculation appears to fill all available VRAM with KV cache without reserving enough space for model weights. The calculation should be:

Fix Action

Fix / Workaround

Workaround Setting OLLAMA_CONTEXT_LENGTH=32768 as a system-wide environment variable caps the default context and completely resolves the issue. The model then loads fully to GPU with proper dual-GPU layer splitting.

Additional notes ReBAR otherwise provides a significant benefit: prompt processing speed improved from 72.7 → 270.3 tok/s (3.7×) for 70B on this setup Smaller models (32B, 8B) that fit on a single GPU are unaffected since they don't trigger the extreme auto-context calculation The OLLAMA_CONTEXT_LENGTH workaround is effective but users shouldn't need to know this — the auto-calculation should be safe by default

RAW_BUFFERClick to expand / collapse

Summary When Resizable BAR (ReBAR) is enabled on multi-GPU systems with large total VRAM, Ollama's automatic num_ctx calculation sets an excessively large default context window. The resulting KV cache consumes so much VRAM that model weight layers spill into system shared memory (via CUDA sysmem fallback), dropping generation speed from ~18 tok/s to ~0.9 tok/s.

Environment OS: Windows 11 Pro GPUs: 2× NVIDIA RTX PRO 4500 Blackwell (32 GB GDDR7 each, 64 GB total) Ollama version: 0.23.1 (also reproduced on 0.20.2) Model: llama3.3:70b (Q4_K_M, ~42 GB) PCIe: Both GPUs confirmed Gen 5 x16 under load ReBAR: Enabled — BAR1 = 32,768 MiB per card What happens With ReBAR enabled, Ollama sees 63.7 GiB of total VRAM across the two GPUs. Debug logs show:

vram-based default context total_vram="63.7 GiB" default_num_ctx=262144 Ollama auto-sets num_ctx=262144 (262K tokens). For a 70B model with Q4_0 KV cache, the KV cache at 262K context is enormous — far larger than the ~22 GB of VRAM remaining after the 42 GB model weights are loaded. This causes model layers to spill into CUDA shared memory (system RAM mapped as GPU-accessible memory), which is 10–50× slower than dedicated VRAM.

Observed behavior Condition VRAM Used CPU RAM Speed ReBAR ON, default context (262K auto) 27.7 GB GPU + rest in CPU ~45 GB 0.9 tok/s 🔴 ReBAR ON, explicit num_ctx=4096 57.3 GB GPU, 0 CPU ~1.2 GB 18.0 tok/s ✅ ReBAR OFF (before enabling) ~52 GB GPU minimal 18.2 tok/s ✅ CUDA itself correctly reports only dedicated VRAM (31.86 GB per card). The issue is in Ollama's context size calculation, not in CUDA's memory reporting.

Root cause Ollama's automatic num_ctx calculation appears to fill all available VRAM with KV cache without reserving enough space for model weights. The calculation should be:

max_safe_context = (total_vram - model_weight_size - safety_margin) / kv_cache_bytes_per_token Instead, it seems to calculate context based on total VRAM alone, which works when VRAM is small (8-24 GB) but fails catastrophically on large multi-GPU setups (64+ GB) where the auto-calculated context creates a KV cache larger than the remaining VRAM after model loading.

Workaround Setting OLLAMA_CONTEXT_LENGTH=32768 as a system-wide environment variable caps the default context and completely resolves the issue. The model then loads fully to GPU with proper dual-GPU layer splitting.

Expected behavior Ollama should calculate the default context size after accounting for model weight size, ensuring that model layers are never pushed to system RAM due to an oversized KV cache. On this system, a safe default would be ~32K–49K tokens for 70B, not 262K.

Related This may be related to #6008 (shared memory / ReBAR issues). However, the root cause here is specifically the context auto-sizing, not CUDA's shared memory behavior — CUDA correctly reports only dedicated VRAM. The shared memory only becomes a problem because the inflated KV cache forces layer spillover.

Additional notes ReBAR otherwise provides a significant benefit: prompt processing speed improved from 72.7 → 270.3 tok/s (3.7×) for 70B on this setup Smaller models (32B, 8B) that fit on a single GPU are unaffected since they don't trigger the extreme auto-context calculation The OLLAMA_CONTEXT_LENGTH workaround is effective but users shouldn't need to know this — the auto-calculation should be safe by default

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

ollama - 💡(How to fix) Fix Resizable BAR causes automatic context size to overflow VRAM, pushing model layers to system RAM [1 participants]