ollama - 💡(How to fix) Fix Resizable BAR causes automatic context size to overflow VRAM, pushing model layers to system RAM [1 participants]

When Resizable BAR (ReBAR) is enabled on multi-GPU systems with large total VRAM, Ollama's automatic num_ctx calculation sets an excessively large default context window. The resulting KV cache consumes so much VRAM that model weight layers spill into system shared memory (via CUDA sysmem fallback), dropping generation speed from ~18 tok/s to ~0.9 tok/s.

Environment OS: Windows 11 Pro GPUs: 2× NVIDIA RTX PRO 4500 Blackwell (32 GB GDDR7 each, 64 GB total) Ollama version: 0.23.1 (also reproduced on 0.20.2) Model: llama3.3:70b (Q4_K_M, ~42 GB) PCIe: Both GPUs confirmed Gen 5 x16 under load ReBAR: Enabled — BAR1 = 32,768 MiB per card What happens With ReBAR enabled, Ollama sees 63.7 GiB of total VRAM across the two GPUs. Debug logs show:

vram-based default context total_vram="63.7 GiB" default_num_ctx=262144 Ollama auto-sets num_ctx=262144 (262K tokens). For a 70B model with Q4_0 KV cache, the KV cache at 262K context is enormous — far larger than the ~22 GB of VRAM remaining after the 42 GB model weights are loaded. This causes model layers to spill into CUDA shared memory (system RAM mapped as GPU-accessible memory), which is 10–50× slower than dedicated VRAM.

Observed behavior Condition VRAM Used CPU RAM Speed ReBAR ON, default context (262K auto) 27.7 GB GPU + rest in CPU ~45 GB 0.9 tok/s 🔴 ReBAR ON, explicit num_ctx=4096 57.3 GB GPU, 0 CPU ~1.2 GB 18.0 tok/s ✅ ReBAR OFF (before enabling) ~52 GB GPU minimal 18.2 tok/s ✅ CUDA itself correctly reports only dedicated VRAM (31.86 GB per card). The issue is in Ollama's context size calculation, not in CUDA's memory reporting.

Root cause Ollama's automatic num_ctx calculation appears to fill all available VRAM with KV cache without reserving enough space for model weights. The calculation should be:

max_safe_context = (total_vram - model_weight_size - safety_margin) / kv_cache_bytes_per_token Instead, it seems to calculate context based on total VRAM alone, which works when VRAM is small (8-24 GB) but fails catastrophically on large multi-GPU setups (64+ GB) where the auto-calculated context creates a KV cache larger than the remaining VRAM after model loading.

Workaround Setting OLLAMA_CONTEXT_LENGTH=32768 as a system-wide environment variable caps the default context and completely resolves the issue. The model then loads fully to GPU with proper dual-GPU layer splitting.

Expected behavior Ollama should calculate the default context size after accounting for model weight size, ensuring that model layers are never pushed to system RAM due to an oversized KV cache. On this system, a safe default would be ~32K–49K tokens for 70B, not 262K.

Related This may be related to #6008 (shared memory / ReBAR issues). However, the root cause here is specifically the context auto-sizing, not CUDA's shared memory behavior — CUDA correctly reports only dedicated VRAM. The shared memory only becomes a problem because the inflated KV cache forces layer spillover.

Additional notes ReBAR otherwise provides a significant benefit: prompt processing speed improved from 72.7 → 270.3 tok/s (3.7×) for 70B on this setup Smaller models (32B, 8B) that fit on a single GPU are unaffected since they don't trigger the extreme auto-context calculation The OLLAMA_CONTEXT_LENGTH workaround is effective but users shouldn't need to know this — the auto-calculation should be safe by default

Fix Action

Fix / Workaround

Summary When Resizable BAR (ReBAR) is enabled on multi-GPU systems with large total VRAM, Ollama's automatic num_ctx calculation sets an excessively large default context window. The resulting KV cache consumes so much VRAM that model weight layers spill into system shared memory (via CUDA sysmem fallback), dropping generation speed from ~18 tok/s to ~0.9 tok/s.

Root cause Ollama's automatic num_ctx calculation appears to fill all available VRAM with KV cache without reserving enough space for model weights. The calculation should be:

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

ollama - 💡(How to fix) Fix Resizable BAR causes automatic context size to overflow VRAM, pushing model layers to system RAM [1 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Fix Action

Fix / Workaround

Still need to ship something?

TRENDING

ollama - 💡(How to fix) Fix Resizable BAR causes automatic context size to overflow VRAM, pushing model layers to system RAM [1 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Fix Action

Fix / Workaround

Still need to ship something?

RELATED_DISCOVERY

TRENDING