ollama - ✅(Solved) Fix ollama runner's prefill is slower than llama runner path on `qwen3-coder-next` [1 pull requests, 1 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
ollama/ollama#16221Fetched 2026-05-20 03:39:30
View on GitHub
Comments
1
Participants
2
Timeline
4
Reactions
0
Participants
Timeline (top)
closed ×1commented ×1cross-referenced ×1labeled ×1

Root Cause

Investigated via GGML_SCHED_DEBUG=2 dump. The op dispatch is correct — all weight matmuls are already scheduled to CUDA via the op_offload path. The slowdown is in the host-to-device transfer path for CPU-resident weights.

ollama runner builds the CPU buffer type list here:

https://github.com/ollama/ollama/blob/56b319f457d6c47a5a69c893110dcfb8290f93bd/ml/backend/ggml/ggml.go#L167-L178

Only ggml_backend_dev_buffer_type() (plain malloc-backed pageable memory) is added to the list, the GPU's host buffer type is not considered.

llama runner (make_cpu_buft_list) does the same accel pass first, then explicitly prepends the first GPU's host buffer type before the plain CPU buffer:

https://github.com/ollama/ollama/blob/56b319f457d6c47a5a69c893110dcfb8290f93bd/llama/llama.cpp/src/llama-model.cpp#L336-L350

ggml_backend_dev_host_buffer_type() returns the CUDA backend's cudaMallocHost-backed pinned buffer type. As noted in the llama.cpp comment above, storing tensors in a host buffer reduces data transfer time when large batches are offloaded to a GPU. With ~28 GiB of CPU-resident weights per prefill batch, this path difference may directly explain the observed 470 ms prefill gap.

Fix Action

Fix / Workaround

Investigated via GGML_SCHED_DEBUG=2 dump. The op dispatch is correct — all weight matmuls are already scheduled to CUDA via the op_offload path. The slowdown is in the host-to-device transfer path for CPU-resident weights.

PR fix notes

PR #16222: ml/ggml: add OLLAMA_PINNED_HOST_BUFFER opt-in for faster prefill on partial-offload models

Description (problem / solution / changelog)

Summary

Adds OLLAMA_PINNED_HOST_BUFFER=1 opt-in to route CPU-resident model weights through CUDA pinned host memory (cudaMallocHost) instead of plain pageable memory, closing the prefill gap described in issue #16221 .

Motivation

On partial-offload setups (model size > VRAM), ollama runner's CPU buffer type list is built from ggml_backend_dev_buffer_type() only — plain malloc-backed pageable memory. GPU devices are skipped entirely in that loop, so the pinned host buffer type exposed via ggml_backend_dev_host_buffer_type() is never used for weight allocation.

llama runner's make_cpu_buft_list explicitly prepends the first GPU's host buffer type to the list for exactly this reason (see comment in llama-model.cpp:336-350). This means CPU-resident weights in llama runner land in cudaMallocHost-backed pinned memory, while ollama runner's go through pageable memory with a driver staging copy on every H2D transfer. With ~28 GiB of CPU-resident weights per prefill batch, this is a meaningful per-batch cost.

Implementation

Backend.New() builds the CPU buft list in three explicit passes, matching llama.cpp's make_cpu_buft_list order (ACCEL → GPU host → CPU):

// ACCEL pass (unchanged)
for _, d := range accels {
    bt := C.ggml_backend_dev_buffer_type(d)
    cpuDeviceBufferType.bts = append(cpuDeviceBufferType.bts, bt)
    btDeviceMemory[bt] = &requiredMemory.CPU
}

// GPU host pass (opt-in)
if envconfig.PinnedHostBuffer() {
    for _, d := range gpus {
        hostBuft := C.ggml_backend_dev_host_buffer_type(d)
        if hostBuft == nil {
            continue
        }
        cpuDeviceBufferType.bts = append(cpuDeviceBufferType.bts, hostBuft)
        btDeviceMemory[hostBuft] = &requiredMemory.CPU
        break
    }
}

// CPU pass (unchanged)
for _, d := range cpus {
    bt := C.ggml_backend_dev_buffer_type(d)
    cpuDeviceBufferType.bts = append(cpuDeviceBufferType.bts, bt)
    btDeviceMemory[bt] = &requiredMemory.CPU
}

createTensor already walks bts in order, so CPU-resident weights automatically land in pinned memory when the opt-in is set. No changes to op scheduling, decode path, or KV cache placement.

Results

Test platform:

Configuration
CPUIntel Core Ultra 9 265K (Arrow Lake)
RAM128 GB DDR5 6400 MT/s
GPUNVIDIA RTX 3090 (24 GB VRAM)
OSWindows 11
Modelqwen3-coder-next 80B Q4_K_M (~52 GB GGUF, 49 layers)
Workloadbatch_size=1024, prompt=1024, max_tokens=16
Configprefill_msprefill_tps
ollama runner (default)2061 ms475 t/s
ollama runner + OLLAMA_PINNED_HOST_BUFFER=11469 ms666 t/s
  • -29% prefill latency / +40% prefill tps vs default ollama runner

Note

Allocation failure fallback: if cudaMallocHost fails, createTensor advances to the next buft in the list — plain CPU buffer. Worst case is identical to the disabled path.

No-op cases:

  • OLLAMA_PINNED_HOST_BUFFER unset (default)
  • No CUDA GPU present (gpus slice is empty)
  • Models that fully fit in VRAM (no CPU-resident weights to transfer)

Changed files

  • envconfig/config.go (modified, +6/-0)
  • ml/backend/ggml/ggml.go (modified, +38/-9)
RAW_BUFFERClick to expand / collapse

Problem

With below setups where the model doesn't fully fit in VRAM, ollama runner's prefill is slower than the llama runner path on qwen3-coder-next.

Test platform:

Configuration
CPUIntel Core Ultra 9 265K (Arrow Lake)
RAM128 GB DDR5 6400 MT/s
GPUNVIDIA RTX 3090 (24 GB VRAM)
OSWindows 11
Modelqwen3-coder-next 80B Q4_K_M (~52 GB GGUF, 49 layers)
Workloadbatch_size=1024, prompt=1024, max_tokens=16

Prefill Results:

Configprefill_msprefill_tps
ollama runner (default)2061 ms475 t/s
llama runner1591 ms616 t/s
gap+470 ms / +30%-23%

Root Cause

Investigated via GGML_SCHED_DEBUG=2 dump. The op dispatch is correct — all weight matmuls are already scheduled to CUDA via the op_offload path. The slowdown is in the host-to-device transfer path for CPU-resident weights.

ollama runner builds the CPU buffer type list here:

https://github.com/ollama/ollama/blob/56b319f457d6c47a5a69c893110dcfb8290f93bd/ml/backend/ggml/ggml.go#L167-L178

Only ggml_backend_dev_buffer_type() (plain malloc-backed pageable memory) is added to the list, the GPU's host buffer type is not considered.

llama runner (make_cpu_buft_list) does the same accel pass first, then explicitly prepends the first GPU's host buffer type before the plain CPU buffer:

https://github.com/ollama/ollama/blob/56b319f457d6c47a5a69c893110dcfb8290f93bd/llama/llama.cpp/src/llama-model.cpp#L336-L350

ggml_backend_dev_host_buffer_type() returns the CUDA backend's cudaMallocHost-backed pinned buffer type. As noted in the llama.cpp comment above, storing tensors in a host buffer reduces data transfer time when large batches are offloaded to a GPU. With ~28 GiB of CPU-resident weights per prefill batch, this path difference may directly explain the observed 470 ms prefill gap.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

ollama - ✅(Solved) Fix ollama runner's prefill is slower than llama runner path on `qwen3-coder-next` [1 pull requests, 1 comments, 2 participants]