ollama - ✅(Solved) Fix ollama runner's prefill is slower than llama runner path on `qwen3-coder-next` [1 pull requests, 1 comments, 2 participants]

ollama2026-05-19 08:06:04

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

ollama/ollama#16221•Fetched 2026-05-20 03:39:30

View on GitHub

Comments

Participants

Timeline

Reactions

Author

lingyuncai

Participants

lingyuncai

pdevine

Timeline (top)

closed ×1commented ×1cross-referenced ×1labeled ×1

Root Cause

Investigated via GGML_SCHED_DEBUG=2 dump. The op dispatch is correct — all weight matmuls are already scheduled to CUDA via the op_offload path. The slowdown is in the host-to-device transfer path for CPU-resident weights.

ollama runner builds the CPU buffer type list here:

https://github.com/ollama/ollama/blob/56b319f457d6c47a5a69c893110dcfb8290f93bd/ml/backend/ggml/ggml.go#L167-L178

Only ggml_backend_dev_buffer_type() (plain malloc-backed pageable memory) is added to the list, the GPU's host buffer type is not considered.

llama runner (make_cpu_buft_list) does the same accel pass first, then explicitly prepends the first GPU's host buffer type before the plain CPU buffer:

https://github.com/ollama/ollama/blob/56b319f457d6c47a5a69c893110dcfb8290f93bd/llama/llama.cpp/src/llama-model.cpp#L336-L350

ggml_backend_dev_host_buffer_type() returns the CUDA backend's cudaMallocHost-backed pinned buffer type. As noted in the llama.cpp comment above, storing tensors in a host buffer reduces data transfer time when large batches are offloaded to a GPU. With ~28 GiB of CPU-resident weights per prefill batch, this path difference may directly explain the observed 470 ms prefill gap.

Fix Action

Fix / Workaround

PR fix notes

PR #16222: ml/ggml: add OLLAMA_PINNED_HOST_BUFFER opt-in for faster prefill on partial-offload models

Repository: ollama/ollama
Author: lingyuncai
State: open | merged: False
Link: https://github.com/ollama/ollama/pull/16222

Description (problem / solution / changelog)

Summary

Adds OLLAMA_PINNED_HOST_BUFFER=1 opt-in to route CPU-resident model weights through CUDA pinned host memory (cudaMallocHost) instead of plain pageable memory, closing the prefill gap described in issue #16221 .

Motivation

On partial-offload setups (model size > VRAM), ollama runner's CPU buffer type list is built from ggml_backend_dev_buffer_type() only — plain malloc-backed pageable memory. GPU devices are skipped entirely in that loop, so the pinned host buffer type exposed via ggml_backend_dev_host_buffer_type() is never used for weight allocation.

llama runner's make_cpu_buft_list explicitly prepends the first GPU's host buffer type to the list for exactly this reason (see comment in llama-model.cpp:336-350). This means CPU-resident weights in llama runner land in cudaMallocHost-backed pinned memory, while ollama runner's go through pageable memory with a driver staging copy on every H2D transfer. With ~28 GiB of CPU-resident weights per prefill batch, this is a meaningful per-batch cost.

Implementation

Backend.New() builds the CPU buft list in three explicit passes, matching llama.cpp's make_cpu_buft_list order (ACCEL → GPU host → CPU):

// ACCEL pass (unchanged)
for _, d := range accels {
    bt := C.ggml_backend_dev_buffer_type(d)
    cpuDeviceBufferType.bts = append(cpuDeviceBufferType.bts, bt)
    btDeviceMemory[bt] = &requiredMemory.CPU
}

// GPU host pass (opt-in)
if envconfig.PinnedHostBuffer() {
    for _, d := range gpus {
        hostBuft := C.ggml_backend_dev_host_buffer_type(d)
        if hostBuft == nil {
            continue
        }
        cpuDeviceBufferType.bts = append(cpuDeviceBufferType.bts, hostBuft)
        btDeviceMemory[hostBuft] = &requiredMemory.CPU
        break
    }
}

// CPU pass (unchanged)
for _, d := range cpus {
    bt := C.ggml_backend_dev_buffer_type(d)
    cpuDeviceBufferType.bts = append(cpuDeviceBufferType.bts, bt)
    btDeviceMemory[bt] = &requiredMemory.CPU
}

createTensor already walks bts in order, so CPU-resident weights automatically land in pinned memory when the opt-in is set. No changes to op scheduling, decode path, or KV cache placement.

Results

Test platform:

Configuration
CPU	Intel Core Ultra 9 265K (Arrow Lake)
RAM	128 GB DDR5 6400 MT/s
GPU	NVIDIA RTX 3090 (24 GB VRAM)
OS	Windows 11
Model	qwen3-coder-next 80B Q4_K_M (~52 GB GGUF, 49 layers)
Workload	batch_size=1024, prompt=1024, max_tokens=16

Config	prefill_ms	prefill_tps
ollama runner (default)	2061 ms	475 t/s
ollama runner + `OLLAMA_PINNED_HOST_BUFFER=1`	1469 ms	666 t/s

-29% prefill latency / +40% prefill tps vs default ollama runner

Note

Allocation failure fallback: if cudaMallocHost fails, createTensor advances to the next buft in the list — plain CPU buffer. Worst case is identical to the disabled path.

No-op cases:

OLLAMA_PINNED_HOST_BUFFER unset (default)
No CUDA GPU present (gpus slice is empty)
Models that fully fit in VRAM (no CPU-resident weights to transfer)

Changed files

envconfig/config.go (modified, +6/-0)
ml/backend/ggml/ggml.go (modified, +38/-9)

RAW_BUFFERClick to expand / collapse

Problem

With below setups where the model doesn't fully fit in VRAM, ollama runner's prefill is slower than the llama runner path on qwen3-coder-next.

Test platform:

Configuration
CPU	Intel Core Ultra 9 265K (Arrow Lake)
RAM	128 GB DDR5 6400 MT/s
GPU	NVIDIA RTX 3090 (24 GB VRAM)
OS	Windows 11
Model	qwen3-coder-next 80B Q4_K_M (~52 GB GGUF, 49 layers)
Workload	batch_size=1024, prompt=1024, max_tokens=16

Prefill Results:

Config	prefill_ms	prefill_tps
ollama runner (default)	2061 ms	475 t/s
llama runner	1591 ms	616 t/s
gap	+470 ms / +30%	-23%

Root Cause

ollama runner builds the CPU buffer type list here:

https://github.com/ollama/ollama/blob/56b319f457d6c47a5a69c893110dcfb8290f93bd/ml/backend/ggml/ggml.go#L167-L178

Only ggml_backend_dev_buffer_type() (plain malloc-backed pageable memory) is added to the list, the GPU's host buffer type is not considered.

llama runner (make_cpu_buft_list) does the same accel pass first, then explicitly prepends the first GPU's host buffer type before the plain CPU buffer:

https://github.com/ollama/ollama/blob/56b319f457d6c47a5a69c893110dcfb8290f93bd/llama/llama.cpp/src/llama-model.cpp#L336-L350

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#pipeline error #runtime error #dependency conflict #environment setup #docker error

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

ollama - ✅(Solved) Fix ollama runner's prefill is slower than llama runner path on `qwen3-coder-next` [1 pull requests, 1 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Fix Action

Fix / Workaround

PR fix notes

PR #16222: ml/ggml: add OLLAMA_PINNED_HOST_BUFFER opt-in for faster prefill on partial-offload models

Description (problem / solution / changelog)

Summary

Motivation

Implementation

Results

Note

Changed files

Problem

Root Cause

Still need to ship something?

TRENDING

ollama - ✅(Solved) Fix ollama runner's prefill is slower than llama runner path on `qwen3-coder-next` [1 pull requests, 1 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Fix Action

Fix / Workaround

PR fix notes

PR #16222: ml/ggml: add OLLAMA_PINNED_HOST_BUFFER opt-in for faster prefill on partial-offload models

Description (problem / solution / changelog)

Summary

Motivation

Implementation

Results

Note

Changed files

Problem

Root Cause

Still need to ship something?

RELATED_DISCOVERY

TRENDING