ollama - 💡(How to fix) Fix MLX bf16 cold prefill is 60–400× slower than warm, even when peak memory fits in physical RAM

ollama2026-05-08 20:00:22

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

Code Example

14:52:02.307  starting mlx runner subprocess  model=gemma4:26b-mlx-bf16
14:52:02.323  MLX engine initialized          MLX version=0.31.2-7-ge8ebdeb device=gpu
14:52:02.433  Model architecture              arch=Gemma4ForConditionalGeneration
14:52:03.057  Loaded tensors from manifest    count=8633
14:52:27.031  Starting HTTP server                              ← weights load took ~24s
14:52:27.142  cache miss   total=17 matched=0 cached=0 left=17
14:53:32.767  Prompt processing progress  processed=13 total=17 ← +65s, no memory pressure
14:53:34.353  Prompt processing progress  processed=16 total=17
14:53:39.434  ServeHTTP POST /v1/completions  took=1m12.32s
14:53:39.434  peak memory size="47.13 GiB"                     ← well under 96 GiB physical

---

14:53:42.600  starting mlx runner subprocess  (after keep_alive=0 unload)
14:54:04.064  Starting HTTP server
14:54:04.113  cache miss   total=26 matched=0 cached=0 left=26
14:54:22.058  Prompt processing progress  processed=22 total=26 ← +18s for 22 tokens (~1.2 t/s)
14:54:23.761  ServeHTTP POST /v1/completions  took=19.65s
14:54:23.761  peak memory size="47.08 GiB"

---

14:13:14.017  cache miss   total=34 matched=0 cached=0 left=34
14:14:52.688  Prompt processing progress  processed=30 total=34  ← +1m38s, paging
14:14:54.080  ServeHTTP POST /v1/completions  took=1m40.05s
14:14:54.080  peak memory size="101.35 GiB"                     ← exceeds 96 GiB physical

---

14:15:37.960  cache miss   total=286 matched=0 cached=0 left=286
14:17:17.544  Prompt processing progress  processed=282 total=286  ← +1m40s
14:17:23.127  ServeHTTP POST /v1/completions  took=1m45.17s
14:17:23.127  peak memory size="101.42 GiB"

---

14:17:28.131  ServeHTTP POST /v1/completions  took=4.979s
14:17:33.148  ServeHTTP POST /v1/completions  took=5.000s
14:17:38.323  ServeHTTP POST /v1/completions  took=5.157s

---

# Pull an MLX bf16 chat-tuned model — gemma4 reproduces Finding A cleanly
ollama pull gemma4:26b-mlx-bf16     # 52 GB

# Force-unload to ensure clean cold-load state
curl -s http://localhost:11434/api/chat -d '{
  "model": "gemma4:26b-mlx-bf16",
  "messages": [{"role":"user","content":"x"}],
  "stream": false,
  "keep_alive": 0
}'

sleep 3

# Cold request — short prompt, time it
time curl -s http://localhost:11434/api/chat -d '{
  "model": "gemma4:26b-mlx-bf16",
  "messages": [{"role":"user","content":"In one short sentence, what is the value of saying nothing?"}],
  "stream": false,
  "options": {"num_predict": 60, "temperature": 0.7},
  "keep_alive": -1,
  "think": false
}' | jq '{prompt_eval_count, prompt_eval_duration, eval_count, eval_duration}'

# Then check the peak-memory line in the server log
tail -30 /tmp/ollama.err | grep -E "peak memory|Prompt processing|cache miss"

---

# /tmp/ollama.err on Ollama 0.23.2 (M3 Ultra 96 GiB, macOS 26.4.1)
# Captures TWO cold-load + first-request cycles:
#   (A) gemma4:26b-mlx-bf16 — peak memory 47 GiB, fits in physical RAM
#   (B) laguna-xs.2:mlx-bf16 — peak memory 101 GiB, exceeds physical RAM
# Both show multi-second gaps between "cache miss" and "Prompt processing progress",
# even though only (B) has memory pressure.

# === (A) gemma4:26b-mlx-bf16, fresh cold load ===
time=2026-05-08T14:52:02.307-05:00 source=client.go:359 msg="starting mlx runner subprocess" model=gemma4:26b-mlx-bf16 port=53146
time=2026-05-08T14:52:02.323-05:00 source=server.go:44  msg="MLX engine initialized" "MLX version"=0.31.2-7-ge8ebdeb device=gpu
time=2026-05-08T14:52:02.433-05:00 source=base.go:110   msg="Model architecture" arch=Gemma4ForConditionalGeneration
time=2026-05-08T14:52:03.057-05:00 source=runner.go:159 msg="Loaded tensors from manifest" count=8633
time=2026-05-08T14:52:27.031-05:00 source=runner.go:194 msg="Starting HTTP server" host=127.0.0.1 port=53146
time=2026-05-08T14:52:27.142-05:00 source=cache.go:126  msg="cache miss" total=17 matched=0 cached=0 left=17
time=2026-05-08T14:53:32.767-05:00 source=pipeline.go:135 msg="Prompt processing progress" processed=13 total=17
time=2026-05-08T14:53:34.353-05:00 source=pipeline.go:135 msg="Prompt processing progress" processed=16 total=17
time=2026-05-08T14:53:39.434-05:00 source=server.go:213 msg=ServeHTTP method=POST path=/v1/completions took=1m12.316795375s status="200 OK"
time=2026-05-08T14:53:39.434-05:00 source=pipeline.go:71 msg="peak memory" size="47.13 GiB"

# === (A) Same model, force-unloaded then cold-loaded again, 26-token prompt ===
time=2026-05-08T14:53:42.600-05:00 source=client.go:359 msg="starting mlx runner subprocess" model=gemma4:26b-mlx-bf16 port=53398
time=2026-05-08T14:54:04.064-05:00 source=runner.go:194 msg="Starting HTTP server" host=127.0.0.1 port=53398
time=2026-05-08T14:54:04.113-05:00 source=cache.go:126  msg="cache miss" total=26 matched=0 cached=0 left=26
time=2026-05-08T14:54:22.058-05:00 source=pipeline.go:135 msg="Prompt processing progress" processed=22 total=26
time=2026-05-08T14:54:23.761-05:00 source=server.go:213 msg=ServeHTTP method=POST path=/v1/completions took=19.654175584s status="200 OK"
time=2026-05-08T14:54:23.761-05:00 source=pipeline.go:71 msg="peak memory" size="47.08 GiB"

# === (B) laguna-xs.2:mlx-bf16, fresh cold load, 34-token prompt ===
time=2026-05-08T14:12:33.579-05:00 source=client.go:359 msg="starting mlx runner subprocess" model=laguna-xs.2:mlx-bf16 port=52137
time=2026-05-08T14:12:33.597-05:00 source=server.go:44  msg="MLX engine initialized" "MLX version"=0.31.2-7-ge8ebdeb device=gpu
time=2026-05-08T14:12:33.741-05:00 source=base.go:110   msg="Model architecture" arch=LagunaForCausalLM
time=2026-05-08T14:12:33.934-05:00 source=runner.go:159 msg="Loaded tensors from manifest" count=30513
time=2026-05-08T14:13:14.004-05:00 source=runner.go:194 msg="Starting HTTP server" host=127.0.0.1 port=52137
time=2026-05-08T14:13:14.017-05:00 source=cache.go:126  msg="cache miss" total=34 matched=0 cached=0 left=34
time=2026-05-08T14:14:52.688-05:00 source=pipeline.go:135 msg="Prompt processing progress" processed=30 total=34
time=2026-05-08T14:14:54.080-05:00 source=server.go:213 msg=ServeHTTP method=POST path=/v1/completions took=1m40.050105458s status="200 OK"
time=2026-05-08T14:14:54.080-05:00 source=pipeline.go:71 msg="peak memory" size="101.35 GiB"

# === (B) Same model, force-unloaded then cold-loaded again, 286-token prompt ===
time=2026-05-08T14:14:56.151-05:00 source=client.go:359 msg="starting mlx runner subprocess" model=laguna-xs.2:mlx-bf16 port=52544
time=2026-05-08T14:15:37.902-05:00 source=runner.go:194 msg="Starting HTTP server" host=127.0.0.1 port=52544
time=2026-05-08T14:15:37.960-05:00 source=cache.go:126  msg="cache miss" total=286 matched=0 cached=0 left=286
time=2026-05-08T14:17:17.544-05:00 source=pipeline.go:135 msg="Prompt processing progress" processed=282 total=286
time=2026-05-08T14:17:23.127-05:00 source=server.go:213 msg=ServeHTTP method=POST path=/v1/completions took=1m45.167289041s status="200 OK"
time=2026-05-08T14:17:23.127-05:00 source=pipeline.go:71 msg="peak memory" size="101.42 GiB"

# === (B) Subsequent warm requests on the now-resident laguna model: ===
time=2026-05-08T14:17:28.131-05:00 source=server.go:213 msg=ServeHTTP method=POST path=/v1/completions took=4.979186084s status="200 OK"
time=2026-05-08T14:17:33.148-05:00 source=server.go:213 msg=ServeHTTP method=POST path=/v1/completions took=5.00038275s status="200 OK"
time=2026-05-08T14:17:38.323-05:00 source=server.go:213 msg=ServeHTTP method=POST path=/v1/completions took=5.157451666s status="200 OK"

RAW_BUFFERClick to expand / collapse

What is the issue?

On Apple Silicon (M3 Ultra, 96 GB unified memory, macOS 26.4.1 / build 25E253), the first /api/chat request to a freshly loaded MLX bf16 model is 60–400× slower at the prefill stage than warm requests against the same model. Subsequent requests perform normally. Decode tok/s is unaffected throughout — the regression is purely on cold prefill.

Reproducible across two distinct MLX bf16 models on Ollama 0.23.2:

gemma4:26b-mlx-bf16 (52 GB on disk; 47 GiB peak memory — fits comfortably in 96 GiB physical RAM)
laguna-xs.2:mlx-bf16 (67 GB on disk; 101 GiB peak memory — exceeds physical RAM)

Persists across 0.21.0 → 0.23.2 including the v0.23.1 "MLX and MLX-C with threading fixes" release.

Two findings, captured separately

Finding A — universal: cold prefill is dramatically slow even when memory fits.

gemma4:26b-mlx-bf16 rules out pure OOM-paging as the explanation. With a peak memory footprint of 47.13 GiB on a 96 GiB host, the model fits in unified memory with ~49 GiB free. Yet cold prefill on a 17-token prompt took 72 seconds (Prompt processing progress log lines show a ~65-second gap between "cache miss" and "processed=13/17"). A second cold-load (force-unload via keep_alive=0, then reload) on a 26-token prompt took 19.7 seconds for prefill — peak memory 47.08 GiB, no swap pressure, but 22 tokens of prefill took 18 seconds (~1.2 t/s).

For comparison, warm prefill on the same model completes in single-digit milliseconds (KV-cache hits aside, real warm prefill is in the 500–540 t/s range). And the same prompt against the GGUF Q4_K_M sibling on the same hardware completes cold prefill at 1,655 t/s — three orders of magnitude faster.

Finding B — laguna-specific: peak memory exceeds bf16 weight size by ~50%.

For laguna-xs.2:mlx-bf16 (67 GB on-disk weights), the MLX engine logs peak memory of 101.35–101.42 GiB. That's ~38 GiB above the weight size and ~5 GiB above physical RAM on this 96 GiB machine. KV cache for a 34-token prompt should not contribute ~38 GiB. The over-allocation forces OS paging, which compounds the Finding-A cold-prefill slowness — wall-clock for a 34-token prompt is 160 seconds (vs. 72s for the comparably-sized gemma4 prompt that fits in memory).

The two findings could be the same bug manifesting differently (a workspace/buffer allocation that scales architecture-dependently) or two related bugs. Both reproduce cleanly.

System info

Hardware: Mac Studio M3 Ultra (28-core CPU, 60-core GPU, 96 GB unified memory)
macOS: 26.4.1 (build 25E253)
Ollama: 0.23.2 (also reproduced on 0.21.0)
MLX runtime version: 0.31.2-7-ge8ebdeb (logged at MLX engine initialized)
Backend env: OLLAMA_MLX=1, OLLAMA_FLASH_ATTENTION=1, OLLAMA_NUM_PARALLEL=1, OLLAMA_KEEP_ALIVE=-1, OLLAMA_MAX_LOADED_MODELS=3
Affected formats: MLX bf16 only. GGUF Q4_K_M on the same hardware is unaffected.

Smoking-gun log lines — gemma4:26b-mlx-bf16 (memory fits, still slow)

14:52:02.307  starting mlx runner subprocess  model=gemma4:26b-mlx-bf16
14:52:02.323  MLX engine initialized          MLX version=0.31.2-7-ge8ebdeb device=gpu
14:52:02.433  Model architecture              arch=Gemma4ForConditionalGeneration
14:52:03.057  Loaded tensors from manifest    count=8633
14:52:27.031  Starting HTTP server                              ← weights load took ~24s
14:52:27.142  cache miss   total=17 matched=0 cached=0 left=17
14:53:32.767  Prompt processing progress  processed=13 total=17 ← +65s, no memory pressure
14:53:34.353  Prompt processing progress  processed=16 total=17
14:53:39.434  ServeHTTP POST /v1/completions  took=1m12.32s
14:53:39.434  peak memory size="47.13 GiB"                     ← well under 96 GiB physical

Force-unload + cold reload, second cold-prefill (26-token prompt):

14:53:42.600  starting mlx runner subprocess  (after keep_alive=0 unload)
14:54:04.064  Starting HTTP server
14:54:04.113  cache miss   total=26 matched=0 cached=0 left=26
14:54:22.058  Prompt processing progress  processed=22 total=26 ← +18s for 22 tokens (~1.2 t/s)
14:54:23.761  ServeHTTP POST /v1/completions  took=19.65s
14:54:23.761  peak memory size="47.08 GiB"

Smoking-gun log lines — laguna-xs.2:mlx-bf16 (memory overflows, even slower)

14:13:14.017  cache miss   total=34 matched=0 cached=0 left=34
14:14:52.688  Prompt processing progress  processed=30 total=34  ← +1m38s, paging
14:14:54.080  ServeHTTP POST /v1/completions  took=1m40.05s
14:14:54.080  peak memory size="101.35 GiB"                     ← exceeds 96 GiB physical

Force-unload + cold reload, second cold-prefill (286-token prompt):

14:15:37.960  cache miss   total=286 matched=0 cached=0 left=286
14:17:17.544  Prompt processing progress  processed=282 total=286  ← +1m40s
14:17:23.127  ServeHTTP POST /v1/completions  took=1m45.17s
14:17:23.127  peak memory size="101.42 GiB"

Subsequent warm requests on the now-resident model:

14:17:28.131  ServeHTTP POST /v1/completions  took=4.979s
14:17:33.148  ServeHTTP POST /v1/completions  took=5.000s
14:17:38.323  ServeHTTP POST /v1/completions  took=5.157s

Two-version timing evidence — gemma4:26b-mlx-bf16

492-token prompt, 4 trials per cell, from a prior bench:

Ollama version	Cold prefill	Warm prefill avg	Cold wall
0.21.0	4.2 t/s	~30,000 t/s (cache-hit)	~117 s
0.23.2	6.3 t/s	~30,000 t/s (cache-hit)	~78 s
same machine, GGUF Q4_K_M baseline	1,655 t/s	~30,000 t/s	~0.3 s

Two-prompt timing evidence — laguna-xs.2:mlx-bf16

4 trials per cell, on Ollama 0.23.2:

Prompt size	Cold prefill	Warm prefill avg	Cold wall
46 tokens	0.4 t/s	533 t/s	160 s
286 tokens	2.9 t/s	3,330 t/s	147 s

For comparison, laguna-xs.2:q4_K_M (Q4 sibling, 23 GB, same architecture) on the same hardware:

Prompt size	Cold prefill	Cold wall
46 tokens	113 t/s	3.6 s
286 tokens	1,196 t/s	7.6 s

Reproduction

# Pull an MLX bf16 chat-tuned model — gemma4 reproduces Finding A cleanly
ollama pull gemma4:26b-mlx-bf16     # 52 GB

# Force-unload to ensure clean cold-load state
curl -s http://localhost:11434/api/chat -d '{
  "model": "gemma4:26b-mlx-bf16",
  "messages": [{"role":"user","content":"x"}],
  "stream": false,
  "keep_alive": 0
}'

sleep 3

# Cold request — short prompt, time it
time curl -s http://localhost:11434/api/chat -d '{
  "model": "gemma4:26b-mlx-bf16",
  "messages": [{"role":"user","content":"In one short sentence, what is the value of saying nothing?"}],
  "stream": false,
  "options": {"num_predict": 60, "temperature": 0.7},
  "keep_alive": -1,
  "think": false
}' | jq '{prompt_eval_count, prompt_eval_duration, eval_count, eval_duration}'

# Then check the peak-memory line in the server log
tail -30 /tmp/ollama.err | grep -E "peak memory|Prompt processing|cache miss"

The cold response shows prompt_eval_duration in the range of 15–80 seconds for prompts of 17–46 tokens. Send the same payload again immediately and prompt_eval_duration collapses to single-digit milliseconds. Peak memory line reports ~47 GiB for gemma4 (under physical), ~101 GiB for laguna-xs.2 (over physical).

Suggested triage / questions

What does the MLX runtime do during the multi-second gap between cache miss and the first Prompt processing progress log line? On gemma4 with peak memory at 47 GiB and 49 GiB GPU memory available, there's no swap pressure — yet that gap is 18–65 seconds. Some kind of one-time setup that should be cached across requests?
Why does laguna-xs.2:mlx-bf16 peak at 101 GiB for 67 GB of weights? That's ~50% over weight size; KV cache for a 34-token prompt should not account for ~38 GiB. Possibly an architecture-specific workspace allocation (Laguna's mixed-attention pattern: 10 layers global + 30 layers SWA + per-head gating).
Would it be feasible to surface a warning at model-load time when the projected working set exceeds available memory? Today the user just sees a slow-but-not-erroring request.
Is ollama serve --debug a viable way for users to capture more granular MLX-runtime diagnostics, and if so, would maintainers find that output useful for triage?

Happy to provide bench scripts, raw transcripts, additional model tests, or run with --debug if it would help.

Relevant log output

# /tmp/ollama.err on Ollama 0.23.2 (M3 Ultra 96 GiB, macOS 26.4.1)
# Captures TWO cold-load + first-request cycles:
#   (A) gemma4:26b-mlx-bf16 — peak memory 47 GiB, fits in physical RAM
#   (B) laguna-xs.2:mlx-bf16 — peak memory 101 GiB, exceeds physical RAM
# Both show multi-second gaps between "cache miss" and "Prompt processing progress",
# even though only (B) has memory pressure.

# === (A) gemma4:26b-mlx-bf16, fresh cold load ===
time=2026-05-08T14:52:02.307-05:00 source=client.go:359 msg="starting mlx runner subprocess" model=gemma4:26b-mlx-bf16 port=53146
time=2026-05-08T14:52:02.323-05:00 source=server.go:44  msg="MLX engine initialized" "MLX version"=0.31.2-7-ge8ebdeb device=gpu
time=2026-05-08T14:52:02.433-05:00 source=base.go:110   msg="Model architecture" arch=Gemma4ForConditionalGeneration
time=2026-05-08T14:52:03.057-05:00 source=runner.go:159 msg="Loaded tensors from manifest" count=8633
time=2026-05-08T14:52:27.031-05:00 source=runner.go:194 msg="Starting HTTP server" host=127.0.0.1 port=53146
time=2026-05-08T14:52:27.142-05:00 source=cache.go:126  msg="cache miss" total=17 matched=0 cached=0 left=17
time=2026-05-08T14:53:32.767-05:00 source=pipeline.go:135 msg="Prompt processing progress" processed=13 total=17
time=2026-05-08T14:53:34.353-05:00 source=pipeline.go:135 msg="Prompt processing progress" processed=16 total=17
time=2026-05-08T14:53:39.434-05:00 source=server.go:213 msg=ServeHTTP method=POST path=/v1/completions took=1m12.316795375s status="200 OK"
time=2026-05-08T14:53:39.434-05:00 source=pipeline.go:71 msg="peak memory" size="47.13 GiB"

# === (A) Same model, force-unloaded then cold-loaded again, 26-token prompt ===
time=2026-05-08T14:53:42.600-05:00 source=client.go:359 msg="starting mlx runner subprocess" model=gemma4:26b-mlx-bf16 port=53398
time=2026-05-08T14:54:04.064-05:00 source=runner.go:194 msg="Starting HTTP server" host=127.0.0.1 port=53398
time=2026-05-08T14:54:04.113-05:00 source=cache.go:126  msg="cache miss" total=26 matched=0 cached=0 left=26
time=2026-05-08T14:54:22.058-05:00 source=pipeline.go:135 msg="Prompt processing progress" processed=22 total=26
time=2026-05-08T14:54:23.761-05:00 source=server.go:213 msg=ServeHTTP method=POST path=/v1/completions took=19.654175584s status="200 OK"
time=2026-05-08T14:54:23.761-05:00 source=pipeline.go:71 msg="peak memory" size="47.08 GiB"

# === (B) laguna-xs.2:mlx-bf16, fresh cold load, 34-token prompt ===
time=2026-05-08T14:12:33.579-05:00 source=client.go:359 msg="starting mlx runner subprocess" model=laguna-xs.2:mlx-bf16 port=52137
time=2026-05-08T14:12:33.597-05:00 source=server.go:44  msg="MLX engine initialized" "MLX version"=0.31.2-7-ge8ebdeb device=gpu
time=2026-05-08T14:12:33.741-05:00 source=base.go:110   msg="Model architecture" arch=LagunaForCausalLM
time=2026-05-08T14:12:33.934-05:00 source=runner.go:159 msg="Loaded tensors from manifest" count=30513
time=2026-05-08T14:13:14.004-05:00 source=runner.go:194 msg="Starting HTTP server" host=127.0.0.1 port=52137
time=2026-05-08T14:13:14.017-05:00 source=cache.go:126  msg="cache miss" total=34 matched=0 cached=0 left=34
time=2026-05-08T14:14:52.688-05:00 source=pipeline.go:135 msg="Prompt processing progress" processed=30 total=34
time=2026-05-08T14:14:54.080-05:00 source=server.go:213 msg=ServeHTTP method=POST path=/v1/completions took=1m40.050105458s status="200 OK"
time=2026-05-08T14:14:54.080-05:00 source=pipeline.go:71 msg="peak memory" size="101.35 GiB"

# === (B) Same model, force-unloaded then cold-loaded again, 286-token prompt ===
time=2026-05-08T14:14:56.151-05:00 source=client.go:359 msg="starting mlx runner subprocess" model=laguna-xs.2:mlx-bf16 port=52544
time=2026-05-08T14:15:37.902-05:00 source=runner.go:194 msg="Starting HTTP server" host=127.0.0.1 port=52544
time=2026-05-08T14:15:37.960-05:00 source=cache.go:126  msg="cache miss" total=286 matched=0 cached=0 left=286
time=2026-05-08T14:17:17.544-05:00 source=pipeline.go:135 msg="Prompt processing progress" processed=282 total=286
time=2026-05-08T14:17:23.127-05:00 source=server.go:213 msg=ServeHTTP method=POST path=/v1/completions took=1m45.167289041s status="200 OK"
time=2026-05-08T14:17:23.127-05:00 source=pipeline.go:71 msg="peak memory" size="101.42 GiB"

# === (B) Subsequent warm requests on the now-resident laguna model: ===
time=2026-05-08T14:17:28.131-05:00 source=server.go:213 msg=ServeHTTP method=POST path=/v1/completions took=4.979186084s status="200 OK"
time=2026-05-08T14:17:33.148-05:00 source=server.go:213 msg=ServeHTTP method=POST path=/v1/completions took=5.00038275s status="200 OK"
time=2026-05-08T14:17:38.323-05:00 source=server.go:213 msg=ServeHTTP method=POST path=/v1/completions took=5.157451666s status="200 OK"

OS

macOS

GPU

Apple

CPU

Apple

Ollama version

0.23.2

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #dependency error #configuration error #environment variable #network issue

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

ollama - 💡(How to fix) Fix MLX bf16 cold prefill is 60–400× slower than warm, even when peak memory fits in physical RAM

Recommended Tools

GitHub issue graph ai analysis

Code Example

What is the issue?

Two findings, captured separately

System info

Smoking-gun log lines — gemma4:26b-mlx-bf16 (memory fits, still slow)

Smoking-gun log lines — laguna-xs.2:mlx-bf16 (memory overflows, even slower)

Two-version timing evidence — gemma4:26b-mlx-bf16

Two-prompt timing evidence — laguna-xs.2:mlx-bf16

Reproduction

Suggested triage / questions

Relevant log output

OS

GPU

CPU

Ollama version

Still need to ship something?

TRENDING

ollama - 💡(How to fix) Fix MLX bf16 cold prefill is 60–400× slower than warm, even when peak memory fits in physical RAM

Recommended Tools

GitHub issue graph ai analysis

Code Example

What is the issue?

Two findings, captured separately

System info

Smoking-gun log lines — gemma4:26b-mlx-bf16 (memory fits, still slow)

Smoking-gun log lines — laguna-xs.2:mlx-bf16 (memory overflows, even slower)

Two-version timing evidence — gemma4:26b-mlx-bf16

Two-prompt timing evidence — laguna-xs.2:mlx-bf16

Reproduction

Suggested triage / questions

Relevant log output

OS

GPU

CPU

Ollama version

Still need to ship something?

RELATED_DISCOVERY

TRENDING