ollama - 💡(How to fix) Fix MLX - nvfp4 models on MacOs extremely slow [1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
ollama/ollama#16030Fetched 2026-05-07 03:31:27
View on GitHub
Comments
0
Participants
1
Timeline
1
Reactions
0
Participants
Timeline (top)
labeled ×1

Code Example

No visible errors, seems to work ok, but extremely slow.

time=2026-05-06T19:47:02.529-03:00 level=INFO source=client.go:359 msg="starting mlx runner subprocess" model=qwen3.5:27b-coding-nvfp4 port=58467
time=2026-05-06T19:47:02.532-03:00 level=INFO source=sched.go:561 msg="loaded runners" count=1
time=2026-05-06T19:47:02.587-03:00 level=INFO source=server.go:44 msg="MLX engine initialized" "MLX version"=0.31.2 device=gpu
time=2026-05-06T19:47:02.667-03:00 level=INFO source=base.go:110 msg="Model architecture" arch=Qwen3_5ForConditionalGeneration
time=2026-05-06T19:47:02.966-03:00 level=INFO source=runner.go:159 msg="Loaded tensors from manifest" count=1584
time=2026-05-06T19:47:09.859-03:00 level=INFO source=runner.go:194 msg="Starting HTTP server" host=127.0.0.1 port=58467
time=2026-05-06T19:47:09.980-03:00 level=INFO source=server.go:213 msg=ServeHTTP method=GET path=/v1/status took=10.853875ms status="200 OK"
RAW_BUFFERClick to expand / collapse

What is the issue?

When running models (like qwen3.6:27b-nvfp4, qwen3.5:27b-coding-nvfp4, qwen3.6:35b-a3b-nvfp4) until version 0.20.0 they worked fine. A simple prompt took around 2 minutes to produce a result (ollama run MODEL "give me a definition for strategy and for tactics. Provide at least 3 examples to differentiate the concepts."

Ollama: 0.20.1: All models took around 1 minute 50 seconds to 2 minutes 25 seconds to answer (results between 120 and 250 lines with content). Ollama 0.23.1: These models takes 45 to write a single line.

Rest of the models with the same test work with the basically same time and result as before (no notable changes), only MLX models seems affected.

Relevant log output

No visible errors, seems to work ok, but extremely slow.

time=2026-05-06T19:47:02.529-03:00 level=INFO source=client.go:359 msg="starting mlx runner subprocess" model=qwen3.5:27b-coding-nvfp4 port=58467
time=2026-05-06T19:47:02.532-03:00 level=INFO source=sched.go:561 msg="loaded runners" count=1
time=2026-05-06T19:47:02.587-03:00 level=INFO source=server.go:44 msg="MLX engine initialized" "MLX version"=0.31.2 device=gpu
time=2026-05-06T19:47:02.667-03:00 level=INFO source=base.go:110 msg="Model architecture" arch=Qwen3_5ForConditionalGeneration
time=2026-05-06T19:47:02.966-03:00 level=INFO source=runner.go:159 msg="Loaded tensors from manifest" count=1584
time=2026-05-06T19:47:09.859-03:00 level=INFO source=runner.go:194 msg="Starting HTTP server" host=127.0.0.1 port=58467
time=2026-05-06T19:47:09.980-03:00 level=INFO source=server.go:213 msg=ServeHTTP method=GET path=/v1/status took=10.853875ms status="200 OK"

OS

macOS

GPU

Apple

CPU

Apple

Ollama version

0.23.1

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING