ollama - 💡(How to fix) Fix Poor performance of the nvfp4 models on MacBook Pro M3 [1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
ollama/ollama#16127Fetched 2026-05-14 03:29:04
View on GitHub
Comments
0
Participants
1
Timeline
2
Reactions
0
Author
Participants
Timeline (top)
closed ×1labeled ×1

Root Cause

This feels like a bug - either in the article, the software, or somewhere deep in the universe where things are supposed to make sense but clearly don't. Because right now, NVFP4 on MLX makes zero practical sense, and that's both confusing and honestly pretty frustrating.

Code Example

% ollama run gemma4:31b-nvfp4
>>> /set verbose
>>> what can you do?
...
total duration:       2m45.425681083s
load duration:        85.741208ms
prompt eval count:    21 token(s)
prompt eval duration: 2.164352083s
prompt eval rate:     9.70 tokens/s
eval count:           1026 token(s)
eval duration:        2m43.175126458s
eval rate:            6.29 tokens/s


% ollama run gemma4:31b-it-q4_K_M
>>> /set verbose
>>> what can you do?
...
total duration:       2m57.964217083s
load duration:        184.854791ms
prompt eval count:    21 token(s)
prompt eval duration: 479.860375ms
prompt eval rate:     43.76 tokens/s
eval count:           1171 token(s)
eval duration:        2m56.923986858s
eval rate:            6.62 tokens/s
RAW_BUFFERClick to expand / collapse

What is the issue?

If I'm reading this article correctly, NVFP4 models are supposed to be the shiny new turbo boost for MLX on Apple Silicon - faster inference, happy developers, rainbows and unicorns. Great! Except… my benchmarks are telling a very different story. Not only are these models NOT faster, some of them are actually slower.

This feels like a bug - either in the article, the software, or somewhere deep in the universe where things are supposed to make sense but clearly don't. Because right now, NVFP4 on MLX makes zero practical sense, and that's both confusing and honestly pretty frustrating.

Please tell me I'm missing something obvious here. I would love nothing more than to be wrong about this. Thanks!

Relevant log output

% ollama run gemma4:31b-nvfp4
>>> /set verbose
>>> what can you do?
...
total duration:       2m45.425681083s
load duration:        85.741208ms
prompt eval count:    21 token(s)
prompt eval duration: 2.164352083s
prompt eval rate:     9.70 tokens/s
eval count:           1026 token(s)
eval duration:        2m43.175126458s
eval rate:            6.29 tokens/s


% ollama run gemma4:31b-it-q4_K_M
>>> /set verbose
>>> what can you do?
...
total duration:       2m57.964217083s
load duration:        184.854791ms
prompt eval count:    21 token(s)
prompt eval duration: 479.860375ms
prompt eval rate:     43.76 tokens/s
eval count:           1171 token(s)
eval duration:        2m56.923986858s
eval rate:            6.62 tokens/s

OS

macOS

GPU

Apple

CPU

Apple

Ollama version

0.23.3

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING