ollama - 💡(How to fix) Fix MLX/NVFP4 models ignore num_ctx and always load with 256K context [2 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
ollama/ollama#16219Fetched 2026-05-20 03:39:33
View on GitHub
Comments
2
Participants
2
Timeline
4
Reactions
0
Author
Participants
Timeline (top)
commented ×2closed ×1labeled ×1

Root Cause

This makes local Codex App usage much slower because the model is loaded with the full 256K context.

RAW_BUFFERClick to expand / collapse

What is the issue?

I found another issue that seems separate from the Codex App catalog metadata issue.

Environment:

  • macOS
  • Mac Studio M2 Ultra, 192GB unified memory
  • Ollama v0.24.0
  • Model: qwen3.6:27b-coding-nvfp4
  • Custom model: qwen36-coding-fast

Modelfile:

FROM qwen3.6:27b-coding-nvfp4

PARAMETER num_ctx 16384 PARAMETER num_predict 512 PARAMETER temperature 0.1 PARAMETER top_p 0.9

ollama show qwen36-coding-fast --modelfile shows that num_ctx is set to 16384.

I also tried explicitly passing num_ctx at runtime:

curl -s http://localhost:11434/api/generate
-d '{"model":"qwen36-coding-fast","prompt":"","keep_alive":"30m","options":{"num_ctx":16384}}' >/dev/null

Before testing, I stopped the model with: ollama stop qwen36-coding-fast:latest

However, ollama ps still shows: qwen36-coding-fast:latest ... CONTEXT 262144

For comparison, the same approach works correctly with gpt-oss:20b: ollama ps shows the requested context size. So this appears to be specific to MLX/NVFP4 models, or at least qwen3.6:27b-coding-nvfp4.

Expected: The model should load with CONTEXT 16384.

Actual: The model loads with CONTEXT 262144 regardless of Modelfile num_ctx or runtime options.num_ctx.

This makes local Codex App usage much slower because the model is loaded with the full 256K context.

Relevant log output

OS

No response

GPU

No response

CPU

No response

Ollama version

No response

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

ollama - 💡(How to fix) Fix MLX/NVFP4 models ignore num_ctx and always load with 256K context [2 comments, 2 participants]