ollama - 💡(How to fix) Fix MLX/NVFP4 models ignore num_ctx and always load with 256K context [2 comments, 2 participants]

ollama2026-05-19 06:40:39

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

ollama/ollama#16219•Fetched 2026-05-20 03:39:33

View on GitHub

Comments

Participants

Timeline

Reactions

Author

kuri54

Participants

kuri54

rick-github

Timeline (top)

commented ×2closed ×1labeled ×1

Root Cause

This makes local Codex App usage much slower because the model is loaded with the full 256K context.

RAW_BUFFERClick to expand / collapse

What is the issue?

I found another issue that seems separate from the Codex App catalog metadata issue.

Environment:

macOS
Mac Studio M2 Ultra, 192GB unified memory
Ollama v0.24.0
Model: qwen3.6:27b-coding-nvfp4
Custom model: qwen36-coding-fast

Modelfile:

FROM qwen3.6:27b-coding-nvfp4

PARAMETER num_ctx 16384 PARAMETER num_predict 512 PARAMETER temperature 0.1 PARAMETER top_p 0.9

ollama show qwen36-coding-fast --modelfile shows that num_ctx is set to 16384.

I also tried explicitly passing num_ctx at runtime:

curl -s http://localhost:11434/api/generate
-d '{"model":"qwen36-coding-fast","prompt":"","keep_alive":"30m","options":{"num_ctx":16384}}' >/dev/null

Before testing, I stopped the model with: ollama stop qwen36-coding-fast:latest

However, ollama ps still shows: qwen36-coding-fast:latest ... CONTEXT 262144

For comparison, the same approach works correctly with gpt-oss:20b: ollama ps shows the requested context size. So this appears to be specific to MLX/NVFP4 models, or at least qwen3.6:27b-coding-nvfp4.

Expected: The model should load with CONTEXT 16384.

Actual: The model loads with CONTEXT 262144 regardless of Modelfile num_ctx or runtime options.num_ctx.

This makes local Codex App usage much slower because the model is loaded with the full 256K context.

Relevant log output

OS

No response

GPU

No response

CPU

No response

Ollama version

No response

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #dependency conflict #environment setup #docker error #permission error

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

ollama - 💡(How to fix) Fix MLX/NVFP4 models ignore num_ctx and always load with 256K context [2 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

What is the issue?

Relevant log output

OS

GPU

CPU

Ollama version

Still need to ship something?

TRENDING

ollama - 💡(How to fix) Fix MLX/NVFP4 models ignore num_ctx and always load with 256K context [2 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

What is the issue?

Relevant log output

OS

GPU

CPU

Ollama version

Still need to ship something?

RELATED_DISCOVERY

TRENDING