ollama - 💡(How to fix) Fix Please add qwen3.5:122b-a10b-q8_0 quantization to model registry [1 participants]

Official PRs (…)
ON THIS PAGE

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
ollama/ollama#15441Fetched 2026-04-09 07:51:09
View on GitHub
Comments
0
Participants
1
Timeline
0
Reactions
0
Participants

Fix Action

Fix / Workaround

Without a Q8 tag in the registry I can't run a fair head-to-head against my current llama.cpp Q8 baseline. The workarounds (manual GGUF build + Modelfile import) lose the registry auto-update story, which is most of the value of switching to Ollama in the first place.

RAW_BUFFERClick to expand / collapse

Request

Please add a q8_0 quantization tag for qwen3.5:122b-a10b. The current registry only has q4_K_M (81 GB) under that model.

Context

I run a small document-processing business doing structured data extraction from business documents into spreadsheets. The pipeline emits deterministic JSON via structured/constrained generation, where small accuracy regressions show up immediately as wrong numbers in the downstream output. In my testing, Q4 vs Q8 is a measurably non-trivial accuracy gap on this workload — Q4 produces enough errors per document to be unusable in production. Q6 is the floor I can tolerate.

Current setup:

  • M2 Ultra 128 GB → running Qwen3.5-122B-A10B at Q6_K via llama.cpp today (Q8 won't fit alongside the rest of my services)
  • M3 Ultra 256 GB on order, specifically to run multiple concurrent Q8 workers for higher accuracy and parallelism
  • I'd like to evaluate Ollama's MLX backend on the new machine once it arrives

Without a Q8 tag in the registry I can't run a fair head-to-head against my current llama.cpp Q8 baseline. The workarounds (manual GGUF build + Modelfile import) lose the registry auto-update story, which is most of the value of switching to Ollama in the first place.

Why this might be a good signal for Ollama

122B-A10B at Q8 is one of the more demanding sustained-throughput workloads targetable on Apple Silicon at this hardware tier. It stresses MLX's prompt caching, KV cache sizing for hybrid attention, and the recent Qwen3.5 thinking-token fixes from v0.19 all at once. If it runs cleanly at this size point on M3 Ultra, it validates the MLX backend for the whole "single-machine production inference on Mac Studio" use case.

Evaluation plan once the hardware lands

3–5 document benchmark comparing wall-clock time, output JSON byte-equality vs current llama.cpp Q8 baseline, thinking-token handling under structured output, and multi-worker behavior (4 concurrent extractions). Happy to share results here if useful.

Thanks.

extent analysis

TL;DR

Add a q8_0 quantization tag for qwen3.5:122b-a10b to the registry to enable fair comparison with the current llama.cpp Q8 baseline.

Guidance

  • Evaluate the current registry configuration to understand why the q8_0 quantization tag is missing for qwen3.5:122b-a10b.
  • Consider manually building GGUF and importing the model file as a temporary workaround, but note that this will lose the registry auto-update functionality.
  • Once the M3 Ultra 256 GB machine arrives, test the Ollama MLX backend with the q8_0 quantization tag to validate its performance for single-machine production inference on Mac Studio.
  • Plan to compare the performance of Ollama with the current llama.cpp Q8 baseline using a 3-5 document benchmark, evaluating wall-clock time, output JSON byte-equality, thinking-token handling, and multi-worker behavior.

Example

No code snippet is provided as it is not explicitly supported by the issue.

Notes

The addition of the q8_0 quantization tag is crucial for a fair comparison between Ollama and the current llama.cpp Q8 baseline. The evaluation plan should provide valuable insights into the performance of Ollama's MLX backend for production inference on Mac Studio.

Recommendation

Apply workaround: Manually build GGUF and import the model file until the q8_0 quantization tag is added to the registry, as this will allow for some level of comparison and testing, albeit without the registry auto-update functionality.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

ollama - 💡(How to fix) Fix Please add qwen3.5:122b-a10b-q8_0 quantization to model registry [1 participants]