ollama - 💡(How to fix) Fix Please add qwen3.5:122b-a10b-q8_0 quantization to model registry [1 participants]

ollama2026-04-09 02:24:00

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

ollama/ollama#15441•Fetched 2026-04-09 07:51:09

View on GitHub

Comments

Participants

Timeline

Reactions

Author

branmacstudio

Participants

branmacstudio

Fix Action

Fix / Workaround

Without a Q8 tag in the registry I can't run a fair head-to-head against my current llama.cpp Q8 baseline. The workarounds (manual GGUF build + Modelfile import) lose the registry auto-update story, which is most of the value of switching to Ollama in the first place.

RAW_BUFFERClick to expand / collapse

Request

Please add a q8_0 quantization tag for qwen3.5:122b-a10b. The current registry only has q4_K_M (81 GB) under that model.

Context

I run a small document-processing business doing structured data extraction from business documents into spreadsheets. The pipeline emits deterministic JSON via structured/constrained generation, where small accuracy regressions show up immediately as wrong numbers in the downstream output. In my testing, Q4 vs Q8 is a measurably non-trivial accuracy gap on this workload — Q4 produces enough errors per document to be unusable in production. Q6 is the floor I can tolerate.

Current setup:

M2 Ultra 128 GB → running Qwen3.5-122B-A10B at Q6_K via llama.cpp today (Q8 won't fit alongside the rest of my services)
M3 Ultra 256 GB on order, specifically to run multiple concurrent Q8 workers for higher accuracy and parallelism
I'd like to evaluate Ollama's MLX backend on the new machine once it arrives

Why this might be a good signal for Ollama

122B-A10B at Q8 is one of the more demanding sustained-throughput workloads targetable on Apple Silicon at this hardware tier. It stresses MLX's prompt caching, KV cache sizing for hybrid attention, and the recent Qwen3.5 thinking-token fixes from v0.19 all at once. If it runs cleanly at this size point on M3 Ultra, it validates the MLX backend for the whole "single-machine production inference on Mac Studio" use case.

Evaluation plan once the hardware lands

3–5 document benchmark comparing wall-clock time, output JSON byte-equality vs current llama.cpp Q8 baseline, thinking-token handling under structured output, and multi-worker behavior (4 concurrent extractions). Happy to share results here if useful.

Thanks.

extent analysis

TL;DR

Add a q8_0 quantization tag for qwen3.5:122b-a10b to the registry to enable fair comparison with the current llama.cpp Q8 baseline.

Guidance

Evaluate the current registry configuration to understand why the q8_0 quantization tag is missing for qwen3.5:122b-a10b.
Consider manually building GGUF and importing the model file as a temporary workaround, but note that this will lose the registry auto-update functionality.
Once the M3 Ultra 256 GB machine arrives, test the Ollama MLX backend with the q8_0 quantization tag to validate its performance for single-machine production inference on Mac Studio.
Plan to compare the performance of Ollama with the current llama.cpp Q8 baseline using a 3-5 document benchmark, evaluating wall-clock time, output JSON byte-equality, thinking-token handling, and multi-worker behavior.

Example

No code snippet is provided as it is not explicitly supported by the issue.

Notes

The addition of the q8_0 quantization tag is crucial for a fair comparison between Ollama and the current llama.cpp Q8 baseline. The evaluation plan should provide valuable insights into the performance of Ollama's MLX backend for production inference on Mac Studio.

Recommendation

Apply workaround: Manually build GGUF and import the model file until the q8_0 quantization tag is added to the registry, as this will allow for some level of comparison and testing, albeit without the registry auto-update functionality.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#inference speed #output truncation #response parsing #generation error #database connection

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

ollama - 💡(How to fix) Fix Please add qwen3.5:122b-a10b-q8_0 quantization to model registry [1 participants]

Recommended Tools

GitHub issue graph ai analysis

Fix Action

Fix / Workaround

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

TRENDING

ollama - 💡(How to fix) Fix Please add qwen3.5:122b-a10b-q8_0 quantization to model registry [1 participants]

Recommended Tools

GitHub issue graph ai analysis

Fix Action

Fix / Workaround

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING