ollama - 💡(How to fix) Fix Qwen3.6-35B-A3B MTP GGUF (nextn) fails to load: 'qwen3next: layer 40 missing attn_qkv/attn_gate projections'

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

Error Message

direct pull via ollama also fails (separate issue: "Error: EOF" after "pulling manifest"),

so the GGUF was fetched directly from HF and imported via Modelfile:

curl -L -o Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf
"https://huggingface.co/unsloth/Qwen3.6-35B-A3B-MTP-GGUF/resolve/main/Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf"

cat > Modelfile <<'EOF' FROM ./Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf TEMPLATE """{{ .Prompt }}""" EOF

ollama create qwen3.6-mtp-test -f Modelfile # succeeds ollama run qwen3.6-mtp-test "hi"

Error: failed to initialize model: qwen3next: layer 40 missing attn_qkv/attn_gate projections

Code Example

failed to initialize model: qwen3next: layer 40 missing attn_qkv/attn_gate projections

---

# direct pull via ollama also fails (separate issue: "Error: EOF" after "pulling manifest"),
# so the GGUF was fetched directly from HF and imported via Modelfile:

curl -L -o Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf \
  "https://huggingface.co/unsloth/Qwen3.6-35B-A3B-MTP-GGUF/resolve/main/Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf"

cat > Modelfile <<'EOF'
FROM ./Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf
TEMPLATE """{{ .Prompt }}"""
EOF

ollama create qwen3.6-mtp-test -f Modelfile   # succeeds
ollama run qwen3.6-mtp-test "hi"
# Error: failed to initialize model: qwen3next: layer 40 missing attn_qkv/attn_gate projections

---

time=2026-05-24T23:01:18.012+08:00 level=INFO source=server.go:792 msg="loading model" "model layers"=42 requested=-1
time=2026-05-24T23:01:18.063+08:00 level=INFO source=ggml.go:144 msg="" architecture=qwen35moe file_type=Q4_K_M name=Qwen3.6-35B-A3B description="" num_tensors=753 num_key_values=56
...
time=2026-05-24T23:01:18.204+08:00 level=INFO source=server.go:1251 msg="llm load error: failed to initialize model: qwen3next: layer 40 missing attn_qkv/attn_gate projections"
time=2026-05-24T23:01:18.204+08:00 level=INFO source=sched.go:511 msg="Load failed" model=.../sha256-55983c5a...
RAW_BUFFERClick to expand / collapse

What is the issue?

Ollama 0.24.0 fails to load Qwen3.6-35B-A3B GGUFs that include the upstream MTP / nextn prediction layer (e.g. Unsloth's Qwen3.6-35B-A3B-MTP-GGUF). The architecture is read correctly as qwen35moe, but during model init Ollama routes layer 40 (the nextn MTP head) through the regular attention path and aborts:

failed to initialize model: qwen3next: layer 40 missing attn_qkv/attn_gate projections

The GGUF is well-formed:

  • qwen35moe.block_count = 41
  • qwen35moe.nextn_predict_layers = 1
  • blk.0blk.39 carry the normal attention/FFN tensors
  • blk.40 carries only the MTP head tensors:
    • blk.40.nextn.eh_proj.weight
    • blk.40.nextn.enorm.weight
    • blk.40.nextn.hnorm.weight
    • blk.40.nextn.shared_head_norm.weight

The qwen3next (and shared qwen35moe) loader appears to iterate 0..block_count-1 and require attn_qkv/attn_gate on every block, instead of treating the trailing nextn_predict_layers blocks as MTP heads.

Setting OLLAMA_NEW_ENGINE=false does not help — the runner is still launched with --ollama-engine and produces the same error.

Affected models

Any Unsloth Qwen3.6 / Qwen3.5 MTP GGUF, e.g.:

  • hf.co/unsloth/Qwen3.6-35B-A3B-MTP-GGUF:UD-Q4_K_XL
  • hf.co/unsloth/Qwen3.6-27B-MTP-GGUF (likely)
  • hf.co/unsloth/Qwen3.5-9B-MTP-GGUF (likely)

The non-MTP counterparts (e.g. unsloth/Qwen3.6-35B-A3B-GGUF) load fine — they simply lack the trailing nextn block.

OS

macOS 26.5 (build 25F71)

GPU

Apple M5 Pro, 64 GB unified memory (Metal)

CPU

Apple M5 Pro

Ollama version

0.24.0 (Homebrew cask ollama-app)

Reproduction

# direct pull via ollama also fails (separate issue: "Error: EOF" after "pulling manifest"),
# so the GGUF was fetched directly from HF and imported via Modelfile:

curl -L -o Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf \
  "https://huggingface.co/unsloth/Qwen3.6-35B-A3B-MTP-GGUF/resolve/main/Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf"

cat > Modelfile <<'EOF'
FROM ./Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf
TEMPLATE """{{ .Prompt }}"""
EOF

ollama create qwen3.6-mtp-test -f Modelfile   # succeeds
ollama run qwen3.6-mtp-test "hi"
# Error: failed to initialize model: qwen3next: layer 40 missing attn_qkv/attn_gate projections

Relevant server log

time=2026-05-24T23:01:18.012+08:00 level=INFO source=server.go:792 msg="loading model" "model layers"=42 requested=-1
time=2026-05-24T23:01:18.063+08:00 level=INFO source=ggml.go:144 msg="" architecture=qwen35moe file_type=Q4_K_M name=Qwen3.6-35B-A3B description="" num_tensors=753 num_key_values=56
...
time=2026-05-24T23:01:18.204+08:00 level=INFO source=server.go:1251 msg="llm load error: failed to initialize model: qwen3next: layer 40 missing attn_qkv/attn_gate projections"
time=2026-05-24T23:01:18.204+08:00 level=INFO source=sched.go:511 msg="Load failed" model=.../sha256-55983c5a...

(Note that the log says "model layers"=42 while block_count=41 — the extra layer is presumably the output layer; the failure is on the nextn layer regardless.)

Suggested fix

In the qwen35moe / qwen3next loader, honour qwen35moe.nextn_predict_layers (or *.nextn_predict_layers generically): treat the trailing N blocks as MTP/nextn heads, looking for blk.<i>.nextn.* tensors instead of attn_qkv/attn_gate. Either:

  1. Load and use them for speculative decoding (the upstream llama.cpp flag is --spec-type draft-mtp --spec-draft-n-max N), or
  2. At minimum, skip them so the base model still loads (matching pre-MTP behaviour of just running the non-MTP body).

Option 2 alone would already unblock everyone running Unsloth's MTP GGUFs on Ollama; option 1 would let users benefit from MTP speedups (the Gemma 4 MTP work in #15980 already wires this up for the MLX runner).

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

ollama - 💡(How to fix) Fix Qwen3.6-35B-A3B MTP GGUF (nextn) fails to load: 'qwen3next: layer 40 missing attn_qkv/attn_gate projections'