Error Message

direct pull via ollama also fails (separate issue: "Error: EOF" after "pulling manifest"),

so the GGUF was fetched directly from HF and imported via Modelfile:

curl -L -o Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf
"https://huggingface.co/unsloth/Qwen3.6-35B-A3B-MTP-GGUF/resolve/main/Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf"

cat > Modelfile <<'EOF' FROM ./Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf TEMPLATE """{{ .Prompt }}""" EOF

ollama create qwen3.6-mtp-test -f Modelfile # succeeds ollama run qwen3.6-mtp-test "hi"

Error: failed to initialize model: qwen3next: layer 40 missing attn_qkv/attn_gate projections

Code Example

failed to initialize model: qwen3next: layer 40 missing attn_qkv/attn_gate projections

---

# direct pull via ollama also fails (separate issue: "Error: EOF" after "pulling manifest"),
# so the GGUF was fetched directly from HF and imported via Modelfile:

curl -L -o Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf \
  "https://huggingface.co/unsloth/Qwen3.6-35B-A3B-MTP-GGUF/resolve/main/Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf"

cat > Modelfile <<'EOF'
FROM ./Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf
TEMPLATE """{{ .Prompt }}"""
EOF

ollama create qwen3.6-mtp-test -f Modelfile   # succeeds
ollama run qwen3.6-mtp-test "hi"
# Error: failed to initialize model: qwen3next: layer 40 missing attn_qkv/attn_gate projections

---

time=2026-05-24T23:01:18.012+08:00 level=INFO source=server.go:792 msg="loading model" "model layers"=42 requested=-1
time=2026-05-24T23:01:18.063+08:00 level=INFO source=ggml.go:144 msg="" architecture=qwen35moe file_type=Q4_K_M name=Qwen3.6-35B-A3B description="" num_tensors=753 num_key_values=56
...
time=2026-05-24T23:01:18.204+08:00 level=INFO source=server.go:1251 msg="llm load error: failed to initialize model: qwen3next: layer 40 missing attn_qkv/attn_gate projections"
time=2026-05-24T23:01:18.204+08:00 level=INFO source=sched.go:511 msg="Load failed" model=.../sha256-55983c5a...

What is the issue?

Ollama 0.24.0 fails to load Qwen3.6-35B-A3B GGUFs that include the upstream MTP / nextn prediction layer (e.g. Unsloth's Qwen3.6-35B-A3B-MTP-GGUF). The architecture is read correctly as qwen35moe, but during model init Ollama routes layer 40 (the nextn MTP head) through the regular attention path and aborts:

failed to initialize model: qwen3next: layer 40 missing attn_qkv/attn_gate projections

The GGUF is well-formed:

qwen35moe.block_count = 41
qwen35moe.nextn_predict_layers = 1
blk.0–blk.39 carry the normal attention/FFN tensors
blk.40 carries only the MTP head tensors:
- blk.40.nextn.eh_proj.weight
- blk.40.nextn.enorm.weight
- blk.40.nextn.hnorm.weight
- blk.40.nextn.shared_head_norm.weight

The qwen3next (and shared qwen35moe) loader appears to iterate 0..block_count-1 and require attn_qkv/attn_gate on every block, instead of treating the trailing nextn_predict_layers blocks as MTP heads.

Setting OLLAMA_NEW_ENGINE=false does not help — the runner is still launched with --ollama-engine and produces the same error.

Affected models

Any Unsloth Qwen3.6 / Qwen3.5 MTP GGUF, e.g.:

hf.co/unsloth/Qwen3.6-35B-A3B-MTP-GGUF:UD-Q4_K_XL
hf.co/unsloth/Qwen3.6-27B-MTP-GGUF (likely)
hf.co/unsloth/Qwen3.5-9B-MTP-GGUF (likely)

The non-MTP counterparts (e.g. unsloth/Qwen3.6-35B-A3B-GGUF) load fine — they simply lack the trailing nextn block.

OS

macOS 26.5 (build 25F71)

GPU

Apple M5 Pro, 64 GB unified memory (Metal)

CPU

Apple M5 Pro

Ollama version

0.24.0 (Homebrew cask ollama-app)

Reproduction

# direct pull via ollama also fails (separate issue: "Error: EOF" after "pulling manifest"),
# so the GGUF was fetched directly from HF and imported via Modelfile:

curl -L -o Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf \
  "https://huggingface.co/unsloth/Qwen3.6-35B-A3B-MTP-GGUF/resolve/main/Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf"

cat > Modelfile <<'EOF'
FROM ./Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf
TEMPLATE """{{ .Prompt }}"""
EOF

ollama create qwen3.6-mtp-test -f Modelfile   # succeeds
ollama run qwen3.6-mtp-test "hi"
# Error: failed to initialize model: qwen3next: layer 40 missing attn_qkv/attn_gate projections

Relevant server log

time=2026-05-24T23:01:18.012+08:00 level=INFO source=server.go:792 msg="loading model" "model layers"=42 requested=-1
time=2026-05-24T23:01:18.063+08:00 level=INFO source=ggml.go:144 msg="" architecture=qwen35moe file_type=Q4_K_M name=Qwen3.6-35B-A3B description="" num_tensors=753 num_key_values=56
...
time=2026-05-24T23:01:18.204+08:00 level=INFO source=server.go:1251 msg="llm load error: failed to initialize model: qwen3next: layer 40 missing attn_qkv/attn_gate projections"
time=2026-05-24T23:01:18.204+08:00 level=INFO source=sched.go:511 msg="Load failed" model=.../sha256-55983c5a...

(Note that the log says "model layers"=42 while block_count=41 — the extra layer is presumably the output layer; the failure is on the nextn layer regardless.)

Suggested fix

In the qwen35moe / qwen3next loader, honour qwen35moe.nextn_predict_layers (or *.nextn_predict_layers generically): treat the trailing N blocks as MTP/nextn heads, looking for blk.<i>.nextn.* tensors instead of attn_qkv/attn_gate. Either:

Load and use them for speculative decoding (the upstream llama.cpp flag is --spec-type draft-mtp --spec-draft-n-max N), or
At minimum, skip them so the base model still loads (matching pre-MTP behaviour of just running the non-MTP body).

Option 2 alone would already unblock everyone running Unsloth's MTP GGUFs on Ollama; option 1 would let users benefit from MTP speedups (the Gemma 4 MTP work in #15980 already wires this up for the MLX runner).

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

ollama - 💡(How to fix) Fix Qwen3.6-35B-A3B MTP GGUF (nextn) fails to load: 'qwen3next: layer 40 missing attn_qkv/attn_gate projections'

Recommended Tools

GitHub issue graph ai analysis