ollama - ✅(Solved) Fix qwen35moe architecture missing from vendored llama.cpp -- mmproj/vision loading fails [1 pull requests, 2 comments, 1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
ollama/ollama#15898Fetched 2026-05-01 05:33:26
View on GitHub
Comments
2
Participants
1
Timeline
4
Reactions
0
Participants
Timeline (top)
commented ×2cross-referenced ×1referenced ×1

Error Message

llama_model_load: error loading model: error loading model architecture: ollama create succeeds, but any /api/generate call returns unable to load model. Server log shows the architecture error above.

  • #14730 (same error, closed as dup of #14575)
  • #15747 (same error on Ollama 0.21.0)
  • #15499 (same error, closed as dup of #14575)

Root Cause

PR #14517 added qwen35moe to the Go engine's text runner. But the Go engine does not support split vision models for this architecture -- when projectors are present, it falls back to the C++ llama.cpp runner. Ollama's vendored llama.cpp fork does not have qwen35 or qwen35moe in its architecture table, so the fallback fails.

Upstream ggml-org/llama.cpp already supports both architectures (LLM_ARCH_QWEN35, LLM_ARCH_QWEN35MOE).

Fix Action

Fixed

PR fix notes

PR #15899: llama: add qwen35/qwen35moe architecture support for community GGUFs

Description (problem / solution / changelog)

Community GGUFs (e.g. bartowski) use upstream llama.cpp's converter which writes "qwen35moe" as the architecture string. Ollama's vendored llama.cpp only recognizes "qwen3next", causing "unknown model architecture: 'qwen35moe'" errors when loading these files

This adds full graph-building support for the qwen35 and qwen35moe architectures. The key differences from qwen3next are:

  • Separate attn_qkv (QKV) + attn_gate (Z) projections instead of combined ssm_in (QKVZ)
  • Separate ssm_alpha and ssm_beta tensors instead of combined ssm_beta_alpha
  • IMROPE (ggml_rope_multi with sections) instead of NEOX (ggml_rope_ext)

The delta-net chunked/autoregressive math, conv1d pipeline, gated normalization, and MoE FFN logic are identical to qwen3next

Fixes #15898

Note: Will be superseded by #15122 when it lands, intended as a stopgap for users blocked on this

Changed files

  • llama/llama.cpp/src/llama-arch.cpp (modified, +67/-0)
  • llama/llama.cpp/src/llama-arch.h (modified, +4/-0)
  • llama/llama.cpp/src/llama-model.cpp (modified, +162/-0)
  • llama/llama.cpp/src/llama-model.h (modified, +5/-0)
  • llama/llama.cpp/src/models/models.h (modified, +51/-0)
  • llama/llama.cpp/src/models/qwen35.cpp (added, +749/-0)

Code Example

llama_model_load: error loading model: error loading model architecture:
unknown model architecture: 'qwen35moe'

---

FROM /path/to/Qwen_Qwen3.6-35B-A3B-IQ2_XS.gguf
FROM /path/to/mmproj-F16.gguf
RAW_BUFFERClick to expand / collapse

Bug

Attaching an mmproj (vision projector) GGUF to a qwen35moe model fails with:

llama_model_load: error loading model: error loading model architecture:
unknown model architecture: 'qwen35moe'

This blocks ALL inference (text + vision) when an mmproj is attached via dual-FROM Modelfile.

Reproduction

Reproduced on Kaggle T4x2 (2026-04-30) using:

  • Text GGUF: bartowski/Qwen_Qwen3.6-35B-A3B-GGUF (IQ2_XS)
  • mmproj: Youseff1987/Qwen3.6-35B-A3B-Claude-4.6-Opus-Reasoning-Distilled-GGUF-with-mmproj (mmproj-F16)

Modelfile:

FROM /path/to/Qwen_Qwen3.6-35B-A3B-IQ2_XS.gguf
FROM /path/to/mmproj-F16.gguf

ollama create succeeds, but any /api/generate call returns unable to load model. Server log shows the architecture error above.

Full reproduction notebook: https://github.com/ArkaD171717/Qwen3.6-Compat/blob/main/ollama/test_mmproj_clip_runner.ipynb

Root cause

PR #14517 added qwen35moe to the Go engine's text runner. But the Go engine does not support split vision models for this architecture -- when projectors are present, it falls back to the C++ llama.cpp runner. Ollama's vendored llama.cpp fork does not have qwen35 or qwen35moe in its architecture table, so the fallback fails.

Upstream ggml-org/llama.cpp already supports both architectures (LLM_ARCH_QWEN35, LLM_ARCH_QWEN35MOE).

Proposed fix

Sync qwen35/qwen35moe architecture support from upstream ggml-org/llama.cpp into:

  • llama/llama.cpp/src/llama-arch.h (enum entries)
  • llama/llama.cpp/src/llama-arch.cpp (name map + tensor maps)
  • llama/llama.cpp/src/llama-model.cpp (hparams + graph building)

Related issues

  • #14730 (same error, closed as dup of #14575)
  • #14575 (open, Qwen3.5 loading failures)
  • #15747 (same error on Ollama 0.21.0)
  • #15499 (same error, closed as dup of #14575)
  • #14517 (text runner fix, merged)

extent analysis

TL;DR

Syncing qwen35 and qwen35moe architecture support from upstream ggml-org/llama.cpp into the vendored llama.cpp fork should resolve the model loading issue.

Guidance

  • Verify that the llama.cpp fork used by Ollama does not support qwen35 and qwen35moe architectures by checking the llama-arch.h and llama-arch.cpp files.
  • Update the llama-arch.h file to include LLM_ARCH_QWEN35 and LLM_ARCH_QWEN35MOE enum entries.
  • Update the llama-arch.cpp file to include name maps and tensor maps for qwen35 and qwen35moe architectures.
  • Update the llama-model.cpp file to include hparams and graph building support for qwen35 and qwen35moe architectures.

Example

No code snippet is provided as the necessary changes are specific to the llama.cpp fork and require careful updates to multiple files.

Notes

The proposed fix assumes that syncing the architecture support from upstream ggml-org/llama.cpp will resolve the issue. However, this may not be the case if there are other underlying problems.

Recommendation

Apply the workaround by syncing the qwen35 and qwen35moe architecture support from upstream ggml-org/llama.cpp into the vendored llama.cpp fork, as this is the most direct solution to the problem.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

ollama - ✅(Solved) Fix qwen35moe architecture missing from vendored llama.cpp -- mmproj/vision loading fails [1 pull requests, 2 comments, 1 participants]