ollama - 💡(How to fix) Fix First-class embedding & reranker support for Qwen3-Embedding, nomic-embed-text-v2-moe & multimodal embedders (Safetensors → MLX path) [1 participants]

ollama2026-05-10 09:19:23

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

ollama/ollama#16076•Fetched 2026-05-11 03:13:27

View on GitHub

Comments

Participants

Timeline

Reactions

Author

CharafChnioune

Participants

CharafChnioune

Error Message

hf download nomic-ai/nomic-embed-text-v2-moe --local-dir ./nomic-v2 cat > ./nomic-v2/Modelfile <<'EOF' FROM . PARAMETER num_ctx 2048 EOF ollama create nomic-embed-text-v2-moe-mlx -f ./nomic-v2/Modelfile --experimental

imports successfully, manifest written

curl http://localhost:11434/api/embed
-d '{"model":"nomic-embed-text-v2-moe-mlx","input":"Hello"}'

{"error":"mlx runner failed: Error: unsupported architecture: NomicBertModel (exit: exit status 1)"}

Root Cause

The MLX engine has no NomicBertMoE forward path — it errors with unsupported architecture: NomicBertModel even though the standard nomic-embed-text (v1, non-MoE) BERT path exists somewhere in the codebase (the official nomic-embed-text:latest works fine via GGUF on the legacy backend).
Even if the MLX engine did support it, Capabilities() would still not flag this as embedding-capable because pooling_type is not extracted from the available 1_Pooling/config.json (which exists in this repo, unlike the Qwen3-Embedding case where it does not).
The 475.21M parameter count is the active-expert count; the full ~1.5B MoE weights are loaded but not all routed correctly, suggesting moe_every_n_layers=2, moe_top_k=2, moe_impl=megablocks is not parsed by the converter.

Fix Action

Fix / Workaround

The rest of this issue is the long version: which models, why MLX-only matters, where the empty-path stat error originates, and a concrete patch sketch.

The string qwen3_embed is present in the Ollama binary (confirmed via strings), and there is internal scaffolding around it (isEmbedTokensWeight, MTPEmbeddingModel, newEmbedModel, IsEmbedding, CapEmbed/CapEmbedder). But there is no dispatch path from a HuggingFace architectures: ["Qwen3ForCausalLM"] value — or any config_sentence_transformers.json content — to qwen3_embed. It appears reachable only via the official GGUF-only releases that ship with qwen3.pooling_type in their KV. This is the gap.

Response: same shape as today (embeddings: [[...]]); for late-interaction models (ColPali / ColQwen2), add a multivector: true field and return embeddings: [[[...], [...], ...]] with one vector per image patch / token.

Code Example

hf download nomic-ai/nomic-embed-text-v2-moe --local-dir ./nomic-v2
cat > ./nomic-v2/Modelfile <<'EOF'
FROM .
PARAMETER num_ctx 2048
EOF
ollama create nomic-embed-text-v2-moe-mlx -f ./nomic-v2/Modelfile --experimental
# imports successfully, manifest written

curl http://localhost:11434/api/embed \
  -d '{"model":"nomic-embed-text-v2-moe-mlx","input":"Hello"}'
# {"error":"mlx runner failed: Error: unsupported architecture: NomicBertModel (exit: exit status 1)"}

---

# Apple Silicon, Ollama 0.23.2, mlx_metal_v4 (libmlx.dylib 0.31.2)
hf download mlx-community/Qwen3-Embedding-8B-mxfp8 --local-dir ./qwen3-emb-8b-mlx
cat > ./qwen3-emb-8b-mlx/Modelfile <<'EOF'
FROM .
PARAMETER num_ctx 32768
EOF
ollama create qwen3-embedding-8b-mlx-mxfp8 -f ./qwen3-emb-8b-mlx/Modelfile --experimental
# imports successfully:
#   "MLX engine initialized" "MLX version"=0.31.2-7-ge8ebdeb device=gpu
#   "Model architecture" arch=Qwen3ForCausalLM
#   "Loaded tensors from manifest" count=651
#   "successfully imported qwen3-embedding-8b-mlx-mxfp8 with 405 layers"

curl http://localhost:11434/api/embed \
  -d '{"model":"qwen3-embedding-8b-mlx-mxfp8","input":"Hello world"}'
# HTTP 500
# {"error":"stat : no such file or directory"}    <-- empty path

---

if f.KeyValue("pooling_type").Valid() {
    capabilities = append(capabilities, model.CapabilityEmbedding)
}
// plus the manifest config branch:
if len(m.Config.Capabilities) > 0 { ... }

---

arch := "qwen3"
if q.NumExperts > 0 {
    arch += "moe"
}
// ... 17 KV keys written, none of them pooling_type

---

// Pseudocode — additions inside KV writer
if st := loadSentenceTransformerConfig(modelDir); st != nil {
    arch = "qwen3_embed" // distinct internal arch enables embed-aware capability + forward path

    kv.Set("qwen3.pooling_type", st.PoolingType)        // "last" | "mean" | "cls"
    kv.Set("qwen3.normalize",     st.Normalize)         // bool
    if st.MaxSeqLength > 0 {
        kv.Set("qwen3.context_length", st.MaxSeqLength) // override n_ctx_train if needed
    }
    if st.SimilarityFn != "" {
        kv.Set("qwen3.similarity_fn", st.SimilarityFn)  // "cosine" | "dot"
    }
    // tied-embedding LMs have no lm_head; we should not require one
    skipLmHead = true
}

---

# Modelfile
FROM .

PARAMETER pooling_type last
PARAMETER normalize true
CAPABILITY embedding

---

POST /api/embed
{
  "model":     "jina-clip-v2-mlx",
  "input":     ["a small black cat"],          // optional if images/audio given
  "images":    ["data:image/png;base64,..."],  // base64 or URL; aligns with /api/generate
  "audio":     ["data:audio/wav;base64,..."],  // for ImageBind
  "modality":  "auto"                          // optional: "text" | "image" | "audio" | "auto"
}

---

{
  "prompts": {
    "query":    "Instruct: Given a web search query, retrieve relevant passages that answer the query\nQuery:",
    "document": ""
  },
  "similarity_fn_name": "cosine"
}

---

POST /api/rerank
{
  "model":    "qwen3-reranker-8b-mlx-mxfp8",
  "query":    "what is the capital of france",
  "documents": ["paris is ...", "london is ...", "berlin is ..."],
  "top_k":    10                  // optional, default = len(documents)
}
// → { "results": [{ "index": 0, "score": 0.998 }, ...] }

---

# Functional
curl /api/embed -d '{"model":"qwen3-embedding-8b-mlx-mxfp8","input":"Hello"}'
# → 200, embedding length = 4096

# Cosine sanity vs reference (mlx-embeddings or transformers)
python3 - <<'PY'
import requests, numpy as np
def emb(text):
    r = requests.post("http://localhost:11434/api/embed",
                      json={"model":"qwen3-embedding-8b-mlx-mxfp8","input":text})
    return np.array(r.json()["embeddings"][0])
a, b, c = emb("a cat"), emb("a kitten"), emb("a freight train")
sim = lambda x,y: float(np.dot(x,y) / (np.linalg.norm(x)*np.linalg.norm(y)))
assert sim(a,b) > sim(a,c) + 0.10, (sim(a,b), sim(a,c))
PY

# Multimodal (Jina-CLIP-v2)
curl /api/embed -d '{"model":"jina-clip-v2-mlx","images":["data:image/png;base64,..."]}'

# Rerank
curl /api/rerank -d '{"model":"qwen3-reranker-8b-mlx-mxfp8","query":"...","documents":["..."]}'

RAW_BUFFERClick to expand / collapse

Feature request: first-class embedding & reranker support for Qwen3-Embedding, nomic-embed-text-v2-moe, and multimodal embedders — including the Safetensors → MLX engine path

Filed against ollama/ollama (CLI/runtime). This consolidates and extends #10989, #12368, #12757, #10602, #13054 with concrete implementation guidance and a broader scope (multimodal + reranker + nomic v2-MoE).

TL;DR

Today, the only first-class way to serve a Qwen3-Embedding (or any sentence-transformers / dual-encoder LM) through /api/embed is to publish a GGUF that already has <arch>.pooling_type baked into its KV metadata (e.g. the official qwen3-embedding:8b Q4_K_M). The Safetensors → Ollama import path:

accepts the weights via convert/convert_qwen3.go,
but never inspects config_sentence_transformers.json / modules.json,
therefore never emits qwen3.pooling_type (or the internal qwen3_embed arch),
so Capabilities() in server/images.go does not append model.CapabilityEmbedding,
so /api/embed returns 500.

For users who want to stay on Apple Silicon + MLX end-to-end (no GGUF in the toolchain at all), this is a hard wall. There is no Modelfile escape hatch — capability detection is metadata-driven, and the Modelfile parser exposes no CAPABILITY embedding / PARAMETER pooling_type directive.

This issue requests three coordinated pieces of work so that an mlx-community/<model> checkpoint (Safetensors, optionally pre-quantized as mxfp8 / mxfp4 / nvfp4) can be ollama create'd and immediately served at /api/embed without any GGUF in the loop:

A. Teach the Safetensors converters (convert/convert_qwen3.go, convert/convert_bert.go, generic) to read config_sentence_transformers.json, modules.json, 1_Pooling/config.json, 2_Normalize/, and emit the right <arch>.pooling_type (+ qwen3_embed / bert_embed style internal arch) so existing capability detection lights up automatically.
B. Add a Modelfile escape hatch (PARAMETER pooling_type last|mean|cls, CAPABILITY embedding, optionally PARAMETER normalize true|false, PARAMETER similarity cosine|dot) so users can publish well-understood embedding models even when the source repo lacks ST configs.
C. Extend the MLX engine path so that the embed mode for tied-embedding causal-LM embedders (Qwen3-Embedding) and BERT-MoE embedders (nomic-embed-text-v2-moe) is part of the same code path that already handles existing bge, nomic-embed-text (v1), mxbai-embed-large. Multimodal embedders (Jina-CLIP v2, ColPali / ColQwen2, ImageBind-style) need a small follow-up: a vision.embedding path that reuses the existing vision tower wiring.

The rest of this issue is the long version: which models, why MLX-only matters, where the empty-path stat error originates, and a concrete patch sketch.

1. Models we want to serve via `/api/embed` on MLX

1.1 Text embedders — high priority

Model	HF repo (source)	MLX-ready repo (target)	Quant we want	Why
Qwen3-Embedding-0.6B	`Qwen/Qwen3-Embedding-0.6B`	`mlx-community/Qwen3-Embedding-0.6B-{4bit,bf16}`	4bit / bf16	Fastest, edge use, RAG default
Qwen3-Embedding-4B	`Qwen/Qwen3-Embedding-4B`	`mlx-community/Qwen3-Embedding-4B-{mxfp8,bf16}`	mxfp8	Quality/cost sweet spot
Qwen3-Embedding-8B	`Qwen/Qwen3-Embedding-8B`	`mlx-community/Qwen3-Embedding-8B-mxfp8` ✓ exists	mxfp8	Best Qwen3 embedder, MTEB SOTA
nomic-embed-text-v2-moe	`nomic-ai/nomic-embed-text-v2-moe`	(we will push `mlx-community/nomic-embed-text-v2-moe-mlx`)	bf16 / 8bit	Multilingual MoE encoder, BERT family, Matryoshka
gte-Qwen2-7B-instruct	`Alibaba-NLP/gte-Qwen2-7B-instruct`	`mlx-community/gte-Qwen2-7B-instruct-{4bit,mxfp8}`	mxfp8	Causal-LM embedder, similar shape to Qwen3-Emb
Stella-en-1.5B-v5	`dunzhang/stella_en_1.5B_v5`	`mlx-community/stella_en_1.5B_v5-{4bit,bf16}`	4bit	Top of MTEB English

1.2 Rerankers — high priority

Model	HF repo	Quant	Why
Qwen3-Reranker-0.6B/4B/8B	`Qwen/Qwen3-Reranker-{0.6,4,8}B`	bf16 / mxfp8	Pair with Qwen3-Embedding; needs an `/api/rerank` or score endpoint
bge-reranker-v2-m3	`BAAI/bge-reranker-v2-m3`	bf16	Reference baseline

There is currently no /api/rerank endpoint. Many users (myself included) implement rerank by computing log-likelihoods of yes/no tokens via /api/generate with logprobs, which is awkward and slow. A proper endpoint that returns scalar relevance scores — and is recognized as a model.CapabilityRerank — would close the loop with the embedders.

1.3 Multimodal embedders — medium priority

Model	HF repo	Modalities	Style
Jina-CLIP-v2	`jinaai/jina-clip-v2`	text + image	Two-tower, single embedding space
ColPali / ColQwen2-v1.0	`vidore/colqwen2-v1.0`	image (page) → text	Late-interaction, ColBERT-style multi-vector
NV-CLIP-v1	`nvidia/NV-CLIP-v1`	text + image	CLIP family
ImageBind-Huge	`facebook/imagebind-huge`	text + image + audio	Six-modality joint space
Voyage-multimodal-3	(proprietary)	text + image	If/when open weights drop

For the MoE / two-tower / late-interaction designs, MLX has all the primitives (the runner already does vision towers for Qwen3-VL, Gemma3-VL, Llama4-VL). What is missing is the embedding-mode forward + pooling + (optional) per-token output wiring on the mlx-engine side, plus a request schema in /api/embed that accepts images: [] / audio: [] next to input.

1.4 Concrete consumers in the wild

This repository has a project policy that is strictly MLX-only on Apple Silicon (CLAUDE.md: no GGUF, no llama.cpp, no --quantize q4_K_M / q8_0, no GGUF FROM/ADAPTER). The blog post at https://ollama.com/blog/mlx is the basis for that policy. Without /api/embed working on Safetensors-imported MLX models, the policy cannot be satisfied for any RAG workload.
Same goes for any team standardizing on mlx-community/* as the canonical MLX distribution channel (which the Hugging Face MLX docs implicitly endorse).

2. The exact failure today

2.0 Second reproducer: nomic-embed-text-v2-moe (NomicBertMoE)

hf download nomic-ai/nomic-embed-text-v2-moe --local-dir ./nomic-v2
cat > ./nomic-v2/Modelfile <<'EOF'
FROM .
PARAMETER num_ctx 2048
EOF
ollama create nomic-embed-text-v2-moe-mlx -f ./nomic-v2/Modelfile --experimental
# imports successfully, manifest written

curl http://localhost:11434/api/embed \
  -d '{"model":"nomic-embed-text-v2-moe-mlx","input":"Hello"}'
# {"error":"mlx runner failed: Error: unsupported architecture: NomicBertModel (exit: exit status 1)"}

ollama show reports architecture nomic_bert, parameters 475.21M, quantization float32, Capabilities: completion (note: no embedding). The Safetensors importer accepts NomicBertModel and writes a manifest with the correct internal nomic_bert arch, but:

The MLX engine has no NomicBertMoE forward path — it errors with unsupported architecture: NomicBertModel even though the standard nomic-embed-text (v1, non-MoE) BERT path exists somewhere in the codebase (the official nomic-embed-text:latest works fine via GGUF on the legacy backend).
Even if the MLX engine did support it, Capabilities() would still not flag this as embedding-capable because pooling_type is not extracted from the available 1_Pooling/config.json (which exists in this repo, unlike the Qwen3-Embedding case where it does not).
The 475.21M parameter count is the active-expert count; the full ~1.5B MoE weights are loaded but not all routed correctly, suggesting moe_every_n_layers=2, moe_top_k=2, moe_impl=megablocks is not parsed by the converter.

2.1 Primary reproducer: Qwen3-Embedding-8B (Qwen3ForCausalLM)

# Apple Silicon, Ollama 0.23.2, mlx_metal_v4 (libmlx.dylib 0.31.2)
hf download mlx-community/Qwen3-Embedding-8B-mxfp8 --local-dir ./qwen3-emb-8b-mlx
cat > ./qwen3-emb-8b-mlx/Modelfile <<'EOF'
FROM .
PARAMETER num_ctx 32768
EOF
ollama create qwen3-embedding-8b-mlx-mxfp8 -f ./qwen3-emb-8b-mlx/Modelfile --experimental
# imports successfully:
#   "MLX engine initialized" "MLX version"=0.31.2-7-ge8ebdeb device=gpu
#   "Model architecture" arch=Qwen3ForCausalLM
#   "Loaded tensors from manifest" count=651
#   "successfully imported qwen3-embedding-8b-mlx-mxfp8 with 405 layers"

curl http://localhost:11434/api/embed \
  -d '{"model":"qwen3-embedding-8b-mlx-mxfp8","input":"Hello world"}'
# HTTP 500
# {"error":"stat : no such file or directory"}    <-- empty path

2.2 Where `stat : no such file or directory` is produced

The empty-path stat is os.Stat("") from llm/server.go (around the runner-launch / projector-validation block, ~L735 and again in projectorMemoryRequirements ~L855). It is reached because server/images.go populates m.ProjectorPaths from the manifest layer set (application/vnd.ollama.image.projector and friends, ~L771–773), and any sentinel/empty entry seeded from modules.json's references to 1_Pooling/ and 2_Normalize/ (which do not physically exist in mlx-community/Qwen3-Embedding-8B-mxfp8 because mlx-embeddings handles pooling/normalize in code) ends up as "" and is os.Stat'd unconditionally.

I confirmed this empirically by removing both modules.json and config_sentence_transformers.json before ollama create --experimental. The empty-path stat persists, which means there is at least one more code path seeding empty projector entries beyond ST configs — likely from manifest layer iteration over zero-size config blobs. This needs a separate audit (server/images.go / server/sched.go).

2.3 Why even fixing the `stat` does not give us `/api/embed`

server/images.go Capabilities():

if f.KeyValue("pooling_type").Valid() {
    capabilities = append(capabilities, model.CapabilityEmbedding)
}
// plus the manifest config branch:
if len(m.Config.Capabilities) > 0 { ... }

convert/convert_qwen3.go (current main):

arch := "qwen3"
if q.NumExperts > 0 {
    arch += "moe"
}
// ... 17 KV keys written, none of them pooling_type

So even with the stat fixed, an imported Qwen3 Safetensors model has no pooling_type KV, no embedding capability, and /api/embed will reject it with this model does not support embeddings (the closed-without-fix outcome of #12368 and #12757).

2.4 The `qwen3_embed` ghost

3. Proposed implementation

3.1 (A) Safetensors → embedding metadata

In convert/convert_qwen3.go (and analogously convert_bert.go for nomic v2), add detection of sentence-transformers configs and emit:

// Pseudocode — additions inside KV writer
if st := loadSentenceTransformerConfig(modelDir); st != nil {
    arch = "qwen3_embed" // distinct internal arch enables embed-aware capability + forward path

    kv.Set("qwen3.pooling_type", st.PoolingType)        // "last" | "mean" | "cls"
    kv.Set("qwen3.normalize",     st.Normalize)         // bool
    if st.MaxSeqLength > 0 {
        kv.Set("qwen3.context_length", st.MaxSeqLength) // override n_ctx_train if needed
    }
    if st.SimilarityFn != "" {
        kv.Set("qwen3.similarity_fn", st.SimilarityFn)  // "cosine" | "dot"
    }
    // tied-embedding LMs have no lm_head; we should not require one
    skipLmHead = true
}

Files to read (in this order, first present wins):

1_Pooling/config.json — pooling_mode_cls_token / pooling_mode_mean_tokens / pooling_mode_lasttoken. This is the canonical sentence-transformers signal.
config_sentence_transformers.json — prompts.{query,document}, default_prompt_name, similarity_fn_name. (These should be exposed via /api/embed request fields too — see §3.5.)
modules.json — presence of 2_Normalize ⇒ normalize=true.
Fallback heuristic for Qwen3-Embedding family: tied embeddings + no lm_head ⇒ assume pooling_type=last, normalize=true. Document the heuristic.

Important: do not seed empty entries in m.ProjectorPaths from modules.json. The pooler/normalize for these models is a code-side op, not a separate weight blob. Either skip the path entirely or guard os.Stat("") (llm/server.go).

3.2 (B) Modelfile escape hatch

For models whose source repo lacks ST configs (custom checkpoints, fine-tunes, or cases where the HF repo deleted the configs after upload), expose:

# Modelfile
FROM .

PARAMETER pooling_type last
PARAMETER normalize true
CAPABILITY embedding

server/parser/parser.go (Modelfile parser) needs the new keywords; server/images.go Capabilities() should OR the manifest-config / Modelfile-config / KV-derived signals.

3.3 (C) MLX engine: embedding forward path

The MLX engine already loads tied-embedding causal LMs cleanly (we see Loaded tensors from manifest count=651 for Qwen3-Embedding-8B). What is missing is:

A Forward(input, mode=Embed) that returns the chosen layer's hidden states (typically last hidden state) instead of logits.
Pooling op (last_token / mean_token / cls), respecting attention mask for mean.
Optional L2-normalize (PR #13661 already adds mlx_linalg_norm_l2 — would slot in here).
Quantized output safe path: tied-embedding models reuse embed_tokens for the output projection, so no lm_head is needed. The recent quantized-embeddings PR (#14884) is the right substrate; we just need a mode=Embed consumer.

For nomic-embed-text-v2-moe (BERT-MoE, not causal): the BERT path needs MoE-aware forward (router + top-k experts) and cls or mean pooling. Existing bert_embed plumbing can be reused; only the MoE block differs.

3.4 Multimodal `/api/embed`

Schema extension (request):

POST /api/embed
{
  "model":     "jina-clip-v2-mlx",
  "input":     ["a small black cat"],          // optional if images/audio given
  "images":    ["data:image/png;base64,..."],  // base64 or URL; aligns with /api/generate
  "audio":     ["data:audio/wav;base64,..."],  // for ImageBind
  "modality":  "auto"                          // optional: "text" | "image" | "audio" | "auto"
}

The MLX engine already runs the vision towers for Qwen3VLForConditionalGeneration etc. — we just need an embed-mode path that stops at the projection layer instead of going into the LM head.

3.5 Sentence-transformers prompt prefixes

Qwen3-Embedding ships with config_sentence_transformers.json prompts:

{
  "prompts": {
    "query":    "Instruct: Given a web search query, retrieve relevant passages that answer the query\nQuery:",
    "document": ""
  },
  "similarity_fn_name": "cosine"
}

The /api/embed request should accept task: "query" | "document" | "<custom>" (or prompt_name) and Ollama should automatically prepend the matching prefix. Today this has to be done client-side, which is fragile.

4. New endpoints

4.1 `/api/rerank` (proposed)

POST /api/rerank
{
  "model":    "qwen3-reranker-8b-mlx-mxfp8",
  "query":    "what is the capital of france",
  "documents": ["paris is ...", "london is ...", "berlin is ..."],
  "top_k":    10                  // optional, default = len(documents)
}
// → { "results": [{ "index": 0, "score": 0.998 }, ...] }

Capability: model.CapabilityRerank. Detection: <arch>.task = "reranker" KV (Qwen3-Reranker ships this in its config).

4.2 `/api/embed` schema additions (multimodal, prefix, multivector)

See §3.4 + §3.5 above.

5. Non-goals / out of scope for this issue

Training / fine-tuning embedders inside Ollama.
Replacing GGUF embedders that already work — those should keep working as-is.
Vector DB integration (out of scope; /api/embed is the plumbing).

6. Test plan

For each of the models in §1.1–1.3, after the patch:

# Functional
curl /api/embed -d '{"model":"qwen3-embedding-8b-mlx-mxfp8","input":"Hello"}'
# → 200, embedding length = 4096

# Cosine sanity vs reference (mlx-embeddings or transformers)
python3 - <<'PY'
import requests, numpy as np
def emb(text):
    r = requests.post("http://localhost:11434/api/embed",
                      json={"model":"qwen3-embedding-8b-mlx-mxfp8","input":text})
    return np.array(r.json()["embeddings"][0])
a, b, c = emb("a cat"), emb("a kitten"), emb("a freight train")
sim = lambda x,y: float(np.dot(x,y) / (np.linalg.norm(x)*np.linalg.norm(y)))
assert sim(a,b) > sim(a,c) + 0.10, (sim(a,b), sim(a,c))
PY

# Multimodal (Jina-CLIP-v2)
curl /api/embed -d '{"model":"jina-clip-v2-mlx","images":["data:image/png;base64,..."]}'

# Rerank
curl /api/rerank -d '{"model":"qwen3-reranker-8b-mlx-mxfp8","query":"...","documents":["..."]}'

CI bumps: add a small fixture (the 0.6B variants — Qwen3-Embedding-0.6B is ~600 MB at 4-bit) so PRs can run the embedding parity test in <60s on a Mac runner.

7. Cross-references

#10989 — Original feature request for qwen3-embedding / qwen3-reranker (open since Jun 2025, ~70 upvotes, no maintainer engagement).
#10602 — unsupported architecture "Qwen3ForCausalLM" — the non-experimental import path. With --experimental import works, but that just exposes the next layer of issues this issue is about.
#12368 — Qwen3-Embedding-0.6B failing since v0.12.0 — closed without resolution, root cause = backend switch dropped pooling_type pickup.
#12757 — Qwen3-Embedding-8B "model does not support embeddings" — closed without resolution, same root cause.
#13054 — embedding crash on macOS.
#14884 (merged Mar 2026) — mlx: quantized embeddings, fast SwiGLU, runtime fixes — adds quantized-embedding layer primitives. Useful substrate for §3.3 but does not by itself enable /api/embed for these models.
#15621 (open) — nil-guard optional embedding components + exact GELU for BERT — overlaps slightly with §3.3 nomic-v2 path.
#13661 (open) — mlx: implement L2Norm using native mlx_linalg_norm_l2 — needed for §3.3 step 3.
#14739 (open) — handle NaN values in embedding responses — orthogonal but related.
#15760 / #15759 (open) / #15744 / #15743 (closed) — x/mlxrunner per-tensor quant overrides for mixed-precision MoE — relevant to nomic-embed-text-v2-moe (MoE).

8. What I am offering

I have already imported mlx-community/Qwen3-Embedding-8B-mxfp8 via ollama create --experimental on 0.23.2 and reproduced the failure. The model is mirrored at charaf/qwen3-embedding-8b-mlx-mxfp8 on the public hub for anyone who wants a known-broken-on-embed reproducer.
I am willing to drive the convert/convert_qwen3.go patch (§3.1 A) as a first PR, gated on a maintainer signing off on the metadata schema (pooling_type / normalize / similarity_fn keys, qwen3_embed arch name).
A second PR for the Modelfile escape hatch (§3.2 B) is small and isolatable.
The MLX engine forward / pooling path (§3.3 C) is the largest piece and probably needs a maintainer driver; happy to write tests + the BERT-MoE variant.

Please confirm:

Whether qwen3_embed is the intended arch name or whether you prefer qwen3 + pooling_type KV alone.
Whether you prefer the Modelfile keyword form (PARAMETER pooling_type last) or a manifest-config form (CAPABILITY embedding).
Whether /api/rerank is in scope here or should be a separate issue.
Whether multimodal /api/embed (§3.4) should be split off.

Happy to split this into 3–4 issues if that is easier to triage.

Thanks!

Environment

macOS 15.x, Apple Silicon (M-series), 128 GiB unified memory
Ollama 0.23.2
MLX 0.31.2-7-ge8ebdeb (mlx_metal_v4/libmlx.dylib)
mlx-community/Qwen3-Embedding-8B-mxfp8 (Qwen3ForCausalLM, mxfp8, ~7.6B params)
nomic-ai/nomic-embed-text-v2-moe (NomicBertMoE, ~475M active / ~1.5B total)

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #mixed precision #conversation history #tool integration #LLM response

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.