ollama - 💡(How to fix) Fix First-class embedding & reranker support for Qwen3-Embedding, nomic-embed-text-v2-moe & multimodal embedders (Safetensors → MLX path) [1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
ollama/ollama#16076Fetched 2026-05-11 03:13:27
View on GitHub
Comments
0
Participants
1
Timeline
0
Reactions
0

Error Message

hf download nomic-ai/nomic-embed-text-v2-moe --local-dir ./nomic-v2 cat > ./nomic-v2/Modelfile <<'EOF' FROM . PARAMETER num_ctx 2048 EOF ollama create nomic-embed-text-v2-moe-mlx -f ./nomic-v2/Modelfile --experimental

imports successfully, manifest written

curl http://localhost:11434/api/embed
-d '{"model":"nomic-embed-text-v2-moe-mlx","input":"Hello"}'

{"error":"mlx runner failed: Error: unsupported architecture: NomicBertModel (exit: exit status 1)"}

Root Cause

  1. The MLX engine has no NomicBertMoE forward path — it errors with unsupported architecture: NomicBertModel even though the standard nomic-embed-text (v1, non-MoE) BERT path exists somewhere in the codebase (the official nomic-embed-text:latest works fine via GGUF on the legacy backend).
  2. Even if the MLX engine did support it, Capabilities() would still not flag this as embedding-capable because pooling_type is not extracted from the available 1_Pooling/config.json (which exists in this repo, unlike the Qwen3-Embedding case where it does not).
  3. The 475.21M parameter count is the active-expert count; the full ~1.5B MoE weights are loaded but not all routed correctly, suggesting moe_every_n_layers=2, moe_top_k=2, moe_impl=megablocks is not parsed by the converter.

Fix Action

Fix / Workaround

The rest of this issue is the long version: which models, why MLX-only matters, where the empty-path stat error originates, and a concrete patch sketch.

The string qwen3_embed is present in the Ollama binary (confirmed via strings), and there is internal scaffolding around it (isEmbedTokensWeight, MTPEmbeddingModel, newEmbedModel, IsEmbedding, CapEmbed/CapEmbedder). But there is no dispatch path from a HuggingFace architectures: ["Qwen3ForCausalLM"] value — or any config_sentence_transformers.json content — to qwen3_embed. It appears reachable only via the official GGUF-only releases that ship with qwen3.pooling_type in their KV. This is the gap.

Response: same shape as today (embeddings: [[...]]); for late-interaction models (ColPali / ColQwen2), add a multivector: true field and return embeddings: [[[...], [...], ...]] with one vector per image patch / token.

Code Example

hf download nomic-ai/nomic-embed-text-v2-moe --local-dir ./nomic-v2
cat > ./nomic-v2/Modelfile <<'EOF'
FROM .
PARAMETER num_ctx 2048
EOF
ollama create nomic-embed-text-v2-moe-mlx -f ./nomic-v2/Modelfile --experimental
# imports successfully, manifest written

curl http://localhost:11434/api/embed \
  -d '{"model":"nomic-embed-text-v2-moe-mlx","input":"Hello"}'
# {"error":"mlx runner failed: Error: unsupported architecture: NomicBertModel (exit: exit status 1)"}

---

# Apple Silicon, Ollama 0.23.2, mlx_metal_v4 (libmlx.dylib 0.31.2)
hf download mlx-community/Qwen3-Embedding-8B-mxfp8 --local-dir ./qwen3-emb-8b-mlx
cat > ./qwen3-emb-8b-mlx/Modelfile <<'EOF'
FROM .
PARAMETER num_ctx 32768
EOF
ollama create qwen3-embedding-8b-mlx-mxfp8 -f ./qwen3-emb-8b-mlx/Modelfile --experimental
# imports successfully:
#   "MLX engine initialized" "MLX version"=0.31.2-7-ge8ebdeb device=gpu
#   "Model architecture" arch=Qwen3ForCausalLM
#   "Loaded tensors from manifest" count=651
#   "successfully imported qwen3-embedding-8b-mlx-mxfp8 with 405 layers"

curl http://localhost:11434/api/embed \
  -d '{"model":"qwen3-embedding-8b-mlx-mxfp8","input":"Hello world"}'
# HTTP 500
# {"error":"stat : no such file or directory"}    <-- empty path

---

if f.KeyValue("pooling_type").Valid() {
    capabilities = append(capabilities, model.CapabilityEmbedding)
}
// plus the manifest config branch:
if len(m.Config.Capabilities) > 0 { ... }

---

arch := "qwen3"
if q.NumExperts > 0 {
    arch += "moe"
}
// ... 17 KV keys written, none of them pooling_type

---

// Pseudocode — additions inside KV writer
if st := loadSentenceTransformerConfig(modelDir); st != nil {
    arch = "qwen3_embed" // distinct internal arch enables embed-aware capability + forward path

    kv.Set("qwen3.pooling_type", st.PoolingType)        // "last" | "mean" | "cls"
    kv.Set("qwen3.normalize",     st.Normalize)         // bool
    if st.MaxSeqLength > 0 {
        kv.Set("qwen3.context_length", st.MaxSeqLength) // override n_ctx_train if needed
    }
    if st.SimilarityFn != "" {
        kv.Set("qwen3.similarity_fn", st.SimilarityFn)  // "cosine" | "dot"
    }
    // tied-embedding LMs have no lm_head; we should not require one
    skipLmHead = true
}

---

# Modelfile
FROM .

PARAMETER pooling_type last
PARAMETER normalize true
CAPABILITY embedding

---

POST /api/embed
{
  "model":     "jina-clip-v2-mlx",
  "input":     ["a small black cat"],          // optional if images/audio given
  "images":    ["data:image/png;base64,..."],  // base64 or URL; aligns with /api/generate
  "audio":     ["data:audio/wav;base64,..."],  // for ImageBind
  "modality":  "auto"                          // optional: "text" | "image" | "audio" | "auto"
}

---

{
  "prompts": {
    "query":    "Instruct: Given a web search query, retrieve relevant passages that answer the query\nQuery:",
    "document": ""
  },
  "similarity_fn_name": "cosine"
}

---

POST /api/rerank
{
  "model":    "qwen3-reranker-8b-mlx-mxfp8",
  "query":    "what is the capital of france",
  "documents": ["paris is ...", "london is ...", "berlin is ..."],
  "top_k":    10                  // optional, default = len(documents)
}
// → { "results": [{ "index": 0, "score": 0.998 }, ...] }

---

# Functional
curl /api/embed -d '{"model":"qwen3-embedding-8b-mlx-mxfp8","input":"Hello"}'
# → 200, embedding length = 4096

# Cosine sanity vs reference (mlx-embeddings or transformers)
python3 - <<'PY'
import requests, numpy as np
def emb(text):
    r = requests.post("http://localhost:11434/api/embed",
                      json={"model":"qwen3-embedding-8b-mlx-mxfp8","input":text})
    return np.array(r.json()["embeddings"][0])
a, b, c = emb("a cat"), emb("a kitten"), emb("a freight train")
sim = lambda x,y: float(np.dot(x,y) / (np.linalg.norm(x)*np.linalg.norm(y)))
assert sim(a,b) > sim(a,c) + 0.10, (sim(a,b), sim(a,c))
PY

# Multimodal (Jina-CLIP-v2)
curl /api/embed -d '{"model":"jina-clip-v2-mlx","images":["data:image/png;base64,..."]}'

# Rerank
curl /api/rerank -d '{"model":"qwen3-reranker-8b-mlx-mxfp8","query":"...","documents":["..."]}'
RAW_BUFFERClick to expand / collapse

Feature request: first-class embedding & reranker support for Qwen3-Embedding, nomic-embed-text-v2-moe, and multimodal embedders — including the Safetensors → MLX engine path

Filed against ollama/ollama (CLI/runtime). This consolidates and extends #10989, #12368, #12757, #10602, #13054 with concrete implementation guidance and a broader scope (multimodal + reranker + nomic v2-MoE).

TL;DR

Today, the only first-class way to serve a Qwen3-Embedding (or any sentence-transformers / dual-encoder LM) through /api/embed is to publish a GGUF that already has <arch>.pooling_type baked into its KV metadata (e.g. the official qwen3-embedding:8b Q4_K_M). The Safetensors → Ollama import path:

  1. accepts the weights via convert/convert_qwen3.go,
  2. but never inspects config_sentence_transformers.json / modules.json,
  3. therefore never emits qwen3.pooling_type (or the internal qwen3_embed arch),
  4. so Capabilities() in server/images.go does not append model.CapabilityEmbedding,
  5. so /api/embed returns 500.

For users who want to stay on Apple Silicon + MLX end-to-end (no GGUF in the toolchain at all), this is a hard wall. There is no Modelfile escape hatch — capability detection is metadata-driven, and the Modelfile parser exposes no CAPABILITY embedding / PARAMETER pooling_type directive.

This issue requests three coordinated pieces of work so that an mlx-community/<model> checkpoint (Safetensors, optionally pre-quantized as mxfp8 / mxfp4 / nvfp4) can be ollama create'd and immediately served at /api/embed without any GGUF in the loop:

  1. A. Teach the Safetensors converters (convert/convert_qwen3.go, convert/convert_bert.go, generic) to read config_sentence_transformers.json, modules.json, 1_Pooling/config.json, 2_Normalize/, and emit the right <arch>.pooling_type (+ qwen3_embed / bert_embed style internal arch) so existing capability detection lights up automatically.
  2. B. Add a Modelfile escape hatch (PARAMETER pooling_type last|mean|cls, CAPABILITY embedding, optionally PARAMETER normalize true|false, PARAMETER similarity cosine|dot) so users can publish well-understood embedding models even when the source repo lacks ST configs.
  3. C. Extend the MLX engine path so that the embed mode for tied-embedding causal-LM embedders (Qwen3-Embedding) and BERT-MoE embedders (nomic-embed-text-v2-moe) is part of the same code path that already handles existing bge, nomic-embed-text (v1), mxbai-embed-large. Multimodal embedders (Jina-CLIP v2, ColPali / ColQwen2, ImageBind-style) need a small follow-up: a vision.embedding path that reuses the existing vision tower wiring.

The rest of this issue is the long version: which models, why MLX-only matters, where the empty-path stat error originates, and a concrete patch sketch.


1. Models we want to serve via /api/embed on MLX

1.1 Text embedders — high priority

ModelHF repo (source)MLX-ready repo (target)Quant we wantWhy
Qwen3-Embedding-0.6BQwen/Qwen3-Embedding-0.6Bmlx-community/Qwen3-Embedding-0.6B-{4bit,bf16}4bit / bf16Fastest, edge use, RAG default
Qwen3-Embedding-4BQwen/Qwen3-Embedding-4Bmlx-community/Qwen3-Embedding-4B-{mxfp8,bf16}mxfp8Quality/cost sweet spot
Qwen3-Embedding-8BQwen/Qwen3-Embedding-8Bmlx-community/Qwen3-Embedding-8B-mxfp8 ✓ existsmxfp8Best Qwen3 embedder, MTEB SOTA
nomic-embed-text-v2-moenomic-ai/nomic-embed-text-v2-moe(we will push mlx-community/nomic-embed-text-v2-moe-mlx)bf16 / 8bitMultilingual MoE encoder, BERT family, Matryoshka
gte-Qwen2-7B-instructAlibaba-NLP/gte-Qwen2-7B-instructmlx-community/gte-Qwen2-7B-instruct-{4bit,mxfp8}mxfp8Causal-LM embedder, similar shape to Qwen3-Emb
Stella-en-1.5B-v5dunzhang/stella_en_1.5B_v5mlx-community/stella_en_1.5B_v5-{4bit,bf16}4bitTop of MTEB English

1.2 Rerankers — high priority

ModelHF repoQuantWhy
Qwen3-Reranker-0.6B/4B/8BQwen/Qwen3-Reranker-{0.6,4,8}Bbf16 / mxfp8Pair with Qwen3-Embedding; needs an /api/rerank or score endpoint
bge-reranker-v2-m3BAAI/bge-reranker-v2-m3bf16Reference baseline

There is currently no /api/rerank endpoint. Many users (myself included) implement rerank by computing log-likelihoods of yes/no tokens via /api/generate with logprobs, which is awkward and slow. A proper endpoint that returns scalar relevance scores — and is recognized as a model.CapabilityRerank — would close the loop with the embedders.

1.3 Multimodal embedders — medium priority

ModelHF repoModalitiesStyle
Jina-CLIP-v2jinaai/jina-clip-v2text + imageTwo-tower, single embedding space
ColPali / ColQwen2-v1.0vidore/colqwen2-v1.0image (page) → textLate-interaction, ColBERT-style multi-vector
NV-CLIP-v1nvidia/NV-CLIP-v1text + imageCLIP family
ImageBind-Hugefacebook/imagebind-hugetext + image + audioSix-modality joint space
Voyage-multimodal-3(proprietary)text + imageIf/when open weights drop

For the MoE / two-tower / late-interaction designs, MLX has all the primitives (the runner already does vision towers for Qwen3-VL, Gemma3-VL, Llama4-VL). What is missing is the embedding-mode forward + pooling + (optional) per-token output wiring on the mlx-engine side, plus a request schema in /api/embed that accepts images: [] / audio: [] next to input.

1.4 Concrete consumers in the wild

  • This repository has a project policy that is strictly MLX-only on Apple Silicon (CLAUDE.md: no GGUF, no llama.cpp, no --quantize q4_K_M / q8_0, no GGUF FROM/ADAPTER). The blog post at https://ollama.com/blog/mlx is the basis for that policy. Without /api/embed working on Safetensors-imported MLX models, the policy cannot be satisfied for any RAG workload.
  • Same goes for any team standardizing on mlx-community/* as the canonical MLX distribution channel (which the Hugging Face MLX docs implicitly endorse).

2. The exact failure today

2.0 Second reproducer: nomic-embed-text-v2-moe (NomicBertMoE)

hf download nomic-ai/nomic-embed-text-v2-moe --local-dir ./nomic-v2
cat > ./nomic-v2/Modelfile <<'EOF'
FROM .
PARAMETER num_ctx 2048
EOF
ollama create nomic-embed-text-v2-moe-mlx -f ./nomic-v2/Modelfile --experimental
# imports successfully, manifest written

curl http://localhost:11434/api/embed \
  -d '{"model":"nomic-embed-text-v2-moe-mlx","input":"Hello"}'
# {"error":"mlx runner failed: Error: unsupported architecture: NomicBertModel (exit: exit status 1)"}

ollama show reports architecture nomic_bert, parameters 475.21M, quantization float32, Capabilities: completion (note: no embedding). The Safetensors importer accepts NomicBertModel and writes a manifest with the correct internal nomic_bert arch, but:

  1. The MLX engine has no NomicBertMoE forward path — it errors with unsupported architecture: NomicBertModel even though the standard nomic-embed-text (v1, non-MoE) BERT path exists somewhere in the codebase (the official nomic-embed-text:latest works fine via GGUF on the legacy backend).
  2. Even if the MLX engine did support it, Capabilities() would still not flag this as embedding-capable because pooling_type is not extracted from the available 1_Pooling/config.json (which exists in this repo, unlike the Qwen3-Embedding case where it does not).
  3. The 475.21M parameter count is the active-expert count; the full ~1.5B MoE weights are loaded but not all routed correctly, suggesting moe_every_n_layers=2, moe_top_k=2, moe_impl=megablocks is not parsed by the converter.

2.1 Primary reproducer: Qwen3-Embedding-8B (Qwen3ForCausalLM)

# Apple Silicon, Ollama 0.23.2, mlx_metal_v4 (libmlx.dylib 0.31.2)
hf download mlx-community/Qwen3-Embedding-8B-mxfp8 --local-dir ./qwen3-emb-8b-mlx
cat > ./qwen3-emb-8b-mlx/Modelfile <<'EOF'
FROM .
PARAMETER num_ctx 32768
EOF
ollama create qwen3-embedding-8b-mlx-mxfp8 -f ./qwen3-emb-8b-mlx/Modelfile --experimental
# imports successfully:
#   "MLX engine initialized" "MLX version"=0.31.2-7-ge8ebdeb device=gpu
#   "Model architecture" arch=Qwen3ForCausalLM
#   "Loaded tensors from manifest" count=651
#   "successfully imported qwen3-embedding-8b-mlx-mxfp8 with 405 layers"

curl http://localhost:11434/api/embed \
  -d '{"model":"qwen3-embedding-8b-mlx-mxfp8","input":"Hello world"}'
# HTTP 500
# {"error":"stat : no such file or directory"}    <-- empty path

2.2 Where stat : no such file or directory is produced

The empty-path stat is os.Stat("") from llm/server.go (around the runner-launch / projector-validation block, ~L735 and again in projectorMemoryRequirements ~L855). It is reached because server/images.go populates m.ProjectorPaths from the manifest layer set (application/vnd.ollama.image.projector and friends, ~L771–773), and any sentinel/empty entry seeded from modules.json's references to 1_Pooling/ and 2_Normalize/ (which do not physically exist in mlx-community/Qwen3-Embedding-8B-mxfp8 because mlx-embeddings handles pooling/normalize in code) ends up as "" and is os.Stat'd unconditionally.

I confirmed this empirically by removing both modules.json and config_sentence_transformers.json before ollama create --experimental. The empty-path stat persists, which means there is at least one more code path seeding empty projector entries beyond ST configs — likely from manifest layer iteration over zero-size config blobs. This needs a separate audit (server/images.go / server/sched.go).

2.3 Why even fixing the stat does not give us /api/embed

server/images.go Capabilities():

if f.KeyValue("pooling_type").Valid() {
    capabilities = append(capabilities, model.CapabilityEmbedding)
}
// plus the manifest config branch:
if len(m.Config.Capabilities) > 0 { ... }

convert/convert_qwen3.go (current main):

arch := "qwen3"
if q.NumExperts > 0 {
    arch += "moe"
}
// ... 17 KV keys written, none of them pooling_type

So even with the stat fixed, an imported Qwen3 Safetensors model has no pooling_type KV, no embedding capability, and /api/embed will reject it with this model does not support embeddings (the closed-without-fix outcome of #12368 and #12757).

2.4 The qwen3_embed ghost

The string qwen3_embed is present in the Ollama binary (confirmed via strings), and there is internal scaffolding around it (isEmbedTokensWeight, MTPEmbeddingModel, newEmbedModel, IsEmbedding, CapEmbed/CapEmbedder). But there is no dispatch path from a HuggingFace architectures: ["Qwen3ForCausalLM"] value — or any config_sentence_transformers.json content — to qwen3_embed. It appears reachable only via the official GGUF-only releases that ship with qwen3.pooling_type in their KV. This is the gap.


3. Proposed implementation

3.1 (A) Safetensors → embedding metadata

In convert/convert_qwen3.go (and analogously convert_bert.go for nomic v2), add detection of sentence-transformers configs and emit:

// Pseudocode — additions inside KV writer
if st := loadSentenceTransformerConfig(modelDir); st != nil {
    arch = "qwen3_embed" // distinct internal arch enables embed-aware capability + forward path

    kv.Set("qwen3.pooling_type", st.PoolingType)        // "last" | "mean" | "cls"
    kv.Set("qwen3.normalize",     st.Normalize)         // bool
    if st.MaxSeqLength > 0 {
        kv.Set("qwen3.context_length", st.MaxSeqLength) // override n_ctx_train if needed
    }
    if st.SimilarityFn != "" {
        kv.Set("qwen3.similarity_fn", st.SimilarityFn)  // "cosine" | "dot"
    }
    // tied-embedding LMs have no lm_head; we should not require one
    skipLmHead = true
}

Files to read (in this order, first present wins):

  1. 1_Pooling/config.jsonpooling_mode_cls_token / pooling_mode_mean_tokens / pooling_mode_lasttoken. This is the canonical sentence-transformers signal.
  2. config_sentence_transformers.jsonprompts.{query,document}, default_prompt_name, similarity_fn_name. (These should be exposed via /api/embed request fields too — see §3.5.)
  3. modules.json — presence of 2_Normalizenormalize=true.
  4. Fallback heuristic for Qwen3-Embedding family: tied embeddings + no lm_head ⇒ assume pooling_type=last, normalize=true. Document the heuristic.

Important: do not seed empty entries in m.ProjectorPaths from modules.json. The pooler/normalize for these models is a code-side op, not a separate weight blob. Either skip the path entirely or guard os.Stat("") (llm/server.go).

3.2 (B) Modelfile escape hatch

For models whose source repo lacks ST configs (custom checkpoints, fine-tunes, or cases where the HF repo deleted the configs after upload), expose:

# Modelfile
FROM .

PARAMETER pooling_type last
PARAMETER normalize true
CAPABILITY embedding

server/parser/parser.go (Modelfile parser) needs the new keywords; server/images.go Capabilities() should OR the manifest-config / Modelfile-config / KV-derived signals.

3.3 (C) MLX engine: embedding forward path

The MLX engine already loads tied-embedding causal LMs cleanly (we see Loaded tensors from manifest count=651 for Qwen3-Embedding-8B). What is missing is:

  1. A Forward(input, mode=Embed) that returns the chosen layer's hidden states (typically last hidden state) instead of logits.
  2. Pooling op (last_token / mean_token / cls), respecting attention mask for mean.
  3. Optional L2-normalize (PR #13661 already adds mlx_linalg_norm_l2 — would slot in here).
  4. Quantized output safe path: tied-embedding models reuse embed_tokens for the output projection, so no lm_head is needed. The recent quantized-embeddings PR (#14884) is the right substrate; we just need a mode=Embed consumer.

For nomic-embed-text-v2-moe (BERT-MoE, not causal): the BERT path needs MoE-aware forward (router + top-k experts) and cls or mean pooling. Existing bert_embed plumbing can be reused; only the MoE block differs.

3.4 Multimodal /api/embed

Schema extension (request):

POST /api/embed
{
  "model":     "jina-clip-v2-mlx",
  "input":     ["a small black cat"],          // optional if images/audio given
  "images":    ["data:image/png;base64,..."],  // base64 or URL; aligns with /api/generate
  "audio":     ["data:audio/wav;base64,..."],  // for ImageBind
  "modality":  "auto"                          // optional: "text" | "image" | "audio" | "auto"
}

Response: same shape as today (embeddings: [[...]]); for late-interaction models (ColPali / ColQwen2), add a multivector: true field and return embeddings: [[[...], [...], ...]] with one vector per image patch / token.

The MLX engine already runs the vision towers for Qwen3VLForConditionalGeneration etc. — we just need an embed-mode path that stops at the projection layer instead of going into the LM head.

3.5 Sentence-transformers prompt prefixes

Qwen3-Embedding ships with config_sentence_transformers.json prompts:

{
  "prompts": {
    "query":    "Instruct: Given a web search query, retrieve relevant passages that answer the query\nQuery:",
    "document": ""
  },
  "similarity_fn_name": "cosine"
}

The /api/embed request should accept task: "query" | "document" | "<custom>" (or prompt_name) and Ollama should automatically prepend the matching prefix. Today this has to be done client-side, which is fragile.


4. New endpoints

4.1 /api/rerank (proposed)

POST /api/rerank
{
  "model":    "qwen3-reranker-8b-mlx-mxfp8",
  "query":    "what is the capital of france",
  "documents": ["paris is ...", "london is ...", "berlin is ..."],
  "top_k":    10                  // optional, default = len(documents)
}
// → { "results": [{ "index": 0, "score": 0.998 }, ...] }

Capability: model.CapabilityRerank. Detection: <arch>.task = "reranker" KV (Qwen3-Reranker ships this in its config).

4.2 /api/embed schema additions (multimodal, prefix, multivector)

See §3.4 + §3.5 above.


5. Non-goals / out of scope for this issue

  • Training / fine-tuning embedders inside Ollama.
  • Replacing GGUF embedders that already work — those should keep working as-is.
  • Vector DB integration (out of scope; /api/embed is the plumbing).

6. Test plan

For each of the models in §1.1–1.3, after the patch:

# Functional
curl /api/embed -d '{"model":"qwen3-embedding-8b-mlx-mxfp8","input":"Hello"}'
# → 200, embedding length = 4096

# Cosine sanity vs reference (mlx-embeddings or transformers)
python3 - <<'PY'
import requests, numpy as np
def emb(text):
    r = requests.post("http://localhost:11434/api/embed",
                      json={"model":"qwen3-embedding-8b-mlx-mxfp8","input":text})
    return np.array(r.json()["embeddings"][0])
a, b, c = emb("a cat"), emb("a kitten"), emb("a freight train")
sim = lambda x,y: float(np.dot(x,y) / (np.linalg.norm(x)*np.linalg.norm(y)))
assert sim(a,b) > sim(a,c) + 0.10, (sim(a,b), sim(a,c))
PY

# Multimodal (Jina-CLIP-v2)
curl /api/embed -d '{"model":"jina-clip-v2-mlx","images":["data:image/png;base64,..."]}'

# Rerank
curl /api/rerank -d '{"model":"qwen3-reranker-8b-mlx-mxfp8","query":"...","documents":["..."]}'

CI bumps: add a small fixture (the 0.6B variants — Qwen3-Embedding-0.6B is ~600 MB at 4-bit) so PRs can run the embedding parity test in <60s on a Mac runner.


7. Cross-references

  • #10989 — Original feature request for qwen3-embedding / qwen3-reranker (open since Jun 2025, ~70 upvotes, no maintainer engagement).
  • #10602unsupported architecture "Qwen3ForCausalLM" — the non-experimental import path. With --experimental import works, but that just exposes the next layer of issues this issue is about.
  • #12368Qwen3-Embedding-0.6B failing since v0.12.0 — closed without resolution, root cause = backend switch dropped pooling_type pickup.
  • #12757Qwen3-Embedding-8B "model does not support embeddings" — closed without resolution, same root cause.
  • #13054 — embedding crash on macOS.
  • #14884 (merged Mar 2026)mlx: quantized embeddings, fast SwiGLU, runtime fixes — adds quantized-embedding layer primitives. Useful substrate for §3.3 but does not by itself enable /api/embed for these models.
  • #15621 (open) — nil-guard optional embedding components + exact GELU for BERT — overlaps slightly with §3.3 nomic-v2 path.
  • #13661 (open)mlx: implement L2Norm using native mlx_linalg_norm_l2 — needed for §3.3 step 3.
  • #14739 (open) — handle NaN values in embedding responses — orthogonal but related.
  • #15760 / #15759 (open) / #15744 / #15743 (closed)x/mlxrunner per-tensor quant overrides for mixed-precision MoE — relevant to nomic-embed-text-v2-moe (MoE).

8. What I am offering

  • I have already imported mlx-community/Qwen3-Embedding-8B-mxfp8 via ollama create --experimental on 0.23.2 and reproduced the failure. The model is mirrored at charaf/qwen3-embedding-8b-mlx-mxfp8 on the public hub for anyone who wants a known-broken-on-embed reproducer.
  • I am willing to drive the convert/convert_qwen3.go patch (§3.1 A) as a first PR, gated on a maintainer signing off on the metadata schema (pooling_type / normalize / similarity_fn keys, qwen3_embed arch name).
  • A second PR for the Modelfile escape hatch (§3.2 B) is small and isolatable.
  • The MLX engine forward / pooling path (§3.3 C) is the largest piece and probably needs a maintainer driver; happy to write tests + the BERT-MoE variant.

Please confirm:

  1. Whether qwen3_embed is the intended arch name or whether you prefer qwen3 + pooling_type KV alone.
  2. Whether you prefer the Modelfile keyword form (PARAMETER pooling_type last) or a manifest-config form (CAPABILITY embedding).
  3. Whether /api/rerank is in scope here or should be a separate issue.
  4. Whether multimodal /api/embed (§3.4) should be split off.

Happy to split this into 3–4 issues if that is easier to triage.

Thanks!


Environment

  • macOS 15.x, Apple Silicon (M-series), 128 GiB unified memory
  • Ollama 0.23.2
  • MLX 0.31.2-7-ge8ebdeb (mlx_metal_v4/libmlx.dylib)
  • mlx-community/Qwen3-Embedding-8B-mxfp8 (Qwen3ForCausalLM, mxfp8, ~7.6B params)
  • nomic-ai/nomic-embed-text-v2-moe (NomicBertMoE, ~475M active / ~1.5B total)

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING