ollama - ✅(Solved) Fix BERT-derived embedding models produce incorrect embeddings for non-ASCII text (strip_accents preprocessing dropped in gguf conversion) [1 pull requests, 2 comments, 3 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
ollama/ollama#15609Fetched 2026-04-17 08:23:18
View on GitHub
Comments
2
Participants
3
Timeline
4
Reactions
0
Timeline (top)
commented ×2cross-referenced ×1labeled ×1

Root Cause

Unrelated words containing diacritics collapse to cosine similarity ≈ 1.0 because they all tokenize to [CLS] [UNK] [SEP]. Same-word ASCII-vs-diacritic pairs (e.g. HokkaidōHokkaido) no longer cluster.

PR fix notes

PR #15627: fix: preserve strip_accents preprocessing in BERT tokenizer conversion

Description (problem / solution / changelog)

Summary

BERT-derived embedding models produce incorrect embeddings for non-ASCII text because the strip_accents preprocessing step is dropped during GGUF conversion.

Root Cause

The HuggingFace BasicTokenizer used by BERT models applies NFD normalization and strips combining diacritical marks (accents) before WordPiece tokenization. This preprocessing is essential - without it, words like Hokkaidō and Éire fail to match their ASCII equivalents (Hokkaido and Eire) because the diacritics cause the words to tokenize to [UNK].

Changes

  1. convert/tokenizer.go: Read strip_accents field from okenizer_config.json during tokenizer parsing
  2. convert/convert_bert.go & convert/convert_nomicbert.go: Store strip_accents in GGUF metadata
  3. tokenizer/wordpiece.go:
    • Add stripAccents field to WordPiece struct
    • Add stripAccents() helper function to remove combining diacritical marks (U+0300-U+036F)
    • Apply accent stripping in Encode() when stripAccents is enabled
  4. model/models/bert/embed.go & model/models/nomicbert/model.go: Pass strip_accents config to NewWordPiece()

Impact

This fix ensures that models like mxbai-embed-large, omic-embed-text, and ll-minilm correctly handle non-ASCII text, matching the behavior of the original HuggingFace models.

Fixes #15609

Changed files

  • convert/convert_bert.go (modified, +1/-0)
  • convert/convert_nomicbert.go (modified, +1/-0)
  • convert/tokenizer.go (modified, +11/-2)
  • model/models/bert/embed.go (modified, +1/-1)
  • model/models/nomicbert/model.go (modified, +1/-0)
  • tokenizer/wordpiece.go (modified, +26/-5)
  • tokenizer/wordpiece_test.go (modified, +2/-1)

Code Example

import ollama, numpy as np

def embed(model, text):
    r = ollama.embeddings(model=model, prompt=text)
    v = np.array(r["embedding"])
    return v / np.linalg.norm(v)

def cos(a, b): return float(a @ b)

for model in ["mxbai-embed-large", "nomic-embed-text", "all-minilm"]:
    h_diac  = embed(model, "Hokkaidō")
    h_ascii = embed(model, "Hokkaido")
    eire    = embed(model, "Éire")
    print(f"{model}: Hokkaidō↔Hokkaido={cos(h_diac, h_ascii):.3f}  "
          f"Hokkaidō↔Éire={cos(h_diac, eire):.3f}")

---

Hokkaidō  ->  ['hokkaido']   # identical to Hokkaido
Éire      ->  ['e', '##ire'] # identical to Eire
Zürich    ->  ['zurich']     # identical to Zurich

---
RAW_BUFFERClick to expand / collapse

What is the issue?

All three BERT-derived embedding models I tested in Ollama produce incorrect embeddings for text containing combining diacritics, while the same models via the HuggingFace transformers pipeline behave correctly. The defect is not model-specific — it appears to be a systematic loss of BasicTokenizer's strip_accents preprocessing somewhere in the HF→gguf conversion path (likely in llama.cpp's convert_hf_to_gguf.py, surfaced to end-users via Ollama).

Unrelated words containing diacritics collapse to cosine similarity ≈ 1.0 because they all tokenize to [CLS] [UNK] [SEP]. Same-word ASCII-vs-diacritic pairs (e.g. HokkaidōHokkaido) no longer cluster.

This affects every production RAG/search/clustering deployment using these models on any multilingual or Unicode-containing corpus.

Affected models (verified)

ModelSame-word diac↔ASCII (should be high)Unrelated diac↔diac (should be low)ASCII controlFailure mode
mxbai-embed-large0.5110.9040.513Full UNK collapse
nomic-embed-text0.8880.9920.426Diacritic attractor
all-minilm0.2400.8750.214Broken ASCII equivalence + attractor

The "Unrelated diac↔diac" column is the headline: unrelated words like Hokkaidō and Éire are embedded as near-identical vectors.

Reproduction

Minimal repro (Python, requires ollama package and the models pulled locally):

import ollama, numpy as np

def embed(model, text):
    r = ollama.embeddings(model=model, prompt=text)
    v = np.array(r["embedding"])
    return v / np.linalg.norm(v)

def cos(a, b): return float(a @ b)

for model in ["mxbai-embed-large", "nomic-embed-text", "all-minilm"]:
    h_diac  = embed(model, "Hokkaidō")
    h_ascii = embed(model, "Hokkaido")
    eire    = embed(model, "Éire")
    print(f"{model}: Hokkaidō↔Hokkaido={cos(h_diac, h_ascii):.3f}  "
          f"Hokkaidō↔Éire={cos(h_diac, eire):.3f}")

Expected: same-word pair ≈ 1.0, unrelated pair ≈ low. Actual: unrelated pair ≈ 0.9–1.0.

Full verification pipeline and 147k-pair empirical analysis:

Likely root cause

The upstream HuggingFace tokenizers for these models use BasicTokenizer with do_lower_case=True, which (by default) applies NFD normalization and strips combining diacritical marks before WordPiece. Verified directly against mixedbread-ai/mxbai-embed-large-v1:

Hokkaidō  ->  ['hokkaido']   # identical to Hokkaido
Éire      ->  ['e', '##ire'] # identical to Eire
Zürich    ->  ['zurich']     # identical to Zurich

The gguf-converted tokenizer used by Ollama does not appear to apply this preprocessing. Raw diacritics hit WordPiece, miss the vocab, and fall back to [UNK]. Because [UNK] is a fixed token, every short diacritic-only string that can't be WordPiece-split produces [CLS] [UNK] [SEP], which embeds to a single attractor point.

The fix is almost certainly in llama.cpp/convert_hf_to_gguf.py — preserving the BasicTokenizer.strip_accents behavior through conversion. I'd expect it to require writing a flag into the gguf metadata and honoring it at tokenization time, but I haven't traced the conversion code in detail and the maintainers will know better.

Scope

Not tested exhaustively, but the mechanism implies any BERT-derived embedding model converted via this pipeline is affected. The three models above cover the most common production choices for local RAG.

Relevant log output

OS

Windows

GPU

Nvidia

CPU

AMD

Ollama version

0.17.1

extent analysis

TL;DR

The issue can be fixed by modifying the convert_hf_to_gguf.py script in llama.cpp to preserve the BasicTokenizer.strip_accents behavior during the conversion process.

Guidance

  • Investigate the convert_hf_to_gguf.py script in llama.cpp to understand how the BasicTokenizer preprocessing is handled during the conversion from HuggingFace to gguf format.
  • Verify that the do_lower_case=True and NFD normalization are applied correctly in the gguf-converted tokenizer to strip combining diacritical marks before WordPiece.
  • Consider adding a flag to the gguf metadata to honor the BasicTokenizer.strip_accents behavior at tokenization time.
  • Test the modified conversion script with the affected models to ensure the issue is resolved.

Example

No code snippet is provided as the issue requires modifications to the convert_hf_to_gguf.py script, which is not publicly available.

Notes

The fix may require collaboration with the maintainers of llama.cpp to ensure the correct implementation of the BasicTokenizer.strip_accents behavior in the gguf-converted tokenizer.

Recommendation

Apply a workaround by modifying the convert_hf_to_gguf.py script to preserve the BasicTokenizer.strip_accents behavior, as this is the most likely root cause of the issue.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

ollama - ✅(Solved) Fix BERT-derived embedding models produce incorrect embeddings for non-ASCII text (strip_accents preprocessing dropped in gguf conversion) [1 pull requests, 2 comments, 3 participants]