ollama - ✅(Solved) Fix BERT-derived embedding models produce incorrect embeddings for non-ASCII text (strip_accents preprocessing dropped in gguf conversion) [1 pull requests, 2 comments, 3 participants]

ollama2026-04-15 20:25:57

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

ollama/ollama#15609•Fetched 2026-04-17 08:23:18

View on GitHub

Comments

Participants

Timeline

Reactions

Author

Participants

Timeline (top)

commented ×2cross-referenced ×1labeled ×1

Root Cause

Unrelated words containing diacritics collapse to cosine similarity ≈ 1.0 because they all tokenize to [CLS] [UNK] [SEP]. Same-word ASCII-vs-diacritic pairs (e.g. Hokkaidō ↔ Hokkaido) no longer cluster.

PR fix notes

PR #15627: fix: preserve strip_accents preprocessing in BERT tokenizer conversion

Repository: ollama/ollama
Author: MasterOfFeelingFish
State: open | merged: False
Link: https://github.com/ollama/ollama/pull/15627

Description (problem / solution / changelog)

Summary

BERT-derived embedding models produce incorrect embeddings for non-ASCII text because the strip_accents preprocessing step is dropped during GGUF conversion.

Root Cause

The HuggingFace BasicTokenizer used by BERT models applies NFD normalization and strips combining diacritical marks (accents) before WordPiece tokenization. This preprocessing is essential - without it, words like Hokkaidō and Éire fail to match their ASCII equivalents (Hokkaido and Eire) because the diacritics cause the words to tokenize to [UNK].

Changes

convert/tokenizer.go: Read strip_accents field from okenizer_config.json during tokenizer parsing
convert/convert_bert.go & convert/convert_nomicbert.go: Store strip_accents in GGUF metadata
tokenizer/wordpiece.go:
- Add stripAccents field to WordPiece struct
- Add stripAccents() helper function to remove combining diacritical marks (U+0300-U+036F)
- Apply accent stripping in Encode() when stripAccents is enabled
model/models/bert/embed.go & model/models/nomicbert/model.go: Pass strip_accents config to NewWordPiece()

Impact

This fix ensures that models like mxbai-embed-large, omic-embed-text, and ll-minilm correctly handle non-ASCII text, matching the behavior of the original HuggingFace models.

Fixes #15609

Changed files

convert/convert_bert.go (modified, +1/-0)
convert/convert_nomicbert.go (modified, +1/-0)
convert/tokenizer.go (modified, +11/-2)
model/models/bert/embed.go (modified, +1/-1)
model/models/nomicbert/model.go (modified, +1/-0)
tokenizer/wordpiece.go (modified, +26/-5)
tokenizer/wordpiece_test.go (modified, +2/-1)

Code Example

import ollama, numpy as np

def embed(model, text):
    r = ollama.embeddings(model=model, prompt=text)
    v = np.array(r["embedding"])
    return v / np.linalg.norm(v)

def cos(a, b): return float(a @ b)

for model in ["mxbai-embed-large", "nomic-embed-text", "all-minilm"]:
    h_diac  = embed(model, "Hokkaidō")
    h_ascii = embed(model, "Hokkaido")
    eire    = embed(model, "Éire")
    print(f"{model}: Hokkaidō↔Hokkaido={cos(h_diac, h_ascii):.3f}  "
          f"Hokkaidō↔Éire={cos(h_diac, eire):.3f}")

---

Hokkaidō  ->  ['hokkaido']   # identical to Hokkaido
Éire      ->  ['e', '##ire'] # identical to Eire
Zürich    ->  ['zurich']     # identical to Zurich

---

RAW_BUFFERClick to expand / collapse

What is the issue?

All three BERT-derived embedding models I tested in Ollama produce incorrect embeddings for text containing combining diacritics, while the same models via the HuggingFace transformers pipeline behave correctly. The defect is not model-specific — it appears to be a systematic loss of BasicTokenizer's strip_accents preprocessing somewhere in the HF→gguf conversion path (likely in llama.cpp's convert_hf_to_gguf.py, surfaced to end-users via Ollama).

This affects every production RAG/search/clustering deployment using these models on any multilingual or Unicode-containing corpus.

Affected models (verified)

Model	Same-word diac↔ASCII (should be high)	Unrelated diac↔diac (should be low)	ASCII control	Failure mode
`mxbai-embed-large`	0.511	0.904	0.513	Full UNK collapse
`nomic-embed-text`	0.888	0.992	0.426	Diacritic attractor
`all-minilm`	0.240	0.875	0.214	Broken ASCII equivalence + attractor

The "Unrelated diac↔diac" column is the headline: unrelated words like Hokkaidō and Éire are embedded as near-identical vectors.

Reproduction

Minimal repro (Python, requires ollama package and the models pulled locally):

import ollama, numpy as np

def embed(model, text):
    r = ollama.embeddings(model=model, prompt=text)
    v = np.array(r["embedding"])
    return v / np.linalg.norm(v)

def cos(a, b): return float(a @ b)

for model in ["mxbai-embed-large", "nomic-embed-text", "all-minilm"]:
    h_diac  = embed(model, "Hokkaidō")
    h_ascii = embed(model, "Hokkaido")
    eire    = embed(model, "Éire")
    print(f"{model}: Hokkaidō↔Hokkaido={cos(h_diac, h_ascii):.3f}  "
          f"Hokkaidō↔Éire={cos(h_diac, eire):.3f}")

Expected: same-word pair ≈ 1.0, unrelated pair ≈ low. Actual: unrelated pair ≈ 0.9–1.0.

Full verification pipeline and 147k-pair empirical analysis:

Repro script: https://github.com/emmaleonhart/latent-space-cartography/blob/main/scripts/verify_tokenizer_divergence.py
Writeup with full tables + mechanism walkthrough: https://emmaleonhart.github.io/latent-space-cartography/

Likely root cause

The upstream HuggingFace tokenizers for these models use BasicTokenizer with do_lower_case=True, which (by default) applies NFD normalization and strips combining diacritical marks before WordPiece. Verified directly against mixedbread-ai/mxbai-embed-large-v1:

Hokkaidō  ->  ['hokkaido']   # identical to Hokkaido
Éire      ->  ['e', '##ire'] # identical to Eire
Zürich    ->  ['zurich']     # identical to Zurich

The gguf-converted tokenizer used by Ollama does not appear to apply this preprocessing. Raw diacritics hit WordPiece, miss the vocab, and fall back to [UNK]. Because [UNK] is a fixed token, every short diacritic-only string that can't be WordPiece-split produces [CLS] [UNK] [SEP], which embeds to a single attractor point.

The fix is almost certainly in llama.cpp/convert_hf_to_gguf.py — preserving the BasicTokenizer.strip_accents behavior through conversion. I'd expect it to require writing a flag into the gguf metadata and honoring it at tokenization time, but I haven't traced the conversion code in detail and the maintainers will know better.

Scope

Not tested exhaustively, but the mechanism implies any BERT-derived embedding model converted via this pipeline is affected. The three models above cover the most common production choices for local RAG.

Relevant log output

OS

Windows

GPU

Nvidia

CPU

AMD

Ollama version

0.17.1

extent analysis

TL;DR

The issue can be fixed by modifying the convert_hf_to_gguf.py script in llama.cpp to preserve the BasicTokenizer.strip_accents behavior during the conversion process.

Guidance

Investigate the convert_hf_to_gguf.py script in llama.cpp to understand how the BasicTokenizer preprocessing is handled during the conversion from HuggingFace to gguf format.
Verify that the do_lower_case=True and NFD normalization are applied correctly in the gguf-converted tokenizer to strip combining diacritical marks before WordPiece.
Consider adding a flag to the gguf metadata to honor the BasicTokenizer.strip_accents behavior at tokenization time.
Test the modified conversion script with the affected models to ensure the issue is resolved.

Example

No code snippet is provided as the issue requires modifications to the convert_hf_to_gguf.py script, which is not publicly available.

Notes

The fix may require collaboration with the maintainers of llama.cpp to ensure the correct implementation of the BasicTokenizer.strip_accents behavior in the gguf-converted tokenizer.

Recommendation

Apply a workaround by modifying the convert_hf_to_gguf.py script to preserve the BasicTokenizer.strip_accents behavior, as this is the most likely root cause of the issue.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#dependency error #configuration error #environment variable #network issue #logging issue

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

ollama - ✅(Solved) Fix BERT-derived embedding models produce incorrect embeddings for non-ASCII text (strip_accents preprocessing dropped in gguf conversion) [1 pull requests, 2 comments, 3 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

PR fix notes

PR #15627: fix: preserve strip_accents preprocessing in BERT tokenizer conversion

Description (problem / solution / changelog)

Summary

Root Cause

Changes

Impact

Changed files

Code Example

What is the issue?

Affected models (verified)

Reproduction

Likely root cause

Scope

Relevant log output

OS

GPU

CPU

Ollama version

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING