ollama - ✅(Solved) Fix gemma4:31b drops all Unicode diacritics (accented characters) in output [1 pull requests, 1 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
ollama/ollama#15229Fetched 2026-04-08 02:34:06
View on GitHub
Comments
1
Participants
2
Timeline
7
Reactions
1
Author
Participants
Assignees
Timeline (top)
cross-referenced ×2referenced ×2assigned ×1closed ×1

Fix Action

Fixed

PR fix notes

PR #15232: tokenizer: add byte fallback for SentencePiece BPE encoding

Description (problem / solution / changelog)

When BPE merging produces tokens not in the vocabulary, fall back to encoding each UTF-8 byte as <0xHH> byte tokens instead of silently dropping the character. Also teach Decode to convert <0xHH> tokens back to raw bytes.

Fixes #15229, fixes #15231

Changed files

  • model/models/gemma4/tokenizer_reference_test.go (added, +341/-0)
  • tokenizer/bytepairencoding.go (modified, +55/-14)
  • tokenizer/bytepairencoding_test.go (modified, +117/-35)

Code Example

"L' a mang g  for."

---

# Non-streaming — accents missing
curl -s http://localhost:11434/api/chat -d '{
  "model": "gemma4:31b",
  "messages": [{"role":"user","content":"Écris une phrase courte en français avec des accents (é, è, ê, ù, ç, à)."}],
  "stream": false
}'
# Response: {"message":{"content":"L' a mang g  for.", ...}}

# Same prompt with gemma3:27b — accents present
curl -s http://localhost:11434/api/chat -d '{
  "model": "gemma3:27b",
  "messages": [{"role":"user","content":"Écris une phrase courte en français avec des accents (é, è, ê, ù, ç, à)."}],
  "stream": false
}'
# Response: {"message":{"content":"L'élu a dû gérer l'événement malgré..."}}
RAW_BUFFERClick to expand / collapse

What is the issue?

gemma4:31b strips all multi-byte UTF-8 characters (accents, diacritics) from both the content and thinking fields. The output is intelligible but every accented character is silently removed.

Example: asking for a short French sentence with accents produces:

"L' a mang g  for."

instead of something like "L'élève a mangé gâteau en forêt." — every é, è, ê, ù, ç, à is dropped.

This does not happen with gemma3:27b on the same Ollama instance, which correctly outputs accented characters.

Steps to reproduce

# Non-streaming — accents missing
curl -s http://localhost:11434/api/chat -d '{
  "model": "gemma4:31b",
  "messages": [{"role":"user","content":"Écris une phrase courte en français avec des accents (é, è, ê, ù, ç, à)."}],
  "stream": false
}'
# Response: {"message":{"content":"L' a mang g  for.", ...}}

# Same prompt with gemma3:27b — accents present
curl -s http://localhost:11434/api/chat -d '{
  "model": "gemma3:27b",
  "messages": [{"role":"user","content":"Écris une phrase courte en français avec des accents (é, è, ê, ù, ç, à)."}],
  "stream": false
}'
# Response: {"message":{"content":"L'élu a dû gérer l'événement malgré..."}}

The thinking field is also affected — diacritics are missing there too.

Disabling thinking with "think": false does not fix the issue.

Environment

  • Ollama version: 0.20.0-rc0
  • OS: Windows (remote server)
  • Model: gemma4:31b (also tested gemma4:latest)
  • Comparison: gemma3:27b works correctly on the same instance

Expected behavior

Accented characters (é, è, ê, ù, ç, à, etc.) should be preserved in the output, as they are with gemma3.

Additional context

Likely related to the tokenizer/detokenizer for the new Gemma4 architecture introduced in PR #15214. The issue affects all output fields (content and thinking), suggesting it happens at the token decoding stage.

extent analysis

TL;DR

The issue with accented characters being stripped from output in gemma4:31b may be resolved by investigating and adjusting the tokenizer/detokenizer configuration or code related to the new Gemma4 architecture.

Guidance

  • Investigate the changes introduced in PR #15214, specifically focusing on the tokenizer/detokenizer for the Gemma4 architecture, to understand how it handles UTF-8 characters.
  • Compare the tokenizer/detokenizer configurations or implementations between gemma3:27b and gemma4:31b to identify potential differences that could cause the issue.
  • Test the gemma4:31b model with different input encoding settings or character sets to see if the issue persists, which could help isolate the problem.
  • Consider reaching out to the developers who worked on PR #15214 or the Gemma4 architecture for more insight into potential encoding or decoding issues.

Example

No specific code example can be provided without more details on the tokenizer/detokenizer implementation, but a general approach might involve checking the encoding used when processing input and output texts, ensuring it supports UTF-8 characters.

Notes

The solution may require modifications to the model's architecture or the tokenizer/detokenizer implementation, which could involve significant development and testing. The issue seems specific to the gemma4:31b model, suggesting a version-specific fix might be necessary.

Recommendation

Apply a workaround by using gemma3:27b for tasks that require preserving accented characters until the issue with gemma4:31b is resolved, as gemma3:27b has been shown to handle accents correctly.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

FAQ

Expected behavior

Accented characters (é, è, ê, ù, ç, à, etc.) should be preserved in the output, as they are with gemma3.

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING