ollama - 💡(How to fix) Fix Bug: Gemma 4 (gemma4:26b) corrupts Spanish tokens with multi-byte chars (ñ, ü)

ollama2026-05-25 06:58:51

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

gemma4:26b in Ollama 0.20.5 produces token corruption when generating Spanish text containing multi-byte UTF-8 characters (specifically ñ, ü). Words like cigüeñal come out as cigüeencial and tiempo comes out as tiemación, suggesting a de-tokenization buffer issue where a token's morphological tail fuses into the previous token.

This is the same class of bug as #15278 (German Umlaute on Gemma 4) and likely shares root cause — note that PR #10081 fixed the SPM tokenize side but explicitly left de-tokenize unaddressed.

Caveat: this was reproduced on Ollama 0.20.5. If a maintainer can confirm it's already fixed in a later release (≥0.21), please close as duplicate / fixed.

Root Cause

This is the same class of bug as #15278 (German Umlaute on Gemma 4) and likely shares root cause — note that PR #10081 fixed the SPM tokenize side but explicitly left de-tokenize unaddressed.

Fix Action

Workaround

Use MLX / SwiftLM or mlx_lm.server for Spanish-language Gemma 4 inference until detok fix lands in Ollama.

Detailed reproduction report (extended cross-framework controls, regex sweep over 34 prompts × T/NT, broader detection regex w/ false-positive filtering) available on request.

Code Example

$ ollama --version
ollama version is 0.20.5

$ ollama run gemma4:26b "Explica cómo funciona un motor de combustión interna de 4 tiempos. Menciona el cigüeñal."

---

Un "tiempo" corresponde a un movimiento completo del pistón... y una rotación de 180° del cigüeencial.
...la energía química... ocurre en cuatro etapas o "tiemación" distintas.

RAW_BUFFERClick to expand / collapse

Summary

This is the same class of bug as #15278 (German Umlaute on Gemma 4) and likely shares root cause — note that PR #10081 fixed the SPM tokenize side but explicitly left de-tokenize unaddressed.

Caveat: this was reproduced on Ollama 0.20.5. If a maintainer can confirm it's already fixed in a later release (≥0.21), please close as duplicate / fixed.

Reproduction

$ ollama --version
ollama version is 0.20.5

$ ollama run gemma4:26b "Explica cómo funciona un motor de combustión interna de 4 tiempos. Menciona el cigüeñal."

Excerpt of corrupted output (literal, copy-pasted):

Un "tiempo" corresponde a un movimiento completo del pistón... y una rotación de 180° del cigüeencial.
...la energía química... ocurre en cuatro etapas o "tiemación" distintas.

Expected (cigüeñal, tiempo) — confirmed correct on the same model weights via mlx-community/gemma-4-26b-a4b-it-4bit running on Apple MLX / SwiftLM, and confirmed correct on gemma3:27b via Ollama with identical prompt and config.

Environment

macOS Darwin 25.4.0, M4 Pro 24 GB
Ollama 0.20.5 (Homebrew)
OLLAMA_NEW_ENGINE=1, OLLAMA_FLASH_ATTENTION=1, OLLAMA_KV_CACHE_TYPE=q8_0
Reproduced in both streaming and non-streaming modes (THINK=ON and THINK=OFF).

Cross-backend control

Backend	Model	`cigüeñal` rendered correctly?
Ollama 0.20.5	`gemma4:26b`	NO (`cigüeencial`)
SwiftLM (MLX)	`mlx-community/gemma-4-26b-a4b-it-4bit`	YES
Ollama 0.20.5	`gemma3:27b`	YES

Same family, same weights upstream — only the Ollama + Gemma 4 combination breaks.

Hypothesis

SPM detokenizer buffer joins a continuation byte of a multi-byte char with the wrong following token, dropping the tail of one token and leaving the head fused with the next token's tail. Consistent with PR #10081 author note that detokenization side was deferred.

Related upstream tickets

#15278 — same bug class on Gemma 4 with German Umlaute (closed without complete fix)
#10081 — SPM tokenize fix merged (detok side explicitly left unaddressed)
#13454 — Gemma 3 QAT gibberish (related family bug)
llama.cpp PRs #21326 + #21343 — tokenizer/template fixes (April 2026), may already be absorbed in Ollama 0.21+

Suggested fix scope

Detok path for SPM tokenizer when Gemma 4 vocab. Reference llama.cpp PRs #21326 + #21343 (April 2026). Worth a regression test on cigüeñal/tiempo and similar ñ+-encial/-ación patterns.

Workaround

Use MLX / SwiftLM or mlx_lm.server for Spanish-language Gemma 4 inference until detok fix lands in Ollama.

Detailed reproduction report (extended cross-framework controls, regex sweep over 34 prompts × T/NT, broader detection regex w/ false-positive filtering) available on request.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

ollama - 💡(How to fix) Fix Bug: Gemma 4 (gemma4:26b) corrupts Spanish tokens with multi-byte chars (ñ, ü)

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Fix Action

Workaround

Code Example

Summary

Reproduction

Environment

Cross-backend control

Hypothesis

Related upstream tickets

Suggested fix scope

Workaround

Still need to ship something?

TRENDING