claude-code - 💡(How to fix) Fix Claude Code persistently strips Spanish accents (tildes) and ñ in generated output despite extensive user mitigations

StepCodex · 2026-05-29T18:12:18Z

[claude-code] Claude Code consistently strips Spanish accents tildes and the letter ñ from generated output — most severely in longer, structured, or file-writ… Claude Code **consistently strips Spanish accents (tildes) and the letter ñ** from generated output — most severely in longer, structured, or file-written content (JSON, code, datasets, docs). This persists despite extensive, layered mitigations on the user side. In Spanish (and many other languages) diacritics are **not cosmetic**. Dropping them changes meaning and makes output unusable in professional contexts. The canonical example: **"años"** (years) vs **"anos"** (anuses); **"año"** (year) vs **"ano"** (anus). Emitting "7 anos de experiencia" in a document is not a typo — it is a different, embarrassing word. ## Fix / Workaround Claude Code **consistently strips Spanish accents (tildes) and the letter ñ** from generated output — most severely in longer, structured, or file-written content (JSON, code, datasets, docs). This persists despite extensive, layered mitigations on the user side. This is the frustrating part. I have layered every mitigation the product offers, and the model still drops accents: ## Summary Claude Code **consistently strips Spanish accents (tildes) and the letter ñ** from generated output — most severely in longer, structured, or file-written content (JSON, code, datasets, docs). This persists despite extensive, layered mitigations on the user side. In Spanish (and many other languages) diacritics are **not cosmetic**. Dropping them changes meaning and makes output unusable in professional contexts. The canonical example: **"años"** (years) vs **"anos"** (anuses); **"año"** (year) vs **"ano"** (anus). Emitting "7 anos de experiencia" in a document is not a typo — it is a different, embarrassing word. ## Environment - Claude Code (latest), model: **Claude Opus 4.x (1M context)** - OS: Windows 11, UTF-8 configured end to end (`PYTHONUTF8=1`, `PYTHONIOENCODING=utf-8`); files are written as valid UTF-8 - Working language: Spanish (Argentina) ## What I have ALREADY tried — and it still fails This is the frustrating part. I have layered every mitigation the product offers, and the model still drops accents: - A **global system prompt** (`~/.claude/CLAUDE.md`) whose **#1, top-priority rule** mandates correct accents/ñ, with explicit examples and a critical-vocabulary list. - **Project-local `CLAUDE.md`** rules repeating the same. - **Persistent memories** marked CRITICAL. - **Hooks**. - **Automated validators** that scan output and flag missing accents/ñ. - **Repeated in-conversation corrections.** None of it reliably works. This strongly suggests the behavior cannot be fixed at the user/configuration layer. ## Observed behavior — bimodal, "all-or-nothing" per file The failure is **not gradual; it is bimodal per generated file**: - Some files come out essentially perfect (~2.2% accented characters, 0 transliterations). - Others come out almost fully destroyed (~0.01% accented characters): ñ transliterated to "ni" (**"anios"** instead of **"años"**) and tildes stripped (**"educacion", "politica", "catolico", "tambien", "despues"** instead of "educación", "política", "católico", "también", "después"). There is rarely a middle ground. A file either respects diacritics or collapses entirely. This points to a **self-reinforcing / cascading decoding effect**: once one unaccented word is emitted, the surrounding context anchors the model to ASCII and the pattern propagates through the rest of the file. ## Concrete, reproducible evidence While generating a synthetic dataset (many near-identical generation tasks differing only in input data), using an **identical, fully-accented prompt** for every task: - Chunk A: 0 transliterations, accents correct throughout — PASS. - Chunk B (same prompt, same pipeline, different input rows): **116 stripped tildes + 29 broken ñ in just 25 records** — FAIL. The only variable was the generation run itself. Notably, when the main model wrote the same content **directly** (no sub-agent), the output also collapsed to ~0.01% accents — i.e. this is **not sub-agent specific**; it is a general decoding tendency that worsens with longer/structured output and after context compaction. ## Likely root cause (hypothesis) - Accented characters (á, é, í, ó, ú, ñ) are comparatively **rare tokens**. Much of the "write-to-file / code / JSON" training distribution is ASCII (keys, identifiers, paths). When the model shifts into a "structured/file output" mode it drifts toward the higher-probability ASCII token and drops the diacritic. - **Cascading/anchoring:** the first unaccented token biases subsequent tokens, producing the all-or-nothing pattern. - It is **not an encoding problem**: the model produces correct accents in conversational prose, and the files are valid UTF-8. It is a model/decoding default. ## Impact - Affects every language with diacritics: Spanish, Portuguese, French, Catalan, Vietnamese, Czech, Turkish

Fix Action

Fix / Workaround

Claude Code consistently strips Spanish accents (tildes) and the letter ñ from generated output — most severely in longer, structured, or file-written content (JSON, code, datasets, docs). This persists despite extensive, layered mitigations on the user side.

This is the frustrating part. I have layered every mitigation the product offers, and the model still drops accents:

Summary

In Spanish (and many other languages) diacritics are not cosmetic. Dropping them changes meaning and makes output unusable in professional contexts. The canonical example: "años" (years) vs "anos" (anuses); "año" (year) vs "ano" (anus). Emitting "7 anos de experiencia" in a document is not a typo — it is a different, embarrassing word.

Environment

Claude Code (latest), model: Claude Opus 4.x (1M context)
OS: Windows 11, UTF-8 configured end to end (PYTHONUTF8=1, PYTHONIOENCODING=utf-8); files are written as valid UTF-8
Working language: Spanish (Argentina)

What I have ALREADY tried — and it still fails

This is the frustrating part. I have layered every mitigation the product offers, and the model still drops accents:

A global system prompt (~/.claude/CLAUDE.md) whose #1, top-priority rule mandates correct accents/ñ, with explicit examples and a critical-vocabulary list.
Project-local CLAUDE.md rules repeating the same.
Persistent memories marked CRITICAL.
Hooks.
Automated validators that scan output and flag missing accents/ñ.
Repeated in-conversation corrections.

None of it reliably works. This strongly suggests the behavior cannot be fixed at the user/configuration layer.

Observed behavior — bimodal, "all-or-nothing" per file

The failure is not gradual; it is bimodal per generated file:

Some files come out essentially perfect (~2.2% accented characters, 0 transliterations).
Others come out almost fully destroyed (~0.01% accented characters): ñ transliterated to "ni" ("anios" instead of "años") and tildes stripped ("educacion", "politica", "catolico", "tambien", "despues" instead of "educación", "política", "católico", "también", "después").

There is rarely a middle ground. A file either respects diacritics or collapses entirely. This points to a self-reinforcing / cascading decoding effect: once one unaccented word is emitted, the surrounding context anchors the model to ASCII and the pattern propagates through the rest of the file.

Concrete, reproducible evidence

While generating a synthetic dataset (many near-identical generation tasks differing only in input data), using an identical, fully-accented prompt for every task:

Chunk A: 0 transliterations, accents correct throughout — PASS.
Chunk B (same prompt, same pipeline, different input rows): 116 stripped tildes + 29 broken ñ in just 25 records — FAIL.

The only variable was the generation run itself. Notably, when the main model wrote the same content directly (no sub-agent), the output also collapsed to ~0.01% accents — i.e. this is not sub-agent specific; it is a general decoding tendency that worsens with longer/structured output and after context compaction.

Likely root cause (hypothesis)

Accented characters (á, é, í, ó, ú, ñ) are comparatively rare tokens. Much of the "write-to-file / code / JSON" training distribution is ASCII (keys, identifiers, paths). When the model shifts into a "structured/file output" mode it drifts toward the higher-probability ASCII token and drops the diacritic.
Cascading/anchoring: the first unaccented token biases subsequent tokens, producing the all-or-nothing pattern.
It is not an encoding problem: the model produces correct accents in conversational prose, and the files are valid UTF-8. It is a model/decoding default.

Impact

Affects every language with diacritics: Spanish, Portuguese, French, Catalan, Vietnamese, Czech, Turkish, German, etc.
Corrupts datasets, code string literals and comments, error messages, docs, PRs and commit messages.
Because no amount of prompt-level instruction reliably fixes it, users cannot solve it themselves — it has to be addressed in the model/decoding or via an official safeguard.

Related issues

#41205 (closed as duplicate)
#32886

The problem is recurring and still unresolved.

Requests

Acknowledge and track this as a structural model/decoding issue, not a user-configuration issue.
Consider constrained / guided decoding that preserves diacritics for non-English locales, or an optional post-generation diacritic-restoration safeguard.
At minimum, document the limitation and provide an official recommendation, since current guidance (use system-prompt rules) demonstrably does not work.

Steps to reproduce

Add a strong system-prompt rule requiring Spanish accents and ñ.
Ask Claude Code to generate a long Spanish text, or a structured file (JSON/dataset/code with Spanish string values), ideally later in a session / after context compaction.
Observe that some outputs strip tildes and transliterate ñ → "ni" despite the rule, in an all-or-nothing fashion.

Thank you. This is a long-standing, daily pain point for Spanish-speaking (and other non-English) users, and it currently has no reliable user-side fix.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

claude-code - 💡(How to fix) Fix Claude Code persistently strips Spanish accents (tildes) and ñ in generated output despite extensive user mitigations — bimodal, all-or-nothing per file

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Likely root cause (hypothesis)

Fix Action

Fix / Workaround

Summary

Environment

What I have ALREADY tried — and it still fails

Observed behavior — bimodal, "all-or-nothing" per file

Concrete, reproducible evidence

Likely root cause (hypothesis)

Impact

Related issues

Requests

Steps to reproduce

Still need to ship something?

TRENDING