claude-code - 💡(How to fix) Fix Claude Code persistently strips Spanish accents (tildes) and ñ in generated output despite extensive user mitigations — bimodal, all-or-nothing per file

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

Claude Code consistently strips Spanish accents (tildes) and the letter ñ from generated output — most severely in longer, structured, or file-written content (JSON, code, datasets, docs). This persists despite extensive, layered mitigations on the user side.

In Spanish (and many other languages) diacritics are not cosmetic. Dropping them changes meaning and makes output unusable in professional contexts. The canonical example: "años" (years) vs "anos" (anuses); "año" (year) vs "ano" (anus). Emitting "7 anos de experiencia" in a document is not a typo — it is a different, embarrassing word.

Error Message

  • Corrupts datasets, code string literals and comments, error messages, docs, PRs and commit messages.

Root Cause

Likely root cause (hypothesis)

Fix Action

Fix / Workaround

Claude Code consistently strips Spanish accents (tildes) and the letter ñ from generated output — most severely in longer, structured, or file-written content (JSON, code, datasets, docs). This persists despite extensive, layered mitigations on the user side.

This is the frustrating part. I have layered every mitigation the product offers, and the model still drops accents:

RAW_BUFFERClick to expand / collapse

Summary

Claude Code consistently strips Spanish accents (tildes) and the letter ñ from generated output — most severely in longer, structured, or file-written content (JSON, code, datasets, docs). This persists despite extensive, layered mitigations on the user side.

In Spanish (and many other languages) diacritics are not cosmetic. Dropping them changes meaning and makes output unusable in professional contexts. The canonical example: "años" (years) vs "anos" (anuses); "año" (year) vs "ano" (anus). Emitting "7 anos de experiencia" in a document is not a typo — it is a different, embarrassing word.

Environment

  • Claude Code (latest), model: Claude Opus 4.x (1M context)
  • OS: Windows 11, UTF-8 configured end to end (PYTHONUTF8=1, PYTHONIOENCODING=utf-8); files are written as valid UTF-8
  • Working language: Spanish (Argentina)

What I have ALREADY tried — and it still fails

This is the frustrating part. I have layered every mitigation the product offers, and the model still drops accents:

  • A global system prompt (~/.claude/CLAUDE.md) whose #1, top-priority rule mandates correct accents/ñ, with explicit examples and a critical-vocabulary list.
  • Project-local CLAUDE.md rules repeating the same.
  • Persistent memories marked CRITICAL.
  • Hooks.
  • Automated validators that scan output and flag missing accents/ñ.
  • Repeated in-conversation corrections.

None of it reliably works. This strongly suggests the behavior cannot be fixed at the user/configuration layer.

Observed behavior — bimodal, "all-or-nothing" per file

The failure is not gradual; it is bimodal per generated file:

  • Some files come out essentially perfect (~2.2% accented characters, 0 transliterations).
  • Others come out almost fully destroyed (~0.01% accented characters): ñ transliterated to "ni" ("anios" instead of "años") and tildes stripped ("educacion", "politica", "catolico", "tambien", "despues" instead of "educación", "política", "católico", "también", "después").

There is rarely a middle ground. A file either respects diacritics or collapses entirely. This points to a self-reinforcing / cascading decoding effect: once one unaccented word is emitted, the surrounding context anchors the model to ASCII and the pattern propagates through the rest of the file.

Concrete, reproducible evidence

While generating a synthetic dataset (many near-identical generation tasks differing only in input data), using an identical, fully-accented prompt for every task:

  • Chunk A: 0 transliterations, accents correct throughout — PASS.
  • Chunk B (same prompt, same pipeline, different input rows): 116 stripped tildes + 29 broken ñ in just 25 records — FAIL.

The only variable was the generation run itself. Notably, when the main model wrote the same content directly (no sub-agent), the output also collapsed to ~0.01% accents — i.e. this is not sub-agent specific; it is a general decoding tendency that worsens with longer/structured output and after context compaction.

Likely root cause (hypothesis)

  • Accented characters (á, é, í, ó, ú, ñ) are comparatively rare tokens. Much of the "write-to-file / code / JSON" training distribution is ASCII (keys, identifiers, paths). When the model shifts into a "structured/file output" mode it drifts toward the higher-probability ASCII token and drops the diacritic.
  • Cascading/anchoring: the first unaccented token biases subsequent tokens, producing the all-or-nothing pattern.
  • It is not an encoding problem: the model produces correct accents in conversational prose, and the files are valid UTF-8. It is a model/decoding default.

Impact

  • Affects every language with diacritics: Spanish, Portuguese, French, Catalan, Vietnamese, Czech, Turkish, German, etc.
  • Corrupts datasets, code string literals and comments, error messages, docs, PRs and commit messages.
  • Because no amount of prompt-level instruction reliably fixes it, users cannot solve it themselves — it has to be addressed in the model/decoding or via an official safeguard.

Related issues

  • #41205 (closed as duplicate)
  • #32886

The problem is recurring and still unresolved.

Requests

  1. Acknowledge and track this as a structural model/decoding issue, not a user-configuration issue.
  2. Consider constrained / guided decoding that preserves diacritics for non-English locales, or an optional post-generation diacritic-restoration safeguard.
  3. At minimum, document the limitation and provide an official recommendation, since current guidance (use system-prompt rules) demonstrably does not work.

Steps to reproduce

  1. Add a strong system-prompt rule requiring Spanish accents and ñ.
  2. Ask Claude Code to generate a long Spanish text, or a structured file (JSON/dataset/code with Spanish string values), ideally later in a session / after context compaction.
  3. Observe that some outputs strip tildes and transliterate ñ → "ni" despite the rule, in an all-or-nothing fashion.

Thank you. This is a long-standing, daily pain point for Spanish-speaking (and other non-English) users, and it currently has no reliable user-side fix.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

claude-code - 💡(How to fix) Fix Claude Code persistently strips Spanish accents (tildes) and ñ in generated output despite extensive user mitigations — bimodal, all-or-nothing per file