claude-code - 💡(How to fix) Fix [BUG] Korean output: valid Hangul but wrong/missing syllables and hallucinated non-words (not U+FFFD)

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

This is distinct from existing CJK corruption issues (#40396, #45508, #46863, #41358), which all produce U+FFFD () replacement characters from UTF-8 byte truncation in the terminal/streaming layer.

This bug produces syntactically valid Hangul syllables that are semantically wrong — missing 종성 (final consonants), syllables dropped, and most concerning, hallucinated non-words that do not exist in Korean appearing in the model's output. Zero are involved.

Root Cause

Suspected root cause

RAW_BUFFERClick to expand / collapse

Summary

This is distinct from existing CJK corruption issues (#40396, #45508, #46863, #41358), which all produce U+FFFD () replacement characters from UTF-8 byte truncation in the terminal/streaming layer.

This bug produces syntactically valid Hangul syllables that are semantically wrong — missing 종성 (final consonants), syllables dropped, and most concerning, hallucinated non-words that do not exist in Korean appearing in the model's output. Zero are involved.

Examples (from a single Plan-mode AskUserQuestion prompt, screenshot to be attached)

ExpectedActualPattern
캠페인캐페인Final consonant dropped (1 jamo missing)
혼란 / 이슈와멸Hallucinated non-word — 와멸 is not a Korean word
방향 / 방안2 syllables collapsed to 1

The string "와멸" does not exist in Korean — not in dictionaries, not in colloquial usage, not as a loanword. The model emitted it twice in the same prompt as if it were a real word meaning "issue/disruption." This is the dangerous signature: a native reader sees valid-looking Korean that is nonsense; an encoding-based check sees nothing wrong.

Why this is NOT the U+FFFD bug

  • Output contains zero U+FFFD replacement characters
  • All bytes form valid UTF-8 and valid composed Hangul syllables
  • Corruption is at the token-selection / semantic layer, not the byte layer
  • File writes show the same corruption (existing U+FFFD bug spares file writes; this one does not)

This means existing UTF-8 boundary-fix proposals (.floor_char_boundary() etc.) will not address it. The bug appears to be upstream in inference, not in CLI rendering.

Environment

  • Claude Code 2.1.145
  • Model: Opus 4.7, 1M context
  • macOS 15.6 (build 24G84), default Claude Code TUI
  • Heaviest in Plan mode with large system prompt (CLAUDE.md + memory + skills + multi-project context)
  • Onset: roughly early May 2026; progressive — minor at session start, increasingly frequent after context compaction

Suspected root cause

Not a CLI / terminal bug. Candidates:

  1. Opus 4.7 1M-context decoder regression for low-resource-language tokens
  2. Context compaction corrupting Korean BPE token sequence integrity
  3. New tokenizer (claimed 20–35% efficiency improvement for CJK in 4.7) introducing a decoding-side regression

Impact

More dangerous than the U+FFFD bug for Korean users:

  • No visual indicator the output is corrupted (no `�`)
  • Corruption propagates through tool calls, file writes, and git commits undetected by any encoding-based linter
  • Plan-mode `.md` outputs persist the corruption into the codebase, where it may be acted on later

(screenshot to follow as a comment)

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING