claude-code - 💡(How to fix) Fix [BUG] Korean text degrades to repeated "측" tokens in long mixed-language responses

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

When generating long responses that mix Korean + Japanese kanji + English heavily, the model's Korean output progressively degrades — valid Korean words get replaced with repeated "측" (one specific Hangul syllable) tokens. The degradation tends to propagate and can fill entire sentences with just "측 측 측 측 측 …".

This severely impacts Korean-speaking users working on bilingual technical projects (Japanese NLP, Chinese/Japanese text analysis, etc.) where mixing languages is unavoidable.

Error Message

Error Messages/Logs

Root Cause

A minimal repro is harder because the model usually behaves correctly on short, single-language prompts. The failure requires accumulated mixed-language context.

Fix Action

Fix / Workaround

Reproducible Workaround

Code Example

A more severe example (entire sentence collapses to "측"):

---

## Why "측"?

"측" (U+CE21) is a valid Hangul syllable meaning "side/aspect". It appears in some Korean words (측면, 관측 etc.) but is unusual as a stand-alone word. The token degradation pattern specifically converges on this single character, suggesting the model's sampling fails into a local repetition mode where one Korean token captures all probability mass.

This is similar to repetition-collapse failure modes seen in other models, but specifically scoped to Korean output when the surrounding context has heavy Japanese kanji.

## Reproducible Workaround

The issue can be temporarily worked around by:
1. Asking the model to respond in English instead of Korean.
2. Keeping responses very short (12 sentences).
3. Avoiding tables and bullet lists with mixed-language cells.

Once the degradation starts, it tends to persist for the rest of the turn even when the model "tries" to recover.

## Impact

- Korean users working on Japanese NLP / Japanese language learning tools (a non-trivial use case for Claude Code) are blocked from getting reliable bilingual output.
- The output is hard to detect programmatically as broken (the Hangul char is "valid"), so it can silently degrade documentation, commit messages, or code comments if not caught by review.

## Suggested Investigation Direction

- Repetition penalty / decoding policy when the active output language has low conditional probability against a dominant context language.
- Token-level probability collapse for Korean in mixed CJK+Latin contexts.
- Comparison with Haiku/Sonnet variants — does this affect all model tiers or specifically Opus 4.7?

## Severity

**Medium-High** for affected users. Functionality is technically not broken (the model still emits something), but the output is unusable.

---

(Posted by a Korean-speaking developer; conversation transcript available on request — message thread is in a bilingual session working on a Japanese NLP project in Claude Code.)


### What Should Happen?

The model should produce coherent Korean text throughout the response, regardless of how much Japanese kanji or English appears in the surrounding context. Specifically:

- Korean tokens should not collapse to a single repeated syllable ().
- Mixed-language responses should remain readable; Korean sections should convey actual meaning, not Hangul-shaped noise.
- If the model is unable to continue cleanly in Korean for some internal reason, it should either:
  - Switch to English with an explanation, or
  - Truncate cleanly,
  rather than silently emitting Hangul-shaped garbage.

### Error Messages/Logs
RAW_BUFFERClick to expand / collapse

Preflight Checklist

  • I have searched existing issues and this hasn't been reported yet
  • This is a single bug report (please file separate reports for different bugs)
  • I am using the latest version of Claude Code

What's Wrong?

Environment

  • Model: Claude Opus 4.7 (claude-opus-4-7)
  • Claude Code: 2.1.143
  • OS: Windows 11 Home
  • Working language context: Korean (primary user language) + Japanese kanji/kana + English (technical terms)

Summary

When generating long responses that mix Korean + Japanese kanji + English heavily, the model's Korean output progressively degrades — valid Korean words get replaced with repeated "측" (one specific Hangul syllable) tokens. The degradation tends to propagate and can fill entire sentences with just "측 측 측 측 측 …".

This severely impacts Korean-speaking users working on bilingual technical projects (Japanese NLP, Chinese/Japanese text analysis, etc.) where mixing languages is unavoidable.

Reproduction Context

The issue reproduces in long-running development sessions where:

  • The user converses in Korean.
  • Code/data contains Japanese kanji (e.g., 程, 位, 思わず, GC entry names).
  • Technical terms in English (GC, regex, patternAlt, etc.) appear frequently.
  • Individual responses grow long (multiple paragraphs, tables, code blocks).

After a certain length / mix density, Korean tokens start dropping out and being replaced by "측".

Example (sanitized excerpt)

nokanji katakana 보유 + 모든 kana common=0 → 動植物名/俗語/scientific katakana 측 측 specialized term → group 4 demote 측 측 측.


A more severe example (entire sentence collapses to "측"):

응답: 측 측 측 측 측 측 측 측 측 측 측 측 측 측 측 측 측 측 측 측 측 측 측 측 측 측 측 측 측 측 측 측 측 측 측 측 측 측 측 측 측 측 측 측 측 측 측 측 측 측 측 측 측 측 측 측.


## Why "측"?

"측" (U+CE21) is a valid Hangul syllable meaning "side/aspect". It appears in some Korean words (측면, 관측 etc.) but is unusual as a stand-alone word. The token degradation pattern specifically converges on this single character, suggesting the model's sampling fails into a local repetition mode where one Korean token captures all probability mass.

This is similar to repetition-collapse failure modes seen in other models, but specifically scoped to Korean output when the surrounding context has heavy Japanese kanji.

## Reproducible Workaround

The issue can be temporarily worked around by:
1. Asking the model to respond in English instead of Korean.
2. Keeping responses very short (1–2 sentences).
3. Avoiding tables and bullet lists with mixed-language cells.

Once the degradation starts, it tends to persist for the rest of the turn even when the model "tries" to recover.

## Impact

- Korean users working on Japanese NLP / Japanese language learning tools (a non-trivial use case for Claude Code) are blocked from getting reliable bilingual output.
- The output is hard to detect programmatically as broken (the Hangul char is "valid"), so it can silently degrade documentation, commit messages, or code comments if not caught by review.

## Suggested Investigation Direction

- Repetition penalty / decoding policy when the active output language has low conditional probability against a dominant context language.
- Token-level probability collapse for Korean in mixed CJK+Latin contexts.
- Comparison with Haiku/Sonnet variants — does this affect all model tiers or specifically Opus 4.7?

## Severity

**Medium-High** for affected users. Functionality is technically not broken (the model still emits something), but the output is unusable.

---

(Posted by a Korean-speaking developer; conversation transcript available on request — message thread is in a bilingual session working on a Japanese NLP project in Claude Code.)


### What Should Happen?

The model should produce coherent Korean text throughout the response, regardless of how much Japanese kanji or English appears in the surrounding context. Specifically:

- Korean tokens should not collapse to a single repeated syllable (측).
- Mixed-language responses should remain readable; Korean sections should convey actual meaning, not Hangul-shaped noise.
- If the model is unable to continue cleanly in Korean for some internal reason, it should either:
  - Switch to English with an explanation, or
  - Truncate cleanly,
  rather than silently emitting Hangul-shaped garbage.

### Error Messages/Logs

```shell

Steps to Reproduce

  1. Launch Claude Code on Windows (PowerShell). Model: Opus 4.7.
  2. Set up a project that involves Japanese text analysis or NLP (the project working dir contains Japanese kanji file names, source comments in Japanese, and English code identifiers).
  3. Start a conversation in Korean. Have an extended back-and-forth (15+ turns) where each Claude response includes:
    • Korean prose explanations
    • Inline Japanese kanji terms (e.g., 「程 (Sino-Japanese reading てい)」, 「いつの日か」, etc.)
    • English technical terms (GC, retokenize, expression, etc.)
    • Markdown tables comparing options
  4. Ask Claude to summarize trade-offs or compare design options. Trigger response with a question like: "이 옵션의 부작용은 뭐야? 장단점 비교해줘" (with Japanese kanji terms expected in the reply).
  5. Observe: somewhere in the middle of a long mixed-language response, Korean tokens begin getting replaced by 측. The degradation worsens as the response continues.

A minimal repro is harder because the model usually behaves correctly on short, single-language prompts. The failure requires accumulated mixed-language context.

Claude Model

None

Is this a regression?

Yes, this worked in a previous version

Last Working Version

No response

Claude Code Version

2.1.143

Platform

Anthropic API

Operating System

Windows

Terminal/Shell

Warp

Additional Information

<img width="3366" height="1596" alt="Image" src="https://github.com/user-attachments/assets/4803f8c6-80be-4e6b-a323-bfa15c38f855" />

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

claude-code - 💡(How to fix) Fix [BUG] Korean text degrades to repeated "측" tokens in long mixed-language responses