claude-code - 💡(How to fix) Fix [BUG] Korean text degrades to repeated "측" tokens in long mixed-language responses

claude-code2026-05-16 03:40:35

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

When generating long responses that mix Korean + Japanese kanji + English heavily, the model's Korean output progressively degrades — valid Korean words get replaced with repeated "측" (one specific Hangul syllable) tokens. The degradation tends to propagate and can fill entire sentences with just "측 측 측 측 측 …".

This severely impacts Korean-speaking users working on bilingual technical projects (Japanese NLP, Chinese/Japanese text analysis, etc.) where mixing languages is unavoidable.

Error Message

Error Messages/Logs

Root Cause

A minimal repro is harder because the model usually behaves correctly on short, single-language prompts. The failure requires accumulated mixed-language context.

Fix Action

Fix / Workaround

Reproducible Workaround

Code Example

A more severe example (entire sentence collapses to "측"):

---

## Why "측"?

"측" (U+CE21) is a valid Hangul syllable meaning "side/aspect". It appears in some Korean words (측면, 관측 etc.) but is unusual as a stand-alone word. The token degradation pattern specifically converges on this single character, suggesting the model's sampling fails into a local repetition mode where one Korean token captures all probability mass.

This is similar to repetition-collapse failure modes seen in other models, but specifically scoped to Korean output when the surrounding context has heavy Japanese kanji.

## Reproducible Workaround

The issue can be temporarily worked around by:
1. Asking the model to respond in English instead of Korean.
2. Keeping responses very short (1–2 sentences).
3. Avoiding tables and bullet lists with mixed-language cells.

Once the degradation starts, it tends to persist for the rest of the turn even when the model "tries" to recover.

## Impact

- Korean users working on Japanese NLP / Japanese language learning tools (a non-trivial use case for Claude Code) are blocked from getting reliable bilingual output.
- The output is hard to detect programmatically as broken (the Hangul char is "valid"), so it can silently degrade documentation, commit messages, or code comments if not caught by review.

## Suggested Investigation Direction

- Repetition penalty / decoding policy when the active output language has low conditional probability against a dominant context language.
- Token-level probability collapse for Korean in mixed CJK+Latin contexts.
- Comparison with Haiku/Sonnet variants — does this affect all model tiers or specifically Opus 4.7?

## Severity

**Medium-High** for affected users. Functionality is technically not broken (the model still emits something), but the output is unusable.

---

(Posted by a Korean-speaking developer; conversation transcript available on request — message thread is in a bilingual session working on a Japanese NLP project in Claude Code.)


### What Should Happen?

The model should produce coherent Korean text throughout the response, regardless of how much Japanese kanji or English appears in the surrounding context. Specifically:

- Korean tokens should not collapse to a single repeated syllable (측).
- Mixed-language responses should remain readable; Korean sections should convey actual meaning, not Hangul-shaped noise.
- If the model is unable to continue cleanly in Korean for some internal reason, it should either:
  - Switch to English with an explanation, or
  - Truncate cleanly,
  rather than silently emitting Hangul-shaped garbage.

### Error Messages/Logs

RAW_BUFFERClick to expand / collapse

Preflight Checklist

I have searched existing issues and this hasn't been reported yet
This is a single bug report (please file separate reports for different bugs)
I am using the latest version of Claude Code

What's Wrong?

Environment

Model: Claude Opus 4.7 (claude-opus-4-7)
Claude Code: 2.1.143
OS: Windows 11 Home
Working language context: Korean (primary user language) + Japanese kanji/kana + English (technical terms)

Summary

This severely impacts Korean-speaking users working on bilingual technical projects (Japanese NLP, Chinese/Japanese text analysis, etc.) where mixing languages is unavoidable.

Reproduction Context

The issue reproduces in long-running development sessions where:

The user converses in Korean.
Code/data contains Japanese kanji (e.g., 程, 位, 思わず, GC entry names).
Technical terms in English (GC, regex, patternAlt, etc.) appear frequently.
Individual responses grow long (multiple paragraphs, tables, code blocks).

After a certain length / mix density, Korean tokens start dropping out and being replaced by "측".

Example (sanitized excerpt)

nokanji katakana 보유 + 모든 kana common=0 → 動植物名/俗語/scientific katakana 측 측 specialized term → group 4 demote 측 측 측.


A more severe example (entire sentence collapses to "측"):

응답: 측 측 측 측 측 측 측 측 측 측 측 측 측 측 측 측 측 측 측 측 측 측 측 측 측 측 측 측 측 측 측 측 측 측 측 측 측 측 측 측 측 측 측 측 측 측 측 측 측 측 측 측 측 측 측 측.


## Why "측"?

"측" (U+CE21) is a valid Hangul syllable meaning "side/aspect". It appears in some Korean words (측면, 관측 etc.) but is unusual as a stand-alone word. The token degradation pattern specifically converges on this single character, suggesting the model's sampling fails into a local repetition mode where one Korean token captures all probability mass.

This is similar to repetition-collapse failure modes seen in other models, but specifically scoped to Korean output when the surrounding context has heavy Japanese kanji.

## Reproducible Workaround

The issue can be temporarily worked around by:
1. Asking the model to respond in English instead of Korean.
2. Keeping responses very short (1–2 sentences).
3. Avoiding tables and bullet lists with mixed-language cells.

Once the degradation starts, it tends to persist for the rest of the turn even when the model "tries" to recover.

## Impact

- Korean users working on Japanese NLP / Japanese language learning tools (a non-trivial use case for Claude Code) are blocked from getting reliable bilingual output.
- The output is hard to detect programmatically as broken (the Hangul char is "valid"), so it can silently degrade documentation, commit messages, or code comments if not caught by review.

## Suggested Investigation Direction

- Repetition penalty / decoding policy when the active output language has low conditional probability against a dominant context language.
- Token-level probability collapse for Korean in mixed CJK+Latin contexts.
- Comparison with Haiku/Sonnet variants — does this affect all model tiers or specifically Opus 4.7?

## Severity

**Medium-High** for affected users. Functionality is technically not broken (the model still emits something), but the output is unusable.

---

(Posted by a Korean-speaking developer; conversation transcript available on request — message thread is in a bilingual session working on a Japanese NLP project in Claude Code.)


### What Should Happen?

The model should produce coherent Korean text throughout the response, regardless of how much Japanese kanji or English appears in the surrounding context. Specifically:

- Korean tokens should not collapse to a single repeated syllable (측).
- Mixed-language responses should remain readable; Korean sections should convey actual meaning, not Hangul-shaped noise.
- If the model is unable to continue cleanly in Korean for some internal reason, it should either:
  - Switch to English with an explanation, or
  - Truncate cleanly,
  rather than silently emitting Hangul-shaped garbage.

### Error Messages/Logs

```shell

Steps to Reproduce

Launch Claude Code on Windows (PowerShell). Model: Opus 4.7.
Set up a project that involves Japanese text analysis or NLP (the project working dir contains Japanese kanji file names, source comments in Japanese, and English code identifiers).
Start a conversation in Korean. Have an extended back-and-forth (15+ turns) where each Claude response includes:
- Korean prose explanations
- Inline Japanese kanji terms (e.g., 「程 (Sino-Japanese reading てい)」, 「いつの日か」, etc.)
- English technical terms (GC, retokenize, expression, etc.)
- Markdown tables comparing options
Ask Claude to summarize trade-offs or compare design options. Trigger response with a question like: "이 옵션의 부작용은 뭐야? 장단점 비교해줘" (with Japanese kanji terms expected in the reply).
Observe: somewhere in the middle of a long mixed-language response, Korean tokens begin getting replaced by 측. The degradation worsens as the response continues.

A minimal repro is harder because the model usually behaves correctly on short, single-language prompts. The failure requires accumulated mixed-language context.

Claude Model

None

Is this a regression?

Yes, this worked in a previous version

Last Working Version

No response

Claude Code Version

2.1.143

Platform

Anthropic API

Operating System

Windows

Terminal/Shell

Warp

Additional Information

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #model download #tokenizer error #prompt formatting #chain error

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

claude-code - 💡(How to fix) Fix [BUG] Korean text degrades to repeated "측" tokens in long mixed-language responses

Recommended Tools

GitHub issue graph ai analysis

Error Message

Error Messages/Logs

Root Cause

Fix Action

Fix / Workaround

Reproducible Workaround

Code Example

Preflight Checklist

What's Wrong?

Environment

Summary

Reproduction Context

Example (sanitized excerpt)

Steps to Reproduce

Claude Model

Is this a regression?

Last Working Version

Claude Code Version

Platform

Operating System

Terminal/Shell

Additional Information

Still need to ship something?

TRENDING

claude-code - 💡(How to fix) Fix [BUG] Korean text degrades to repeated "측" tokens in long mixed-language responses

Recommended Tools

GitHub issue graph ai analysis

Error Message

Error Messages/Logs

Root Cause

Fix Action

Fix / Workaround

Reproducible Workaround

Code Example

Preflight Checklist

What's Wrong?

Environment

Summary

Reproduction Context

Example (sanitized excerpt)

Steps to Reproduce

Claude Model

Is this a regression?

Last Working Version

Claude Code Version

Platform

Operating System

Terminal/Shell

Additional Information

Still need to ship something?

RELATED_DISCOVERY

TRENDING