claude-code - 💡(How to fix) Fix Frequent CJK character substitution errors during Japanese text generation in MCP tool calls (Notion content_updates etc.) [1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
anthropics/claude-code#54051Fetched 2026-04-28 06:40:34
View on GitHub
Comments
0
Participants
1
Timeline
4
Reactions
0
Participants
Timeline (top)
labeled ×4

When using Claude (Opus 4.7 / Sonnet 4.6) to generate or edit Japanese text via MCP tools — particularly mcp__notion__notion-update-page with single-character old_str/new_str replacements — the model frequently substitutes incorrect Unicode code points. The wrong characters are typically Simplified Chinese or rare/archaic Japanese characters that look semantically related but are wrong, indicating an issue with how the model generates CJK code points one character at a time.

Error Message

| 締切った | 締切りた | Conjugation error (less severe but related) | Error rate is highest in three scenarios:

  • Self-correction workflows are unreliable — fixing one error introduces new ones
  • Trust in MCP-based document workflows degrades quickly with this error rate
  1. Whether the same error rate occurs in non-MCP Edit tool calls on local files (initial impression: somewhat, but less frequent than MCP tool calls)
  2. Whether including context in old_str/new_str (e.g., 5+ chars instead of 1) reduces error rate (anecdotally yes)

Root Cause

When using Claude (Opus 4.7 / Sonnet 4.6) to generate or edit Japanese text via MCP tools — particularly mcp__notion__notion-update-page with single-character old_str/new_str replacements — the model frequently substitutes incorrect Unicode code points. The wrong characters are typically Simplified Chinese or rare/archaic Japanese characters that look semantically related but are wrong, indicating an issue with how the model generates CJK code points one character at a time.

Fix Action

Fix / Workaround

  • Documents intended for external sharing (clients, business partners) require manual character-by-character review before publication
  • Self-correction workflows are unreliable — fixing one error introduces new ones
  • Trust in MCP-based document workflows degrades quickly with this error rate
  • For Japanese-language users, the workaround load is significant

Current workaround

RAW_BUFFERClick to expand / collapse

Summary

When using Claude (Opus 4.7 / Sonnet 4.6) to generate or edit Japanese text via MCP tools — particularly mcp__notion__notion-update-page with single-character old_str/new_str replacements — the model frequently substitutes incorrect Unicode code points. The wrong characters are typically Simplified Chinese or rare/archaic Japanese characters that look semantically related but are wrong, indicating an issue with how the model generates CJK code points one character at a time.

Frequency observed

In a single ~3-hour session editing one Notion page in Japanese, I observed 8+ such errors:

IntendedActually generatedComment
嬉しい (U+5B09)嫌しい (U+5ACC)Self-correction attempt failed, character substituted with semantically opposite kanji
春 2026 (U+6625)多 2026 (U+591A)Generated unrelated kanji
複数列読Two-char span, both wrong, semantically meaningless
揃え (U+63C3)諿え (U+8AFF)Rare/archaic kanji substituted
処分 (U+5206)処勭 (U+52ED)Very rare kanji substituted
億 (U+5104)亿 (U+4EBF)Simplified Chinese substituted for Japanese
付いてくる (U+4ED8)跟いてくる (U+8DDF)Chinese-only kanji substituted
側 (U+5074)侧 (U+4FA7)Simplified Chinese substituted
締切った締切りたConjugation error (less severe but related)

Reproduction pattern

Error rate is highest in three scenarios:

  1. Single-character replacements in mcp__notion__notion-update-page content_updates arrays where old_str and new_str differ by one kanji
  2. Generation of new Japanese paragraphs with mixed kanji content (especially when generating long markdown via notion-create-pages content arg)
  3. Self-correction attempts — correcting one typo frequently introduces another (compounding errors)

Hypothesized cause

The model appears to mishandle Unicode escape sequences \uXXXX when generating CJK characters one at a time. The errors are NOT transcription errors from input — input Japanese is preserved correctly when read back via notion-fetch. The errors happen during generation of new content / replacements.

This suggests a tokenizer or JSON-encoding-pipeline issue specific to single-CJK-character spans, rather than a general Japanese understanding limitation.

Impact

  • Documents intended for external sharing (clients, business partners) require manual character-by-character review before publication
  • Self-correction workflows are unreliable — fixing one error introduces new ones
  • Trust in MCP-based document workflows degrades quickly with this error rate
  • For Japanese-language users, the workaround load is significant

Current workaround

After every MCP write, immediately notion-fetch the document, grep for known-bad characters (Simplified Chinese, rare CJK), and correct manually. This is brittle and high-toil.

Suggested investigation areas

  1. Tokenizer behavior for CJK single-character spans (possibly subword splits introducing wrong neighbors)
  2. JSON encoding pipeline between Claude → MCP client → MCP server (any Unicode normalization happening?)
  3. Whether the same error rate occurs in non-MCP Edit tool calls on local files (initial impression: somewhat, but less frequent than MCP tool calls)
  4. Whether including context in old_str/new_str (e.g., 5+ chars instead of 1) reduces error rate (anecdotally yes)

Environment

  • Claude Code on macOS Darwin 25.2.0
  • Model: Claude Opus 4.7 (1M context)
  • MCP server: Anthropic-managed Notion connector (mcp__notion__*)
  • Use case: Japanese business documents (TERIYAKI Inc.)

Related

This appears to be a recurring issue. Documenting it here so the pattern is centrally visible for the team's CJK quality work.

extent analysis

TL;DR

The issue can be mitigated by adjusting the tokenizer behavior for CJK single-character spans and exploring JSON encoding pipeline modifications to handle Unicode characters correctly.

Guidance

  • Investigate the tokenizer's subword split behavior for single CJK characters to determine if it introduces incorrect neighbors, potentially causing the model to generate wrong characters.
  • Examine the JSON encoding pipeline between Claude and the MCP client/server for any Unicode normalization that might be contributing to the error.
  • Test whether including more context in old_str/new_str (e.g., 5+ characters instead of 1) reduces the error rate, as anecdotal evidence suggests this might be beneficial.
  • Consider comparing the error rate in non-MCP Edit tool calls on local files to determine if the issue is specific to MCP tool calls.

Example

No specific code example is provided due to the complexity and specificity of the issue, but exploring the tokenizer and JSON encoding pipeline configurations could involve reviewing and potentially modifying settings related to Unicode handling and character encoding.

Notes

The exact cause of the issue is not yet determined, and further investigation into the suggested areas is necessary. The problem seems to be related to how the model handles CJK characters, particularly in single-character replacements and generation tasks.

Recommendation

Apply a workaround by including more context in old_str/new_str and manually reviewing documents for incorrect characters until a more permanent fix can be implemented, as this approach has shown some promise in reducing the error rate.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

claude-code - 💡(How to fix) Fix Frequent CJK character substitution errors during Japanese text generation in MCP tool calls (Notion content_updates etc.) [1 participants]