claude-code - 💡(How to fix) Fix Frequent CJK character substitution errors during Japanese text generation in MCP tool calls (Notion content_updates etc.) [1 participants]

claude-code2026-04-27 19:55:09

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

anthropics/claude-code#54051•Fetched 2026-04-28 06:40:34

View on GitHub

Comments

Participants

Timeline

Reactions

Author

ryofukutani

Participants

ryofukutani

Timeline (top)

labeled ×4

When using Claude (Opus 4.7 / Sonnet 4.6) to generate or edit Japanese text via MCP tools — particularly mcp__notion__notion-update-page with single-character old_str/new_str replacements — the model frequently substitutes incorrect Unicode code points. The wrong characters are typically Simplified Chinese or rare/archaic Japanese characters that look semantically related but are wrong, indicating an issue with how the model generates CJK code points one character at a time.

Error Message

| 締切った | 締切りた | Conjugation error (less severe but related) | Error rate is highest in three scenarios:

Self-correction workflows are unreliable — fixing one error introduces new ones
Trust in MCP-based document workflows degrades quickly with this error rate

Whether the same error rate occurs in non-MCP Edit tool calls on local files (initial impression: somewhat, but less frequent than MCP tool calls)
Whether including context in old_str/new_str (e.g., 5+ chars instead of 1) reduces error rate (anecdotally yes)

Root Cause

Fix Action

Fix / Workaround

Documents intended for external sharing (clients, business partners) require manual character-by-character review before publication
Self-correction workflows are unreliable — fixing one error introduces new ones
Trust in MCP-based document workflows degrades quickly with this error rate
For Japanese-language users, the workaround load is significant

Current workaround

RAW_BUFFERClick to expand / collapse

Summary

Frequency observed

In a single ~3-hour session editing one Notion page in Japanese, I observed 8+ such errors:

Intended	Actually generated	Comment
嬉しい (U+5B09)	嫌しい (U+5ACC)	Self-correction attempt failed, character substituted with semantically opposite kanji
春 2026 (U+6625)	多 2026 (U+591A)	Generated unrelated kanji
複数	列読	Two-char span, both wrong, semantically meaningless
揃え (U+63C3)	諿え (U+8AFF)	Rare/archaic kanji substituted
処分 (U+5206)	処勭 (U+52ED)	Very rare kanji substituted
億 (U+5104)	亿 (U+4EBF)	Simplified Chinese substituted for Japanese
付いてくる (U+4ED8)	跟いてくる (U+8DDF)	Chinese-only kanji substituted
側 (U+5074)	侧 (U+4FA7)	Simplified Chinese substituted
締切った	締切りた	Conjugation error (less severe but related)

Reproduction pattern

Error rate is highest in three scenarios:

Single-character replacements in mcp__notion__notion-update-page content_updates arrays where old_str and new_str differ by one kanji
Generation of new Japanese paragraphs with mixed kanji content (especially when generating long markdown via notion-create-pages content arg)
Self-correction attempts — correcting one typo frequently introduces another (compounding errors)

Hypothesized cause

The model appears to mishandle Unicode escape sequences \uXXXX when generating CJK characters one at a time. The errors are NOT transcription errors from input — input Japanese is preserved correctly when read back via notion-fetch. The errors happen during generation of new content / replacements.

This suggests a tokenizer or JSON-encoding-pipeline issue specific to single-CJK-character spans, rather than a general Japanese understanding limitation.

Impact

Documents intended for external sharing (clients, business partners) require manual character-by-character review before publication
Self-correction workflows are unreliable — fixing one error introduces new ones
Trust in MCP-based document workflows degrades quickly with this error rate
For Japanese-language users, the workaround load is significant

Current workaround

After every MCP write, immediately notion-fetch the document, grep for known-bad characters (Simplified Chinese, rare CJK), and correct manually. This is brittle and high-toil.

Suggested investigation areas

Tokenizer behavior for CJK single-character spans (possibly subword splits introducing wrong neighbors)
JSON encoding pipeline between Claude → MCP client → MCP server (any Unicode normalization happening?)
Whether the same error rate occurs in non-MCP Edit tool calls on local files (initial impression: somewhat, but less frequent than MCP tool calls)
Whether including context in old_str/new_str (e.g., 5+ chars instead of 1) reduces error rate (anecdotally yes)

Environment

Claude Code on macOS Darwin 25.2.0
Model: Claude Opus 4.7 (1M context)
MCP server: Anthropic-managed Notion connector (mcp__notion__*)
Use case: Japanese business documents (TERIYAKI Inc.)

This appears to be a recurring issue. Documenting it here so the pattern is centrally visible for the team's CJK quality work.

extent analysis

TL;DR

The issue can be mitigated by adjusting the tokenizer behavior for CJK single-character spans and exploring JSON encoding pipeline modifications to handle Unicode characters correctly.

Guidance

Investigate the tokenizer's subword split behavior for single CJK characters to determine if it introduces incorrect neighbors, potentially causing the model to generate wrong characters.
Examine the JSON encoding pipeline between Claude and the MCP client/server for any Unicode normalization that might be contributing to the error.
Test whether including more context in old_str/new_str (e.g., 5+ characters instead of 1) reduces the error rate, as anecdotal evidence suggests this might be beneficial.
Consider comparing the error rate in non-MCP Edit tool calls on local files to determine if the issue is specific to MCP tool calls.

Example

No specific code example is provided due to the complexity and specificity of the issue, but exploring the tokenizer and JSON encoding pipeline configurations could involve reviewing and potentially modifying settings related to Unicode handling and character encoding.

Notes

The exact cause of the issue is not yet determined, and further investigation into the suggested areas is necessary. The problem seems to be related to how the model handles CJK characters, particularly in single-character replacements and generation tasks.

Recommendation

Apply a workaround by including more context in old_str/new_str and manually reviewing documents for incorrect characters until a more permanent fix can be implemented, as this approach has shown some promise in reducing the error rate.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #ssr #model download #tokenizer error #prompt formatting

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

claude-code - 💡(How to fix) Fix Frequent CJK character substitution errors during Japanese text generation in MCP tool calls (Notion content_updates etc.) [1 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Fix / Workaround

Current workaround

Summary

Frequency observed

Reproduction pattern

Hypothesized cause

Impact

Current workaround

Suggested investigation areas

Environment

Related

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

TRENDING

claude-code - 💡(How to fix) Fix Frequent CJK character substitution errors during Japanese text generation in MCP tool calls (Notion content_updates etc.) [1 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Fix / Workaround

Current workaround

Summary

Frequency observed

Reproduction pattern

Hypothesized cause

Impact

Current workaround

Suggested investigation areas

Environment

Related

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING