claude-code - 💡(How to fix) Fix Model emitted orphan \uD83A surrogate + garbled Korean inside AskUserQuestion, bricking session

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

Error Message

API Error: 400 The request body is not valid JSON: no low surrogate in string: line 1 column 703863 (char 703862)

Root Cause

So at the same spot in the output, two layers broke at the same time. The surrounding Korean is valid Unicode but garbled as language; the orphan is invalid Unicode. The model sometimes drifts between scripts (Korean leaking into Japanese or Chinese codepoints), but those drifts stay inside valid Unicode and the parser is happy. This case is different because the encoding layer also broke. I cannot tell from outside whether these failures share the same root cause. I am only noting that the language-level corruption and the invalid surrogate appeared in the same local region of the same tool call option. The root cause could be model generation, decoding, serialization, persistence, or another layer.

Code Example

API Error: 400 The request body is not valid JSON: no low surrogate in string:
line 1 column 703863 (char 703862)

---

LINE 757 | role=assistant | ts=2026-05-23T05:23:54.040Z
LINE 758 | role=user      | ts=2026-05-23T05:24:47.296Z

---

"description":"wiki-lint.sh § 4 의 dec-* 검사 로직을 D-* 로 정정.
                완좌 단은 \uD83A는 이세 이될 — 크지는 5 분"
RAW_BUFFERClick to expand / collapse

During a long Claude Code session (Opus 4.7 1M context), the model appears to have emitted a lone UTF-16 high surrogate escape, \uD83A, inside an AskUserQuestion tool call.

A high surrogate without its matching low surrogate is not a valid Unicode scalar sequence. In this case, once that orphan was replayed as part of the session history, the Anthropic API rejected the request body as invalid JSON.

<img width="1748" height="854" alt="Image" src="https://github.com/user-attachments/assets/535b9af6-e16c-4a31-b34c-e2aa6646168d" />

Once that orphan landed in the session jsonl, every following turn replayed it as part of the conversation history, and the Anthropic API rejected each request with the same 400:

API Error: 400 The request body is not valid JSON: no low surrogate in string:
line 1 column 703863 (char 703862)

Every kind of input came back with the same 400 at the same offset: Korean text (뭐야, last-verified 누락 ...), English (wtf), a single character (?), and a clean ASCII Python snippet. The offset never moved across all of them, which pointed at something static inside the accumulated context rather than anything I was typing. /clear resolved it.

How I traced where the orphan came from

I ran two scans on the session jsonl files at ~/.claude/projects/<hash>/*.jsonl:

  1. UTF-8 raw byte scan for the byte pattern ED A0~BF (a broken surrogate written directly as UTF-8). This came up clean across every jsonl file and the rest of ~/.claude.
  2. JSON-escape regex scan for unpaired high-surrogate escapes (\uD800-\uDBFF) written as six-character ASCII escapes inside JSON strings.

The second scan found two orphans in the dead session, and both were the same \uD83A. I confirmed the source by checking the role field of each containing line:

LINE 757 | role=assistant | ts=2026-05-23T05:23:54.040Z
LINE 758 | role=user      | ts=2026-05-23T05:24:47.296Z

LINE 757 is the assistant turn that produced the orphan. LINE 758 is Claude Code re-quoting that same assistant tool call into the user-side conversation log (the "User answered Claude's questions" UI block), which is why paired=0, orphan=2 instead of just 1 — same string written twice from a single source.

The actual description field from LINE 757:

"description":"wiki-lint.sh § 4 의 dec-* 검사 로직을 D-* 로 정정.
               안 완좌 단은 \uD83A는 이세 이될 — 크지는 5 분"

The same tool call had three other options that came out clean

The AskUserQuestion here had four options. Only the one above came out broken. The other three were perfectly readable Korean from the same turn and the same inference run:

  • E (Recommended): 8 commits push 후 PR --base dev. 현재 코드 상태 독립 환경으로 공유 + [teammate-A]가 GitHub 에서 검토 가능...
  • B (above): garbled + orphan
  • A: postmortem 과 부속 몇 파일에 v2 필드 last_verified_at / last_verified_by 추가. wiki:lint 경고 해소
  • F: Phase A + audit + M3 strategy lock + orphan 완화 + memory 4 건 + seed 1 건 다 완료. 충분한 세션.

One option corrupted, three clean. That feels like a useful signal that whatever broke was local to one generated option, rather than affecting the whole turn.

Turn metadata from message.usage:

  • model: claude-opus-4-7
  • stop_reason: tool_use (clean termination, no truncation)
  • cache_read_input_tokens: 385045 (~38% of the 1M context window)
  • cache_creation_input_tokens: 3098
  • output_tokens: 1566

So this happened at around 38% context utilization, in a normal-sized 1.5k output turn, with a clean tool_use stop. Nothing in the metadata pointed at exhaustion or runaway output.

Why this looked different from normal language drift

The Korean around the orphan is broken at two distinct layers, and I think the distinction may be useful for triage:

  • Codepoint layer완좌, 이세 이될, 크지는 are all valid Hangul syllables. Each character sits inside U+AC00..U+D7AF, so Unicode-wise these are fine. (For reference, the model probably aimed for something close to "정정 작업, 약 5 분" — "the fix, about 5 minutes".)
  • Word / morpheme layer — those syllable combinations are not real Korean words. I'm a native Korean reader and the structure is incoherent at the morpheme level.
  • Surrogate \uD83A — this one breaks Unicode encoding outright (a lone high surrogate), which is what the API parser rejects.

So at the same spot in the output, two layers broke at the same time. The surrounding Korean is valid Unicode but garbled as language; the orphan is invalid Unicode. The model sometimes drifts between scripts (Korean leaking into Japanese or Chinese codepoints), but those drifts stay inside valid Unicode and the parser is happy. This case is different because the encoding layer also broke. I cannot tell from outside whether these failures share the same root cause. I am only noting that the language-level corruption and the invalid surrogate appeared in the same local region of the same tool call option. The root cause could be model generation, decoding, serialization, persistence, or another layer.

What I ruled out

  • Static file corruption — full ~/.claude byte scan came up clean (memory, hooks, settings, all jsonl files except the dead session)
  • Network truncation — the orphan was already written to disk before any retry, so this isn't a transport artifact
  • Harness-side string truncation — there is a [truncated] marker on MCP instructions in the system prompt which is a real channel for this kind of bug in principle, but the evidence here points at model emission rather than transport

A fix-layer thought (please weigh this lightly — you know your stack)

My guess is the client side is the better place to catch this, but I have no insight into your internals so take this as a suggestion rather than a recommendation:

  • Claude Code CLI — before serializing a tool call payload, scanning every string field for lone surrogates and failing that one turn fast would keep the session alive and surface the underlying model behavior cleanly. The user would lose one turn instead of the whole session.
  • API gateway — server-side sanitization would catch this across every client, which is valuable as a safety net, but it might also smooth over signal you'd want in telemetry.

Both make sense to me, and possibly both together. Layer-priority is your call.

Reproducibility

This was a single occurrence in a heavy session — body around 200KB, deep into the 1M context window, with a large skill set and a large MCP tool set loaded into the system prompt. I have not been able to reproduce it on demand from a fresh session with similar prompts. From the outside, I can only say that this happened in a large-context session and that the invalid escape was already present on disk before retries. I cannot prove whether context size was causally related.

What I can say with confidence is that the orphan was present in the on-disk jsonl before any retry, so this is not a network artifact.

Environment

  • Claude Code with Opus 4.7 (1M context), macOS Darwin 24.6.0
  • Trigger turn: 2026-05-23T05:23:54.040Z

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

claude-code - 💡(How to fix) Fix Model emitted orphan \uD83A surrogate + garbled Korean inside AskUserQuestion, bricking session