openclaw - ✅(Solved) Fix [Bug]: Streaming Chinese character corruption - random single characters replaced with U+FFFD [1 pull requests, 1 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
openclaw/openclaw#50887Fetched 2026-04-08 01:06:56
View on GitHub
Comments
1
Participants
2
Timeline
4
Reactions
0
Author
Participants
Timeline (top)
cross-referenced ×2commented ×1referenced ×1

When agents respond in Chinese via group chat (webchat), random individual Chinese characters are replaced with the U+FFFD replacement character. The pattern suggests a streaming token boundary issue where the last byte of a multi-byte UTF-8 sequence is occasionally dropped during stream assembly.

Root Cause

When agents respond in Chinese via group chat (webchat), random individual Chinese characters are replaced with the U+FFFD replacement character. The pattern suggests a streaming token boundary issue where the last byte of a multi-byte UTF-8 sequence is occasionally dropped during stream assembly.

Fix Action

Fixed

PR fix notes

PR #50909: fix: handle UTF-8 multi-byte boundaries in streaming assembly

Description (problem / solution / changelog)

Summary

Fixes #50887 -- Chinese characters corrupted (replaced with U+FFFD) during webchat streaming.

  • Primary fix: src/infra/jsonl-socket.ts -- TCP socket data events deliver raw bytes that can split mid-character. Replaced data.toString("utf8") with StringDecoder from node:string_decoder, which buffers incomplete multi-byte sequences across chunks.
  • Secondary fix: src/agents/subagent-registry.ts -- Buffer.subarray(0, maxPayloadBytes) can cut inside a multi-byte UTF-8 character. Added a walk-back loop that retreats to the nearest character boundary before decoding.
  • Defensive fixes: src/memory/qmd-process.ts and src/tui/tui-local-shell.ts -- child process stdout/stderr data handlers had the same buf.toString("utf8") pattern; switched to StringDecoder.

Root cause

Buffer.toString("utf8") on a chunk that ends mid-character (e.g., after byte 1 of a 3-byte Chinese character) produces U+FFFD for the partial sequence. The next chunk starts with the remaining bytes, which also decode incorrectly. StringDecoder holds incomplete trailing bytes until the next chunk arrives, producing correct characters.

Test plan

  • Added test in src/infra/jsonl-socket.test.ts that sends a JSON line with Chinese characters split across two TCP writes at a multi-byte boundary; verifies the result contains the correct characters with no U+FFFD.
  • All 3 existing + 1 new test pass: pnpm test -- src/infra/jsonl-socket.test.ts
  • pnpm check passes (format, typecheck, lint)

Changed files

  • src/agents/subagent-registry.ts (modified, +8/-1)
  • src/infra/jsonl-socket.test.ts (modified, +41/-0)
  • src/infra/jsonl-socket.ts (modified, +6/-1)
  • src/memory/qmd-process.ts (modified, +7/-2)
  • src/tui/tui-local-shell.ts (modified, +7/-2)
RAW_BUFFERClick to expand / collapse

Bug type

Behavior bug (incorrect output/state without crash)

Summary

When agents respond in Chinese via group chat (webchat), random individual Chinese characters are replaced with the U+FFFD replacement character. The pattern suggests a streaming token boundary issue where the last byte of a multi-byte UTF-8 sequence is occasionally dropped during stream assembly.

Steps to reproduce

  1. Set up multiple agents in a group chat (webchat channel)
  2. Trigger an agent to produce a longer Chinese response (~200+ characters)
  3. Observe the response: random individual characters will appear as \u00ef\u00bf\u00bd (U+FFFD)

Expected behavior

All Chinese characters should render correctly.

Actual behavior

Random individual Chinese characters are corrupted, e.g.:

  • 特油盘点 appears as 特油�点
  • 3月18日 appears as 3月18�
  • 张建洋 appears as 张�洋

The corruption rate is approximately 3-5 characters per message. Core meaning is preserved but readability is degraded.

Environment

  • OpenClaw version: 2026.3.11
  • OS: Ubuntu 24.04 LTS (Linux)
  • Channel: webchat group chat
  • Model provider: cursor2api-go (local reverse proxy, port 8003)
  • Model: claude-sonnet-4.6
  • Direct API test (curl to cursor2api): 0 corrupted characters
  • Conclusion: corruption happens between cursor2api output and OpenClaw message delivery

Additional context

This appears to be related to how OpenClaw assembles streaming tokens into complete messages. The cursor2api layer itself produces clean UTF-8 output (verified by direct curl test). The corruption only appears in the final message delivered to the chat UI.

extent analysis

Fix Plan

To address the issue of random Chinese characters being replaced with the U+FFFD replacement character due to a streaming token boundary problem, we will modify the token assembly logic in OpenClaw.

Step-by-Step Solution:

  1. Update Token Assembly Logic: Ensure that the token assembly mechanism correctly handles multi-byte UTF-8 sequences by checking for incomplete sequences at the end of each token and waiting for the next token to complete the sequence if necessary.
  2. Implement UTF-8 Validation: Validate the UTF-8 encoding of each assembled message to detect and correct any invalid sequences before delivering the message to the chat UI.
  3. Adjust Buffer Handling: Review and adjust the buffer size and handling in the streaming token assembly process to prevent the truncation of multi-byte UTF-8 characters.

Example Code Snippet (Python):

import codecs

def assemble_tokens(tokens):
    message = ''
    for token in tokens:
        # Check if the token ends with an incomplete UTF-8 sequence
        if not token.endswith(codecs.BOM_UTF8) and not is_valid_utf8(token):
            # Wait for the next token to complete the sequence
            next_token = get_next_token()
            token += next_token
        message += token
    return message

def is_valid_utf8(byte_string):
    try:
        byte_string.decode('utf-8')
        return True
    except UnicodeDecodeError:
        return False

def get_next_token():
    # Logic to get the next token from the stream
    pass

Verification

To verify that the fix worked:

  • Test the group chat functionality with agents responding in Chinese.
  • Observe the responses for any corrupted characters.
  • Use tools like curl to test the API directly and compare the output with the messages received in the chat UI.

Extra Tips

  • Ensure that all components in the pipeline (including any reverse proxies) are configured to handle UTF-8 encoding correctly.
  • Regularly review and test the handling of multi-byte character sequences in the token assembly logic to prevent similar issues in the future.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

FAQ

Expected behavior

All Chinese characters should render correctly.

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING