openclaw - ✅(Solved) Fix [Bug]: Streaming Chinese character corruption - random single characters replaced with U+FFFD [1 pull requests, 1 comments, 2 participants]

openclaw2026-03-20 06:26:31

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

openclaw/openclaw#50887•Fetched 2026-04-08 01:06:56

View on GitHub

Comments

Participants

Timeline

Reactions

Author

ColaFatty

Participants

ColaFatty

Ryce

Timeline (top)

cross-referenced ×2commented ×1referenced ×1

When agents respond in Chinese via group chat (webchat), random individual Chinese characters are replaced with the U+FFFD replacement character. The pattern suggests a streaming token boundary issue where the last byte of a multi-byte UTF-8 sequence is occasionally dropped during stream assembly.

Root Cause

Fix Action

Fixed

Fixed by PR: fix: handle UTF-8 multi-byte boundaries in streaming assembly (https://github.com/openclaw/openclaw/pull/50909)

PR fix notes

PR #50909: fix: handle UTF-8 multi-byte boundaries in streaming assembly

Repository: openclaw/openclaw
Author: bugkill3r
State: open | merged: False
Link: https://github.com/openclaw/openclaw/pull/50909

Description (problem / solution / changelog)

Summary

Fixes #50887 -- Chinese characters corrupted (replaced with U+FFFD) during webchat streaming.

Primary fix: src/infra/jsonl-socket.ts -- TCP socket data events deliver raw bytes that can split mid-character. Replaced data.toString("utf8") with StringDecoder from node:string_decoder, which buffers incomplete multi-byte sequences across chunks.
Secondary fix: src/agents/subagent-registry.ts -- Buffer.subarray(0, maxPayloadBytes) can cut inside a multi-byte UTF-8 character. Added a walk-back loop that retreats to the nearest character boundary before decoding.
Defensive fixes: src/memory/qmd-process.ts and src/tui/tui-local-shell.ts -- child process stdout/stderr data handlers had the same buf.toString("utf8") pattern; switched to StringDecoder.

Root cause

Buffer.toString("utf8") on a chunk that ends mid-character (e.g., after byte 1 of a 3-byte Chinese character) produces U+FFFD for the partial sequence. The next chunk starts with the remaining bytes, which also decode incorrectly. StringDecoder holds incomplete trailing bytes until the next chunk arrives, producing correct characters.

Test plan

Added test in src/infra/jsonl-socket.test.ts that sends a JSON line with Chinese characters split across two TCP writes at a multi-byte boundary; verifies the result contains the correct characters with no U+FFFD.
All 3 existing + 1 new test pass: pnpm test -- src/infra/jsonl-socket.test.ts
pnpm check passes (format, typecheck, lint)

Changed files

src/agents/subagent-registry.ts (modified, +8/-1)
src/infra/jsonl-socket.test.ts (modified, +41/-0)
src/infra/jsonl-socket.ts (modified, +6/-1)
src/memory/qmd-process.ts (modified, +7/-2)
src/tui/tui-local-shell.ts (modified, +7/-2)

RAW_BUFFERClick to expand / collapse

Bug type

Behavior bug (incorrect output/state without crash)

Summary

Steps to reproduce

Set up multiple agents in a group chat (webchat channel)
Trigger an agent to produce a longer Chinese response (~200+ characters)
Observe the response: random individual characters will appear as \u00ef\u00bf\u00bd (U+FFFD)

Expected behavior

All Chinese characters should render correctly.

Actual behavior

Random individual Chinese characters are corrupted, e.g.:

特油盘点 appears as 特油�点
3月18日 appears as 3月18�
张建洋 appears as 张�洋

The corruption rate is approximately 3-5 characters per message. Core meaning is preserved but readability is degraded.

Environment

OpenClaw version: 2026.3.11
OS: Ubuntu 24.04 LTS (Linux)
Channel: webchat group chat
Model provider: cursor2api-go (local reverse proxy, port 8003)
Model: claude-sonnet-4.6
Direct API test (curl to cursor2api): 0 corrupted characters
Conclusion: corruption happens between cursor2api output and OpenClaw message delivery

Additional context

This appears to be related to how OpenClaw assembles streaming tokens into complete messages. The cursor2api layer itself produces clean UTF-8 output (verified by direct curl test). The corruption only appears in the final message delivered to the chat UI.

extent analysis

Fix Plan

To address the issue of random Chinese characters being replaced with the U+FFFD replacement character due to a streaming token boundary problem, we will modify the token assembly logic in OpenClaw.

Step-by-Step Solution:

Update Token Assembly Logic: Ensure that the token assembly mechanism correctly handles multi-byte UTF-8 sequences by checking for incomplete sequences at the end of each token and waiting for the next token to complete the sequence if necessary.
Implement UTF-8 Validation: Validate the UTF-8 encoding of each assembled message to detect and correct any invalid sequences before delivering the message to the chat UI.
Adjust Buffer Handling: Review and adjust the buffer size and handling in the streaming token assembly process to prevent the truncation of multi-byte UTF-8 characters.

Example Code Snippet (Python):

import codecs

def assemble_tokens(tokens):
    message = ''
    for token in tokens:
        # Check if the token ends with an incomplete UTF-8 sequence
        if not token.endswith(codecs.BOM_UTF8) and not is_valid_utf8(token):
            # Wait for the next token to complete the sequence
            next_token = get_next_token()
            token += next_token
        message += token
    return message

def is_valid_utf8(byte_string):
    try:
        byte_string.decode('utf-8')
        return True
    except UnicodeDecodeError:
        return False

def get_next_token():
    # Logic to get the next token from the stream
    pass

Verification

To verify that the fix worked:

Test the group chat functionality with agents responding in Chinese.
Observe the responses for any corrupted characters.
Use tools like curl to test the API directly and compare the output with the messages received in the chat UI.

Extra Tips

Ensure that all components in the pipeline (including any reverse proxies) are configured to handle UTF-8 encoding correctly.
Regularly review and test the handling of multi-byte character sequences in the token assembly logic to prevent similar issues in the future.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

FAQ

Expected behavior

All Chinese characters should render correctly.

#api #ssr #installation #tensor shape #autograd error #model loading #dependency error #configuration error #environment variable

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

openclaw - ✅(Solved) Fix [Bug]: Streaming Chinese character corruption - random single characters replaced with U+FFFD [1 pull requests, 1 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Fix Action

Fixed

PR fix notes

PR #50909: fix: handle UTF-8 multi-byte boundaries in streaming assembly

Description (problem / solution / changelog)

Summary

Root cause

Test plan

Changed files

Bug type

Summary

Steps to reproduce

Expected behavior

Actual behavior

Environment

Additional context

extent analysis

Fix Plan

Step-by-Step Solution:

Example Code Snippet (Python):

Verification

Extra Tips

FAQ

Expected behavior

Still need to ship something?

RELATED_DISCOVERY

TRENDING