openclaw - ✅(Solved) Fix [Bug] Memory indexing stalls permanently when chunker splits an emoji surrogate pair (refs #27753) [1 pull requests, 1 participants]

jensenwang560-blip · 2026-04-13T07:35:13Z

[openclaw] Memory indexing stalls permanently for any agent whose content contains an emoji or supplementary-plane character that the chunker happens to split… Memory indexing stalls permanently for any agent whose content contains an emoji or supplementary-plane character that the chunker happens to split across a surrogate-pair boundary. Reproduced live on `openclaw 2026.4.5`. This is the same family of bug as #27753 (closed as `stale` 2026-03-29 with no fix merged). Filing a fresh report with a current-version reproduction, root-cause analysis, and a PR. # PR #65783: fix(memory): preserve surrogate pairs in chunker; sanitize embed inputs - Repository: openclaw/openclaw - Author: jensenwang560-blip - State: open | merged: False - Link: https://github.com/openclaw/openclaw/pull/65783 ## Description (problem / solution / changelog) ## Summary - Fix the outer coarse-split loop in `chunkMarkdown` so it does not bisect a UTF-16 surrogate pair (root cause of indexing stalls). - Add a defensive `stripUnpairedSurrogates` sanitizer at the embed boundary (the safety net #27753 originally requested). - Tests for both, including a new regression test that triggers the exact code path the existing surrogate-pair test missed. Closes #27753. Fixes #65782. ## Why Memory indexing stalls permanently for any agent whose content contains an emoji or supplementary-plane character that the chunker happens to split across a surrogate-pair boundary. `openclaw memory status` reports the index as `clean / no issues` while ~98% of sessions are unindexed. End-to-end symptom (verbose): ``` [memory] embeddings: batch start (×N) Memory index failed ( ): openai embeddings failed: 500 {"error":{"message":"... 'utf-8' codec can't encode character '\\ud83c' in position 3229: surrogates not allowed ..."}} ``` Two failure modes compound: 1. **Chunker** (`packages/memory-host-sdk/src/host/internal.ts`) — the inner fine-split loop already guards against splitting inside a surrogate pair, but the outer coarse loop advances by `start += maxChars` without checking the boundary. When the cut lands inside e.g. an emoji like 🌸 (`U+1F338` = `\uD83C\uDF38`), the chunk ends with a lone high surrogate and the next begins with a lone low surrogate. The existing surrogate-pair test (`"does not break surrogate pairs when splitting long CJK lines"`) does not catch this because it uses an all-surrogate-pair line — the high `estimateStringChars` weight forces the inner pass, which is correct. 2. **Embedding HTTP path** — Python JSON encoders (e.g. via LiteLLM proxies) reject lone surrogates with a 500 wrapping a `UnicodeEncodeError`. The retry policy only matches `rate_limit | 429 | 5\d\d | cloudflare | tokens per day`, but the body shape is a connect-error, not a transient one — and the indexer is transactional, so a single bad batch rolls back the entire run. Every subsequent `openclaw memory index` invocation reproduces the same failure. ## What changed ### `packages/memory-host-sdk/src/host/internal.ts` ```diff - for (let start = 0; start = 0xd800 && lastCode <= 0xdbff) { + coarseEnd += 1; // include the trailing low surrogate + } + } + const coarse = line.slice(start, coarseEnd); + start = coarseEnd; ``` Mirrors the existing inner-loop guard at the same file/line. New regression test (`"does not break surrogate pairs at the coarse split boundary (issue #27753)"`) constructs a line of `39 ASCII + 🌸 + 39 ASCII` so the outer cut at `start + maxChars = 40` lands between the high and low surrogate. Without the fix, the test fails by producing chunks containing lone surrogates. ### `extensions/memory-core/src/memory/manager-embedding-ops.ts` Adds a tiny exported helper: ```ts export function stripUnpairedSurrogates(text: string): string { return text.replace( /[\uD800-\uDBFF](?![\uDC00-\uDFFF])|(?<![\uD800-\uDBFF])[\uDC00-\uDFFF]/g, "\uFFFD", ); } ``` …and applies it at the entry of `embedBatchWithRetry` and `embedBatchInputsWithRetry`. This is the change #27753 originally requested. It protects against any other code path that might produce a lone surrogate (LLM tool output, file ingestion, future chunkers) — chunker-only fix is necessary but not sufficient. Adds `manager-embedding-ops.test.ts` with focused unit tests for the helper (preserves valid pairs, replaces lone high/low surrogates, handles reversed pairs, no-op on empty/non-string). ## Test plan - [x] Verified live on installed runtime (`openclaw 2026.4.5`) by patching the equivalent bundled artifact under `dist/manager-CKYnEo0k.js` and running `openclaw memory index --agent `: | | before | after | |---|---:|---:| | claw

openclaw2026-04-13 07:35:13

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

openclaw/openclaw#65782•Fetched 2026-04-14 05:40:19

View on GitHub

Comments

Participants

Timeline

Reactions

Author

jensenwang560-blip

Participants

jensenwang560-blip

Timeline (top)

cross-referenced ×3referenced ×3

Memory indexing stalls permanently for any agent whose content contains an emoji or supplementary-plane character that the chunker happens to split across a surrogate-pair boundary. Reproduced live on openclaw 2026.4.5.

This is the same family of bug as #27753 (closed as stale 2026-03-29 with no fix merged). Filing a fresh report with a current-version reproduction, root-cause analysis, and a PR.

Error Message

[memory] embeddings: batch start (×119) Memory index failed (clawdea): openai embeddings failed: 500 {"error":{"message":"litellm.APIConnectionError: AzureException APIConnectionError - 'utf-8' codec can't encode character '\ud83c' in position 3229: surrogates not allowed ... UnicodeEncodeError: 'utf-8' codec can't encode character '\ud83c' in position 3229: surrogates not allowed

Root Cause

The chunker chunkMarkdown in packages/memory-host-sdk/src/host/internal.ts has two split passes:

Inner (fine) pass — already guards against splitting inside a UTF-16 surrogate pair (line 414).
Outer (coarse) pass — for (let start = 0; start < line.length; start += maxChars) advances by a fixed number of UTF-16 code units without checking the boundary. When start + maxChars lands inside a surrogate pair, the chunk ends with a lone high surrogate and the next chunk begins with a lone low surrogate.

This bug is invisible to the existing surrogate-pair test (internal.test.ts, "does not break surrogate pairs when splitting long CJK lines") because that test uses a line of all surrogate-pair characters. The high estimateStringChars weight forces the inner pass to run, where the boundary is correctly handled. A line that is mostly low-cost (Latin) with a single emoji at the right offset routes through the outer pass only and triggers the bug.

Once a contaminated chunk is produced, the embedding HTTP client (OpenAI-compatible, here proxied via LiteLLM) sends raw text into Python's JSON encoder, which rejects lone surrogates with the 500 above. The retry policy in manager-embedding-policy.ts only matches rate_limit | too many requests | 429 | 5\d\d | cloudflare | tokens per day — a bare UnicodeEncodeError returned as a 500-shaped envelope still fails the regex's intent (the response body looks like a connect error, not retryable). All progress in the run is rolled back. Every subsequent openclaw memory index invocation re-chunks, re-embeds, fails the same batch, rolls back. Indexing is permanently stuck.

Fix Action

Fix

PR #<TBD> does both:

Root cause — internal.ts: outer chunker loop now extends the cut by one code unit when the trailing char is a high surrogate, mirroring the inner loop's existing logic. Adds a regression test that exercises the exact-boundary case (39 ASCII + 🌸).
Defense in depth — manager-embedding-ops.ts: stripUnpairedSurrogates replaces lone surrogates with U+FFFD before embedBatch / embedBatchInputs. Adds focused unit tests.

PR fix notes

PR #65783: fix(memory): preserve surrogate pairs in chunker; sanitize embed inputs

Repository: openclaw/openclaw
Author: jensenwang560-blip
State: open | merged: False
Link: https://github.com/openclaw/openclaw/pull/65783

Description (problem / solution / changelog)

Summary

Fix the outer coarse-split loop in chunkMarkdown so it does not bisect a UTF-16 surrogate pair (root cause of indexing stalls).
Add a defensive stripUnpairedSurrogates sanitizer at the embed boundary (the safety net #27753 originally requested).
Tests for both, including a new regression test that triggers the exact code path the existing surrogate-pair test missed.

Closes #27753. Fixes #65782.

Why

End-to-end symptom (verbose):

[memory] embeddings: batch start  (×N)
Memory index failed (<agent>): openai embeddings failed: 500
{"error":{"message":"... 'utf-8' codec can't encode character '\\ud83c'
in position 3229: surrogates not allowed ..."}}

Two failure modes compound:

Chunker (packages/memory-host-sdk/src/host/internal.ts) — the inner fine-split loop already guards against splitting inside a surrogate pair, but the outer coarse loop advances by start += maxChars without checking the boundary. When the cut lands inside e.g. an emoji like 🌸 (U+1F338 = \uD83C\uDF38), the chunk ends with a lone high surrogate and the next begins with a lone low surrogate. The existing surrogate-pair test ("does not break surrogate pairs when splitting long CJK lines") does not catch this because it uses an all-surrogate-pair line — the high estimateStringChars weight forces the inner pass, which is correct.
Embedding HTTP path — Python JSON encoders (e.g. via LiteLLM proxies) reject lone surrogates with a 500 wrapping a UnicodeEncodeError. The retry policy only matches rate_limit | 429 | 5\d\d | cloudflare | tokens per day, but the body shape is a connect-error, not a transient one — and the indexer is transactional, so a single bad batch rolls back the entire run. Every subsequent openclaw memory index invocation reproduces the same failure.

What changed

`packages/memory-host-sdk/src/host/internal.ts`

- for (let start = 0; start < line.length; start += maxChars) {
-   const coarse = line.slice(start, start + maxChars);
+ for (let start = 0; start < line.length; ) {
+   let coarseEnd = Math.min(line.length, start + maxChars);
+   // Avoid splitting inside a UTF-16 surrogate pair (emoji, CJK Extension B+).
+   if (coarseEnd < line.length) {
+     const lastCode = line.charCodeAt(coarseEnd - 1);
+     if (lastCode >= 0xd800 && lastCode <= 0xdbff) {
+       coarseEnd += 1; // include the trailing low surrogate
+     }
+   }
+   const coarse = line.slice(start, coarseEnd);
+   start = coarseEnd;

Mirrors the existing inner-loop guard at the same file/line. New regression test ("does not break surrogate pairs at the coarse split boundary (issue #27753)") constructs a line of 39 ASCII + 🌸 + 39 ASCII so the outer cut at start + maxChars = 40 lands between the high and low surrogate. Without the fix, the test fails by producing chunks containing lone surrogates.

`extensions/memory-core/src/memory/manager-embedding-ops.ts`

Adds a tiny exported helper:

export function stripUnpairedSurrogates(text: string): string {
  return text.replace(
    /[\uD800-\uDBFF](?![\uDC00-\uDFFF])|(?<![\uD800-\uDBFF])[\uDC00-\uDFFF]/g,
    "\uFFFD",
  );
}

…and applies it at the entry of embedBatchWithRetry and embedBatchInputsWithRetry. This is the change #27753 originally requested. It protects against any other code path that might produce a lone surrogate (LLM tool output, file ingestion, future chunkers) — chunker-only fix is necessary but not sufficient.

Adds manager-embedding-ops.test.ts with focused unit tests for the helper (preserves valid pairs, replaces lone high/low surrogates, handles reversed pairs, no-op on empty/non-string).

Test plan

Verified live on installed runtime (openclaw 2026.4.5) by patching the equivalent bundled artifact under dist/manager-CKYnEo0k.js and running openclaw memory index --agent <id>:

before after
clawdea files 6 166
clawdea chunks 93 4584
dev files 2 17
dev chunks 251 588

openclaw memory search works end-to-end after the fix.
CI runs pnpm test against the new unit tests in extensions/memory-core and packages/memory-host-sdk.
Existing surrogate-pair test ("does not break surrogate pairs when splitting long CJK lines") still passes.

Notes for reviewers

I did not run pnpm install locally to execute vitest — the runtime fix is verified live, and the new tests follow the same vitest patterns as adjacent files (manager-embedding-cache.test.ts, internal.test.ts).
The retry-policy gap (fetch failed / UnicodeEncodeError 500 not retried) is a separate concern tracked in #56815 / #58255 — out of scope for this PR. With the chunker + sanitizer in place, the original failure no longer reaches the retry path.
If the team prefers the helper to live in a shared utility module (e.g. under packages/memory-host-sdk) rather than manager-embedding-ops.ts, happy to relocate.

Changed files

extensions/memory-core/src/memory/manager-embedding-ops.test.ts (added, +36/-0)
extensions/memory-core/src/memory/manager-embedding-ops.ts (modified, +28/-4)
packages/memory-host-sdk/src/host/internal.test.ts (modified, +27/-0)
packages/memory-host-sdk/src/host/internal.ts (modified, +11/-2)

Code Example

[memory] embeddings: batch start  (×119)
Memory index failed (clawdea): openai embeddings failed: 500
{"error":{"message":"litellm.APIConnectionError: AzureException APIConnectionError -
'utf-8' codec can't encode character '\\ud83c' in position 3229: surrogates not allowed
...
UnicodeEncodeError: 'utf-8' codec can't encode character '\\ud83c' in position 3229: surrogates not allowed

RAW_BUFFERClick to expand / collapse

Summary

This is the same family of bug as #27753 (closed as stale 2026-03-29 with no fix merged). Filing a fresh report with a current-version reproduction, root-cause analysis, and a PR.

Reproduction

Two locally-running agents (clawdea, dev) both stopped indexing on 2026-04-01 and were never able to recover via openclaw memory index. Symptom from the indexer (verbose):

[memory] embeddings: batch start  (×119)
Memory index failed (clawdea): openai embeddings failed: 500
{"error":{"message":"litellm.APIConnectionError: AzureException APIConnectionError -
'utf-8' codec can't encode character '\\ud83c' in position 3229: surrogates not allowed
...
UnicodeEncodeError: 'utf-8' codec can't encode character '\\ud83c' in position 3229: surrogates not allowed

State at the time of failure:

	clawdea	dev
session jsonl files on disk	159	11
files indexed in sqlite	3	1
`memory status` reports `dirty`	no	no
`memory status` reports `issues`	none	none

So memory status --deep happily reports the index as clean while ~98% of sessions are unindexed.

Root cause

The chunker chunkMarkdown in packages/memory-host-sdk/src/host/internal.ts has two split passes:

Inner (fine) pass — already guards against splitting inside a UTF-16 surrogate pair (line 414).
Outer (coarse) pass — for (let start = 0; start < line.length; start += maxChars) advances by a fixed number of UTF-16 code units without checking the boundary. When start + maxChars lands inside a surrogate pair, the chunk ends with a lone high surrogate and the next chunk begins with a lone low surrogate.

Why #27753 was right and still applies

#27753 ("[Bug]: Embedding sync loop — lone surrogates in session content not sanitized before sending to embeddings API") proposed sanitizing lone surrogates with U+FFFD at the embed boundary. That recommendation is still correct — even with the chunker fixed, defense-in-depth at the embed call protects against any future code path that might produce a lone surrogate (LLM tool output, file ingestion, etc.).

Fix

PR #<TBD> does both:

Root cause — internal.ts: outer chunker loop now extends the cut by one code unit when the trailing char is a high surrogate, mirroring the inner loop's existing logic. Adds a regression test that exercises the exact-boundary case (39 ASCII + 🌸).
Defense in depth — manager-embedding-ops.ts: stripUnpairedSurrogates replaces lone surrogates with U+FFFD before embedBatch / embedBatchInputs. Adds focused unit tests.

Live verification

Patched the equivalent bundled artifact under /opt/homebrew/lib/node_modules/openclaw/dist/ and reran openclaw memory index against both stalled agents:

	before	after
clawdea files	6	166
clawdea chunks	93	4584
dev files	2	17
dev chunks	251	588

Search works end-to-end after the fix.

Environment

OpenClaw: 2026.4.5 (3e72c03)
OS: macOS 26.x (Darwin 25.3.0, arm64)
Node: 22.x
Embedding model: text-embedding-3-small via OpenAI-compatible endpoint (LiteLLM proxy)

#27753 — the closed-as-stale predecessor with the original sanitize-at-embed proposal
#56815 — 'TypeError: fetch failed' after ~40K chunks (related family — bare network errors not in retry allowlist)
#58255 — Gemini memory indexing fails with fetch failed while direct Node fetch succeeds (same retry-policy gap)

extent analysis

TL;DR

The most likely fix for the memory indexing stall issue is to update the chunker logic in internal.ts to handle surrogate pairs correctly and add defense-in-depth by sanitizing lone surrogates before sending to the embeddings API.

Guidance

Review the internal.ts file and update the outer chunker loop to extend the cut by one code unit when the trailing character is a high surrogate, as proposed in the fix.
Add a regression test to exercise the exact-boundary case, such as 39 ASCII characters followed by an emoji (🌸).
Implement the stripUnpairedSurrogates function in manager-embedding-ops.ts to replace lone surrogates with U+FFFD before calling embedBatch or embedBatchInputs.
Verify the fix by running openclaw memory index against the stalled agents and checking the indexing progress.

Example

// internal.ts
for (let start = 0; start < line.length; start += maxChars) {
  // Check if the trailing character is a high surrogate
  if (line.charCodeAt(start + maxChars - 1) >= 0xD800 && line.charCodeAt(start + maxChars - 1) <= 0xDBFF) {
    // Extend the cut by one code unit
    maxChars++;
  }
  // ...
}

Notes

The provided fix and tests should address the issue, but it's essential to thoroughly review and test the changes to ensure they don't introduce new problems. Additionally, the stripUnpairedSurrogates function should be implemented carefully to avoid modifying the original text unnecessarily.

Recommendation

Apply the proposed fix, which includes updating the chunker logic and adding defense-in-depth by sanitizing lone surrogates. This approach addresses the root cause of the issue and provides an additional layer of protection against similar problems in the future.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #installation #tensor shape #autograd error #model save/load

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

openclaw - ✅(Solved) Fix [Bug] Memory indexing stalls permanently when chunker splits an emoji surrogate pair (refs #27753) [1 pull requests, 1 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Fix

PR fix notes

PR #65783: fix(memory): preserve surrogate pairs in chunker; sanitize embed inputs

Description (problem / solution / changelog)

Summary

Why

What changed

packages/memory-host-sdk/src/host/internal.ts

extensions/memory-core/src/memory/manager-embedding-ops.ts

Test plan

Notes for reviewers

Changed files

Code Example

Summary

Reproduction

Root cause

Why #27753 was right and still applies

Fix

Live verification

Environment

Related

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING

`packages/memory-host-sdk/src/host/internal.ts`

`extensions/memory-core/src/memory/manager-embedding-ops.ts`