openclaw - ✅(Solved) Fix [Bug] Memory indexing stalls permanently when chunker splits an emoji surrogate pair (refs #27753) [1 pull requests, 1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
openclaw/openclaw#65782Fetched 2026-04-14 05:40:19
View on GitHub
Comments
0
Participants
1
Timeline
6
Reactions
0
Participants
Timeline (top)
cross-referenced ×3referenced ×3

Memory indexing stalls permanently for any agent whose content contains an emoji or supplementary-plane character that the chunker happens to split across a surrogate-pair boundary. Reproduced live on openclaw 2026.4.5.

This is the same family of bug as #27753 (closed as stale 2026-03-29 with no fix merged). Filing a fresh report with a current-version reproduction, root-cause analysis, and a PR.

Error Message

[memory] embeddings: batch start (×119) Memory index failed (clawdea): openai embeddings failed: 500 {"error":{"message":"litellm.APIConnectionError: AzureException APIConnectionError - 'utf-8' codec can't encode character '\ud83c' in position 3229: surrogates not allowed ... UnicodeEncodeError: 'utf-8' codec can't encode character '\ud83c' in position 3229: surrogates not allowed

Root Cause

The chunker chunkMarkdown in packages/memory-host-sdk/src/host/internal.ts has two split passes:

  • Inner (fine) pass — already guards against splitting inside a UTF-16 surrogate pair (line 414).
  • Outer (coarse) passfor (let start = 0; start < line.length; start += maxChars) advances by a fixed number of UTF-16 code units without checking the boundary. When start + maxChars lands inside a surrogate pair, the chunk ends with a lone high surrogate and the next chunk begins with a lone low surrogate.

This bug is invisible to the existing surrogate-pair test (internal.test.ts, "does not break surrogate pairs when splitting long CJK lines") because that test uses a line of all surrogate-pair characters. The high estimateStringChars weight forces the inner pass to run, where the boundary is correctly handled. A line that is mostly low-cost (Latin) with a single emoji at the right offset routes through the outer pass only and triggers the bug.

Once a contaminated chunk is produced, the embedding HTTP client (OpenAI-compatible, here proxied via LiteLLM) sends raw text into Python's JSON encoder, which rejects lone surrogates with the 500 above. The retry policy in manager-embedding-policy.ts only matches rate_limit | too many requests | 429 | 5\d\d | cloudflare | tokens per day — a bare UnicodeEncodeError returned as a 500-shaped envelope still fails the regex's intent (the response body looks like a connect error, not retryable). All progress in the run is rolled back. Every subsequent openclaw memory index invocation re-chunks, re-embeds, fails the same batch, rolls back. Indexing is permanently stuck.

Fix Action

Fix

PR #<TBD> does both:

  1. Root causeinternal.ts: outer chunker loop now extends the cut by one code unit when the trailing char is a high surrogate, mirroring the inner loop's existing logic. Adds a regression test that exercises the exact-boundary case (39 ASCII + 🌸).
  2. Defense in depthmanager-embedding-ops.ts: stripUnpairedSurrogates replaces lone surrogates with U+FFFD before embedBatch / embedBatchInputs. Adds focused unit tests.

PR fix notes

PR #65783: fix(memory): preserve surrogate pairs in chunker; sanitize embed inputs

Description (problem / solution / changelog)

Summary

  • Fix the outer coarse-split loop in chunkMarkdown so it does not bisect a UTF-16 surrogate pair (root cause of indexing stalls).
  • Add a defensive stripUnpairedSurrogates sanitizer at the embed boundary (the safety net #27753 originally requested).
  • Tests for both, including a new regression test that triggers the exact code path the existing surrogate-pair test missed.

Closes #27753. Fixes #65782.

Why

Memory indexing stalls permanently for any agent whose content contains an emoji or supplementary-plane character that the chunker happens to split across a surrogate-pair boundary. openclaw memory status reports the index as clean / no issues while ~98% of sessions are unindexed.

End-to-end symptom (verbose):

[memory] embeddings: batch start  (×N)
Memory index failed (<agent>): openai embeddings failed: 500
{"error":{"message":"... 'utf-8' codec can't encode character '\\ud83c'
in position 3229: surrogates not allowed ..."}}

Two failure modes compound:

  1. Chunker (packages/memory-host-sdk/src/host/internal.ts) — the inner fine-split loop already guards against splitting inside a surrogate pair, but the outer coarse loop advances by start += maxChars without checking the boundary. When the cut lands inside e.g. an emoji like 🌸 (U+1F338 = \uD83C\uDF38), the chunk ends with a lone high surrogate and the next begins with a lone low surrogate. The existing surrogate-pair test ("does not break surrogate pairs when splitting long CJK lines") does not catch this because it uses an all-surrogate-pair line — the high estimateStringChars weight forces the inner pass, which is correct.
  2. Embedding HTTP path — Python JSON encoders (e.g. via LiteLLM proxies) reject lone surrogates with a 500 wrapping a UnicodeEncodeError. The retry policy only matches rate_limit | 429 | 5\d\d | cloudflare | tokens per day, but the body shape is a connect-error, not a transient one — and the indexer is transactional, so a single bad batch rolls back the entire run. Every subsequent openclaw memory index invocation reproduces the same failure.

What changed

packages/memory-host-sdk/src/host/internal.ts

- for (let start = 0; start < line.length; start += maxChars) {
-   const coarse = line.slice(start, start + maxChars);
+ for (let start = 0; start < line.length; ) {
+   let coarseEnd = Math.min(line.length, start + maxChars);
+   // Avoid splitting inside a UTF-16 surrogate pair (emoji, CJK Extension B+).
+   if (coarseEnd < line.length) {
+     const lastCode = line.charCodeAt(coarseEnd - 1);
+     if (lastCode >= 0xd800 && lastCode <= 0xdbff) {
+       coarseEnd += 1; // include the trailing low surrogate
+     }
+   }
+   const coarse = line.slice(start, coarseEnd);
+   start = coarseEnd;

Mirrors the existing inner-loop guard at the same file/line. New regression test ("does not break surrogate pairs at the coarse split boundary (issue #27753)") constructs a line of 39 ASCII + 🌸 + 39 ASCII so the outer cut at start + maxChars = 40 lands between the high and low surrogate. Without the fix, the test fails by producing chunks containing lone surrogates.

extensions/memory-core/src/memory/manager-embedding-ops.ts

Adds a tiny exported helper:

export function stripUnpairedSurrogates(text: string): string {
  return text.replace(
    /[\uD800-\uDBFF](?![\uDC00-\uDFFF])|(?<![\uD800-\uDBFF])[\uDC00-\uDFFF]/g,
    "\uFFFD",
  );
}

…and applies it at the entry of embedBatchWithRetry and embedBatchInputsWithRetry. This is the change #27753 originally requested. It protects against any other code path that might produce a lone surrogate (LLM tool output, file ingestion, future chunkers) — chunker-only fix is necessary but not sufficient.

Adds manager-embedding-ops.test.ts with focused unit tests for the helper (preserves valid pairs, replaces lone high/low surrogates, handles reversed pairs, no-op on empty/non-string).

Test plan

  • Verified live on installed runtime (openclaw 2026.4.5) by patching the equivalent bundled artifact under dist/manager-CKYnEo0k.js and running openclaw memory index --agent <id>:

    beforeafter
    clawdea files6166
    clawdea chunks934584
    dev files217
    dev chunks251588

    openclaw memory search works end-to-end after the fix.

  • CI runs pnpm test against the new unit tests in extensions/memory-core and packages/memory-host-sdk.

  • Existing surrogate-pair test ("does not break surrogate pairs when splitting long CJK lines") still passes.

Notes for reviewers

  • I did not run pnpm install locally to execute vitest — the runtime fix is verified live, and the new tests follow the same vitest patterns as adjacent files (manager-embedding-cache.test.ts, internal.test.ts).
  • The retry-policy gap (fetch failed / UnicodeEncodeError 500 not retried) is a separate concern tracked in #56815 / #58255 — out of scope for this PR. With the chunker + sanitizer in place, the original failure no longer reaches the retry path.
  • If the team prefers the helper to live in a shared utility module (e.g. under packages/memory-host-sdk) rather than manager-embedding-ops.ts, happy to relocate.

Changed files

  • extensions/memory-core/src/memory/manager-embedding-ops.test.ts (added, +36/-0)
  • extensions/memory-core/src/memory/manager-embedding-ops.ts (modified, +28/-4)
  • packages/memory-host-sdk/src/host/internal.test.ts (modified, +27/-0)
  • packages/memory-host-sdk/src/host/internal.ts (modified, +11/-2)

Code Example

[memory] embeddings: batch start  (×119)
Memory index failed (clawdea): openai embeddings failed: 500
{"error":{"message":"litellm.APIConnectionError: AzureException APIConnectionError -
'utf-8' codec can't encode character '\\ud83c' in position 3229: surrogates not allowed
...
UnicodeEncodeError: 'utf-8' codec can't encode character '\\ud83c' in position 3229: surrogates not allowed
RAW_BUFFERClick to expand / collapse

Summary

Memory indexing stalls permanently for any agent whose content contains an emoji or supplementary-plane character that the chunker happens to split across a surrogate-pair boundary. Reproduced live on openclaw 2026.4.5.

This is the same family of bug as #27753 (closed as stale 2026-03-29 with no fix merged). Filing a fresh report with a current-version reproduction, root-cause analysis, and a PR.

Reproduction

Two locally-running agents (clawdea, dev) both stopped indexing on 2026-04-01 and were never able to recover via openclaw memory index. Symptom from the indexer (verbose):

[memory] embeddings: batch start  (×119)
Memory index failed (clawdea): openai embeddings failed: 500
{"error":{"message":"litellm.APIConnectionError: AzureException APIConnectionError -
'utf-8' codec can't encode character '\\ud83c' in position 3229: surrogates not allowed
...
UnicodeEncodeError: 'utf-8' codec can't encode character '\\ud83c' in position 3229: surrogates not allowed

State at the time of failure:

clawdeadev
session jsonl files on disk15911
files indexed in sqlite31
memory status reports dirtynono
memory status reports issuesnonenone

So memory status --deep happily reports the index as clean while ~98% of sessions are unindexed.

Root cause

The chunker chunkMarkdown in packages/memory-host-sdk/src/host/internal.ts has two split passes:

  • Inner (fine) pass — already guards against splitting inside a UTF-16 surrogate pair (line 414).
  • Outer (coarse) passfor (let start = 0; start < line.length; start += maxChars) advances by a fixed number of UTF-16 code units without checking the boundary. When start + maxChars lands inside a surrogate pair, the chunk ends with a lone high surrogate and the next chunk begins with a lone low surrogate.

This bug is invisible to the existing surrogate-pair test (internal.test.ts, "does not break surrogate pairs when splitting long CJK lines") because that test uses a line of all surrogate-pair characters. The high estimateStringChars weight forces the inner pass to run, where the boundary is correctly handled. A line that is mostly low-cost (Latin) with a single emoji at the right offset routes through the outer pass only and triggers the bug.

Once a contaminated chunk is produced, the embedding HTTP client (OpenAI-compatible, here proxied via LiteLLM) sends raw text into Python's JSON encoder, which rejects lone surrogates with the 500 above. The retry policy in manager-embedding-policy.ts only matches rate_limit | too many requests | 429 | 5\d\d | cloudflare | tokens per day — a bare UnicodeEncodeError returned as a 500-shaped envelope still fails the regex's intent (the response body looks like a connect error, not retryable). All progress in the run is rolled back. Every subsequent openclaw memory index invocation re-chunks, re-embeds, fails the same batch, rolls back. Indexing is permanently stuck.

Why #27753 was right and still applies

#27753 ("[Bug]: Embedding sync loop — lone surrogates in session content not sanitized before sending to embeddings API") proposed sanitizing lone surrogates with U+FFFD at the embed boundary. That recommendation is still correct — even with the chunker fixed, defense-in-depth at the embed call protects against any future code path that might produce a lone surrogate (LLM tool output, file ingestion, etc.).

Fix

PR #<TBD> does both:

  1. Root causeinternal.ts: outer chunker loop now extends the cut by one code unit when the trailing char is a high surrogate, mirroring the inner loop's existing logic. Adds a regression test that exercises the exact-boundary case (39 ASCII + 🌸).
  2. Defense in depthmanager-embedding-ops.ts: stripUnpairedSurrogates replaces lone surrogates with U+FFFD before embedBatch / embedBatchInputs. Adds focused unit tests.

Live verification

Patched the equivalent bundled artifact under /opt/homebrew/lib/node_modules/openclaw/dist/ and reran openclaw memory index against both stalled agents:

beforeafter
clawdea files6166
clawdea chunks934584
dev files217
dev chunks251588

Search works end-to-end after the fix.

Environment

  • OpenClaw: 2026.4.5 (3e72c03)
  • OS: macOS 26.x (Darwin 25.3.0, arm64)
  • Node: 22.x
  • Embedding model: text-embedding-3-small via OpenAI-compatible endpoint (LiteLLM proxy)

Related

  • #27753 — the closed-as-stale predecessor with the original sanitize-at-embed proposal
  • #56815 — 'TypeError: fetch failed' after ~40K chunks (related family — bare network errors not in retry allowlist)
  • #58255 — Gemini memory indexing fails with fetch failed while direct Node fetch succeeds (same retry-policy gap)

extent analysis

TL;DR

The most likely fix for the memory indexing stall issue is to update the chunker logic in internal.ts to handle surrogate pairs correctly and add defense-in-depth by sanitizing lone surrogates before sending to the embeddings API.

Guidance

  • Review the internal.ts file and update the outer chunker loop to extend the cut by one code unit when the trailing character is a high surrogate, as proposed in the fix.
  • Add a regression test to exercise the exact-boundary case, such as 39 ASCII characters followed by an emoji (🌸).
  • Implement the stripUnpairedSurrogates function in manager-embedding-ops.ts to replace lone surrogates with U+FFFD before calling embedBatch or embedBatchInputs.
  • Verify the fix by running openclaw memory index against the stalled agents and checking the indexing progress.

Example

// internal.ts
for (let start = 0; start < line.length; start += maxChars) {
  // Check if the trailing character is a high surrogate
  if (line.charCodeAt(start + maxChars - 1) >= 0xD800 && line.charCodeAt(start + maxChars - 1) <= 0xDBFF) {
    // Extend the cut by one code unit
    maxChars++;
  }
  // ...
}

Notes

The provided fix and tests should address the issue, but it's essential to thoroughly review and test the changes to ensure they don't introduce new problems. Additionally, the stripUnpairedSurrogates function should be implemented carefully to avoid modifying the original text unnecessarily.

Recommendation

Apply the proposed fix, which includes updating the chunker logic and adding defense-in-depth by sanitizing lone surrogates. This approach addresses the root cause of the issue and provides an additional layer of protection against similar problems in the future.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING