openclaw - ✅(Solved) Fix [Bug]: CJK Token Estimation Underestimation Causes Context Window Overflow in Compaction Path [2 pull requests, 1 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
openclaw/openclaw#70052Fetched 2026-04-23 07:29:55
View on GitHub
Comments
1
Participants
2
Timeline
6
Reactions
0
Timeline (top)
cross-referenced ×2commented ×1labeled ×1mentioned ×1

The compaction path uses @mariozechner/pi-coding-agent's estimateTokens which assumes ~4 ASCII chars per token, severely underestimating CJK content where each character maps to ~1.5–2.0 tokens. This causes the context window guard to fail for Japanese/Chinese/Korean sessions.

Error Message

  • Severity: High (silent data loss — early instructions dropped without explicit error)
  • Safety margin inadequacy: SAFETY_MARGIN = 1.2 (20% buffer) is calibrated for ASCII error, but CJK error is 200–600%, making the buffer meaningless

Root Cause

  • Dual token estimation architecture:
    • Compaction (critical path): uses upstream estimateTokens — CJK-blind
    • Pruning (secondary): uses OpenClaw's estimateStringChars — CJK-aware
  • OpenClaw already has the correct estimator (src/utils/cjk-chars.ts:35-81) but it is NOT used in compaction
  • The upstream package @mariozechner/pi-coding-agent (v0.68.1, from badlogic/pi-mono/packages/coding-agent) is a black-box dependency

Fix Action

Fixed

PR fix notes

PR #70081: fix: use CJK-aware token estimation in compaction path

Description (problem / solution / changelog)

Summary

Closes #70052

The compaction pipeline used estimateTokens from @mariozechner/pi-coding-agent, which divides text.length by 4 to estimate tokens. This works for Latin scripts (~4 chars/token) but severely underestimates CJK content where each character is typically ≈1 token.

Example: 19 Japanese characters ("このメッセージは日本語で書かれています") produce 19 tokens, but chars/4 estimated only 5 — a 3.8× undercount.

Root Cause

OpenClaw already has a CJK-aware utility (estimateStringChars in src/utils/cjk-chars.ts) that weights CJK characters at ~4× their byte count, matching real tokenizer behavior. However, this utility was only used in the context-pruning hook — not in the compaction path, which still relied on the upstream CJK-blind heuristic.

Fix

Introduces estimateMessageTokensCjkAware in compaction.ts, which mirrors the upstream estimateTokens message-walking logic but uses estimateStringChars for text measurement. All compaction-path consumers now use this function:

FileChange
src/agents/compaction.tsNew estimateMessageTokensCjkAware; estimateMessagesTokens now uses it
src/agents/pi-embedded-runner/compact.tsReplaced upstream estimateTokens import with CJK-aware version
src/agents/pi-embedded-runner/run/preemptive-compaction.tsSame replacement for pre-prompt estimation

For pure ASCII content the results are identical (no behavior change).
For CJK-heavy sessions the token estimate increases by ~3–4×, which prevents premature compaction triggers from underestimating context usage.

Tests

  • 11 new tests in compaction-cjk.test.ts covering Japanese, Chinese, Korean, mixed content, all message roles, array content blocks, and edge cases
  • All 17 existing compaction.test.ts tests pass unchanged
  • All 9 preemptive-compaction.test.ts tests pass unchanged

Changed files

  • src/agents/compaction-cjk.test.ts (added, +145/-0)
  • src/agents/compaction.ts (modified, +90/-5)
  • src/agents/pi-embedded-runner/compact.ts (modified, +8/-5)
  • src/agents/pi-embedded-runner/run/preemptive-compaction.ts (modified, +6/-3)

PR #70112: fix(agents): use CJK-aware token estimation in compaction path (#70052)

Description (problem / solution / changelog)

Summary

The compaction path uses estimateTokens from @mariozechner/pi-coding-agent which assumes ~4 chars/token for all scripts. This severely underestimates CJK (Chinese, Japanese, Korean) content where each character maps to ~1 token, causing compaction to trigger too late and leading to context window overflow.

Root Cause

The upstream estimateTokens() counts text.length / 4 for all scripts. For CJK text, this yields ~0.25 tokens/char instead of ~1 token/char (a 4x underestimation). OpenClaw already has a CJK-aware estimator (estimateStringChars in src/utils/cjk-chars.ts) used in the pruning path, but the compaction path was not using it.

Changes

  • src/agents/compaction.ts: Add cjkTokenCorrection() helper and estimateTokensCjkAware() wrapper that adds CJK correction delta on top of the upstream estimate. Use it in estimateMessagesTokens().
  • src/agents/pi-embedded-runner/compact.ts: Replace all 4 estimateTokens references with estimateTokensCjkAware (metrics, full session estimation, hook callbacks).
  • src/agents/pi-embedded-runner/run/preemptive-compaction.ts: Replace estimateTokens with estimateTokensCjkAware for synthetic message estimation.

Test

All existing compaction and CJK tests pass:

  • pnpm test -- src/agents/compaction.test.ts — 17 passed
  • pnpm test -- src/utils/cjk-chars.test.ts — 17 passed
  • pnpm test -- src/agents/pi-embedded-runner/run/preemptive-compaction.test.ts — 9 passed

Closes #70052

Changed files

  • src/agents/compaction.ts (modified, +59/-1)
  • src/agents/pi-embedded-runner/compact.ts (modified, +5/-5)
  • src/agents/pi-embedded-runner/run/preemptive-compaction.ts (modified, +2/-3)

Code Example

## Code Evidence

### Problem Path (Compaction uses upstream CJK-blind estimator)

// src/agents/compaction.ts:103-107
import { estimateTokens } from "@mariozechner/pi-coding-agent";
export function estimateMessagesTokens(messages: AgentMessage[]): number {
  const safe = stripToolResultDetails(messages);
  return safe.reduce((sum, message) => sum + estimateTokens(message), 0);
}

Same upstream `estimateTokens` used in:
- `src/agents/pi-embedded-runner/compact.ts:280`
- `src/agents/pi-embedded-runner/compact.ts:1014`
- `src/agents/pi-embedded-runner/run/preemptive-compaction.ts:37`
### Correct Path (Pruning uses OpenClaw's CJK-aware estimator)

// src/agents/pi-hooks/context-pruning/pruner.ts:115
function estimateWeightedTextChars(text: string): number {
  return estimateStringChars(text); // CJK-aware, but pruning only!
}


### Token Comparison
| Text | length/4 estimate | CJK-aware estimate | Actual (cl100k) |
|------|-------------------|-------------------|-----------------|
| `このメッセージは日本語で書かれています` (24 chars) | 6 tokens | 20 tokens | ~20 tokens |

### Ecosystem Precedent
- Martian-Engineering/lossless-claw PR #344: Same `chars/4` bug fixed upstream
- win4r/lossless-claw-enhanced: Fork documenting CJK underestimation
RAW_BUFFERClick to expand / collapse

Bug type

Behavior bug (incorrect output/state without crash)

Beta release blocker

No

Summary

The compaction path uses @mariozechner/pi-coding-agent's estimateTokens which assumes ~4 ASCII chars per token, severely underestimating CJK content where each character maps to ~1.5–2.0 tokens. This causes the context window guard to fail for Japanese/Chinese/Korean sessions.

Steps to reproduce

  1. Start a session with any model (Claude, GPT-4, etc.)
  2. Send a Japanese message: このメッセージは日本語で書かれています (24 chars)
  3. Observe that estimateTokens() returns 6 tokens (24/4)
  4. Actual cl100k_base usage is ~20 tokens
  5. Repeat for 5–6 tasks; compaction does not trigger
  6. Context overflows; agent "forgets" instructions

Expected behavior

OpenClaw should use its existing CJK-aware estimator (src/utils/cjk-chars.ts, estimateStringChars) in the compaction path, yielding ~20 tokens for the example above and triggering compaction before the context window overflows.

Actual behavior

Compaction uses estimateTokens from @mariozechner/pi-coding-agent which counts CJK characters as ~0.25 tokens/char (same as ASCII). For the example 24-char Japanese message, it estimates 6 tokens instead of ~20. With 5–6 accumulated tasks, the actual context is 3–4× larger than estimated. The LLM silently truncates early messages (system prompt, instructions), and the agent loses track of tasks.

OpenClaw version

2026.4.22 (main branch, commit range includes src/agents/compaction.ts and src/utils/cjk-chars.ts)

Operating system

macOS 15.4 (also affects all platforms: Linux, Windows, Docker)

Install method

pnpm dev (from source) / npm global

Model

All models affected (tokenizer-independent): anthropic/claude-sonnet-4.6, openai/gpt-4o, google/gemini-2.5-pro, etc.

Provider / routing chain

OpenClaw -> any provider -> any model (issue is in local compaction logic, not provider-specific)

Additional provider/model setup details

No provider-specific configuration required. The bug is in the local token estimation logic at src/agents/compaction.ts which imports estimateTokens from @mariozechner/pi-coding-agent, not in any provider API routing.

Logs, screenshots, and evidence

## Code Evidence

### Problem Path (Compaction uses upstream CJK-blind estimator)

// src/agents/compaction.ts:103-107
import { estimateTokens } from "@mariozechner/pi-coding-agent";
export function estimateMessagesTokens(messages: AgentMessage[]): number {
  const safe = stripToolResultDetails(messages);
  return safe.reduce((sum, message) => sum + estimateTokens(message), 0);
}

Same upstream `estimateTokens` used in:
- `src/agents/pi-embedded-runner/compact.ts:280`
- `src/agents/pi-embedded-runner/compact.ts:1014`
- `src/agents/pi-embedded-runner/run/preemptive-compaction.ts:37`
### Correct Path (Pruning uses OpenClaw's CJK-aware estimator)

// src/agents/pi-hooks/context-pruning/pruner.ts:115
function estimateWeightedTextChars(text: string): number {
  return estimateStringChars(text); // CJK-aware, but pruning only!
}


### Token Comparison
| Text | length/4 estimate | CJK-aware estimate | Actual (cl100k) |
|------|-------------------|-------------------|-----------------|
| `このメッセージは日本語で書かれています` (24 chars) | 6 tokens | 20 tokens | ~20 tokens |

### Ecosystem Precedent
- Martian-Engineering/lossless-claw PR #344: Same `chars/4` bug fixed upstream
- win4r/lossless-claw-enhanced: Fork documenting CJK underestimation

Impact and severity

  • Affected users/systems/channels: All CJK-language users (Chinese, Japanese, Korean)
  • Severity: High (silent data loss — early instructions dropped without explicit error)
  • Frequency: Always for CJK-heavy sessions beyond ~5 tasks
  • Consequence: Agent loses track of user instructions; forced session restart; wasted API tokens on degraded outputs
  • Safety margin inadequacy: SAFETY_MARGIN = 1.2 (20% buffer) is calibrated for ASCII error, but CJK error is 200–600%, making the buffer meaningless

Additional information

Root Cause

  • Dual token estimation architecture:
    • Compaction (critical path): uses upstream estimateTokens — CJK-blind
    • Pruning (secondary): uses OpenClaw's estimateStringChars — CJK-aware
  • OpenClaw already has the correct estimator (src/utils/cjk-chars.ts:35-81) but it is NOT used in compaction
  • The upstream package @mariozechner/pi-coding-agent (v0.68.1, from badlogic/pi-mono/packages/coding-agent) is a black-box dependency

Proposed Fix

  1. Immediate: Wrap upstream estimateTokens with OpenClaw's cjk-chars.ts in compaction path
  2. Medium-term: Recalibrate cjk-chars.ts ratios against real tokenizers (current 4× virtual chars may be too conservative; lossless-claw uses 1.5 tokens/char)
  3. Long-term: Integrate exact tokenizer (js-tiktoken/gpt-tokenizer) as optional exact backend

Maintainer

CC: @jalehman (Compaction, Context Engine maintainer per CONTRIBUTING.md)

extent analysis

TL;DR

Replace the estimateTokens function from @mariozechner/pi-coding-agent with OpenClaw's CJK-aware estimateStringChars in the compaction path to accurately estimate tokens for CJK content.

Guidance

  • Identify all occurrences of estimateTokens from @mariozechner/pi-coding-agent in the compaction path and replace them with OpenClaw's estimateStringChars to ensure CJK characters are accurately estimated.
  • Verify the fix by testing with CJK-heavy sessions and checking that compaction triggers correctly before the context window overflows.
  • Consider recalibrating the cjk-chars.ts ratios against real tokenizers for more accurate estimates.
  • In the long term, integrating an exact tokenizer like js-tiktoken/gpt-tokenizer as an optional backend could provide the most accurate token estimates.

Example

// src/agents/compaction.ts:103-107
import { estimateStringChars } from '../utils/cjk-chars';
export function estimateMessagesTokens(messages: AgentMessage[]): number {
  const safe = stripToolResultDetails(messages);
  return safe.reduce((sum, message) => sum + estimateStringChars(message), 0);
}

Notes

The provided fix assumes that OpenClaw's estimateStringChars is correctly implemented and provides accurate estimates for CJK characters. If this is not the case, further investigation into the cjk-chars.ts file may be necessary.

Recommendation

Apply the workaround by replacing estimateTokens with estimateStringChars in the compaction path, as this will provide a more accurate estimate of tokens for CJK content and prevent context window overflows.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

FAQ

Expected behavior

OpenClaw should use its existing CJK-aware estimator (src/utils/cjk-chars.ts, estimateStringChars) in the compaction path, yielding ~20 tokens for the example above and triggering compaction before the context window overflows.

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

openclaw - ✅(Solved) Fix [Bug]: CJK Token Estimation Underestimation Causes Context Window Overflow in Compaction Path [2 pull requests, 1 comments, 2 participants]