openclaw - ✅(Solved) Fix [Bug]: CJK Token Estimation Underestimation Causes Context Window Overflow in Compaction Path [2 pull requests, 1 comments, 2 participants]

openclaw2026-04-22 07:39:30

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

openclaw/openclaw#70052•Fetched 2026-04-23 07:29:55

View on GitHub

Comments

Participants

Timeline

Reactions

Author

TerasawaShuhei

Participants

rafiki270

TerasawaShuhei

Timeline (top)

cross-referenced ×2commented ×1labeled ×1mentioned ×1

The compaction path uses @mariozechner/pi-coding-agent's estimateTokens which assumes ~4 ASCII chars per token, severely underestimating CJK content where each character maps to ~1.5–2.0 tokens. This causes the context window guard to fail for Japanese/Chinese/Korean sessions.

Error Message

Severity: High (silent data loss — early instructions dropped without explicit error)
Safety margin inadequacy: SAFETY_MARGIN = 1.2 (20% buffer) is calibrated for ASCII error, but CJK error is 200–600%, making the buffer meaningless

Root Cause

Dual token estimation architecture:
- Compaction (critical path): uses upstream estimateTokens — CJK-blind
- Pruning (secondary): uses OpenClaw's estimateStringChars — CJK-aware
OpenClaw already has the correct estimator (src/utils/cjk-chars.ts:35-81) but it is NOT used in compaction
The upstream package @mariozechner/pi-coding-agent (v0.68.1, from badlogic/pi-mono/packages/coding-agent) is a black-box dependency

Fix Action

Fixed

Fixed by PR: fix: use CJK-aware token estimation in compaction path (https://github.com/openclaw/openclaw/pull/70081)
Fixed by PR: fix(agents): use CJK-aware token estimation in compaction path (#70052) (https://github.com/openclaw/openclaw/pull/70112)

PR fix notes

PR #70081: fix: use CJK-aware token estimation in compaction path

Repository: openclaw/openclaw
Author: Bartok9
State: closed | merged: False
Link: https://github.com/openclaw/openclaw/pull/70081

Description (problem / solution / changelog)

Summary

Closes #70052

The compaction pipeline used estimateTokens from @mariozechner/pi-coding-agent, which divides text.length by 4 to estimate tokens. This works for Latin scripts (~4 chars/token) but severely underestimates CJK content where each character is typically ≈1 token.

Example: 19 Japanese characters ("このメッセージは日本語で書かれています") produce 19 tokens, but chars/4 estimated only 5 — a 3.8× undercount.

Root Cause

OpenClaw already has a CJK-aware utility (estimateStringChars in src/utils/cjk-chars.ts) that weights CJK characters at ~4× their byte count, matching real tokenizer behavior. However, this utility was only used in the context-pruning hook — not in the compaction path, which still relied on the upstream CJK-blind heuristic.

Fix

Introduces estimateMessageTokensCjkAware in compaction.ts, which mirrors the upstream estimateTokens message-walking logic but uses estimateStringChars for text measurement. All compaction-path consumers now use this function:

File	Change
`src/agents/compaction.ts`	New `estimateMessageTokensCjkAware`; `estimateMessagesTokens` now uses it
`src/agents/pi-embedded-runner/compact.ts`	Replaced upstream `estimateTokens` import with CJK-aware version
`src/agents/pi-embedded-runner/run/preemptive-compaction.ts`	Same replacement for pre-prompt estimation

For pure ASCII content the results are identical (no behavior change).
For CJK-heavy sessions the token estimate increases by ~3–4×, which prevents premature compaction triggers from underestimating context usage.

Tests

11 new tests in compaction-cjk.test.ts covering Japanese, Chinese, Korean, mixed content, all message roles, array content blocks, and edge cases
All 17 existing compaction.test.ts tests pass unchanged
All 9 preemptive-compaction.test.ts tests pass unchanged

Changed files

src/agents/compaction-cjk.test.ts (added, +145/-0)
src/agents/compaction.ts (modified, +90/-5)
src/agents/pi-embedded-runner/compact.ts (modified, +8/-5)
src/agents/pi-embedded-runner/run/preemptive-compaction.ts (modified, +6/-3)

PR #70112: fix(agents): use CJK-aware token estimation in compaction path (#70052)

Repository: openclaw/openclaw
Author: MoerAI
State: open | merged: False
Link: https://github.com/openclaw/openclaw/pull/70112

Description (problem / solution / changelog)

Summary

The compaction path uses estimateTokens from @mariozechner/pi-coding-agent which assumes ~4 chars/token for all scripts. This severely underestimates CJK (Chinese, Japanese, Korean) content where each character maps to ~1 token, causing compaction to trigger too late and leading to context window overflow.

Root Cause

The upstream estimateTokens() counts text.length / 4 for all scripts. For CJK text, this yields ~0.25 tokens/char instead of ~1 token/char (a 4x underestimation). OpenClaw already has a CJK-aware estimator (estimateStringChars in src/utils/cjk-chars.ts) used in the pruning path, but the compaction path was not using it.

Changes

src/agents/compaction.ts: Add cjkTokenCorrection() helper and estimateTokensCjkAware() wrapper that adds CJK correction delta on top of the upstream estimate. Use it in estimateMessagesTokens().
src/agents/pi-embedded-runner/compact.ts: Replace all 4 estimateTokens references with estimateTokensCjkAware (metrics, full session estimation, hook callbacks).
src/agents/pi-embedded-runner/run/preemptive-compaction.ts: Replace estimateTokens with estimateTokensCjkAware for synthetic message estimation.

Test

All existing compaction and CJK tests pass:

pnpm test -- src/agents/compaction.test.ts — 17 passed
pnpm test -- src/utils/cjk-chars.test.ts — 17 passed
pnpm test -- src/agents/pi-embedded-runner/run/preemptive-compaction.test.ts — 9 passed

Closes #70052

Changed files

src/agents/compaction.ts (modified, +59/-1)
src/agents/pi-embedded-runner/compact.ts (modified, +5/-5)
src/agents/pi-embedded-runner/run/preemptive-compaction.ts (modified, +2/-3)

Code Example

## Code Evidence

### Problem Path (Compaction uses upstream CJK-blind estimator)

// src/agents/compaction.ts:103-107
import { estimateTokens } from "@mariozechner/pi-coding-agent";
export function estimateMessagesTokens(messages: AgentMessage[]): number {
  const safe = stripToolResultDetails(messages);
  return safe.reduce((sum, message) => sum + estimateTokens(message), 0);
}

Same upstream `estimateTokens` used in:
- `src/agents/pi-embedded-runner/compact.ts:280`
- `src/agents/pi-embedded-runner/compact.ts:1014`
- `src/agents/pi-embedded-runner/run/preemptive-compaction.ts:37`
### Correct Path (Pruning uses OpenClaw's CJK-aware estimator)

// src/agents/pi-hooks/context-pruning/pruner.ts:115
function estimateWeightedTextChars(text: string): number {
  return estimateStringChars(text); // CJK-aware, but pruning only!
}


### Token Comparison
| Text | length/4 estimate | CJK-aware estimate | Actual (cl100k) |
|------|-------------------|-------------------|-----------------|
| `このメッセージは日本語で書かれています` (24 chars) | 6 tokens | 20 tokens | ~20 tokens |

### Ecosystem Precedent
- Martian-Engineering/lossless-claw PR #344: Same `chars/4` bug fixed upstream
- win4r/lossless-claw-enhanced: Fork documenting CJK underestimation

RAW_BUFFERClick to expand / collapse

Bug type

Behavior bug (incorrect output/state without crash)

Beta release blocker

Summary

Steps to reproduce

Start a session with any model (Claude, GPT-4, etc.)
Send a Japanese message: このメッセージは日本語で書かれています (24 chars)
Observe that estimateTokens() returns 6 tokens (24/4)
Actual cl100k_base usage is ~20 tokens
Repeat for 5–6 tasks; compaction does not trigger
Context overflows; agent "forgets" instructions

Expected behavior

OpenClaw should use its existing CJK-aware estimator (src/utils/cjk-chars.ts, estimateStringChars) in the compaction path, yielding ~20 tokens for the example above and triggering compaction before the context window overflows.

Actual behavior

Compaction uses estimateTokens from @mariozechner/pi-coding-agent which counts CJK characters as ~0.25 tokens/char (same as ASCII). For the example 24-char Japanese message, it estimates 6 tokens instead of ~20. With 5–6 accumulated tasks, the actual context is 3–4× larger than estimated. The LLM silently truncates early messages (system prompt, instructions), and the agent loses track of tasks.

OpenClaw version

2026.4.22 (main branch, commit range includes src/agents/compaction.ts and src/utils/cjk-chars.ts)

Operating system

macOS 15.4 (also affects all platforms: Linux, Windows, Docker)

Install method

pnpm dev (from source) / npm global

Model

All models affected (tokenizer-independent): anthropic/claude-sonnet-4.6, openai/gpt-4o, google/gemini-2.5-pro, etc.

Provider / routing chain

OpenClaw -> any provider -> any model (issue is in local compaction logic, not provider-specific)

Additional provider/model setup details

No provider-specific configuration required. The bug is in the local token estimation logic at src/agents/compaction.ts which imports estimateTokens from @mariozechner/pi-coding-agent, not in any provider API routing.

Logs, screenshots, and evidence

## Code Evidence

### Problem Path (Compaction uses upstream CJK-blind estimator)

// src/agents/compaction.ts:103-107
import { estimateTokens } from "@mariozechner/pi-coding-agent";
export function estimateMessagesTokens(messages: AgentMessage[]): number {
  const safe = stripToolResultDetails(messages);
  return safe.reduce((sum, message) => sum + estimateTokens(message), 0);
}

Same upstream `estimateTokens` used in:
- `src/agents/pi-embedded-runner/compact.ts:280`
- `src/agents/pi-embedded-runner/compact.ts:1014`
- `src/agents/pi-embedded-runner/run/preemptive-compaction.ts:37`
### Correct Path (Pruning uses OpenClaw's CJK-aware estimator)

// src/agents/pi-hooks/context-pruning/pruner.ts:115
function estimateWeightedTextChars(text: string): number {
  return estimateStringChars(text); // CJK-aware, but pruning only!
}


### Token Comparison
| Text | length/4 estimate | CJK-aware estimate | Actual (cl100k) |
|------|-------------------|-------------------|-----------------|
| `このメッセージは日本語で書かれています` (24 chars) | 6 tokens | 20 tokens | ~20 tokens |

### Ecosystem Precedent
- Martian-Engineering/lossless-claw PR #344: Same `chars/4` bug fixed upstream
- win4r/lossless-claw-enhanced: Fork documenting CJK underestimation

Impact and severity

Affected users/systems/channels: All CJK-language users (Chinese, Japanese, Korean)
Severity: High (silent data loss — early instructions dropped without explicit error)
Frequency: Always for CJK-heavy sessions beyond ~5 tasks
Consequence: Agent loses track of user instructions; forced session restart; wasted API tokens on degraded outputs
Safety margin inadequacy: SAFETY_MARGIN = 1.2 (20% buffer) is calibrated for ASCII error, but CJK error is 200–600%, making the buffer meaningless

Additional information

Root Cause

Dual token estimation architecture:
- Compaction (critical path): uses upstream estimateTokens — CJK-blind
- Pruning (secondary): uses OpenClaw's estimateStringChars — CJK-aware
OpenClaw already has the correct estimator (src/utils/cjk-chars.ts:35-81) but it is NOT used in compaction
The upstream package @mariozechner/pi-coding-agent (v0.68.1, from badlogic/pi-mono/packages/coding-agent) is a black-box dependency

Proposed Fix

Immediate: Wrap upstream estimateTokens with OpenClaw's cjk-chars.ts in compaction path
Medium-term: Recalibrate cjk-chars.ts ratios against real tokenizers (current 4× virtual chars may be too conservative; lossless-claw uses 1.5 tokens/char)
Long-term: Integrate exact tokenizer (js-tiktoken/gpt-tokenizer) as optional exact backend

Maintainer

CC: @jalehman (Compaction, Context Engine maintainer per CONTRIBUTING.md)

extent analysis

TL;DR

Replace the estimateTokens function from @mariozechner/pi-coding-agent with OpenClaw's CJK-aware estimateStringChars in the compaction path to accurately estimate tokens for CJK content.

Guidance

Identify all occurrences of estimateTokens from @mariozechner/pi-coding-agent in the compaction path and replace them with OpenClaw's estimateStringChars to ensure CJK characters are accurately estimated.
Verify the fix by testing with CJK-heavy sessions and checking that compaction triggers correctly before the context window overflows.
Consider recalibrating the cjk-chars.ts ratios against real tokenizers for more accurate estimates.
In the long term, integrating an exact tokenizer like js-tiktoken/gpt-tokenizer as an optional backend could provide the most accurate token estimates.

Example

// src/agents/compaction.ts:103-107
import { estimateStringChars } from '../utils/cjk-chars';
export function estimateMessagesTokens(messages: AgentMessage[]): number {
  const safe = stripToolResultDetails(messages);
  return safe.reduce((sum, message) => sum + estimateStringChars(message), 0);
}

Notes

The provided fix assumes that OpenClaw's estimateStringChars is correctly implemented and provides accurate estimates for CJK characters. If this is not the case, further investigation into the cjk-chars.ts file may be necessary.

Recommendation

Apply the workaround by replacing estimateTokens with estimateStringChars in the compaction path, as this will provide a more accurate estimate of tokens for CJK content and prevent context window overflows.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

FAQ

Expected behavior

#api #ssr #installation #tensor shape #API routing

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

openclaw - ✅(Solved) Fix [Bug]: CJK Token Estimation Underestimation Causes Context Window Overflow in Compaction Path [2 pull requests, 1 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Fixed

PR fix notes

PR #70081: fix: use CJK-aware token estimation in compaction path

Description (problem / solution / changelog)

Summary

Root Cause

Fix

Tests

Changed files

PR #70112: fix(agents): use CJK-aware token estimation in compaction path (#70052)

Description (problem / solution / changelog)

Summary

Root Cause

Changes

Test

Changed files

Code Example

Bug type

Beta release blocker

Summary

Steps to reproduce

Expected behavior

Actual behavior

OpenClaw version

Operating system

Install method

Model

Provider / routing chain

Additional provider/model setup details

Logs, screenshots, and evidence

Impact and severity

Additional information

Root Cause

Proposed Fix

Maintainer

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

FAQ

Expected behavior

Still need to ship something?

RELATED_DISCOVERY

TRENDING