claude-code - 💡(How to fix) Fix [BUG] Token estimation vs API-reported tokens diverge significantly for cached context [1 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
anthropics/claude-code#46422Fetched 2026-04-11 06:20:43
View on GitHub
Comments
1
Participants
2
Timeline
3
Reactions
0
Author
Timeline (top)
labeled ×2commented ×1

Root Cause

Root Cause Hypothesis

RAW_BUFFERClick to expand / collapse

Preflight Checklist

  • I have searched existing issues and this has not been reported yet
  • This is a single bug report
  • I am using the latest version of Claude Code

What's Wrong?

Claude Code's local token estimation (used for AutoCompact triggering, context bar display, and status line) diverges significantly from the API-reported token counts in the response usage field, particularly for sessions with high cache_read_input_tokens and cache_creation_input_tokens.

This causes:

  1. AutoCompact fires at wrong threshold — local estimation shows e.g. "80% full" but API has already consumed much more of the context window
  2. Context bar misreads — user sees "X% used" but the real percentage is different
  3. Silent context overruns — session appears fine locally but hits API limits unexpectedly

Reproduction

  1. Start a long session with multiple rounds of tool use
  2. Note the local context bar percentage vs the API-reported input_tokens in responses
  3. Compare: local estimation may show 60% while API shows 85%+ of effective window consumed

Technical Detail

The divergence appears in how Claude Code estimates tokens vs how the API counts them:

Local estimation (from tokenCountWithEstimation / context bar):

  • Counts raw token content (messages + tools + system prompt)
  • Does not always correctly account for cache hit savings
  • Shows a number that may not reflect actual API context consumption

API-reported (from response usage field):

  • input_tokens: fresh tokens consumed this request
  • cache_read_input_tokens: tokens served from cache (5-minute ephemeral cache)
  • cache_creation_input_tokens: tokens written to cache this request
  • Both cache fields are not subtracted from the effective context window in local estimation

Result: A session with 100K cache_read_input_tokens might show as "60% full" locally while actually having consumed far more of the API's context budget.

Root Cause Hypothesis

The gap is in how cache_read_input_tokens are treated in getEffectiveContextWindowSize(). The API counts cache reads against the context window differently than Claude Code's local estimation does.

Impact

  • AutoCompact fires at wrong time — too late or too early depending on cache ratio
  • Context bar misleading — users cannot trust the displayed percentage
  • Budget surprises — sessions unexpectedly hit limits when local estimation said there was headroom

Suggested Labels

area:core, area:cost, bug

extent analysis

TL;DR

Adjust the local token estimation to correctly account for cache hit savings by incorporating cache_read_input_tokens and cache_creation_input_tokens into the calculation.

Guidance

  • Review the tokenCountWithEstimation function to ensure it accurately reflects the API's context consumption, including cache hits and misses.
  • Modify the getEffectiveContextWindowSize() function to subtract cache_read_input_tokens and cache_creation_input_tokens from the effective context window, aligning with the API's counting method.
  • Verify the changes by comparing local estimations with API-reported token counts in various scenarios, including high cache usage.
  • Consider adding logging or debugging statements to monitor the accuracy of local token estimations and identify potential discrepancies.

Example

// Pseudocode example, actual implementation may vary
function getEffectiveContextWindowSize(apiResponse) {
  const cacheReadTokens = apiResponse.cache_read_input_tokens;
  const cacheCreationTokens = apiResponse.cache_creation_input_tokens;
  const effectiveWindowSize = apiResponse.input_tokens - cacheReadTokens - cacheCreationTokens;
  return effectiveWindowSize;
}

Notes

The provided solution assumes that the API's counting method is the authoritative source for token consumption. However, the actual implementation may require additional considerations, such as handling edge cases or optimizing performance.

Recommendation

Apply the workaround by adjusting the local token estimation to match the API's counting method, as this will provide a more accurate representation of context consumption and help prevent unexpected budget surprises.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING