claude-code - 💡(How to fix) Fix Prompt cache fully re-created after turns with many parallel tool calls (cache_read collapses to system+tools floor) — ~74% of cache writes wasted on Opus 4.8 / v2.1.15x

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

Fix Action

Fix / Workaround

Confirmed live in one session: invoking ToolSearch to materialize WebFetch made the next request's cache_read drop to 0, and rewinding to before that turn — so the materialized tool was removed from the tools array — restored normal caching. That both isolates the tools-array mutation as the cause and gives a practical workaround (avoid unnecessary tool materialization; rewind past it if it happens).

RAW_BUFFERClick to expand / collapse

Describe the bug

Since ~v2.1.154 (coincident with the Opus 4.7 → 4.8 switch), the prompt cache is repeatedly invalidated mid-session in two distinct ways, causing the entire conversation to be re-cached from scratch over and over. Across four of my sessions, 74% of all cache_creation tokens were waste — re-caching content that had been cached seconds earlier — for ~$42 of needless Opus cache-write spend in those four sessions alone.

I reconstructed the real API calls from the session transcripts (collapsing assistant records by requestId) and read cache_read_input_tokens (cr) and cache_creation_input_tokens (cc) from message.usage.

In a healthy session every call satisfies cr[n] ≈ cr[n-1] + cc[n-1] (the cache grows monotonically as a prefix; cc is just the new turn's delta). Two failure modes break this:

Mode B — message-history cache invalidated (cache_read collapses to the system+tools floor) — primary, high cost

On certain turns the next request's cache_read collapses to the size of the system+tools block only (~18.7k tokens), and cache_creation balloons to ≈ the entire message history. The full history is re-written to cache, then read straight back on the following turn. The system/tools prefix stays cached; only the message-history portion is lost.

Evidence it's the whole history, not a normal delta:

  • prior context 222,540 tok; a healthy cc would be ~12,902; actual cc = 216,643, cr = 18,799
  • another turn re-cached 232,372 tokens that had been cached 70 seconds earlier (healthy cc would be ~3,121)

This has more than one trigger — parallel tool-calling is the most common but not the only one:

Trigger 1 — a preceding turn with many parallel tool_use blocks (most common). Sharply discriminated across four sessions:

preceding-turn parallel tool_use blocks
floor-miss calls (n=14)mean 16.7, median 13 (dist [0,0,12,12,12,12,13,13,15,15,17,25,43,45])
healthy calls (n=68)mean 2.5, median 2 (only one ever exceeded 8)

→ 12 of 14 floor misses in those sessions immediately follow a turn with ≥12 parallel tool calls.

Possible second trigger (tentative — single, entangled observation). In one session a floor miss occurred on a user turn ~206s after the prior turn, with no heavy parallel-tool turn preceding it (cache_read dropped from ~152k to the 18,706 floor, re-caching 139k). But that same turn also invoked ToolSearch (see Mode A), so it is not a clean isolated case — treat it as unconfirmed. The parallel-tool trigger above is the well-supported one.

These misses occur well inside Claude Code's cache TTL (Claude Code requests the 1-hour / 60-minute extended TTL, cache_control: {ttl: "1h"}). Observed miss gaps range from 28s to ~25 min — all < 60 min, so none is TTL expiry.

Mode A — full cache invalidation (cache_read → 0) after ToolSearch

When ToolSearch materializes a deferred tool, the next request's cache_read drops to 0 — the whole prefix (system + tools + history) is re-created. 3 of 3 full misses in my data are ToolSearch-preceded. This is consistent with the materialized tool schema being added to the tools array, which sits at the front of the cached prefix. (A deferred_tools_delta injected as message content does not break the cache, so it's specifically the tools-array mutation.)

Confirmed live in one session: invoking ToolSearch to materialize WebFetch made the next request's cache_read drop to 0, and rewinding to before that turn — so the materialized tool was removed from the tools array — restored normal caching. That both isolates the tools-array mutation as the cause and gives a practical workaround (avoid unnecessary tool materialization; rewind past it if it happens).

Steps to reproduce

In a long Opus 4.8 session (context > ~50k tokens), either:

  1. Mode B (parallel tools): get the assistant to issue one turn with ≥12 parallel tool calls (easy during codebase exploration — many Bash/Read/Grep at once). On the next request, cache_read drops to the system+tools floor and cache_creation ≈ the full history.
  2. Mode A (ToolSearch): call ToolSearch to materialize a deferred tool. On the next request, cache_read drops to 0. (Rewinding to before that turn restores caching.)

Inspect ~/.claude/projects/<proj>/<session>.jsonl: group type:"assistant" records by requestId, read message.usage.cache_read_input_tokens / cache_creation_input_tokens, and compare consecutive API calls.

Expected behavior

Parallel tool calls and deferred-tool materialization should not invalidate the cached conversation prefix. cache_read should keep growing monotonically; cache_creation should only ever cover the genuinely new content of the latest turn.

Actual behavior

The cached prefix is abandoned and the entire conversation history (100k–260k tokens) is re-written to cache on the next turn after a many-parallel-tool turn (Mode B) or after ToolSearch (Mode A), then read back on the turn after that. This repeats every few turns, billing cache writes (1.25× input rate) instead of cache reads (0.1×).

Likely cause (inference — cache_control is not logged in transcripts)

Mode B keeps system+tools cached but loses the entire message history, which means the request retained the system/tools breakpoint but no cache_control breakpoint covered the conversation prefix that was demonstrably cached moments earlier. The likely cause: Claude Code's rolling message-history breakpoint not surviving a heavy turn — with the API's 4-breakpoint limit, a single 12–45-block turn likely pushes the rolling breakpoints entirely inside the newly-added (uncached) block group, abandoning the breakpoint that covered the older prefix.

Mode A is more direct: materializing a deferred tool changes the tools array at the front of the prompt, invalidating the whole prefix cache (cache_read → 0).

The fix most likely belongs in Claude Code's cache-breakpoint / tools-array handling — neither aggressive parallel tool-calling nor on-demand tool materialization should invalidate an otherwise-warm prefix.

Honest confound: the CC version bump and the Opus 4.7 → 4.8 model switch happened together, and Opus 4.8 parallelizes tool calls far more (max 4–11/turn on 4.7 vs 43–45/turn on 4.8). But the within-session contrast controls for the model: inside one Opus-4.8 session, floor misses occur only after heavy turns and never after normal turns (same model both times), so the parallel-tool-block count is the causal trigger and the breakpoint handling is what fails to cope.

Impact (4 sessions analyzed)

  • 17 cache-miss turns
  • 3,062,631 cache_creation tokens billed; 2,259,367 (74%) were waste
  • $42 wasted Opus cache-write spend across just these four sessions

Environment

  • Claude Code version: 2.1.158 (also reproduced on 2.1.156; first seen at 2.1.154)
  • Model: claude-opus-4-8
  • OS: macOS 26.5 (build 25F71), Darwin 25.5.0
  • Pre-regression control: ~1,750 API calls across 8 sessions on v2.1.140–2.1.153 with claude-opus-4-7 showed only one ≥12-tool turn and zero floor misses.

Notes / red herrings ruled out

  • Records with cache_read=0, cache_creation=0, input=0 are model:"<synthetic>" (stop_reason:"stop_sequence", output 0) — locally-generated interrupt/stop placeholders, not API calls. Excluded.
  • TTL expiry excluded: Claude Code requests the 1-hour / 60-minute extended cache TTL (cache_control: {ttl: "1h"}), and every observed miss gap (28s up to ~25 min) is well under 60 min — so no miss is explained by expiry.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

FAQ

Expected behavior

Parallel tool calls and deferred-tool materialization should not invalidate the cached conversation prefix. cache_read should keep growing monotonically; cache_creation should only ever cover the genuinely new content of the latest turn.

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

claude-code - 💡(How to fix) Fix Prompt cache fully re-created after turns with many parallel tool calls (cache_read collapses to system+tools floor) — ~74% of cache writes wasted on Opus 4.8 / v2.1.15x