claude-code - 💡(How to fix) Fix Prompt cache fully re-created after turns with many parallel tool calls (cache_read collapses to system+tools floor) — ~74% of cache writes wasted on Opus 4.8 / v2.1.15x

StepCodex · 2026-05-30T10:53:46Z

[claude-code] Describe the bug Since ~v2.1.154 coincident with the Opus 4.7 → 4.8 switch , the prompt cache is repeatedly invalidated mid-session in two distin… ## Fix / Workaround Confirmed live in one session: invoking `ToolSearch` to materialize `WebFetch` made the next request's `cache_read` drop to 0, and **rewinding to before that turn — so the materialized tool was removed from the `tools` array — restored normal caching.** That both isolates the `tools`-array mutation as the cause and gives a practical workaround (avoid unnecessary tool materialization; rewind past it if it happens). ## Describe the bug Since ~v2.1.154 (coincident with the Opus 4.7 → 4.8 switch), the prompt cache is repeatedly invalidated mid-session in two distinct ways, causing the entire conversation to be re-cached from scratch over and over. Across four of my sessions, **74% of all `cache_creation` tokens were waste** — re-caching content that had been cached seconds earlier — for ~$42 of needless Opus cache-write spend in those four sessions alone. I reconstructed the real API calls from the session transcripts (collapsing assistant records by `requestId`) and read `cache_read_input_tokens` (cr) and `cache_creation_input_tokens` (cc) from `message.usage`. In a healthy session every call satisfies `cr[n] ≈ cr[n-1] + cc[n-1]` (the cache grows monotonically as a prefix; cc is just the new turn's delta). Two failure modes break this: ### Mode B — message-history cache invalidated (`cache_read` collapses to the system+tools floor) — primary, high cost On certain turns the *next* request's `cache_read` collapses to the size of the system+tools block only (~18.7k tokens), and `cache_creation` balloons to ≈ the **entire** message history. The full history is re-written to cache, then read straight back on the following turn. The system/tools prefix stays cached; only the message-history portion is lost. Evidence it's the whole history, not a normal delta: - prior context 222,540 tok; a healthy `cc` would be ~12,902; **actual `cc` = 216,643**, `cr` = 18,799 - another turn re-cached **232,372 tokens** that had been cached **70 seconds** earlier (healthy `cc` would be ~3,121) This has **more than one trigger** — parallel tool-calling is the most common but not the only one: **Trigger 1 — a preceding turn with many parallel `tool_use` blocks** (most common). Sharply discriminated across four sessions: | | preceding-turn parallel `tool_use` blocks | |---|---| | **floor-miss calls** (n=14) | mean **16.7**, median 13 (dist `[0,0,12,12,12,12,13,13,15,15,17,25,43,45]`) | | **healthy calls** (n=68) | mean **2.5**, median 2 (only one ever exceeded 8) | → 12 of 14 floor misses in those sessions immediately follow a turn with ≥12 parallel tool calls. **Possible second trigger (tentative — single, entangled observation).** In one session a floor miss occurred on a user turn ~206s after the prior turn, with no heavy parallel-tool turn preceding it (`cache_read` dropped from ~152k to the 18,706 floor, re-caching 139k). But that same turn also invoked `ToolSearch` (see Mode A), so it is not a clean isolated case — treat it as **unconfirmed**. The parallel-tool trigger above is the well-supported one. These misses occur well inside Claude Code's cache TTL (Claude Code requests the **1-hour / 60-minute extended TTL**, `cache_control: {ttl: "1h"}`). Observed miss gaps range from **28s to ~25 min — all < 60 min**, so none is TTL expiry. ### Mode A — full cache invalidation (`cache_read → 0`) after `ToolSearch` When `ToolSearch` materializes a deferred tool, the *next* request's `cache_read` drops to **0** — the whole prefix (system + tools + history) is re-created. 3 of 3 full misses in my data are `ToolSearch`-preceded. This is consistent with the materialized tool schema being added to the `tools` array, which sits at the front of the cached prefix. (A `deferred_tools_delta` injected as *message* content does **not** break the cache, so it's specifically the `tools`-array mutation.) Confirmed live in one session: invoking `ToolSearch` to materialize `WebFetch` made the next request's `cache_read` drop to 0, and **rewinding to before that turn — so the materialized tool was removed from the `tools` array — restored normal caching.** That both isolates the `tools`-array mutation as the cause and gives a practical workaround (avoid unnecessary tool materialization; rewind past it if it happens). ## Steps to reproduce In a long Opus 4.8 session (context > ~50k tokens), either: 1. **Mode B (parallel tools):** get the assistant to issue one turn with **≥12 parallel tool calls** (easy during codebase exploration — many `Bash`/`Read`/`Grep` at once). On the next request, `cache_read` drops to the system+tools floor and `cache_creation` ≈ the full history. 2. **Mode A (`ToolSearch`):** call `ToolSearch` to materialize a deferred tool. On the next request, `cache_read` drops to 0. (Rewinding to before that turn restores caching.) Inspect `~/.claude/projects/<proj

Fix Action

Fix / Workaround

Confirmed live in one session: invoking ToolSearch to materialize WebFetch made the next request's cache_read drop to 0, and rewinding to before that turn — so the materialized tool was removed from the tools array — restored normal caching. That both isolates the tools-array mutation as the cause and gives a practical workaround (avoid unnecessary tool materialization; rewind past it if it happens).

Describe the bug

Since ~v2.1.154 (coincident with the Opus 4.7 → 4.8 switch), the prompt cache is repeatedly invalidated mid-session in two distinct ways, causing the entire conversation to be re-cached from scratch over and over. Across four of my sessions, 74% of all cache_creation tokens were waste — re-caching content that had been cached seconds earlier — for ~$42 of needless Opus cache-write spend in those four sessions alone.

I reconstructed the real API calls from the session transcripts (collapsing assistant records by requestId) and read cache_read_input_tokens (cr) and cache_creation_input_tokens (cc) from message.usage.

In a healthy session every call satisfies cr[n] ≈ cr[n-1] + cc[n-1] (the cache grows monotonically as a prefix; cc is just the new turn's delta). Two failure modes break this:

Mode B — message-history cache invalidated (`cache_read` collapses to the system+tools floor) — primary, high cost

On certain turns the next request's cache_read collapses to the size of the system+tools block only (~18.7k tokens), and cache_creation balloons to ≈ the entire message history. The full history is re-written to cache, then read straight back on the following turn. The system/tools prefix stays cached; only the message-history portion is lost.

Evidence it's the whole history, not a normal delta:

prior context 222,540 tok; a healthy cc would be ~12,902; actual cc = 216,643, cr = 18,799
another turn re-cached 232,372 tokens that had been cached 70 seconds earlier (healthy cc would be ~3,121)

This has more than one trigger — parallel tool-calling is the most common but not the only one:

Trigger 1 — a preceding turn with many parallel tool_use blocks (most common). Sharply discriminated across four sessions:

	preceding-turn parallel `tool_use` blocks
floor-miss calls (n=14)	mean 16.7, median 13 (dist `[0,0,12,12,12,12,13,13,15,15,17,25,43,45]`)
healthy calls (n=68)	mean 2.5, median 2 (only one ever exceeded 8)

→ 12 of 14 floor misses in those sessions immediately follow a turn with ≥12 parallel tool calls.

Possible second trigger (tentative — single, entangled observation). In one session a floor miss occurred on a user turn ~206s after the prior turn, with no heavy parallel-tool turn preceding it (cache_read dropped from ~152k to the 18,706 floor, re-caching 139k). But that same turn also invoked ToolSearch (see Mode A), so it is not a clean isolated case — treat it as unconfirmed. The parallel-tool trigger above is the well-supported one.

These misses occur well inside Claude Code's cache TTL (Claude Code requests the 1-hour / 60-minute extended TTL, cache_control: {ttl: "1h"}). Observed miss gaps range from 28s to ~25 min — all < 60 min, so none is TTL expiry.

Mode A — full cache invalidation (`cache_read → 0`) after `ToolSearch`

When ToolSearch materializes a deferred tool, the next request's cache_read drops to 0 — the whole prefix (system + tools + history) is re-created. 3 of 3 full misses in my data are ToolSearch-preceded. This is consistent with the materialized tool schema being added to the tools array, which sits at the front of the cached prefix. (A deferred_tools_delta injected as message content does not break the cache, so it's specifically the tools-array mutation.)

Steps to reproduce

In a long Opus 4.8 session (context > ~50k tokens), either:

Mode B (parallel tools): get the assistant to issue one turn with ≥12 parallel tool calls (easy during codebase exploration — many Bash/Read/Grep at once). On the next request, cache_read drops to the system+tools floor and cache_creation ≈ the full history.
Mode A (ToolSearch): call ToolSearch to materialize a deferred tool. On the next request, cache_read drops to 0. (Rewinding to before that turn restores caching.)

Inspect ~/.claude/projects/<proj>/<session>.jsonl: group type:"assistant" records by requestId, read message.usage.cache_read_input_tokens / cache_creation_input_tokens, and compare consecutive API calls.

Expected behavior

Parallel tool calls and deferred-tool materialization should not invalidate the cached conversation prefix. cache_read should keep growing monotonically; cache_creation should only ever cover the genuinely new content of the latest turn.

Actual behavior

The cached prefix is abandoned and the entire conversation history (100k–260k tokens) is re-written to cache on the next turn after a many-parallel-tool turn (Mode B) or after ToolSearch (Mode A), then read back on the turn after that. This repeats every few turns, billing cache writes (1.25× input rate) instead of cache reads (0.1×).

Likely cause (inference — `cache_control` is not logged in transcripts)

Mode B keeps system+tools cached but loses the entire message history, which means the request retained the system/tools breakpoint but no cache_control breakpoint covered the conversation prefix that was demonstrably cached moments earlier. The likely cause: Claude Code's rolling message-history breakpoint not surviving a heavy turn — with the API's 4-breakpoint limit, a single 12–45-block turn likely pushes the rolling breakpoints entirely inside the newly-added (uncached) block group, abandoning the breakpoint that covered the older prefix.

Mode A is more direct: materializing a deferred tool changes the tools array at the front of the prompt, invalidating the whole prefix cache (cache_read → 0).

The fix most likely belongs in Claude Code's cache-breakpoint / tools-array handling — neither aggressive parallel tool-calling nor on-demand tool materialization should invalidate an otherwise-warm prefix.

Honest confound: the CC version bump and the Opus 4.7 → 4.8 model switch happened together, and Opus 4.8 parallelizes tool calls far more (max 4–11/turn on 4.7 vs 43–45/turn on 4.8). But the within-session contrast controls for the model: inside one Opus-4.8 session, floor misses occur only after heavy turns and never after normal turns (same model both times), so the parallel-tool-block count is the causal trigger and the breakpoint handling is what fails to cope.

Impact (4 sessions analyzed)

17 cache-miss turns
3,062,631 cache_creation tokens billed; 2,259,367 (74%) were waste
≈ $42 wasted Opus cache-write spend across just these four sessions

Environment

Claude Code version: 2.1.158 (also reproduced on 2.1.156; first seen at 2.1.154)
Model: claude-opus-4-8
OS: macOS 26.5 (build 25F71), Darwin 25.5.0
Pre-regression control: ~1,750 API calls across 8 sessions on v2.1.140–2.1.153 with claude-opus-4-7 showed only one ≥12-tool turn and zero floor misses.

Notes / red herrings ruled out

Records with cache_read=0, cache_creation=0, input=0 are model:"<synthetic>" (stop_reason:"stop_sequence", output 0) — locally-generated interrupt/stop placeholders, not API calls. Excluded.
TTL expiry excluded: Claude Code requests the 1-hour / 60-minute extended cache TTL (cache_control: {ttl: "1h"}), and every observed miss gap (28s up to ~25 min) is well under 60 min — so no miss is explained by expiry.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

claude-code - 💡(How to fix) Fix Prompt cache fully re-created after turns with many parallel tool calls (cache_read collapses to system+tools floor) — ~74% of cache writes wasted on Opus 4.8 / v2.1.15x

Recommended Tools

GitHub issue graph ai analysis

Fix Action

Fix / Workaround

Describe the bug

Mode B — message-history cache invalidated (`cache_read` collapses to the system+tools floor) — primary, high cost

Mode A — full cache invalidation (`cache_read → 0`) after `ToolSearch`

Steps to reproduce

Expected behavior

Actual behavior

Likely cause (inference — `cache_control` is not logged in transcripts)

Impact (4 sessions analyzed)

Environment

Notes / red herrings ruled out

FAQ

Expected behavior

Still need to ship something?

TRENDING

claude-code - 💡(How to fix) Fix Prompt cache fully re-created after turns with many parallel tool calls (cache_read collapses to system+tools floor) — ~74% of cache writes wasted on Opus 4.8 / v2.1.15x

Recommended Tools

GitHub issue graph ai analysis

Fix Action

Fix / Workaround

Describe the bug

Mode B — message-history cache invalidated (cache_read collapses to the system+tools floor) — primary, high cost

Mode A — full cache invalidation (cache_read → 0) after ToolSearch

Steps to reproduce

Expected behavior

Actual behavior

Likely cause (inference — cache_control is not logged in transcripts)

Impact (4 sessions analyzed)

Environment

Notes / red herrings ruled out

FAQ

Expected behavior

Still need to ship something?

TRENDING

Mode B — message-history cache invalidated (`cache_read` collapses to the system+tools floor) — primary, high cost

Mode A — full cache invalidation (`cache_read → 0`) after `ToolSearch`

Likely cause (inference — `cache_control` is not logged in transcripts)