claude-code - 💡(How to fix) Fix [BUG] CLAUDE_CODE_USE_BEDROCK=1 writes <2% of prompt to cache at >150k context, vs full prompt on Anthropic 1P direct

Root Cause

For real-world CC usage on Bedrock with large repo context (extended sessions, big code files inlined, MCP servers with large state) this turns multi-second warm-cache requests into ~30-75s cold pre-fills every turn, and charges the full uncached input-token rate every turn (vs the 90% cache-read discount). At Sonnet 4.6 input pricing this is roughly $1.05/turn ($3/M × 350k) instead of $0.10/turn warmed.

Code Example

ANTHROPIC_BASE_URL=https://api.anthropic.com \
   ANTHROPIC_API_KEY=<your-anthropic-key> \
   ANTHROPIC_MODEL=claude-sonnet-4-6 \
   claude --bare -p "<300k-token prompt>" --output-format stream-json --verbose | \
     grep -o 'cache_creation_input_tokens":[0-9]*' | head -3

---

CLAUDE_CODE_USE_BEDROCK=1 \
   AWS_PROFILE=<profile> \
   AWS_REGION=us-east-1 \
   ANTHROPIC_MODEL=us.anthropic.claude-sonnet-4-6 \
   claude --bare -p "<same 300k-token prompt>" --output-format stream-json --verbose | \
     grep -o 'cache_creation_input_tokens":[0-9]*' | head -3

Preflight

I have searched existing issues and this hasn't been reported yet
This is a single bug report
I am using the latest version of Claude Code (v2.1.144)

What's Wrong?

When CLAUDE_CODE_USE_BEDROCK=1 is set and the user prompt is large (>150k tokens), Claude Code emits a cache_control marker that covers only a tiny fraction of the prompt. The same prompt sent through Claude Code in default mode (Anthropic 1P direct) caches the full prompt correctly.

Concretely, with a single ~350k-token user-content prompt:

Mode	`cache_creation_input_tokens`	TTFT cold	TTFT warm
CC + Anthropic 1P direct	~790,000 (full prompt)	~12s	~5s
CC + Bedrock direct (`CLAUDE_CODE_USE_BEDROCK=1`)	~6,500 (≈1.7% of prompt)	~33-75s	~33-45s (no warm-up)

Same Claude Code version (v2.1.144), same model (claude-sonnet-4-6), same prompt fixture, run minutes apart. The Bedrock-mode path effectively never warms because almost nothing gets written to the cache.

This is not a Bedrock-side limit — see "Control test" below; Bedrock honors a single cache_control: {"type": "ephemeral"} marker at 300k context and caches the full ~393k tokens correctly when the request body has the marker placed on the user content.

What Should Happen?

CLAUDE_CODE_USE_BEDROCK=1 mode should place cache_control markers such that the bulk of the user prompt is covered, matching what the default (Anthropic 1P) mode does for the same input.

Evidence — bench measurement

3 trials × 3 task sizes × 2 modes, all using the same CC v2.1.144 binary and the same prompt fixtures:

Task	Mode	T1 cache_creation	T2 cache_creation	T3 cache_creation
50k inline	CC + 1P direct	135,372	133,082	0 (full read of T2)
50k inline	CC + Bedrock direct	135,372	0 (full read)	0 (full read)
150k inline	CC + 1P direct	398,282	398,282	0 (full read)
150k inline	CC + Bedrock direct	398,282	398,282	0 (full read)
300k inline	CC + 1P direct	790,540	792,830	0 (full read)
300k inline	CC + Bedrock direct	6,482 ⚠️	6,800 ⚠️	61,957 (timeout trial)

At 50k and 150k both modes cache correctly. At 300k the Bedrock-direct path writes only ~6.5k tokens to cache (≈1.7% of the prompt) — every trial is effectively cold pre-fill.

Control test — Bedrock works fine

Direct boto3.bedrock-runtime.invoke_model_with_response_stream against us.anthropic.claude-sonnet-4-6 in us-east-1, with the same ~350k-token prompt and one explicit cache_control: {"type": "ephemeral"} block on the user text content:

Trial	Condition	RequestId	TTFT	input_tokens	cache_creation	cache_read
1	no marker	`<redacted>`	18.4s	393,558	0	0
2	no marker	`<redacted>`	30.2s	393,558	0	0
3	marker, WRITE	`<redacted>`	19.3s	3	393,555	0
4	marker, HIT	`<redacted>`	4.3s	3	0	393,555
5	marker, HIT	`<redacted>`	4.5s	3	0	393,555

Bedrock cached 393,555 tokens (full prompt) on the first call and served warm reads on the next two — TTFT collapsed from ~24s to ~4.5s. So at 300k context, Bedrock honors cache_control correctly. The bug is on the CC client side, in how it constructs the request body for the Bedrock transport at large prompt sizes.

Steps to Reproduce

Generate a ~300k-token inline prompt (e.g., synthetic log lines + a marker line + a question). Anything well above the 150k threshold.

Run via Claude Code on Anthropic 1P (control):

ANTHROPIC_BASE_URL=https://api.anthropic.com \
ANTHROPIC_API_KEY=<your-anthropic-key> \
ANTHROPIC_MODEL=claude-sonnet-4-6 \
claude --bare -p "<300k-token prompt>" --output-format stream-json --verbose | \
  grep -o 'cache_creation_input_tokens":[0-9]*' | head -3

Observe cache_creation_input_tokens ≈ 790,000 on the first turn.

Run via Claude Code on Bedrock direct:

CLAUDE_CODE_USE_BEDROCK=1 \
AWS_PROFILE=<profile> \
AWS_REGION=us-east-1 \
ANTHROPIC_MODEL=us.anthropic.claude-sonnet-4-6 \
claude --bare -p "<same 300k-token prompt>" --output-format stream-json --verbose | \
  grep -o 'cache_creation_input_tokens":[0-9]*' | head -3

Observe cache_creation_input_tokens ≈ 6,500 — the bug.

Likely root cause (untested hypothesis)

Looks similar in shape to #50213 (subagent system-context block placed after the cache marker), but on the user-content path rather than the system-prompt path, and only triggering above ~150k tokens. The Bedrock transport may be splitting the prompt into chunks differently than the 1P transport and only attaching the marker to a small leading chunk.

Why this matters

Environment

Claude Code: v2.1.144 (also reproduced on v2.1.145)
Platform: Windows 11 + native Windows binary
Bedrock model: us.anthropic.claude-sonnet-4-6 cross-region inference profile
Region: us-east-1
Boundary: bug appears between 150k (works) and 300k (broken). Did not bisect further.

Notes

Not yet known whether this affects Vertex transport (CLAUDE_CODE_USE_VERTEX=1) at the same context size — please consider checking both.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering