claude-code - 💡(How to fix) Fix [BUG] CLAUDE_CODE_USE_BEDROCK=1 writes <2% of prompt to cache at >150k context, vs full prompt on Anthropic 1P direct

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

Root Cause

For real-world CC usage on Bedrock with large repo context (extended sessions, big code files inlined, MCP servers with large state) this turns multi-second warm-cache requests into ~30-75s cold pre-fills every turn, and charges the full uncached input-token rate every turn (vs the 90% cache-read discount). At Sonnet 4.6 input pricing this is roughly $1.05/turn ($3/M × 350k) instead of $0.10/turn warmed.

Code Example

ANTHROPIC_BASE_URL=https://api.anthropic.com \
   ANTHROPIC_API_KEY=<your-anthropic-key> \
   ANTHROPIC_MODEL=claude-sonnet-4-6 \
   claude --bare -p "<300k-token prompt>" --output-format stream-json --verbose | \
     grep -o 'cache_creation_input_tokens":[0-9]*' | head -3

---

CLAUDE_CODE_USE_BEDROCK=1 \
   AWS_PROFILE=<profile> \
   AWS_REGION=us-east-1 \
   ANTHROPIC_MODEL=us.anthropic.claude-sonnet-4-6 \
   claude --bare -p "<same 300k-token prompt>" --output-format stream-json --verbose | \
     grep -o 'cache_creation_input_tokens":[0-9]*' | head -3
RAW_BUFFERClick to expand / collapse

Preflight

  • I have searched existing issues and this hasn't been reported yet
  • This is a single bug report
  • I am using the latest version of Claude Code (v2.1.144)

What's Wrong?

When CLAUDE_CODE_USE_BEDROCK=1 is set and the user prompt is large (>150k tokens), Claude Code emits a cache_control marker that covers only a tiny fraction of the prompt. The same prompt sent through Claude Code in default mode (Anthropic 1P direct) caches the full prompt correctly.

Concretely, with a single ~350k-token user-content prompt:

Modecache_creation_input_tokensTTFT coldTTFT warm
CC + Anthropic 1P direct~790,000 (full prompt)~12s~5s
CC + Bedrock direct (CLAUDE_CODE_USE_BEDROCK=1)~6,500 (≈1.7% of prompt)~33-75s~33-45s (no warm-up)

Same Claude Code version (v2.1.144), same model (claude-sonnet-4-6), same prompt fixture, run minutes apart. The Bedrock-mode path effectively never warms because almost nothing gets written to the cache.

This is not a Bedrock-side limit — see "Control test" below; Bedrock honors a single cache_control: {"type": "ephemeral"} marker at 300k context and caches the full ~393k tokens correctly when the request body has the marker placed on the user content.

What Should Happen?

CLAUDE_CODE_USE_BEDROCK=1 mode should place cache_control markers such that the bulk of the user prompt is covered, matching what the default (Anthropic 1P) mode does for the same input.

Evidence — bench measurement

3 trials × 3 task sizes × 2 modes, all using the same CC v2.1.144 binary and the same prompt fixtures:

TaskModeT1 cache_creationT2 cache_creationT3 cache_creation
50k inlineCC + 1P direct135,372133,0820 (full read of T2)
50k inlineCC + Bedrock direct135,3720 (full read)0 (full read)
150k inlineCC + 1P direct398,282398,2820 (full read)
150k inlineCC + Bedrock direct398,282398,2820 (full read)
300k inlineCC + 1P direct790,540792,8300 (full read)
300k inlineCC + Bedrock direct6,482 ⚠️6,800 ⚠️61,957 (timeout trial)

At 50k and 150k both modes cache correctly. At 300k the Bedrock-direct path writes only ~6.5k tokens to cache (≈1.7% of the prompt) — every trial is effectively cold pre-fill.

Control test — Bedrock works fine

Direct boto3.bedrock-runtime.invoke_model_with_response_stream against us.anthropic.claude-sonnet-4-6 in us-east-1, with the same ~350k-token prompt and one explicit cache_control: {"type": "ephemeral"} block on the user text content:

TrialConditionRequestIdTTFTinput_tokenscache_creationcache_read
1no marker<redacted>18.4s393,55800
2no marker<redacted>30.2s393,55800
3marker, WRITE<redacted>19.3s3393,5550
4marker, HIT<redacted>4.3s30393,555
5marker, HIT<redacted>4.5s30393,555

Bedrock cached 393,555 tokens (full prompt) on the first call and served warm reads on the next two — TTFT collapsed from ~24s to ~4.5s. So at 300k context, Bedrock honors cache_control correctly. The bug is on the CC client side, in how it constructs the request body for the Bedrock transport at large prompt sizes.

Steps to Reproduce

  1. Generate a ~300k-token inline prompt (e.g., synthetic log lines + a marker line + a question). Anything well above the 150k threshold.

  2. Run via Claude Code on Anthropic 1P (control):

    ANTHROPIC_BASE_URL=https://api.anthropic.com \
    ANTHROPIC_API_KEY=<your-anthropic-key> \
    ANTHROPIC_MODEL=claude-sonnet-4-6 \
    claude --bare -p "<300k-token prompt>" --output-format stream-json --verbose | \
      grep -o 'cache_creation_input_tokens":[0-9]*' | head -3

    Observe cache_creation_input_tokens ≈ 790,000 on the first turn.

  3. Run via Claude Code on Bedrock direct:

    CLAUDE_CODE_USE_BEDROCK=1 \
    AWS_PROFILE=<profile> \
    AWS_REGION=us-east-1 \
    ANTHROPIC_MODEL=us.anthropic.claude-sonnet-4-6 \
    claude --bare -p "<same 300k-token prompt>" --output-format stream-json --verbose | \
      grep -o 'cache_creation_input_tokens":[0-9]*' | head -3

    Observe cache_creation_input_tokens ≈ 6,500 — the bug.

Likely root cause (untested hypothesis)

Looks similar in shape to #50213 (subagent system-context block placed after the cache marker), but on the user-content path rather than the system-prompt path, and only triggering above ~150k tokens. The Bedrock transport may be splitting the prompt into chunks differently than the 1P transport and only attaching the marker to a small leading chunk.

Why this matters

For real-world CC usage on Bedrock with large repo context (extended sessions, big code files inlined, MCP servers with large state) this turns multi-second warm-cache requests into ~30-75s cold pre-fills every turn, and charges the full uncached input-token rate every turn (vs the 90% cache-read discount). At Sonnet 4.6 input pricing this is roughly $1.05/turn ($3/M × 350k) instead of $0.10/turn warmed.

Environment

  • Claude Code: v2.1.144 (also reproduced on v2.1.145)
  • Platform: Windows 11 + native Windows binary
  • Bedrock model: us.anthropic.claude-sonnet-4-6 cross-region inference profile
  • Region: us-east-1
  • Boundary: bug appears between 150k (works) and 300k (broken). Did not bisect further.

Notes

  • Not yet known whether this affects Vertex transport (CLAUDE_CODE_USE_VERTEX=1) at the same context size — please consider checking both.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING