litellm - 💡(How to fix) Fix [Bug]: Anthropic /v1/messages — cache_read_input_tokens not normalized into prompt_tokens_details.cached_tokens; litellm_cached_tokens_metric_total never increments (Vertex + Bedrock confirmed)

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

Code Example

litellm_settings:
  callbacks:
    - prometheus
  prometheus_metrics_config:
    - group: cache_metrics
      metrics:
        - litellm_cache_hits_metric
        - litellm_cache_misses_metric
        - litellm_cached_tokens_metric
      include_labels: [model, team_alias]

model_list:
  - model_name: claude-sonnet-4-5-vertex
    litellm_params:
      model: vertex_ai/claude-sonnet-4-5@20250929
      vertex_project: <your-project>
      vertex_location: <your-region>

  - model_name: claude-sonnet-4-5-bedrock
    litellm_params:
      model: bedrock/us.anthropic.claude-sonnet-4-5-20250929-v1:0

---

SYS=$(python3 -c "print(('You are an analyzer of static legal text. ' * 200))")
PAYLOAD=$(jq -nc --arg model "claude-sonnet-4-5-vertex" --arg sys "$SYS" '{
  model: $model,
  max_tokens: 16,
  system: [{"type":"text","text":$sys,"cache_control":{"type":"ephemeral"}}],
  messages: [{"role":"user","content":"ok"}]
}')

for i in 1 2; do
  echo "--- call $i ---"
  curl -sS "$LITELLM_PROXY_URL/v1/messages" \
    -H "x-api-key: $LITELLM_KEY" \
    -H "anthropic-version: 2023-06-01" \
    -H "Content-Type: application/json" \
    -d "$PAYLOAD" | jq '.usage'
  sleep 3
done

---

--- call 1 (cache write) ---
{
  "input_tokens": 7,
  "cache_creation_input_tokens": 1802,
  "cache_read_input_tokens": 0,
  "cache_creation": {
    "ephemeral_5m_input_tokens": 1802,
    "ephemeral_1h_input_tokens": 0
  },
  "output_tokens": 16,
  "total_tokens": 23
}

--- call 2 (cache hit) ---
{
  "input_tokens": 7,
  "cache_creation_input_tokens": 0,
  "cache_read_input_tokens": 1802,
  "cache_creation": {
    "ephemeral_5m_input_tokens": 0,
    "ephemeral_1h_input_tokens": 0
  },
  "output_tokens": 16,
  "total_tokens": 23
}

---

# Direct query — no series exist
> litellm_cached_tokens_metric_total
(no data)

# Active-series count for related metrics across the entire proxy
> count by (__name__) ({__name__=~"litellm_.*cach.*|litellm_.*token.*"})
litellm_cache_misses_metric_total          176
litellm_input_tokens_metric_total          313
litellm_output_tokens_metric_total         313
litellm_total_tokens_metric_total          313
litellm_cache_hits_metric_total            (absent)
litellm_cached_tokens_metric_total         (absent)
RAW_BUFFERClick to expand / collapse

Check for existing issues

  • I have searched the existing issues and checked that my issue is not a duplicate.

Closest related but distinct: #11935 / #11992 (partial fix for Vertex Anthropic passthrough cost tracking; never extended to prometheus.py), #11364 (cached_tokens not populated, Anthropic direct, open), #7790 (async logging callbacks drop cache fields when streaming), #11789 (Anthropic streaming cost tracking ignores cache reads), #26625 (Bedrock /v1/messages caching reported broken — likely the same passthrough-layer pattern).

What happened?

When sending Anthropic Messages-format requests with cache_control to either vertex_ai/claude-* or bedrock/...anthropic.claude-* deployments via the /v1/messages endpoint:

  • The upstream returns valid cache usage (cache_creation_input_tokens / cache_read_input_tokens populated correctly across two consecutive calls).
  • LiteLLM does not normalize these Anthropic-native fields into usage.prompt_tokens_details.cached_tokens (the OpenAI-standardized field).
  • Consequently, litellm_cached_tokens_metric_total is never incremented — verified across two providers (Vertex AI and Bedrock) on the same proxy with confirmed live cache hits at the API level.
  • We suspect litellm_spend_metric is similarly affected (cache reads billed as full-priced input tokens), in line with the pattern in #11789.

Reproducing across two distinct provider paths suggests the defect lives in the shared response-normalization or async-callback layer for Anthropic-format usage objects, not in either provider's transformer.

Expected behavior

For Anthropic-format responses (Vertex partner-models, Bedrock, and Anthropic-direct), the Prometheus integration should observe non-zero values when cache hits/writes occur. Either the response normalization layer should populate prompt_tokens_details.cached_tokens from cache_read_input_tokens, or _increment_cache_metrics in litellm/integrations/prometheus.py should fall back to the Anthropic-native fields when present.

Steps to Reproduce

  1. Run LiteLLM proxy v1.83.10-stable with the prometheus callback and the cache metric explicitly enabled:
litellm_settings:
  callbacks:
    - prometheus
  prometheus_metrics_config:
    - group: cache_metrics
      metrics:
        - litellm_cache_hits_metric
        - litellm_cache_misses_metric
        - litellm_cached_tokens_metric
      include_labels: [model, team_alias]

model_list:
  - model_name: claude-sonnet-4-5-vertex
    litellm_params:
      model: vertex_ai/claude-sonnet-4-5@20250929
      vertex_project: <your-project>
      vertex_location: <your-region>

  - model_name: claude-sonnet-4-5-bedrock
    litellm_params:
      model: bedrock/us.anthropic.claude-sonnet-4-5-20250929-v1:0
  1. Send two consecutive /v1/messages calls within 5 minutes against either alias, with an identical >2,048-token system prompt and an explicit cache_control marker:
SYS=$(python3 -c "print(('You are an analyzer of static legal text. ' * 200))")
PAYLOAD=$(jq -nc --arg model "claude-sonnet-4-5-vertex" --arg sys "$SYS" '{
  model: $model,
  max_tokens: 16,
  system: [{"type":"text","text":$sys,"cache_control":{"type":"ephemeral"}}],
  messages: [{"role":"user","content":"ok"}]
}')

for i in 1 2; do
  echo "--- call $i ---"
  curl -sS "$LITELLM_PROXY_URL/v1/messages" \
    -H "x-api-key: $LITELLM_KEY" \
    -H "anthropic-version: 2023-06-01" \
    -H "Content-Type: application/json" \
    -d "$PAYLOAD" | jq '.usage'
  sleep 3
done
  1. Repeat step 2 against the Bedrock alias.

  2. Query Prometheus for the cache metric series.

Relevant log output

Response payloads — Vertex AI Anthropic. Cache write/read cycle confirmed at the upstream API level:

--- call 1 (cache write) ---
{
  "input_tokens": 7,
  "cache_creation_input_tokens": 1802,
  "cache_read_input_tokens": 0,
  "cache_creation": {
    "ephemeral_5m_input_tokens": 1802,
    "ephemeral_1h_input_tokens": 0
  },
  "output_tokens": 16,
  "total_tokens": 23
}

--- call 2 (cache hit) ---
{
  "input_tokens": 7,
  "cache_creation_input_tokens": 0,
  "cache_read_input_tokens": 1802,
  "cache_creation": {
    "ephemeral_5m_input_tokens": 0,
    "ephemeral_1h_input_tokens": 0
  },
  "output_tokens": 16,
  "total_tokens": 23
}

Bedrock Anthropic also confirmed working at the upstream API level (same response shape — cache_read_input_tokens > 0 on call 2).

Prometheus state at the same proxy across both tests (sanitized):

# Direct query — no series exist
> litellm_cached_tokens_metric_total
(no data)

# Active-series count for related metrics across the entire proxy
> count by (__name__) ({__name__=~"litellm_.*cach.*|litellm_.*token.*"})
litellm_cache_misses_metric_total          176
litellm_input_tokens_metric_total          313
litellm_output_tokens_metric_total         313
litellm_total_tokens_metric_total          313
litellm_cache_hits_metric_total            (absent)
litellm_cached_tokens_metric_total         (absent)

litellm_cached_tokens_metric_total and litellm_cache_hits_metric_total have never emitted a single series across this proxy, despite continuous production traffic to vertex_ai/claude-* and bedrock/...anthropic.claude-* aliases that reproducibly produces cache hits per the curls above.

Suspected location

Likely candidates (cross-provider reproduction makes this almost certainly a shared-layer defect rather than a per-provider transformer issue):

  • litellm/integrations/prometheus.py_increment_cache_metrics reads usage.prompt_tokens_details.cached_tokens; doesn't fall back to Anthropic-native cache_read_input_tokens.
  • The async-callback path that fires for /v1/messages passthrough — does not appear to translate Anthropic-format usage into the OpenAI-standardized shape before invoking callbacks (parallel to #7790).
  • The Anthropic-direct route is untested by us; a maintainer-side spot-check would confirm whether this is universal across all Anthropic-format responses or specific to passthrough providers.

Happy to test additional scenarios — async vs sync logging, streaming vs non-streaming, Anthropic-direct route — to narrow further.

What part of LiteLLM is this about?

Logging / Observability (Prometheus integration + Anthropic-format response normalization)

What LiteLLM version are you on?

v1.83.10-stable

Twitter / LinkedIn details

n/a

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

FAQ

Expected behavior

For Anthropic-format responses (Vertex partner-models, Bedrock, and Anthropic-direct), the Prometheus integration should observe non-zero values when cache hits/writes occur. Either the response normalization layer should populate prompt_tokens_details.cached_tokens from cache_read_input_tokens, or _increment_cache_metrics in litellm/integrations/prometheus.py should fall back to the Anthropic-native fields when present.

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

litellm - 💡(How to fix) Fix [Bug]: Anthropic /v1/messages — cache_read_input_tokens not normalized into prompt_tokens_details.cached_tokens; litellm_cached_tokens_metric_total never increments (Vertex + Bedrock confirmed)