litellm - 💡(How to fix) Fix [Bug]: Anthropic /v1/messages — cache_read_input_tokens not normalized into prompt_tokens_details.cached_tokens; litellm_cached_tokens_metric_total never increments (Vertex + Bedrock confirmed)

Code Example

litellm_settings:
  callbacks:
    - prometheus
  prometheus_metrics_config:
    - group: cache_metrics
      metrics:
        - litellm_cache_hits_metric
        - litellm_cache_misses_metric
        - litellm_cached_tokens_metric
      include_labels: [model, team_alias]

model_list:
  - model_name: claude-sonnet-4-5-vertex
    litellm_params:
      model: vertex_ai/claude-sonnet-4-5@20250929
      vertex_project: <your-project>
      vertex_location: <your-region>

  - model_name: claude-sonnet-4-5-bedrock
    litellm_params:
      model: bedrock/us.anthropic.claude-sonnet-4-5-20250929-v1:0

---

SYS=$(python3 -c "print(('You are an analyzer of static legal text. ' * 200))")
PAYLOAD=$(jq -nc --arg model "claude-sonnet-4-5-vertex" --arg sys "$SYS" '{
  model: $model,
  max_tokens: 16,
  system: [{"type":"text","text":$sys,"cache_control":{"type":"ephemeral"}}],
  messages: [{"role":"user","content":"ok"}]
}')

for i in 1 2; do
  echo "--- call $i ---"
  curl -sS "$LITELLM_PROXY_URL/v1/messages" \
    -H "x-api-key: $LITELLM_KEY" \
    -H "anthropic-version: 2023-06-01" \
    -H "Content-Type: application/json" \
    -d "$PAYLOAD" | jq '.usage'
  sleep 3
done

---

--- call 1 (cache write) ---
{
  "input_tokens": 7,
  "cache_creation_input_tokens": 1802,
  "cache_read_input_tokens": 0,
  "cache_creation": {
    "ephemeral_5m_input_tokens": 1802,
    "ephemeral_1h_input_tokens": 0
  },
  "output_tokens": 16,
  "total_tokens": 23
}

--- call 2 (cache hit) ---
{
  "input_tokens": 7,
  "cache_creation_input_tokens": 0,
  "cache_read_input_tokens": 1802,
  "cache_creation": {
    "ephemeral_5m_input_tokens": 0,
    "ephemeral_1h_input_tokens": 0
  },
  "output_tokens": 16,
  "total_tokens": 23
}

---

# Direct query — no series exist
> litellm_cached_tokens_metric_total
(no data)

# Active-series count for related metrics across the entire proxy
> count by (__name__) ({__name__=~"litellm_.*cach.*|litellm_.*token.*"})
litellm_cache_misses_metric_total          176
litellm_input_tokens_metric_total          313
litellm_output_tokens_metric_total         313
litellm_total_tokens_metric_total          313
litellm_cache_hits_metric_total            (absent)
litellm_cached_tokens_metric_total         (absent)

Check for existing issues

I have searched the existing issues and checked that my issue is not a duplicate.

Closest related but distinct: #11935 / #11992 (partial fix for Vertex Anthropic passthrough cost tracking; never extended to prometheus.py), #11364 (cached_tokens not populated, Anthropic direct, open), #7790 (async logging callbacks drop cache fields when streaming), #11789 (Anthropic streaming cost tracking ignores cache reads), #26625 (Bedrock /v1/messages caching reported broken — likely the same passthrough-layer pattern).

What happened?

When sending Anthropic Messages-format requests with cache_control to either vertex_ai/claude-* or bedrock/...anthropic.claude-* deployments via the /v1/messages endpoint:

The upstream returns valid cache usage (cache_creation_input_tokens / cache_read_input_tokens populated correctly across two consecutive calls).
LiteLLM does not normalize these Anthropic-native fields into usage.prompt_tokens_details.cached_tokens (the OpenAI-standardized field).
Consequently, litellm_cached_tokens_metric_total is never incremented — verified across two providers (Vertex AI and Bedrock) on the same proxy with confirmed live cache hits at the API level.
We suspect litellm_spend_metric is similarly affected (cache reads billed as full-priced input tokens), in line with the pattern in #11789.

Reproducing across two distinct provider paths suggests the defect lives in the shared response-normalization or async-callback layer for Anthropic-format usage objects, not in either provider's transformer.

Expected behavior

For Anthropic-format responses (Vertex partner-models, Bedrock, and Anthropic-direct), the Prometheus integration should observe non-zero values when cache hits/writes occur. Either the response normalization layer should populate prompt_tokens_details.cached_tokens from cache_read_input_tokens, or _increment_cache_metrics in litellm/integrations/prometheus.py should fall back to the Anthropic-native fields when present.

Steps to Reproduce

Run LiteLLM proxy v1.83.10-stable with the prometheus callback and the cache metric explicitly enabled:

litellm_settings:
  callbacks:
    - prometheus
  prometheus_metrics_config:
    - group: cache_metrics
      metrics:
        - litellm_cache_hits_metric
        - litellm_cache_misses_metric
        - litellm_cached_tokens_metric
      include_labels: [model, team_alias]

model_list:
  - model_name: claude-sonnet-4-5-vertex
    litellm_params:
      model: vertex_ai/claude-sonnet-4-5@20250929
      vertex_project: <your-project>
      vertex_location: <your-region>

  - model_name: claude-sonnet-4-5-bedrock
    litellm_params:
      model: bedrock/us.anthropic.claude-sonnet-4-5-20250929-v1:0

Send two consecutive /v1/messages calls within 5 minutes against either alias, with an identical >2,048-token system prompt and an explicit cache_control marker:

SYS=$(python3 -c "print(('You are an analyzer of static legal text. ' * 200))")
PAYLOAD=$(jq -nc --arg model "claude-sonnet-4-5-vertex" --arg sys "$SYS" '{
  model: $model,
  max_tokens: 16,
  system: [{"type":"text","text":$sys,"cache_control":{"type":"ephemeral"}}],
  messages: [{"role":"user","content":"ok"}]
}')

for i in 1 2; do
  echo "--- call $i ---"
  curl -sS "$LITELLM_PROXY_URL/v1/messages" \
    -H "x-api-key: $LITELLM_KEY" \
    -H "anthropic-version: 2023-06-01" \
    -H "Content-Type: application/json" \
    -d "$PAYLOAD" | jq '.usage'
  sleep 3
done

Repeat step 2 against the Bedrock alias.
Query Prometheus for the cache metric series.

Relevant log output

Response payloads — Vertex AI Anthropic. Cache write/read cycle confirmed at the upstream API level:

--- call 1 (cache write) ---
{
  "input_tokens": 7,
  "cache_creation_input_tokens": 1802,
  "cache_read_input_tokens": 0,
  "cache_creation": {
    "ephemeral_5m_input_tokens": 1802,
    "ephemeral_1h_input_tokens": 0
  },
  "output_tokens": 16,
  "total_tokens": 23
}

--- call 2 (cache hit) ---
{
  "input_tokens": 7,
  "cache_creation_input_tokens": 0,
  "cache_read_input_tokens": 1802,
  "cache_creation": {
    "ephemeral_5m_input_tokens": 0,
    "ephemeral_1h_input_tokens": 0
  },
  "output_tokens": 16,
  "total_tokens": 23
}

Bedrock Anthropic also confirmed working at the upstream API level (same response shape — cache_read_input_tokens > 0 on call 2).

Prometheus state at the same proxy across both tests (sanitized):

# Direct query — no series exist
> litellm_cached_tokens_metric_total
(no data)

# Active-series count for related metrics across the entire proxy
> count by (__name__) ({__name__=~"litellm_.*cach.*|litellm_.*token.*"})
litellm_cache_misses_metric_total          176
litellm_input_tokens_metric_total          313
litellm_output_tokens_metric_total         313
litellm_total_tokens_metric_total          313
litellm_cache_hits_metric_total            (absent)
litellm_cached_tokens_metric_total         (absent)

litellm_cached_tokens_metric_total and litellm_cache_hits_metric_total have never emitted a single series across this proxy, despite continuous production traffic to vertex_ai/claude-* and bedrock/...anthropic.claude-* aliases that reproducibly produces cache hits per the curls above.

Suspected location

Likely candidates (cross-provider reproduction makes this almost certainly a shared-layer defect rather than a per-provider transformer issue):

litellm/integrations/prometheus.py — _increment_cache_metrics reads usage.prompt_tokens_details.cached_tokens; doesn't fall back to Anthropic-native cache_read_input_tokens.
The async-callback path that fires for /v1/messages passthrough — does not appear to translate Anthropic-format usage into the OpenAI-standardized shape before invoking callbacks (parallel to #7790).
The Anthropic-direct route is untested by us; a maintainer-side spot-check would confirm whether this is universal across all Anthropic-format responses or specific to passthrough providers.

Happy to test additional scenarios — async vs sync logging, streaming vs non-streaming, Anthropic-direct route — to narrow further.

What part of LiteLLM is this about?

Logging / Observability (Prometheus integration + Anthropic-format response normalization)

What LiteLLM version are you on?

v1.83.10-stable

Twitter / LinkedIn details

n/a

FAQ

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

litellm - 💡(How to fix) Fix [Bug]: Anthropic /v1/messages — cache_read_input_tokens not normalized into prompt_tokens_details.cached_tokens; litellm_cached_tokens_metric_total never increments (Vertex + Bedrock confirmed)

Recommended Tools

GitHub issue graph ai analysis

Code Example

Check for existing issues

What happened?

Expected behavior

Steps to Reproduce

Relevant log output

Suspected location

What part of LiteLLM is this about?

What LiteLLM version are you on?

Twitter / LinkedIn details

FAQ

Expected behavior

Still need to ship something?

TRENDING

litellm - 💡(How to fix) Fix [Bug]: Anthropic /v1/messages — cache_read_input_tokens not normalized into prompt_tokens_details.cached_tokens; litellm_cached_tokens_metric_total never increments (Vertex + Bedrock confirmed)

Recommended Tools

GitHub issue graph ai analysis

Code Example

Check for existing issues

What happened?

Expected behavior

Steps to Reproduce

Relevant log output

Suspected location

What part of LiteLLM is this about?

What LiteLLM version are you on?

Twitter / LinkedIn details

FAQ

Expected behavior

Still need to ship something?

RELATED_DISCOVERY

TRENDING