litellm - 💡(How to fix) Fix [Bug]: Anthropic→Responses streaming adapter never extracts cache_read_input_tokens (always 0) [1 pull requests]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

Root Cause

litellm/llms/anthropic/experimental_pass_through/responses_adapters/streaming_iterator.py, lines 264–270 in AnthropicResponsesStreamWrapper._process_event, on the response.completed event:

# Prefer direct cache fields if present
cache_creation_tokens = int(
    getattr(usage, "cache_creation_input_tokens", 0) or 0
)
cache_read_tokens = int(
    getattr(usage, "cache_read_input_tokens", 0) or 0
)

The translator reads usage.cache_read_input_tokens — an Anthropic-only field name. The OpenAI Responses API usage object exposes the same value at usage.input_tokens_details.cached_tokens (consistent with Chat Completions' usage.prompt_tokens_details.cached_tokens). getattr for the non-existent Anthropic attribute always returns the default 0.

LiteLLM gets this right in the parallel Chat Completions path (litellm/llms/anthropic/experimental_pass_through/adapters/transformation.py:1367–1386 and adapters/streaming_iterator.py:287–293):

cached_tokens = (
    getattr(usage.prompt_tokens_details, "cached_tokens", 0) or 0
)

The Responses adapter just needs the equivalent.

Fix Action

Fixed

Code Example

# Prefer direct cache fields if present
cache_creation_tokens = int(
    getattr(usage, "cache_creation_input_tokens", 0) or 0
)
cache_read_tokens = int(
    getattr(usage, "cache_read_input_tokens", 0) or 0
)

---

cached_tokens = (
    getattr(usage.prompt_tokens_details, "cached_tokens", 0) or 0
)

---

import asyncio, os
import openai

# Pre-warm OpenAI's cache with a known body so subsequent calls have something to hit.
WARMUP_INSTRUCTIONS = (
    "You are a careful, methodical engineer. Respond with short factual statements. "
) * 50  # ~1.2K tokens — clears the 1024-token minimum for implicit caching.

async def main():
    direct = openai.AsyncOpenAI()
    # Warm cache, then verify the openai SDK can read it back.
    for i in range(3):
        raw = await direct.responses.with_raw_response.create(
            model="gpt-4o-mini",
            instructions=WARMUP_INSTRUCTIONS,
            input="hi",
            max_output_tokens=20,
        )
        resp = raw.parse()
        cached = (resp.usage.input_tokens_details.cached_tokens
                  if resp.usage and resp.usage.input_tokens_details else 0)
        print(f"direct call {i+1}: input={resp.usage.input_tokens} cached={cached}")

    # Now do the same body through LiteLLM's /v1/messages → /v1/responses bridge.
    # (Spin up `litellm --config <yaml>` with an openai-* model declared
    # ``mode: "responses"`` and ANTHROPIC_BASE_URL pointed at it.)
    from anthropic import AsyncAnthropic
    client = AsyncAnthropic(
        base_url=os.environ["LITELLM_PROXY_URL"],
        api_key=os.environ["LITELLM_MASTER_KEY"],
    )
    for i in range(3):
        msg = await client.messages.create(
            model="openai-gpt-4o-mini",
            system=WARMUP_INSTRUCTIONS,
            messages=[{"role": "user", "content": "hi"}],
            max_tokens=20,
        )
        # cache_read_input_tokens is always 0 here — even when the underlying
        # OpenAI call reports cached_tokens > 0 (verifiable via wire capture).
        print(f"litellm call {i+1}: input={msg.usage.input_tokens} cache_read={msg.usage.cache_read_input_tokens}")

asyncio.run(main())

---

direct call 1: input=1234 cached=0       # cold
direct call 2: input=1234 cached=1024    # cache hit
direct call 3: input=1234 cached=1024
litellm call 1: input=1234 cache_read=0
litellm call 2: input=1234 cache_read=1024
litellm call 3: input=1234 cache_read=1024

---

direct call 1: input=1234 cached=0
direct call 2: input=1234 cached=1024
direct call 3: input=1234 cached=1024
litellm call 1: input=1234 cache_read=0
litellm call 2: input=1234 cache_read=0   # ← bug
litellm call 3: input=1234 cache_read=0   # ← bug

---

if usage is not None:
    input_tokens = getattr(usage, "input_tokens", 0) or 0
    output_tokens = getattr(usage, "output_tokens", 0) or 0
    # OpenAI Responses API exposes the cache count at:
    #   usage.input_tokens_details.cached_tokens
    # NOT at usage.cache_read_input_tokens (Anthropic-only name).
    details = getattr(usage, "input_tokens_details", None)
    cache_read_tokens = 0
    if details is not None:
        cache_read_tokens = int(
            getattr(details, "cached_tokens", None)
            or (details.get("cached_tokens") if isinstance(details, dict) else 0)
            or 0
        )
    # Cache-creation isn't a Responses-API concept; keep it 0 for now
    # (Anthropic-side `cache_creation_input_tokens` has no OpenAI equivalent).
    cache_creation_tokens = 0
RAW_BUFFERClick to expand / collapse

Bug Description

When clients call /v1/messages (Anthropic format) and LiteLLM bridges to an OpenAI Responses API model (mode: "responses"), the streaming response emitted back to the client always reports cache_read_input_tokens = 0 — even when OpenAI's underlying response correctly reports thousands of cached prompt tokens.

This breaks observability for any Anthropic-compatible client (Claude Code, Claude Agent SDK, etc.) routing OpenAI traffic through LiteLLM: dashboards, Sentry, Langfuse, billing readouts all see 0 cache reads and conclude "OpenAI prompt caching is broken" — when in fact only the reporting is broken.

Root Cause

litellm/llms/anthropic/experimental_pass_through/responses_adapters/streaming_iterator.py, lines 264–270 in AnthropicResponsesStreamWrapper._process_event, on the response.completed event:

# Prefer direct cache fields if present
cache_creation_tokens = int(
    getattr(usage, "cache_creation_input_tokens", 0) or 0
)
cache_read_tokens = int(
    getattr(usage, "cache_read_input_tokens", 0) or 0
)

The translator reads usage.cache_read_input_tokens — an Anthropic-only field name. The OpenAI Responses API usage object exposes the same value at usage.input_tokens_details.cached_tokens (consistent with Chat Completions' usage.prompt_tokens_details.cached_tokens). getattr for the non-existent Anthropic attribute always returns the default 0.

LiteLLM gets this right in the parallel Chat Completions path (litellm/llms/anthropic/experimental_pass_through/adapters/transformation.py:1367–1386 and adapters/streaming_iterator.py:287–293):

cached_tokens = (
    getattr(usage.prompt_tokens_details, "cached_tokens", 0) or 0
)

The Responses adapter just needs the equivalent.

Reproduction

import asyncio, os
import openai

# Pre-warm OpenAI's cache with a known body so subsequent calls have something to hit.
WARMUP_INSTRUCTIONS = (
    "You are a careful, methodical engineer. Respond with short factual statements. "
) * 50  # ~1.2K tokens — clears the 1024-token minimum for implicit caching.

async def main():
    direct = openai.AsyncOpenAI()
    # Warm cache, then verify the openai SDK can read it back.
    for i in range(3):
        raw = await direct.responses.with_raw_response.create(
            model="gpt-4o-mini",
            instructions=WARMUP_INSTRUCTIONS,
            input="hi",
            max_output_tokens=20,
        )
        resp = raw.parse()
        cached = (resp.usage.input_tokens_details.cached_tokens
                  if resp.usage and resp.usage.input_tokens_details else 0)
        print(f"direct call {i+1}: input={resp.usage.input_tokens} cached={cached}")

    # Now do the same body through LiteLLM's /v1/messages → /v1/responses bridge.
    # (Spin up `litellm --config <yaml>` with an openai-* model declared
    # ``mode: "responses"`` and ANTHROPIC_BASE_URL pointed at it.)
    from anthropic import AsyncAnthropic
    client = AsyncAnthropic(
        base_url=os.environ["LITELLM_PROXY_URL"],
        api_key=os.environ["LITELLM_MASTER_KEY"],
    )
    for i in range(3):
        msg = await client.messages.create(
            model="openai-gpt-4o-mini",
            system=WARMUP_INSTRUCTIONS,
            messages=[{"role": "user", "content": "hi"}],
            max_tokens=20,
        )
        # cache_read_input_tokens is always 0 here — even when the underlying
        # OpenAI call reports cached_tokens > 0 (verifiable via wire capture).
        print(f"litellm call {i+1}: input={msg.usage.input_tokens} cache_read={msg.usage.cache_read_input_tokens}")

asyncio.run(main())

Expected output (with the fix):

direct call 1: input=1234 cached=0       # cold
direct call 2: input=1234 cached=1024    # cache hit
direct call 3: input=1234 cached=1024
litellm call 1: input=1234 cache_read=0
litellm call 2: input=1234 cache_read=1024
litellm call 3: input=1234 cache_read=1024

Actual output (current main):

direct call 1: input=1234 cached=0
direct call 2: input=1234 cached=1024
direct call 3: input=1234 cached=1024
litellm call 1: input=1234 cache_read=0
litellm call 2: input=1234 cache_read=0   # ← bug
litellm call 3: input=1234 cache_read=0   # ← bug

We verified this end-to-end by capturing the LiteLLM proxy's outbound TLS traffic with Wireshark + SSLKEYLOGFILE, decoding the OpenAI response bodies, and confirming OpenAI returns usage.input_tokens_details.cached_tokens > 0 while LiteLLM's translated Anthropic SSE shows cache_read_input_tokens: 0.

Proposed Fix

In AnthropicResponsesStreamWrapper._process_event (responses_adapters/streaming_iterator.py:259-270):

if usage is not None:
    input_tokens = getattr(usage, "input_tokens", 0) or 0
    output_tokens = getattr(usage, "output_tokens", 0) or 0
    # OpenAI Responses API exposes the cache count at:
    #   usage.input_tokens_details.cached_tokens
    # NOT at usage.cache_read_input_tokens (Anthropic-only name).
    details = getattr(usage, "input_tokens_details", None)
    cache_read_tokens = 0
    if details is not None:
        cache_read_tokens = int(
            getattr(details, "cached_tokens", None)
            or (details.get("cached_tokens") if isinstance(details, dict) else 0)
            or 0
        )
    # Cache-creation isn't a Responses-API concept; keep it 0 for now
    # (Anthropic-side `cache_creation_input_tokens` has no OpenAI equivalent).
    cache_creation_tokens = 0

Lines 262–263 (cache_creation_tokens = getattr(usage, "input_tokens_details", None) followed by cache_read_tokens = getattr(usage, "output_tokens_details", None)) also appear to be dead/wrong — they assign a details object to a token-count variable that's overwritten two lines later, so they could be removed as part of the same cleanup.

Environment

  • litellm: 1.85.0 (also verified bug present on main as of 2026-05-20)
  • openai SDK: 2.30.0
  • httpx: 0.28.1
  • Python: 3.12.9

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

litellm - 💡(How to fix) Fix [Bug]: Anthropic→Responses streaming adapter never extracts cache_read_input_tokens (always 0) [1 pull requests]