litellm - ✅(Solved) Fix [Bug]: cache misses due to asymmetry between Anthropic disk cache formatting and live formatting [1 pull requests, 1 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
BerriAI/litellm#27337Fetched 2026-05-07 03:33:04
View on GitHub
Comments
1
Participants
2
Timeline
8
Reactions
0
Author
Timeline (top)
labeled ×3cross-referenced ×2commented ×1mentioned ×1

Root Cause

For Anthropic responses with extended thinking, the Message returned from the litellm disk cache is not byte-equivalent to the same Message returned live. The cached version has provider_specific_fields.reasoning_content populated; the live one doesn't. Any standard multi-turn loop that appends the assistant message back into messages and re-calls litellm.completion will cache-miss as soon as one earlier turn was a cache hit, because the disk cache key hashes the full messages structure (including provider_specific_fields).

Fix Action

Fix / Workaround

Workaround: stuff reasoning_content into provider_specific_fields before appending a returned message back to the message history. Alternatively, manually copy each cached Message and remove reasoning_content from provider_specific_fields before re-using it.

PR fix notes

PR #27364: fix(cache): align anthropic reasoning_content with live response

Description (problem / solution / changelog)

fix : #27337

Type

🐛 Bug Fix

Changes

  • Live Anthropic responses set reasoning_content only on Message.reasoning_content; cache replay was also writing it to provider_specific_fields, breaking disk cache keys on multi-turn calls.
  • Replay no longer adds reasoning_content to provider_specific_fields; any stale duplicate is stripped when top-level reasoning is present.
  • Strengthened existing thinking-content test and added a regression test for the duplicate-stripping case.

Changed files

  • litellm/litellm_core_utils/llm_response_utils/convert_dict_to_response.py (modified, +2/-3)
  • tests/llm_translation/test_llm_response_utils/test_convert_dict_to_chat_completion.py (modified, +45/-1)

Code Example

import litellm, tempfile, time
from litellm.types.caching import LiteLLMCacheType

litellm.enable_cache(type=LiteLLMCacheType.DISK, disk_cache_dir=tempfile.mkdtemp())

MODEL = "anthropic/claude-sonnet-4-6"
THINKING = {"thinking": {"type": "enabled", "budget_tokens": 4000}}

def two_turn(label):
    messages = [{"role": "user", "content": "What is 2 + 2? Think briefly."}]
    timings = []

    t = time.perf_counter()
    r = litellm.completion(model=MODEL, messages=messages, max_tokens=8000, **THINKING)
    timings.append(time.perf_counter() - t)
    messages.append(r.choices[0].message.model_dump())

    # Uncommenting the below line would fix it
    # messages[-1]['provider_specific_fields']['reasoning_content'] = messages[-1]['reasoning_content']

    messages.append({"role": "user", "content": "Now multiply that by 5."})
    t = time.perf_counter()
    r = litellm.completion(model=MODEL, messages=messages, max_tokens=8000, **THINKING)
    timings.append(time.perf_counter() - t)

    print(f"[{label}] " + ", ".join(f"{x:.2f}s" for x in timings))

two_turn("live")      # e.g. 0.9s, 1.2s
two_turn("cached")    # e.g. 0.0s, 1.2s   <-- turn 2 should also be ~0s

---

[live] 0.90s, 1.22s
[cached] 0.00s, 1.18s

---

"provider_specific_fields": {
     "citations": null,
+    "reasoning_content": "<joined thinking text>",
     "thinking_blocks": [...]
   }

---
RAW_BUFFERClick to expand / collapse

Check for existing issues

  • I have searched the existing issues and checked that my issue is not a duplicate.

What happened?

For Anthropic responses with extended thinking, the Message returned from the litellm disk cache is not byte-equivalent to the same Message returned live. The cached version has provider_specific_fields.reasoning_content populated; the live one doesn't. Any standard multi-turn loop that appends the assistant message back into messages and re-calls litellm.completion will cache-miss as soon as one earlier turn was a cache hit, because the disk cache key hashes the full messages structure (including provider_specific_fields).

Workaround: stuff reasoning_content into provider_specific_fields before appending a returned message back to the message history. Alternatively, manually copy each cached Message and remove reasoning_content from provider_specific_fields before re-using it.

Details

reasoning_content is set into provider_specific_fields only on the cache-replay path:

The replay-side line was added for DeepSeek in PR #8288 (Feb 7, 2025), back when reasoning genuinely was provider-specific (no top-level Message.reasoning_content attribute existed yet). When Anthropic thinking landed in PR #8843 (Feb 26, 2025), the new transform set reasoning_content only at the top level, never mirroring it into provider_specific_fields, but cache replay continued running through the older code path that does.

Fundamentally, the cache is miss-prone because it is not based on what's actually sent in the outgoing request. factory.py reads only specific keys out of provider_specific_fields (compaction blocks, web search results, tool results, signatures), and reasoning_content is not among them, so the asymmetry doesn't change what's actually sent to api.anthropic.com. But the disk cache key is built in caching.py:300 as cache_key += f"{param}: {str(param_value)}", so when param_value is a list[Message] it dumps the full Pydantic repr regardless of what factory.py later reads from it. So messages.append(msg) and messages.append(msg.model_dump()) both bake the asymmetric field into the cache key.

Steps to Reproduce

import litellm, tempfile, time
from litellm.types.caching import LiteLLMCacheType

litellm.enable_cache(type=LiteLLMCacheType.DISK, disk_cache_dir=tempfile.mkdtemp())

MODEL = "anthropic/claude-sonnet-4-6"
THINKING = {"thinking": {"type": "enabled", "budget_tokens": 4000}}

def two_turn(label):
    messages = [{"role": "user", "content": "What is 2 + 2? Think briefly."}]
    timings = []

    t = time.perf_counter()
    r = litellm.completion(model=MODEL, messages=messages, max_tokens=8000, **THINKING)
    timings.append(time.perf_counter() - t)
    messages.append(r.choices[0].message.model_dump())

    # Uncommenting the below line would fix it
    # messages[-1]['provider_specific_fields']['reasoning_content'] = messages[-1]['reasoning_content']

    messages.append({"role": "user", "content": "Now multiply that by 5."})
    t = time.perf_counter()
    r = litellm.completion(model=MODEL, messages=messages, max_tokens=8000, **THINKING)
    timings.append(time.perf_counter() - t)

    print(f"[{label}] " + ", ".join(f"{x:.2f}s" for x in timings))

two_turn("live")      # e.g. 0.9s, 1.2s
two_turn("cached")    # e.g. 0.0s, 1.2s   <-- turn 2 should also be ~0s

Output:

[live] 0.90s, 1.22s
[cached] 0.00s, 1.18s

The diff between the assistant message produced by the cold call (live) and the same call repeated against the populated cache:

   "provider_specific_fields": {
     "citations": null,
+    "reasoning_content": "<joined thinking text>",
     "thinking_blocks": [...]
   }

Relevant log output

What part of LiteLLM is this about?

SDK (litellm Python package)

What LiteLLM version are you on ?

v1.83.13

Twitter / LinkedIn details

No response

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

litellm - ✅(Solved) Fix [Bug]: cache misses due to asymmetry between Anthropic disk cache formatting and live formatting [1 pull requests, 1 comments, 2 participants]