Root Cause

For Anthropic responses with extended thinking, the Message returned from the litellm disk cache is not byte-equivalent to the same Message returned live. The cached version has provider_specific_fields.reasoning_content populated; the live one doesn't. Any standard multi-turn loop that appends the assistant message back into messages and re-calls litellm.completion will cache-miss as soon as one earlier turn was a cache hit, because the disk cache key hashes the full messages structure (including provider_specific_fields).

Fix Action

Fix / Workaround

Workaround: stuff reasoning_content into provider_specific_fields before appending a returned message back to the message history. Alternatively, manually copy each cached Message and remove reasoning_content from provider_specific_fields before re-using it.

PR fix notes

PR #27364: fix(cache): align anthropic reasoning_content with live response

Repository: BerriAI/litellm
Author: nehaaprasaad
State: open | merged: False
Link: https://github.com/BerriAI/litellm/pull/27364

Description (problem / solution / changelog)

fix : #27337

Type

🐛 Bug Fix

Changes

Live Anthropic responses set reasoning_content only on Message.reasoning_content; cache replay was also writing it to provider_specific_fields, breaking disk cache keys on multi-turn calls.
Replay no longer adds reasoning_content to provider_specific_fields; any stale duplicate is stripped when top-level reasoning is present.
Strengthened existing thinking-content test and added a regression test for the duplicate-stripping case.

Changed files

litellm/litellm_core_utils/llm_response_utils/convert_dict_to_response.py (modified, +2/-3)
tests/llm_translation/test_llm_response_utils/test_convert_dict_to_chat_completion.py (modified, +45/-1)

Code Example

import litellm, tempfile, time
from litellm.types.caching import LiteLLMCacheType

litellm.enable_cache(type=LiteLLMCacheType.DISK, disk_cache_dir=tempfile.mkdtemp())

MODEL = "anthropic/claude-sonnet-4-6"
THINKING = {"thinking": {"type": "enabled", "budget_tokens": 4000}}

def two_turn(label):
    messages = [{"role": "user", "content": "What is 2 + 2? Think briefly."}]
    timings = []

    t = time.perf_counter()
    r = litellm.completion(model=MODEL, messages=messages, max_tokens=8000, **THINKING)
    timings.append(time.perf_counter() - t)
    messages.append(r.choices[0].message.model_dump())

    # Uncommenting the below line would fix it
    # messages[-1]['provider_specific_fields']['reasoning_content'] = messages[-1]['reasoning_content']

    messages.append({"role": "user", "content": "Now multiply that by 5."})
    t = time.perf_counter()
    r = litellm.completion(model=MODEL, messages=messages, max_tokens=8000, **THINKING)
    timings.append(time.perf_counter() - t)

    print(f"[{label}] " + ", ".join(f"{x:.2f}s" for x in timings))

two_turn("live")      # e.g. 0.9s, 1.2s
two_turn("cached")    # e.g. 0.0s, 1.2s   <-- turn 2 should also be ~0s

---

[live] 0.90s, 1.22s
[cached] 0.00s, 1.18s

---

"provider_specific_fields": {
     "citations": null,
+    "reasoning_content": "<joined thinking text>",
     "thinking_blocks": [...]
   }

---

Check for existing issues

I have searched the existing issues and checked that my issue is not a duplicate.

What happened?

Details

reasoning_content is set into provider_specific_fields only on the cache-replay path:

Live: transformation.py::_build_provider_specific_fields sets citations and thinking_blocks only. reasoning_content is set as a top-level Message.reasoning_content attribute and never added to provider_specific_fields.
Replay: convert_dict_to_response.py:599 does set provider_specific_fields["reasoning_content"] when reconstructing the Message.

The replay-side line was added for DeepSeek in PR #8288 (Feb 7, 2025), back when reasoning genuinely was provider-specific (no top-level Message.reasoning_content attribute existed yet). When Anthropic thinking landed in PR #8843 (Feb 26, 2025), the new transform set reasoning_content only at the top level, never mirroring it into provider_specific_fields, but cache replay continued running through the older code path that does.

Fundamentally, the cache is miss-prone because it is not based on what's actually sent in the outgoing request. factory.py reads only specific keys out of provider_specific_fields (compaction blocks, web search results, tool results, signatures), and reasoning_content is not among them, so the asymmetry doesn't change what's actually sent to api.anthropic.com. But the disk cache key is built in caching.py:300 as cache_key += f"{param}: {str(param_value)}", so when param_value is a list[Message] it dumps the full Pydantic repr regardless of what factory.py later reads from it. So messages.append(msg) and messages.append(msg.model_dump()) both bake the asymmetric field into the cache key.

Steps to Reproduce

import litellm, tempfile, time
from litellm.types.caching import LiteLLMCacheType

litellm.enable_cache(type=LiteLLMCacheType.DISK, disk_cache_dir=tempfile.mkdtemp())

MODEL = "anthropic/claude-sonnet-4-6"
THINKING = {"thinking": {"type": "enabled", "budget_tokens": 4000}}

def two_turn(label):
    messages = [{"role": "user", "content": "What is 2 + 2? Think briefly."}]
    timings = []

    t = time.perf_counter()
    r = litellm.completion(model=MODEL, messages=messages, max_tokens=8000, **THINKING)
    timings.append(time.perf_counter() - t)
    messages.append(r.choices[0].message.model_dump())

    # Uncommenting the below line would fix it
    # messages[-1]['provider_specific_fields']['reasoning_content'] = messages[-1]['reasoning_content']

    messages.append({"role": "user", "content": "Now multiply that by 5."})
    t = time.perf_counter()
    r = litellm.completion(model=MODEL, messages=messages, max_tokens=8000, **THINKING)
    timings.append(time.perf_counter() - t)

    print(f"[{label}] " + ", ".join(f"{x:.2f}s" for x in timings))

two_turn("live")      # e.g. 0.9s, 1.2s
two_turn("cached")    # e.g. 0.0s, 1.2s   <-- turn 2 should also be ~0s

Output:

[live] 0.90s, 1.22s
[cached] 0.00s, 1.18s

The diff between the assistant message produced by the cold call (live) and the same call repeated against the populated cache:

   "provider_specific_fields": {
     "citations": null,
+    "reasoning_content": "<joined thinking text>",
     "thinking_blocks": [...]
   }

Relevant log output

What part of LiteLLM is this about?

SDK (litellm Python package)

What LiteLLM version are you on ?

v1.83.13

Twitter / LinkedIn details

No response

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

litellm - ✅(Solved) Fix [Bug]: cache misses due to asymmetry between Anthropic disk cache formatting and live formatting [1 pull requests, 1 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Fix Action

Fix / Workaround

PR fix notes

PR #27364: fix(cache): align anthropic reasoning_content with live response

Description (problem / solution / changelog)

Type

Changes

Changed files

Code Example

Check for existing issues

What happened?

Details

Steps to Reproduce

Relevant log output

What part of LiteLLM is this about?

What LiteLLM version are you on ?

Twitter / LinkedIn details

Still need to ship something?

TRENDING

litellm - ✅(Solved) Fix [Bug]: cache misses due to asymmetry between Anthropic disk cache formatting and live formatting [1 pull requests, 1 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Fix Action

Fix / Workaround

PR fix notes

PR #27364: fix(cache): align anthropic reasoning_content with live response

Description (problem / solution / changelog)

Type

Changes

Changed files

Code Example

Check for existing issues

What happened?

Details

Steps to Reproduce

Relevant log output

What part of LiteLLM is this about?

What LiteLLM version are you on ?

Twitter / LinkedIn details

Still need to ship something?

RELATED_DISCOVERY

TRENDING