litellm - 💡(How to fix) Fix [Bug]: Prompt caching router affinity TTL is hardcoded to 5 minutes

Code Example

def add_model_id(self, model_id, messages, tools) -> None:
    ...
    self.cache.set_cache(
        cache_key, PromptCachingCacheValue(model_id=model_id), ttl=300
    )

async def async_add_model_id(self, model_id, messages, tools) -> None:
    ...
    await self.cache.async_set_cache(
        cache_key,
        PromptCachingCacheValue(model_id=model_id),
        ttl=300,  # store for 5 minutes
    )

---

router_settings:
     optional_pre_call_checks: ["prompt_caching"]

   model_list:
     - model_name: claude-sonnet
       litellm_params:
         model: us.anthropic.claude-sonnet-4-5-20250929-v1:0
         aws_profile_name: account-1
     - model_name: claude-sonnet
       litellm_params:
         model: us.anthropic.claude-sonnet-4-5-20250929-v1:0
         aws_profile_name: account-2

---

curl -s http://localhost:4000/v1/chat/completions \
     -H "Authorization: Bearer $LITELLM_KEY" \
     -H "Content-Type: application/json" \
     -d '{
       "model": "claude-sonnet",
       "messages": [
         {"role": "system",
          "content": [{"type": "text",
                       "text": "<long system prompt ≥1024 tokens>",
                       "cache_control": {"type": "ephemeral", "ttl": "1h"}}]},
         {"role": "user", "content": "hi"}
       ]
     }'

---

Check for existing issues

I have searched the existing issues and checked that my issue is not a duplicate.

What happened?

LiteLLM's prompt-caching-aware routing (optional_pre_call_checks: ["prompt_caching"]) stores the model_id → cacheable_prefix_hash binding with a hardcoded 5-minute TTL. This was correct when Anthropic only offered the 5-minute ephemeral cache, but it is now incorrect for the 1-hour cache (cache_control: {"type": "ephemeral", "ttl": "1h"}) supported by Anthropic and Bedrock.

In a multi-deployment setup (e.g., the same Bedrock model across multiple AWS accounts/regions, or multiple Anthropic API keys), this means:

Request at t=0 lands on deployment A with cache_control: {"type": "ephemeral", "ttl": "1h"}. The provider caches the prefix for 60 minutes. LiteLLM stores prefix_hash → A for 5 minutes.
Request at t=6 min with the same prefix arrives. LiteLLM's affinity entry has expired, so the router falls back to its normal strategy (e.g., least-busy) and may pick deployment B.
Deployment B has no cache for the prefix → provider-side cache miss, full input token cost, even though deployment A is still holding the cache warm for another ~54 minutes.

Expected: the router should keep the affinity for as long as the provider would honor the cache. If the cacheable prefix carries cache_control.ttl == "1h", the affinity entry should live for ~3600 seconds (or be configurable).

Actual: the TTL is the literal 300 in two places, with no override.

litellm/router_utils/prompt_caching_cache.py:182-219:

def add_model_id(self, model_id, messages, tools) -> None:
    ...
    self.cache.set_cache(
        cache_key, PromptCachingCacheValue(model_id=model_id), ttl=300
    )

async def async_add_model_id(self, model_id, messages, tools) -> None:
    ...
    await self.cache.async_set_cache(
        cache_key,
        PromptCachingCacheValue(model_id=model_id),
        ttl=300,  # store for 5 minutes
    )

There is no router setting that overrides this, and extract_cacheable_prefix does not look at the ttl sub-field of cache_control blocks. Verified on latest main (commit 79b4578671).

Suggested fix: thread an effective TTL through add_model_id / async_add_model_id. Default to 300; if any cache_control block in the cacheable prefix declares "ttl": "1h", use 3600. Optionally expose a prompt_caching_affinity_ttl_seconds router setting for users who want to tune it explicitly (matching the existing deployment_affinity_ttl_seconds knob).

Steps to Reproduce

proxy_config.yaml with two deployments of the same model and prompt-caching routing enabled:

router_settings:
  optional_pre_call_checks: ["prompt_caching"]

model_list:
  - model_name: claude-sonnet
    litellm_params:
      model: us.anthropic.claude-sonnet-4-5-20250929-v1:0
      aws_profile_name: account-1
  - model_name: claude-sonnet
    litellm_params:
      model: us.anthropic.claude-sonnet-4-5-20250929-v1:0
      aws_profile_name: account-2

Send a request whose system prompt is marked with the 1-hour ephemeral cache:

curl -s http://localhost:4000/v1/chat/completions \
  -H "Authorization: Bearer $LITELLM_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "claude-sonnet",
    "messages": [
      {"role": "system",
       "content": [{"type": "text",
                    "text": "<long system prompt ≥1024 tokens>",
                    "cache_control": {"type": "ephemeral", "ttl": "1h"}}]},
      {"role": "user", "content": "hi"}
    ]
  }'

Wait > 5 minutes (e.g., 10 minutes), then send the same request again. Observe in logs/usage that the request can route to the other AWS account, with no cache_read_input_tokens on the provider response — even though the original account would still have the prefix cached for the remainder of the 1-hour window.

Relevant log output

What part of LiteLLM is this about?

Proxy

What LiteLLM version are you on ?

main

Twitter / LinkedIn details

No response

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

litellm - 💡(How to fix) Fix [Bug]: Prompt caching router affinity TTL is hardcoded to 5 minutes — breaks 1-hour ephemeral cache routing

Recommended Tools

GitHub issue graph ai analysis