litellm - 💡(How to fix) Fix [Bug]: Prompt caching router affinity TTL is hardcoded to 5 minutes — breaks 1-hour ephemeral cache routing

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

Code Example

def add_model_id(self, model_id, messages, tools) -> None:
    ...
    self.cache.set_cache(
        cache_key, PromptCachingCacheValue(model_id=model_id), ttl=300
    )

async def async_add_model_id(self, model_id, messages, tools) -> None:
    ...
    await self.cache.async_set_cache(
        cache_key,
        PromptCachingCacheValue(model_id=model_id),
        ttl=300,  # store for 5 minutes
    )

---

router_settings:
     optional_pre_call_checks: ["prompt_caching"]

   model_list:
     - model_name: claude-sonnet
       litellm_params:
         model: us.anthropic.claude-sonnet-4-5-20250929-v1:0
         aws_profile_name: account-1
     - model_name: claude-sonnet
       litellm_params:
         model: us.anthropic.claude-sonnet-4-5-20250929-v1:0
         aws_profile_name: account-2

---

curl -s http://localhost:4000/v1/chat/completions \
     -H "Authorization: Bearer $LITELLM_KEY" \
     -H "Content-Type: application/json" \
     -d '{
       "model": "claude-sonnet",
       "messages": [
         {"role": "system",
          "content": [{"type": "text",
                       "text": "<long system prompt ≥1024 tokens>",
                       "cache_control": {"type": "ephemeral", "ttl": "1h"}}]},
         {"role": "user", "content": "hi"}
       ]
     }'

---
RAW_BUFFERClick to expand / collapse

Check for existing issues

  • I have searched the existing issues and checked that my issue is not a duplicate.

What happened?

LiteLLM's prompt-caching-aware routing (optional_pre_call_checks: ["prompt_caching"]) stores the model_id → cacheable_prefix_hash binding with a hardcoded 5-minute TTL. This was correct when Anthropic only offered the 5-minute ephemeral cache, but it is now incorrect for the 1-hour cache (cache_control: {"type": "ephemeral", "ttl": "1h"}) supported by Anthropic and Bedrock.

In a multi-deployment setup (e.g., the same Bedrock model across multiple AWS accounts/regions, or multiple Anthropic API keys), this means:

  1. Request at t=0 lands on deployment A with cache_control: {"type": "ephemeral", "ttl": "1h"}. The provider caches the prefix for 60 minutes. LiteLLM stores prefix_hash → A for 5 minutes.
  2. Request at t=6 min with the same prefix arrives. LiteLLM's affinity entry has expired, so the router falls back to its normal strategy (e.g., least-busy) and may pick deployment B.
  3. Deployment B has no cache for the prefix → provider-side cache miss, full input token cost, even though deployment A is still holding the cache warm for another ~54 minutes.

Expected: the router should keep the affinity for as long as the provider would honor the cache. If the cacheable prefix carries cache_control.ttl == "1h", the affinity entry should live for ~3600 seconds (or be configurable).

Actual: the TTL is the literal 300 in two places, with no override.

litellm/router_utils/prompt_caching_cache.py:182-219:

def add_model_id(self, model_id, messages, tools) -> None:
    ...
    self.cache.set_cache(
        cache_key, PromptCachingCacheValue(model_id=model_id), ttl=300
    )

async def async_add_model_id(self, model_id, messages, tools) -> None:
    ...
    await self.cache.async_set_cache(
        cache_key,
        PromptCachingCacheValue(model_id=model_id),
        ttl=300,  # store for 5 minutes
    )

There is no router setting that overrides this, and extract_cacheable_prefix does not look at the ttl sub-field of cache_control blocks. Verified on latest main (commit 79b4578671).

Suggested fix: thread an effective TTL through add_model_id / async_add_model_id. Default to 300; if any cache_control block in the cacheable prefix declares "ttl": "1h", use 3600. Optionally expose a prompt_caching_affinity_ttl_seconds router setting for users who want to tune it explicitly (matching the existing deployment_affinity_ttl_seconds knob).

Steps to Reproduce

  1. proxy_config.yaml with two deployments of the same model and prompt-caching routing enabled:

    router_settings:
      optional_pre_call_checks: ["prompt_caching"]
    
    model_list:
      - model_name: claude-sonnet
        litellm_params:
          model: us.anthropic.claude-sonnet-4-5-20250929-v1:0
          aws_profile_name: account-1
      - model_name: claude-sonnet
        litellm_params:
          model: us.anthropic.claude-sonnet-4-5-20250929-v1:0
          aws_profile_name: account-2
  2. Send a request whose system prompt is marked with the 1-hour ephemeral cache:

    curl -s http://localhost:4000/v1/chat/completions \
      -H "Authorization: Bearer $LITELLM_KEY" \
      -H "Content-Type: application/json" \
      -d '{
        "model": "claude-sonnet",
        "messages": [
          {"role": "system",
           "content": [{"type": "text",
                        "text": "<long system prompt ≥1024 tokens>",
                        "cache_control": {"type": "ephemeral", "ttl": "1h"}}]},
          {"role": "user", "content": "hi"}
        ]
      }'
  3. Wait > 5 minutes (e.g., 10 minutes), then send the same request again. Observe in logs/usage that the request can route to the other AWS account, with no cache_read_input_tokens on the provider response — even though the original account would still have the prefix cached for the remainder of the 1-hour window.

Relevant log output

What part of LiteLLM is this about?

Proxy

What LiteLLM version are you on ?

main

Twitter / LinkedIn details

No response

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING