litellm - 💡(How to fix) Fix [Bug]: `redis-semantic` cache never produces semantic hits, `_get_cache_key_filter_expression` uses full request hash as RediSearch pre-filter, making KNN unreachable

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

Semantic Redis caching (cache_params.type: redis-semantic) has been silently broken since the isolation feature was introduced. Every cache lookup returns a miss regardless of semantic similarity between prompts. The root cause is that _get_cache_key_filter_expression returns a RediSearch Tag filter scoped to the full SHA256 request hash. This pre-filter runs before the KNN vector search and always produces an empty candidate set for any prompt that is not a byte-for-byte repeat of a previously cached request. The KNN similarity search never executes. The distance threshold is never evaluated. The feature does not work as documented.


Error Message

9. Error Logs

Root Cause

Semantic Redis caching (cache_params.type: redis-semantic) has been silently broken since the isolation feature was introduced. Every cache lookup returns a miss regardless of semantic similarity between prompts. The root cause is that _get_cache_key_filter_expression returns a RediSearch Tag filter scoped to the full SHA256 request hash. This pre-filter runs before the KNN vector search and always produces an empty candidate set for any prompt that is not a byte-for-byte repeat of a previously cached request. The KNN similarity search never executes. The distance threshold is never evaluated. The feature does not work as documented.

Code Example

litellm_settings:
  cache: true
  enable_redis_auth_cache: true

  cache_params:
    type: redis-semantic
    host: os.environ/REDIS_HOST
    port: os.environ/REDIS_PORT
    ttl: 3600 # applies to responses
    max_connections: 100
    similarity_threshold: 0.75
    redis_semantic_cache_embedding_model: ollama/mxbai-embed-large:335m

---

docker compose up
# via Docker Compose, image: ghcr.io/berriai/litellm-database:main-latest

---

REDIS_HOST=redis-stack
REDIS_PORT=6379
OLLAMA_API_BASE=http://host.docker.internal:11434

---

Docker container, single instance
OS: Linux (Docker)

---

curl -X POST http://localhost:4000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $LITELLM_KEY" \
  -d '{
    "model": "your-model",
    "messages": [{"role": "user", "content": "What is the capital of France?"}]
  }'

---

curl -X POST http://localhost:4000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $LITELLM_KEY" \
  -d '{
    "model": "your-model",
    "messages": [{"role": "user", "content": "Which city is the capital of France?"}]
  }'

---

Using docker.litellm.ai/berriai/litellm:main-stable image
litellm version: 1.85.1
Redis: Redis Stack server
Python: 3.13
Deployment: Docker / Docker Compose
Embedding model: ollama/mxbai-embed-large:335m (1024-dim)

---

litellm_settings:
  cache: true
  cache_params:
    type: redis-semantic
    host: os.environ/REDIS_HOST
    port: os.environ/REDIS_PORT
    ttl: 3600
    similarity_threshold: 0.7
    redis_semantic_cache_embedding_model: ollama/mxbai-embed-large:335m

---

# Current implementation
def _get_cache_key_filter_expression(self, key: str) -> Optional[Tag]:
    return Tag("litellm_cache_key") == key

---

results = await self.semantic_cache.acheck(
    vector=query_vector,
    filter_expression=Tag("litellm_cache_key") == key  # ← the problem
)

---

FT.SEARCH litellm_semantic_cache_index
  "(@litellm_cache_key:{<SHA256_of_current_request>})"
  => [KNN 3 @vector $query_vec AS vector_distance]
  PARAMS 2 query_vec <blob>
  SORTBY vector_distance
  DIALECT 2

---

# Current implementation
def _cache_hit_matches_key(self, cache_hit: dict, key: str) -> bool:
    return cache_hit.get("litellm_cache_key") == key

---

def _get_cache_key_filter_expression(
    self, 
    key: str, 
    metadata: Optional[dict] = None
) -> Optional[FilterExpression]:
    """
    Scope semantic cache lookups by caller identity, not request content.
    This preserves cross-tenant isolation while allowing semantic matching
    within a caller's own cached entries.
    """
    if metadata:
        if team_id := metadata.get("team_id"):
            return Tag("team_id") == team_id
        if api_key_hash := metadata.get("user_api_key_hash"):
            return Tag("api_key_hash") == api_key_hash
    # No identity context — return None to search all entries
    # (single-tenant deployments where isolation is not required)
    return None

def _cache_hit_matches_key(self, cache_hit: dict, key: str) -> bool:
    # Distance threshold applied inside acheck() is the sole
    # acceptance criterion. Key equality check is not appropriate
    # for semantic matching.
    return True

---

curl -X POST http://localhost:4000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $LITELLM_KEY" \
  -d '{
    "model": "your-model",
    "messages": [{"role": "user", "content": "What is the capital of France?"}]
  }'

---

redis-cli FT.SEARCH litellm_semantic_cache_index "*" RETURN 2 litellm_cache_key vector_distance LIMIT 0 5

---

curl -X POST http://localhost:4000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $LITELLM_KEY" \
  -d '{
    "model": "your-model",
    "messages": [{"role": "user", "content": "Which city is the capital of France?"}]
  }'

---

import asyncio
from redisvl.extensions.llmcache import SemanticCache

cache = SemanticCache(
    name="litellm_semantic_cache_index",
    redis_url="redis://localhost:6379",
    distance_threshold=0.9
)

# pre-computed vector for "Which city is the capital of France?"
# (1024-dim, obtained from your embedding model)
query_vector = [...]  # your actual vector here

results = asyncio.run(cache.acheck(vector=query_vector))
print(results)
# Returns 3 hits with distances ~0.008 — well within threshold
# Proves vectors are stored correctly and redisvl works correctly
# The bug is entirely in the filter applied before acheck() is called

---
RAW_BUFFERClick to expand / collapse

Check for existing issues

  • I have searched the existing issues and checked that my issue is not a duplicate.

What happened?

1. LiteLLM Configuration File

litellm_settings:
  cache: true
  enable_redis_auth_cache: true

  cache_params:
    type: redis-semantic
    host: os.environ/REDIS_HOST
    port: os.environ/REDIS_PORT
    ttl: 3600 # applies to responses
    max_connections: 100
    similarity_threshold: 0.75
    redis_semantic_cache_embedding_model: ollama/mxbai-embed-large:335m

2. Initialization Command

docker compose up
# via Docker Compose, image: ghcr.io/berriai/litellm-database:main-latest

3. LiteLLM Version

  • Current version: 1.85.1
  • Issue first appeared: introduced by the Redis semantic cache isolation feature (May 2025 staging merge)
  • Confirmed present through: 1.86.2

4. Environment Variables

REDIS_HOST=redis-stack
REDIS_PORT=6379
OLLAMA_API_BASE=http://host.docker.internal:11434

5. Server Specifications

Docker container, single instance
OS: Linux (Docker)

6. Database and Redis Usage

  • Database: Yes, PostgreSQL
  • Redis: Yes — Redis Stack Server (latest), RediSearch, Standalone
  • Cache type: redis-semantic

7. Endpoints

/v1/chat/completions

8. Request Example

First request — populates cache:

curl -X POST http://localhost:4000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $LITELLM_KEY" \
  -d '{
    "model": "your-model",
    "messages": [{"role": "user", "content": "What is the capital of France?"}]
  }'

Second request — semantically identical, different wording:

curl -X POST http://localhost:4000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $LITELLM_KEY" \
  -d '{
    "model": "your-model",
    "messages": [{"role": "user", "content": "Which city is the capital of France?"}]
  }'

Expected: Cache hit. Cosine distance between the two embeddings is ~0.008, well within the configured threshold.

Actual: Cache miss every time. Request forwarded to model. Second entry written to Redis.

9. Error Logs

No errors are raised. The failure is silent — the cache appears to function but never produces a semantic hit. With --detailed_debug you can observe the embedding call succeeding and the Redis lookup returning empty on every request.

Summary

Semantic Redis caching (cache_params.type: redis-semantic) has been silently broken since the isolation feature was introduced. Every cache lookup returns a miss regardless of semantic similarity between prompts. The root cause is that _get_cache_key_filter_expression returns a RediSearch Tag filter scoped to the full SHA256 request hash. This pre-filter runs before the KNN vector search and always produces an empty candidate set for any prompt that is not a byte-for-byte repeat of a previously cached request. The KNN similarity search never executes. The distance threshold is never evaluated. The feature does not work as documented.


Environment

Using docker.litellm.ai/berriai/litellm:main-stable image
litellm version: 1.85.1
Redis: Redis Stack server
Python: 3.13
Deployment: Docker / Docker Compose
Embedding model: ollama/mxbai-embed-large:335m (1024-dim)

Configuration

litellm_settings:
  cache: true
  cache_params:
    type: redis-semantic
    host: os.environ/REDIS_HOST
    port: os.environ/REDIS_PORT
    ttl: 3600
    similarity_threshold: 0.7
    redis_semantic_cache_embedding_model: ollama/mxbai-embed-large:335m

Root Cause Analysis

The broken code path

File: litellm/caching/redis_semantic_cache.py

On every cache read, async_get_cache calls _get_cache_key_filter_expression(key) where key is the SHA256 hash of model + messages + temperature + all_other_params.

# Current implementation
def _get_cache_key_filter_expression(self, key: str) -> Optional[Tag]:
    return Tag("litellm_cache_key") == key

This filter is then passed directly to redisvl's acheck():

results = await self.semantic_cache.acheck(
    vector=query_vector,
    filter_expression=Tag("litellm_cache_key") == key  # ← the problem
)

What RediSearch actually executes

When filter_expression is set, redisvl constructs a hybrid query:

FT.SEARCH litellm_semantic_cache_index
  "(@litellm_cache_key:{<SHA256_of_current_request>})"
  => [KNN 3 @vector $query_vec AS vector_distance]
  PARAMS 2 query_vec <blob>
  SORTBY vector_distance
  DIALECT 2

In RediSearch hybrid queries, the pre-filter runs first and produces the candidate set for KNN. Since every unique prompt produces a unique SHA256 hash, and the stored litellm_cache_key is the hash of the original request, the pre-filter returns zero documents for any non-identical prompt. KNN is then executed against an empty candidate set and returns nothing.

The distance threshold, the cosine similarity calculation, the stored vectors, none of it is ever evaluated.

The post-retrieval check compounds the problem

Even if acheck() somehow returned a result, a second exact-match check would reject it:

# Current implementation
def _cache_hit_matches_key(self, cache_hit: dict, key: str) -> bool:
    return cache_hit.get("litellm_cache_key") == key

cache_hit["litellm_cache_key"] is the hash of the original request. key is the hash of the current request. For any semantically similar but non-identical prompt these are always different strings. This check would reject every valid semantic hit.

Why this is architecturally wrong

The isolation feature was introduced to prevent cross-tenant cache leakage, which is a legitimate concern. However, the implementation chose the full request hash as the isolation scope key. The request hash includes the message content, the very thing semantic matching needs to be flexible about. Using it as a hard equality filter collapses semantic retrieval to exact retrieval.

The correct isolation scope key is caller identity, not request content. These must be kept separate:

  • Who is asking → isolation boundary → should be user_api_key_hash, team_id, or user_id
  • What they are asking → semantic matching → should be KNN + distance threshold only

What The Fix Should Look Like

Option A: Scope by caller identity (recommended)

def _get_cache_key_filter_expression(
    self, 
    key: str, 
    metadata: Optional[dict] = None
) -> Optional[FilterExpression]:
    """
    Scope semantic cache lookups by caller identity, not request content.
    This preserves cross-tenant isolation while allowing semantic matching
    within a caller's own cached entries.
    """
    if metadata:
        if team_id := metadata.get("team_id"):
            return Tag("team_id") == team_id
        if api_key_hash := metadata.get("user_api_key_hash"):
            return Tag("api_key_hash") == api_key_hash
    # No identity context — return None to search all entries
    # (single-tenant deployments where isolation is not required)
    return None

def _cache_hit_matches_key(self, cache_hit: dict, key: str) -> bool:
    # Distance threshold applied inside acheck() is the sole
    # acceptance criterion. Key equality check is not appropriate
    # for semantic matching.
    return True

The write path would also need to store team_id or api_key_hash in the Redis document alongside the vector so the filter has something to match against.

Option B: No filter (single-tenant deployments)

For deployments where cross-tenant isolation is not a concern, filter_expression=None passed to acheck() searches the full index and lets the distance threshold be the only gate. This restores the documented behaviour completely.


Impact

  • redis-semantic cache type has been non-functional for semantic matching since the isolation feature was introduced
  • It silently degrades to exact matching, no errors are raised, misses are not flagged, the feature appears to work but provides no semantic benefit
  • Confirmed broken through v1.86.2
  • Any user relying on semantic caching to reduce model calls is getting zero benefit from it

  • LiteLLM semantic caching documentation: documented flow shows no filter step

Steps to Reproduce

1. Send an initial request to populate the cache:

curl -X POST http://localhost:4000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $LITELLM_KEY" \
  -d '{
    "model": "your-model",
    "messages": [{"role": "user", "content": "What is the capital of France?"}]
  }'

2. Confirm the vector was stored in Redis:

redis-cli FT.SEARCH litellm_semantic_cache_index "*" RETURN 2 litellm_cache_key vector_distance LIMIT 0 5

You will see entries with litellm_cache_key values that are SHA256 hashes of the full request.

3. Send a semantically identical but non-identical prompt:

curl -X POST http://localhost:4000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $LITELLM_KEY" \
  -d '{
    "model": "your-model",
    "messages": [{"role": "user", "content": "Which city is the capital of France?"}]
  }'

Expected: Cache hit. The two prompts embed to vectors with cosine distance well within the configured threshold (~0.008 in testing). The stored response should be returned without forwarding to the model.

Actual: Cache miss. Request is forwarded to the model. A second distinct entry is written to Redis.

4. Confirm vectors are genuinely similar — the embedding infrastructure is not the problem:

import asyncio
from redisvl.extensions.llmcache import SemanticCache

cache = SemanticCache(
    name="litellm_semantic_cache_index",
    redis_url="redis://localhost:6379",
    distance_threshold=0.9
)

# pre-computed vector for "Which city is the capital of France?"
# (1024-dim, obtained from your embedding model)
query_vector = [...]  # your actual vector here

results = asyncio.run(cache.acheck(vector=query_vector))
print(results)
# Returns 3 hits with distances ~0.008 — well within threshold
# Proves vectors are stored correctly and redisvl works correctly
# The bug is entirely in the filter applied before acheck() is called

Relevant log output

What part of LiteLLM is this about?

Proxy

What LiteLLM version are you on ?

1.85.1

Twitter / LinkedIn details

https://www.linkedin.com/in/joshua-ike/

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

litellm - 💡(How to fix) Fix [Bug]: `redis-semantic` cache never produces semantic hits, `_get_cache_key_filter_expression` uses full request hash as RediSearch pre-filter, making KNN unreachable