litellm - 💡(How to fix) Fix [Bug] Embedding cache merge corrupts `data[*].index` on partial cache hits

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

Error Message

  • Silent corruption: HTTP 200, no exception, malformed payload.

Root Cause

litellm/caching/caching_handler.py, function _combine_cached_embedding_response_with_api_result (lines 590–640).

Cache hits are inserted with the final position as index (correct, see _process_async_embedding_cached_response at lines 444–448):

final_embedding_cached_response.data[idx] = Embedding(
    embedding=embedding_data,
    index=idx,             # position-based — correct
    object="embedding",
)

But when merging the provider's response, items are taken verbatim with their sub-batch index intact:

# line 615ff
idx = 0
final_data_list = []
for item in _caching_handler_response.final_embedding_cached_response.data:
    if item is None and embedding_response.data is not None:
        final_data_list.append(embedding_response.data[idx])  # <-- BUG: index not re-mapped
        idx += 1
    else:
        final_data_list.append(item)

The provider returned indices 0..K-1 where K = number of cache misses. Those items keep those indices when placed at the final positions — so an item with index=0 from the provider can end up at final position 18, colliding with the cache hit that has index=0..17 correctly assigned.

Fix Action

Workaround

Disable embedding caching globally by setting litellm_settings.cache: False. Note that:

  • cache_params.supported_call_types excluding embedding/aembedding does not prevent this — embedding writes happen regardless (likely the bug tracked in #20456).
  • cache_params.mode: default_off does not prevent it either.

Code Example

final_embedding_cached_response.data[idx] = Embedding(
    embedding=embedding_data,
    index=idx,             # position-based — correct
    object="embedding",
)

---

# line 615ff
idx = 0
final_data_list = []
for item in _caching_handler_response.final_embedding_cached_response.data:
    if item is None and embedding_response.data is not None:
        final_data_list.append(embedding_response.data[idx])  # <-- BUG: index not re-mapped
        idx += 1
    else:
        final_data_list.append(item)

---

import pytest
from datetime import datetime

from litellm.caching.caching_handler import (
    LLMCachingHandler,
    CachingHandlerResponse,
)
from litellm.types.utils import Embedding, EmbeddingResponse


def emb(index, marker):
    return Embedding(embedding=[float(marker)], index=index, object="embedding")


def test_partial_cache_hit_does_not_corrupt_indices():
    """
    Repro for the partial-cache-hit index merge bug.

    Cache hits at positions 1 and 4 in a batch of 8.
    Provider returns the remaining 6 items with sub-batch indices 0..5.
    After merge, every data[i].index must equal i.
    """
    batch_size = 8
    cache_positions = {1, 4}

    # Build what _process_async_embedding_cached_response would produce:
    # cache hits already carry position-based indices; misses are None.
    cached_data = []
    for pos in range(batch_size):
        if pos in cache_positions:
            cached_data.append(emb(index=pos, marker=100 + pos))
        else:
            cached_data.append(None)

    cached_response = EmbeddingResponse(model="x", data=cached_data, object="list")

    # Provider response: 6 items (the cache misses), with sub-batch indices.
    provider_data = [emb(index=i, marker=200 + i) for i in range(6)]
    provider_response = EmbeddingResponse(model="x", data=provider_data, object="list")

    handler = LLMCachingHandler.__new__(LLMCachingHandler)  # skip __init__
    chr_ = CachingHandlerResponse(final_embedding_cached_response=cached_response)

    merged = handler._combine_cached_embedding_response_with_api_result(
        _caching_handler_response=chr_,
        embedding_response=provider_response,
        start_time=datetime.now(),
        end_time=datetime.now(),
    )

    assert merged is not None
    assert len(merged.data) == batch_size

    actual = [d.index for d in merged.data]
    expected = list(range(batch_size))
    assert actual == expected, f"index mismatch: actual={actual} expected={expected}"

    seen = set()
    for i, d in enumerate(merged.data):
        assert d.index not in seen, f"duplicate index {d.index} at position {i}"
        seen.add(d.index)

---

index mismatch: actual=[0, 1, 1, 2, 4, 3, 4, 5] expected=[0, 1, 2, 3, 4, 5, 6, 7]

---

# 1) Configure litellm_settings.cache: True with type: redis
# 2) Send a primer batch to populate the cache
curl -X POST $PROXY/v1/embeddings \
  -H 'Authorization: Bearer $KEY' \
  -d '{"model":"text-embedding-3-small","input":["foo","bar"]}'

# 3) Send a mixed batch — "foo" and "bar" will be cache hits
curl -X POST $PROXY/v1/embeddings \
  -H 'Authorization: Bearer $KEY' \
  -d '{"model":"text-embedding-3-small",
       "input":["fresh1","foo","fresh2","fresh3","bar","fresh4","fresh5","fresh6"]}'
# Observe: data[0..7].index is NOT [0,1,2,3,4,5,6,7]

---

idx = 0
         final_data_list = []
-        for item in _caching_handler_response.final_embedding_cached_response.data:
+        for pos, item in enumerate(_caching_handler_response.final_embedding_cached_response.data):
             if item is None and embedding_response.data is not None:
-                final_data_list.append(embedding_response.data[idx])
+                api_item = embedding_response.data[idx]
+                # Re-map the provider's sub-batch index to this item's final position.
+                # api_item may be an Embedding pydantic model or a plain dict — both
+                # support __setitem__ (Embedding defines it explicitly).
+                api_item["index"] = pos
+                final_data_list.append(api_item)
                 idx += 1
             else:
                 final_data_list.append(item)
RAW_BUFFERClick to expand / collapse

What Happened

When an embedding request is partly served from the cache and partly forwarded to the upstream provider, the merged response returned to the client contains duplicate and wrong index values in data[*]. The duplicates cause downstream clients (e.g. our Java ETL using Apache HttpClient) to reject the response as malformed.

The cache write path keeps the indices correct, and the DB spend log re-enumerates them — so this bug is invisible in LiteLLM_SpendLogs.response, only the client-visible response is broken.

Affected Versions

Reproduced on v1.86.2 and confirmed still present on main (HEAD f27df8d at time of writing).

Root Cause

litellm/caching/caching_handler.py, function _combine_cached_embedding_response_with_api_result (lines 590–640).

Cache hits are inserted with the final position as index (correct, see _process_async_embedding_cached_response at lines 444–448):

final_embedding_cached_response.data[idx] = Embedding(
    embedding=embedding_data,
    index=idx,             # position-based — correct
    object="embedding",
)

But when merging the provider's response, items are taken verbatim with their sub-batch index intact:

# line 615ff
idx = 0
final_data_list = []
for item in _caching_handler_response.final_embedding_cached_response.data:
    if item is None and embedding_response.data is not None:
        final_data_list.append(embedding_response.data[idx])  # <-- BUG: index not re-mapped
        idx += 1
    else:
        final_data_list.append(item)

The provider returned indices 0..K-1 where K = number of cache misses. Those items keep those indices when placed at the final positions — so an item with index=0 from the provider can end up at final position 18, colliding with the cache hit that has index=0..17 correctly assigned.

Behavior

  • Drift at any final position P = number of cache hits at positions < P.
  • 0% cache hit (everything fresh) → indices 0..N-1, no bug.
  • 100% cache hit (everything cached) → indices 0..N-1, no bug.
  • Partial cache hit (1 ≤ hits < N) → drift, duplicates, missing indices.

Minimal, Self-contained Repro

This requires no proxy and no real provider. It calls the buggy function directly with synthetic Embedding objects:

import pytest
from datetime import datetime

from litellm.caching.caching_handler import (
    LLMCachingHandler,
    CachingHandlerResponse,
)
from litellm.types.utils import Embedding, EmbeddingResponse


def emb(index, marker):
    return Embedding(embedding=[float(marker)], index=index, object="embedding")


def test_partial_cache_hit_does_not_corrupt_indices():
    """
    Repro for the partial-cache-hit index merge bug.

    Cache hits at positions 1 and 4 in a batch of 8.
    Provider returns the remaining 6 items with sub-batch indices 0..5.
    After merge, every data[i].index must equal i.
    """
    batch_size = 8
    cache_positions = {1, 4}

    # Build what _process_async_embedding_cached_response would produce:
    # cache hits already carry position-based indices; misses are None.
    cached_data = []
    for pos in range(batch_size):
        if pos in cache_positions:
            cached_data.append(emb(index=pos, marker=100 + pos))
        else:
            cached_data.append(None)

    cached_response = EmbeddingResponse(model="x", data=cached_data, object="list")

    # Provider response: 6 items (the cache misses), with sub-batch indices.
    provider_data = [emb(index=i, marker=200 + i) for i in range(6)]
    provider_response = EmbeddingResponse(model="x", data=provider_data, object="list")

    handler = LLMCachingHandler.__new__(LLMCachingHandler)  # skip __init__
    chr_ = CachingHandlerResponse(final_embedding_cached_response=cached_response)

    merged = handler._combine_cached_embedding_response_with_api_result(
        _caching_handler_response=chr_,
        embedding_response=provider_response,
        start_time=datetime.now(),
        end_time=datetime.now(),
    )

    assert merged is not None
    assert len(merged.data) == batch_size

    actual = [d.index for d in merged.data]
    expected = list(range(batch_size))
    assert actual == expected, f"index mismatch: actual={actual} expected={expected}"

    seen = set()
    for i, d in enumerate(merged.data):
        assert d.index not in seen, f"duplicate index {d.index} at position {i}"
        seen.add(d.index)

On v1.86.2 this fails with:

index mismatch: actual=[0, 1, 1, 2, 4, 3, 4, 5] expected=[0, 1, 2, 3, 4, 5, 6, 7]

End-to-end Reproduction (optional, against a live proxy)

# 1) Configure litellm_settings.cache: True with type: redis
# 2) Send a primer batch to populate the cache
curl -X POST $PROXY/v1/embeddings \
  -H 'Authorization: Bearer $KEY' \
  -d '{"model":"text-embedding-3-small","input":["foo","bar"]}'

# 3) Send a mixed batch — "foo" and "bar" will be cache hits
curl -X POST $PROXY/v1/embeddings \
  -H 'Authorization: Bearer $KEY' \
  -d '{"model":"text-embedding-3-small",
       "input":["fresh1","foo","fresh2","fresh3","bar","fresh4","fresh5","fresh6"]}'
# Observe: data[0..7].index is NOT [0,1,2,3,4,5,6,7]

Suggested Fix

Re-map the provider item's index to its final position when inserting it into the merged list. Use __setitem__ (api_item["index"] = pos) because providers may return either Embedding pydantic instances or plain dicts — both implement __setitem__ (Embedding defines it explicitly), but only the pydantic model exposes .index as an attribute.

         idx = 0
         final_data_list = []
-        for item in _caching_handler_response.final_embedding_cached_response.data:
+        for pos, item in enumerate(_caching_handler_response.final_embedding_cached_response.data):
             if item is None and embedding_response.data is not None:
-                final_data_list.append(embedding_response.data[idx])
+                api_item = embedding_response.data[idx]
+                # Re-map the provider's sub-batch index to this item's final position.
+                # api_item may be an Embedding pydantic model or a plain dict — both
+                # support __setitem__ (Embedding defines it explicitly).
+                api_item["index"] = pos
+                final_data_list.append(api_item)
                 idx += 1
             else:
                 final_data_list.append(item)

Workaround

Disable embedding caching globally by setting litellm_settings.cache: False. Note that:

  • cache_params.supported_call_types excluding embedding/aembedding does not prevent this — embedding writes happen regardless (likely the bug tracked in #20456).
  • cache_params.mode: default_off does not prevent it either.

Impact

  • Severity: high for any consumer that batches embedding requests and validates response shape (vector DBs, ETL pipelines, OpenAI-compatible clients that assert on data[i].index == i).
  • Silent corruption: HTTP 200, no exception, malformed payload.
  • Reproduces deterministically — easy to validate the fix.

Why this bug has been hiding in plain sight

The bug is only triggered when a real, persistent cache backend holds enough entries to produce a partial cache hit on a multi-input batch. The default type: local (InMemoryCache) backend masks it:

  • litellm/caching/in_memory_cache.py defaults to max_size_in_memory=200 with LRU eviction.
  • Cache(type="local") in litellm/caching/caching.py:221 constructs InMemoryCache() with no arguments — there is no way to raise that limit from cache_params in config.yaml.
  • Any meaningful warmup blows past 200 entries quickly. Earlier inputs get evicted long before a later batch could see them as cache hits, so the partial-hit code path in _combine_cached_embedding_response_with_api_result never runs.

Reproductions therefore need either:

  • a backend without a hard item limit (Redis, Disk, S3/GCS/Azure Blob), or
  • a direct call into _combine_cached_embedding_response_with_api_result (as the unit tests below do).

Our production stack uses Redis, which is what surfaced this.

Cross-backend confirmation (4-way matrix)

End-to-end against a local LiteLLM proxy from this source, 30 warmup batches + 1 target batch (128 inputs, 2 overlapping with warmup):

BackendItem limitUnpatchedPatched
Redis (type: redis)unlimited (TTL only)duplicate_index: 16 @ pos 18✅ clean
In-Memory (type: local, default)200, LRU✅ clean — bug masked by eviction✅ clean
In-Memory, patched to InMemoryCache(max_size_in_memory=5000)5000, LRUduplicate_index: 16 @ pos 18✅ clean

Conclusion: the bug is purely in the merge function — independent of the cache backend. The only reason it doesn't show up with the default in-memory cache is the 200-item LRU window.

Verification

Tested locally against main (HEAD f27df8d):

Unit tests (tests/test_litellm/caching/test_partial_cache_merge.py, 7 cases)

StateResult
Unpatched master4 failed, 3 passed (sanity cases)
With patch7 passed

The 3 pre-existing failures in test_llm_caching_handler.py (LLMClientCache async eviction tests) are present on both patched and unpatched main, i.e. unrelated to this change.

End-to-end against a real LiteLLM proxy

Started a local LiteLLM proxy from the source tree (litellm --config ... --port 4444) configured with cache_params.type: redis (Redis DB 1, cleared before each run) and hosted_vllm/lexcom/bge-m3 as the model. Sent raw HTTP requests at the proxy from a separate Python client (no litellm library on the caller side — same shape as our Java ETL).

Replayed 30 warm-up batches then the target batch (128 inputs, 2 strings overlapping the warm-up → 2 cache hits at the same positions).

Proxy codeFirst target attempt
Unpatchedduplicate_index: index=16 at position 18 (deterministic)
Patchedclean — data[*].index is 0..127 over 3 attempts

This exercises the full production code path (raw HTTP → LiteLLM proxy → Redis cache → upstream provider → cache merge → HTTP response) and confirms the patch fixes the bug at the actual integration point.

Smaller direct-library test (8-item batch via litellm.aembedding(caching=True) against the same upstream): unpatched [0,1,1,2,4,3,4,5], patched [0,1,2,3,4,5,6,7].

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

litellm - 💡(How to fix) Fix [Bug] Embedding cache merge corrupts `data[*].index` on partial cache hits