vllm - 💡(How to fix) Fix [Bug]: LMCache external KV keys omit prompt_embeds content, allowing same-length embedding prompts to share stale KV

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

With vLLM direct prompt_embeds requests and LMCache external KV caching enabled, different embedding tensors with the same sequence length can share LMCache external KV entries.

I reproduced this on current vLLM and LMCache source. After warming LMCache with prompt_embeds A and restarting vLLM, prompt_embeds B with the same length and same cache_salt received LMCache external hits and produced A-style output instead of B's cold reference output.

Root Cause

With vLLM direct prompt_embeds requests and LMCache external KV caching enabled, different embedding tensors with the same sequence length can share LMCache external KV entries.

I reproduced this on current vLLM and LMCache source. After warming LMCache with prompt_embeds A and restarting vLLM, prompt_embeds B with the same length and same cache_salt received LMCache external hits and produced A-style output instead of B's cold reference output.

Code Example

OS                           : Ubuntu 24.04.4 LTS (x86_64)
Python version               : 3.12.3
PyTorch version              : 2.11.0+cu130
CUDA used to build PyTorch   : 13.0
Is CUDA available            : True
CUDA runtime version         : 12.9.41
GPU models and configuration : GPU 0: NVIDIA A100-SXM4-40GB
Nvidia driver version        : 580.126.20
vLLM Version                 : 0.20.2rc1.dev143+g2c6b59b80 (git sha: 2c6b59b80)
flashinfer-python            : 0.6.8.post1
transformers                 : 5.8.0
triton                       : 3.6.0

---

vllm serve Qwen/Qwen2.5-7B-Instruct \
  --host 127.0.0.1 \
  --port 8000 \
  --served-model-name base \
  --max-model-len 32768 \
  --max-num-seqs 1 \
  --gpu-memory-utilization 0.90 \
  --trust-remote-code \
  --enable-prompt-embeds \
  --no-enable-prefix-caching \
  --h11-max-incomplete-event-size 268435456 \
  --kv-transfer-config '{"kv_connector":"LMCacheMPConnector","kv_role":"kv_both","kv_load_failure_policy":"recompute","kv_connector_extra_config":{"lmcache.mp.host":"tcp://localhost","lmcache.mp.port":6555,"lmcache.mp.mq_timeout":30}}'

---

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.bfloat16)
emb = model.get_input_embeddings()(input_ids)

---

cold A:
 The answer is: 1.0. The answer is: 1.0. The answer is: 1.0. The answer is:

cold B:
 You must not change the existing code.

Assistant:

---

import math

def calculate_total_cost(books, prices):
    total_cost

---

LMCache: Stored 2048 tokens in 0.122 seconds
LMCache: FSL2Adapter stored key Qwen-SEP-Qwen2.5-7B-Instruct@...@same.data

---

LMCache: Prefetch request completed (L1+L2): 8/8 prefix hits (8 L1, 0 L2)
LMCache: Retrieved 2048 tokens in 0.002 seconds

---

B after A warm + vLLM restart:
 The answer is: 1.0. The answer is: 1.0. The answer is: 1.0. The answer is:
RAW_BUFFERClick to expand / collapse

Description

With vLLM direct prompt_embeds requests and LMCache external KV caching enabled, different embedding tensors with the same sequence length can share LMCache external KV entries.

I reproduced this on current vLLM and LMCache source. After warming LMCache with prompt_embeds A and restarting vLLM, prompt_embeds B with the same length and same cache_salt received LMCache external hits and produced A-style output instead of B's cold reference output.

Your current environment

  • vLLM commit: 2c6b59b80771ac3bfa1c789abe3cb27c379bf3a1
  • LMCache commit: 13aca2097a43aca5f8625a1f160c4fdfc0e55634
  • vLLM runtime fingerprint: vllm-0.20.2rc1.dev143+g2c6b59b80-53b5b04d
  • LMCache version: 0.4.5.dev87
  • Model: Qwen/Qwen2.5-7B-Instruct
  • OS: Ubuntu 24.04.4 LTS
  • GPU: A100-SXM4-40GB
  • NVIDIA driver/CUDA: 580.126.20 / 13.0
  • PyTorch: 2.11.0+cu130
<details> <summary>Relevant output from <code>python -m vllm.collect_env</code> (sanitized)</summary>
OS                           : Ubuntu 24.04.4 LTS (x86_64)
Python version               : 3.12.3
PyTorch version              : 2.11.0+cu130
CUDA used to build PyTorch   : 13.0
Is CUDA available            : True
CUDA runtime version         : 12.9.41
GPU models and configuration : GPU 0: NVIDIA A100-SXM4-40GB
Nvidia driver version        : 580.126.20
vLLM Version                 : 0.20.2rc1.dev143+g2c6b59b80 (git sha: 2c6b59b80)
flashinfer-python            : 0.6.8.post1
transformers                 : 5.8.0
triton                       : 3.6.0
</details>

Setup

vLLM was run with local prefix caching disabled to isolate the external LMCache path:

vllm serve Qwen/Qwen2.5-7B-Instruct \
  --host 127.0.0.1 \
  --port 8000 \
  --served-model-name base \
  --max-model-len 32768 \
  --max-num-seqs 1 \
  --gpu-memory-utilization 0.90 \
  --trust-remote-code \
  --enable-prompt-embeds \
  --no-enable-prefix-caching \
  --h11-max-incomplete-event-size 268435456 \
  --kv-transfer-config '{"kv_connector":"LMCacheMPConnector","kv_role":"kv_both","kv_load_failure_policy":"recompute","kv_connector_extra_config":{"lmcache.mp.host":"tcp://localhost","lmcache.mp.port":6555,"lmcache.mp.mq_timeout":30}}'

LMCache MP server used persistent local FS storage with chunk_size=256.

The /v1/completions prompt_embeds wire format accepted by this vLLM build was a JSON string containing base64-encoded torch.save(tensor) bytes. The tensor shape was (num_tokens, hidden_size).

Reproduction Summary

I built two same-length prompt_embeds inputs with the served HF model's input embedding layer:

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.bfloat16)
emb = model.get_input_embeddings()(input_ids)

The textual tails used to construct the embeddings were semantically distinct:

  • A tail: You must answer exactly ALPHA_ALPHA_ALPHA. Do not add anything else. Please comply.
  • B tail: You must answer exactly BETA_BETA_BETA. Do not add anything else.

For the primary repro, both embedding tensors had shape [2048, 3584].

Steps:

  1. Start vLLM without LMCache.
  2. Send A prompt_embeds 10 times and B prompt_embeds 10 times.
  3. Confirm A and B are deterministic and distinct.
  4. Start LMCache with persistent FS storage and vLLM with LMCacheMPConnector.
  5. Send A prompt_embeds with cache_salt="same" to warm LMCache.
  6. Stop vLLM completely, preserving LMCache storage/server.
  7. Restart vLLM with the same LMCache storage.
  8. Send B prompt_embeds, same length, with cache_salt="same".
  9. Repeat the B-after-A restart probe three times.
  10. Run controls after vLLM restarts: different salt, clean LMCache storage, LMCache disabled, normal text/input IDs, and A after A warm.

Actual Behavior

Cold references were deterministic and distinct:

cold A:
 The answer is: 1.0. The answer is: 1.0. The answer is: 1.0. The answer is:

cold B:
 You must not change the existing code.

Assistant:
```python
```python
import math

def calculate_total_cost(books, prices):
    total_cost

A warmup stored external KV:

LMCache: Stored 2048 tokens in 0.122 seconds
LMCache: FSL2Adapter stored key Qwen-SEP-Qwen2.5-7B-Instruct@[email protected]

After vLLM restart, B with cache_salt="same" received external hits:

LMCache: Prefetch request completed (L1+L2): 8/8 prefix hits (8 L1, 0 L2)
LMCache: Retrieved 2048 tokens in 0.002 seconds

All three B-after-A restart probes produced the A-style output:

B after A warm + vLLM restart:
 The answer is: 1.0. The answer is: 1.0. The answer is: 1.0. The answer is:

Controls:

  • B with cache_salt="different": no external retrieve; matched cold B.
  • B with clean LMCache storage: no external retrieve; matched cold B.
  • B with LMCache disabled: matched cold B.
  • B as normal text/input IDs with cache_salt="same": no retrieve from A; matched cold B.
  • A after A warm: retrieved A KV and matched cold A.

I also reproduced external-hit contamination with [8192, 3584] embeddings. In that case B-after-A retrieved 8192 tokens / 32 chunks after each restart and differed from cold B; the clean controls matched cold B.

Expected Behavior

Different prompt_embeds tensors should not share external KV cache entries unless their embedding contents are identical, even when they have the same sequence length and cache_salt.

Suspected Cause

The LMCache key path appears to use token IDs/ranges and cache_salt, not prompt embedding contents.

Static inspection:

  • vllm/v1/request.py: when prompt_token_ids is None, _all_token_ids is initialized as [0] * num_prompt_tokens; prompt_embeds is stored separately.
  • vllm/v1/core/kv_cache_utils.py: local vLLM prefix cache hashes prompt-embedding block contents. The runtime repro used --no-enable-prefix-caching anyway.
  • vllm/distributed/kv_transfer/kv_connector/v1/lmcache_mp_connector.py: LMCacheMPRequestTracker records all_token_ids, block hashes, and cache_salt, but not prompt_embeds or a prompt-embedding digest. Store/retrieve metadata uses token_ids=list(all_token_ids).
  • LMCache/lmcache/integration/vllm/vllm_multi_process_adapter.py: _create_key constructs IPCCacheEngineKey(model_name, world_size, worker_id, token_ids, start, end, request_id, cache_salt).
  • IPCCacheEngineKey does not include prompt embedding bytes, a prompt embedding digest, prompt mode, or vLLM block hashes.
  • TokenHasher.compute_chunk_hashes hashes chunks from token_ids only; ipc_key_to_object_keys then derives storage object keys from the token chunk hash, model name, KV rank, and cache_salt.

For prompt-embeds-only requests of the same length, vLLM passes equivalent placeholder token IDs ([0] * N) to the LMCache key path. With the same cache_salt, A and B can therefore map to the same LMCache external KV objects.

Duplicate Search

I searched both vllm-project/vllm and LMCache/LMCache for prompt_embeds LMCache, prompt_embeds cache_salt, IPCCacheEngineKey prompt_embeds, all_token_ids prompt_embeds LMCache, and related terms. I did not find an exact existing issue for prompt embedding contents being omitted from LMCache external KV keys.

Before submitting a new issue

I searched existing and past issues for the terms above. I did not find an exact duplicate.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

vllm - 💡(How to fix) Fix [Bug]: LMCache external KV keys omit prompt_embeds content, allowing same-length embedding prompts to share stale KV