vllm - 💡(How to fix) Fix [Bug]: LMCache external KV keys omit prompt_embeds content, allowing same-length embedding prompts to share stale KV

vllm2026-05-08 22:11:49

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

With vLLM direct prompt_embeds requests and LMCache external KV caching enabled, different embedding tensors with the same sequence length can share LMCache external KV entries.

I reproduced this on current vLLM and LMCache source. After warming LMCache with prompt_embeds A and restarting vLLM, prompt_embeds B with the same length and same cache_salt received LMCache external hits and produced A-style output instead of B's cold reference output.

Root Cause

With vLLM direct prompt_embeds requests and LMCache external KV caching enabled, different embedding tensors with the same sequence length can share LMCache external KV entries.

Code Example

OS                           : Ubuntu 24.04.4 LTS (x86_64)
Python version               : 3.12.3
PyTorch version              : 2.11.0+cu130
CUDA used to build PyTorch   : 13.0
Is CUDA available            : True
CUDA runtime version         : 12.9.41
GPU models and configuration : GPU 0: NVIDIA A100-SXM4-40GB
Nvidia driver version        : 580.126.20
vLLM Version                 : 0.20.2rc1.dev143+g2c6b59b80 (git sha: 2c6b59b80)
flashinfer-python            : 0.6.8.post1
transformers                 : 5.8.0
triton                       : 3.6.0

---

vllm serve Qwen/Qwen2.5-7B-Instruct \
  --host 127.0.0.1 \
  --port 8000 \
  --served-model-name base \
  --max-model-len 32768 \
  --max-num-seqs 1 \
  --gpu-memory-utilization 0.90 \
  --trust-remote-code \
  --enable-prompt-embeds \
  --no-enable-prefix-caching \
  --h11-max-incomplete-event-size 268435456 \
  --kv-transfer-config '{"kv_connector":"LMCacheMPConnector","kv_role":"kv_both","kv_load_failure_policy":"recompute","kv_connector_extra_config":{"lmcache.mp.host":"tcp://localhost","lmcache.mp.port":6555,"lmcache.mp.mq_timeout":30}}'

---

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.bfloat16)
emb = model.get_input_embeddings()(input_ids)

---

cold A:
 The answer is: 1.0. The answer is: 1.0. The answer is: 1.0. The answer is:

cold B:
 You must not change the existing code.

Assistant:

---

import math

def calculate_total_cost(books, prices):
    total_cost

---

LMCache: Stored 2048 tokens in 0.122 seconds
LMCache: FSL2Adapter stored key Qwen-SEP-Qwen2.5-7B-Instruct@...@same.data

---

LMCache: Prefetch request completed (L1+L2): 8/8 prefix hits (8 L1, 0 L2)
LMCache: Retrieved 2048 tokens in 0.002 seconds

---

B after A warm + vLLM restart:
 The answer is: 1.0. The answer is: 1.0. The answer is: 1.0. The answer is:

RAW_BUFFERClick to expand / collapse

Description

With vLLM direct prompt_embeds requests and LMCache external KV caching enabled, different embedding tensors with the same sequence length can share LMCache external KV entries.

Your current environment

vLLM commit: 2c6b59b80771ac3bfa1c789abe3cb27c379bf3a1
LMCache commit: 13aca2097a43aca5f8625a1f160c4fdfc0e55634
vLLM runtime fingerprint: vllm-0.20.2rc1.dev143+g2c6b59b80-53b5b04d
LMCache version: 0.4.5.dev87
Model: Qwen/Qwen2.5-7B-Instruct
OS: Ubuntu 24.04.4 LTS
GPU: A100-SXM4-40GB
NVIDIA driver/CUDA: 580.126.20 / 13.0
PyTorch: 2.11.0+cu130

<details> <summary>Relevant output from <code>python -m vllm.collect_env</code> (sanitized)</summary>

OS                           : Ubuntu 24.04.4 LTS (x86_64)
Python version               : 3.12.3
PyTorch version              : 2.11.0+cu130
CUDA used to build PyTorch   : 13.0
Is CUDA available            : True
CUDA runtime version         : 12.9.41
GPU models and configuration : GPU 0: NVIDIA A100-SXM4-40GB
Nvidia driver version        : 580.126.20
vLLM Version                 : 0.20.2rc1.dev143+g2c6b59b80 (git sha: 2c6b59b80)
flashinfer-python            : 0.6.8.post1
transformers                 : 5.8.0
triton                       : 3.6.0

</details>

Setup

vLLM was run with local prefix caching disabled to isolate the external LMCache path:

vllm serve Qwen/Qwen2.5-7B-Instruct \
  --host 127.0.0.1 \
  --port 8000 \
  --served-model-name base \
  --max-model-len 32768 \
  --max-num-seqs 1 \
  --gpu-memory-utilization 0.90 \
  --trust-remote-code \
  --enable-prompt-embeds \
  --no-enable-prefix-caching \
  --h11-max-incomplete-event-size 268435456 \
  --kv-transfer-config '{"kv_connector":"LMCacheMPConnector","kv_role":"kv_both","kv_load_failure_policy":"recompute","kv_connector_extra_config":{"lmcache.mp.host":"tcp://localhost","lmcache.mp.port":6555,"lmcache.mp.mq_timeout":30}}'

LMCache MP server used persistent local FS storage with chunk_size=256.

The /v1/completions prompt_embeds wire format accepted by this vLLM build was a JSON string containing base64-encoded torch.save(tensor) bytes. The tensor shape was (num_tokens, hidden_size).

Reproduction Summary

I built two same-length prompt_embeds inputs with the served HF model's input embedding layer:

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.bfloat16)
emb = model.get_input_embeddings()(input_ids)

The textual tails used to construct the embeddings were semantically distinct:

A tail: You must answer exactly ALPHA_ALPHA_ALPHA. Do not add anything else. Please comply.
B tail: You must answer exactly BETA_BETA_BETA. Do not add anything else.

For the primary repro, both embedding tensors had shape [2048, 3584].

Steps:

Start vLLM without LMCache.
Send A prompt_embeds 10 times and B prompt_embeds 10 times.
Confirm A and B are deterministic and distinct.
Start LMCache with persistent FS storage and vLLM with LMCacheMPConnector.
Send A prompt_embeds with cache_salt="same" to warm LMCache.
Stop vLLM completely, preserving LMCache storage/server.
Restart vLLM with the same LMCache storage.
Send B prompt_embeds, same length, with cache_salt="same".
Repeat the B-after-A restart probe three times.
Run controls after vLLM restarts: different salt, clean LMCache storage, LMCache disabled, normal text/input IDs, and A after A warm.

Actual Behavior

Cold references were deterministic and distinct:

cold A:
 The answer is: 1.0. The answer is: 1.0. The answer is: 1.0. The answer is:

cold B:
 You must not change the existing code.

Assistant:
```python
```python
import math

def calculate_total_cost(books, prices):
    total_cost

A warmup stored external KV:

LMCache: Stored 2048 tokens in 0.122 seconds
LMCache: FSL2Adapter stored key Qwen-SEP-Qwen2.5-7B-Instruct@[email protected]

After vLLM restart, B with cache_salt="same" received external hits:

LMCache: Prefetch request completed (L1+L2): 8/8 prefix hits (8 L1, 0 L2)
LMCache: Retrieved 2048 tokens in 0.002 seconds

All three B-after-A restart probes produced the A-style output:

B after A warm + vLLM restart:
 The answer is: 1.0. The answer is: 1.0. The answer is: 1.0. The answer is:

Controls:

B with cache_salt="different": no external retrieve; matched cold B.
B with clean LMCache storage: no external retrieve; matched cold B.
B with LMCache disabled: matched cold B.
B as normal text/input IDs with cache_salt="same": no retrieve from A; matched cold B.
A after A warm: retrieved A KV and matched cold A.

I also reproduced external-hit contamination with [8192, 3584] embeddings. In that case B-after-A retrieved 8192 tokens / 32 chunks after each restart and differed from cold B; the clean controls matched cold B.

Expected Behavior

Different prompt_embeds tensors should not share external KV cache entries unless their embedding contents are identical, even when they have the same sequence length and cache_salt.

Suspected Cause

The LMCache key path appears to use token IDs/ranges and cache_salt, not prompt embedding contents.

Static inspection:

vllm/v1/request.py: when prompt_token_ids is None, _all_token_ids is initialized as [0] * num_prompt_tokens; prompt_embeds is stored separately.
vllm/v1/core/kv_cache_utils.py: local vLLM prefix cache hashes prompt-embedding block contents. The runtime repro used --no-enable-prefix-caching anyway.
vllm/distributed/kv_transfer/kv_connector/v1/lmcache_mp_connector.py: LMCacheMPRequestTracker records all_token_ids, block hashes, and cache_salt, but not prompt_embeds or a prompt-embedding digest. Store/retrieve metadata uses token_ids=list(all_token_ids).
LMCache/lmcache/integration/vllm/vllm_multi_process_adapter.py: _create_key constructs IPCCacheEngineKey(model_name, world_size, worker_id, token_ids, start, end, request_id, cache_salt).
IPCCacheEngineKey does not include prompt embedding bytes, a prompt embedding digest, prompt mode, or vLLM block hashes.
TokenHasher.compute_chunk_hashes hashes chunks from token_ids only; ipc_key_to_object_keys then derives storage object keys from the token chunk hash, model name, KV rank, and cache_salt.

For prompt-embeds-only requests of the same length, vLLM passes equivalent placeholder token IDs ([0] * N) to the LMCache key path. With the same cache_salt, A and B can therefore map to the same LMCache external KV objects.

Duplicate Search

I searched both vllm-project/vllm and LMCache/LMCache for prompt_embeds LMCache, prompt_embeds cache_salt, IPCCacheEngineKey prompt_embeds, all_token_ids prompt_embeds LMCache, and related terms. I did not find an exact existing issue for prompt embedding contents being omitted from LMCache external KV keys.

Before submitting a new issue

I searched existing and past issues for the terms above. I did not find an exact duplicate.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#tensor shape #task chaining #parallel task #integration issue #index setup

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

vllm - 💡(How to fix) Fix [Bug]: LMCache external KV keys omit prompt_embeds content, allowing same-length embedding prompts to share stale KV

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Code Example

Description

Your current environment

Setup

Reproduction Summary

Actual Behavior

Expected Behavior

Suspected Cause

Duplicate Search

Before submitting a new issue

Still need to ship something?

TRENDING

vllm - 💡(How to fix) Fix [Bug]: LMCache external KV keys omit prompt_embeds content, allowing same-length embedding prompts to share stale KV

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Code Example

Description

Your current environment

Setup

Reproduction Summary

Actual Behavior

Expected Behavior

Suspected Cause

Duplicate Search

Before submitting a new issue

Still need to ship something?

RELATED_DISCOVERY

TRENDING