vllm - ✅(Solved) Fix [Bug]: KV block corruption under rapid LoRA adapter alternation [1 pull requests, 2 comments, 3 participants]

vllm2026-03-31 03:37:45

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

vllm-project/vllm#38606•Fetched 2026-04-08 01:58:57

View on GitHub

Comments

Participants

Timeline

Reactions

Author

Participants

Assignees

Timeline (top)

commented ×2referenced ×2assigned ×1cross-referenced ×1

Fix Action

Fix / Workaround

#37076 / PR #37164 fixes a TOCTOU race where request B steals a cached block between get_computed_blocks() and allocate_slots(). The fix pre-pins blocks immediately after lookup. This issue looks different and is most likely not patched by that fix.

PR fix notes

PR #38715: [Bugfix] Fix intra-step KV block corruption from stale prefix cache hits

Repository: vllm-project/vllm
Author: KrxGu
State: closed | merged: False
Link: https://github.com/vllm-project/vllm/pull/38715

Description (problem / solution / changelog)

Problem

When two requests share a prefix and are scheduled in the same step, the second request can find the first request's blocks in the prefix cache and read from them before the GPU has written any KV data. allocate_slots() calls cache_blocks() at scheduling time, ahead of the forward pass, so the registered blocks contain stale memory. The corruption is most visible under rapid LoRA adapter alternation where different adapters share prefixes frequently (#38606).

Fix

MambaManager already had protection via a cached_blocks_this_step set. This PR moves that guard into the base SingleTypeKVCacheManager class so FullAttentionManager and SlidingWindowManager get the same protection.

When the last prefix-cache hit block was registered in the current step, get_num_blocks_to_allocate returns num_gpu_blocks + 1. The scheduler treats this as "no capacity" and defers the request to the next step, by which time the GPU will have committed the KV data.

Tests

Updated existing tests to call new_step_starts() between requests that belong to different scheduler steps.
Added test_no_intra_step_prefix_reuse and test_no_intra_step_prefix_reuse_with_lora as direct regression tests.

Fixes unsafe intra-step prefix reuse, which is a likely corruption path behind #38606

Changed files

tests/v1/core/test_prefix_caching.py (modified, +125/-0)
vllm/v1/core/single_type_kv_cache_manager.py (modified, +37/-25)

Code Example

[DIVERGED] thrash_1  (9/10 runs differ from run 1)
    Run  1: ' u9659_9860 u9659_1138 u9659_7561 u9659_2174 u9659_8725 ...'
    Run  2: ' u9659_298  u9659_998  u9659_7618 u9659_826  u9659_2823 ...'
  [DIVERGED] thrash_3  (7/10 runs differ from run 1)
    Run  1: ' u1112_834 u1112_134 u1112_884 u1112_134 u1112_884 ...'  ← repeating loop
    Run  2: ' u1112_834 u1112_134 u1112_884 u1112_2124 u1112_8077 ...'
  [DIVERGED] thrash_5  (9/10 runs differ from run 1)
    Run  1: ' u3439_888 u3439_2646 u3439_2446 u3439_888 u3439_984 ...'
    Run  2: ' u3439_832 u3439_2646 u3439_656  u3439_888 u3439_312 ...'
  [DIVERGED] thrash_9  (9/10 runs differ from run 1)
    Run  1: ' u8427_788 u8427_107 u8427_384 u8427_924 u8427_780 ...'
    Run  2: ' u8427_7886 u8427_107 u8427_218 u8427_8344 u8427_333 ...'

---

python3 gen_lora_weights.py --output-dir /tmp/lora_weights

---

python -m vllm.entrypoints.openai.api_server \
    --model Qwen/Qwen2.5-0.5B-Instruct \
    --enable-prefix-caching \
    --enable-chunked-prefill \
    --enable-lora \
    --lora-modules lora_a=/tmp/lora_weights/lora_a \
                   lora_b=/tmp/lora_weights/lora_b \
    --gpu-memory-utilization 0.95 \
    --max-model-len 32768

---

python3 repro.py --base-url http://localhost:8000 --runs 10

RAW_BUFFERClick to expand / collapse

Your current environment

<details> <summary>The output of <code>python collect_env.py</code></summary>

/nfshomes/yunze/miniconda3/envs/vllm-fuzz/lib/python3.11/site-packages/torch/cuda/init.py:63: FutureWarning: The pynvml package is deprecated. Please install nvidia-ml-py instead. If you did not install pynvml directly, please report this to the maintainers of the package that installed pynvml for you. import pynvml # type: ignore[import] Collecting environment information...

    System Info

============================== OS : Red Hat Enterprise Linux release 8.10 (Ootpa) (x86_64) GCC version : (GCC) 8.5.0 20210514 (Red Hat 8.5.0-28) Clang version : Could not collect CMake version : version 3.26.5 Libc version : glibc-2.28

============================== PyTorch Info

PyTorch version : 2.9.0+cu128 Is debug build : False CUDA used to build PyTorch : 12.8 ROCM used to build PyTorch : N/A

============================== Python Environment

Python version : 3.11.14 (main, Oct 21 2025, 18:31:21) [GCC 11.2.0] (64-bit runtime) Python platform : Linux-4.18.0-553.109.1.el8_10.x86_64-x86_64-with-glibc2.28

============================== CUDA / GPU Info

Is CUDA available : True CUDA runtime version : 13.1.115 CUDA_MODULE_LOADING set to : GPU models and configuration : GPU 0: NVIDIA RTX A6000 Nvidia driver version : 590.48.01 cuDNN version : Could not collect HIP runtime version : N/A MIOpen runtime version : N/A Is XNNPACK available : True

============================== CPU Info

Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 32 On-line CPU(s) list: 0-31 Thread(s) per core: 1 Core(s) per socket: 16 Socket(s): 2 NUMA node(s): 2 Vendor ID: AuthenticAMD CPU family: 23 Model: 49 Model name: AMD EPYC 7302 16-Core Processor Stepping: 0 CPU MHz: 3000.000 CPU max MHz: 3000.0000 CPU min MHz: 1500.0000 BogoMIPS: 6000.12 Virtualization: AMD-V L1d cache: 32K L1i cache: 32K L2 cache: 512K L3 cache: 16384K NUMA node0 CPU(s): 0-15 NUMA node1 CPU(s): 16-31 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate ssbd mba ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local clzero irperf xsaveerptr wbnoinvd amd_ppin arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif v_spec_ctrl umip rdpid overflow_recov succor smca sme sev sev_es

============================== Versions of relevant libraries

[pip3] flashinfer-python==0.5.2 [pip3] numpy==2.2.6 [pip3] nvidia-cublas-cu12==12.8.4.1 [pip3] nvidia-cuda-cupti-cu12==12.8.90 [pip3] nvidia-cuda-nvrtc-cu12==12.8.93 [pip3] nvidia-cuda-runtime-c [pip3] nvidia-cudnn-cu12==9.10.2.21 [pip3] nvidia-cudnn-frontend==1.18.0 [pip3] nvidia-cufft-cu12==11.3.3.83 [pip3] nvidia-cufile-cu12==1.13.1.3 [pip3] nvidia-curand-cu12==10.3.9.90 [pip3] nvidia-cusolver-cu12==11.7.3.90 [pip3] nvidia-cusparse-cu12==12.5.8.93 [pip3] nvidia-cusparselt-cu12==0.7.1 [pip3] nvidia-cutlass-dsl==4.4.1 [pip3] nvidia-cutlass-dsl-li [pip3] nvidia-ml-py==13.590.48 [pip3] nvidia-nccl-cu12==2.27.5 [pip3] nvidia-nvjitlink-cu12==12.8.93 [pip3] nvidia-nvshmem-cu12==3.3.20 [pip3] nvidia-nvtx-cu12==12.8.90 [pip3] pynvml==13.0.1 [pip3] pyzmq==27.1.0 [pip3] torch==2.9.0 [pip3] torchaudio==2.9.0 [pip3] torchvision==0.24.0 [pip3] transformers==4.57.6 [pip3] triton==3.5.0 [conda] flashinfer-python [conda] numpy [conda] nvidia-cublas-cu12 [conda] nvidia-cuda-cupti-cu12 [conda] nvidia-cuda-nvrtc-cu12 [conda] nvidia-cuda-runtime-cu12 [conda] nvidia-cudnn-cu12 [conda] nvidia-cudnn-frontend [conda] nvidia-cufft-cu12 [conda] nvidia-cufile-cu12 [conda] nvidia-curand-cu12 [conda] nvidia-cusolver-cu12 [conda] nvidia-cusparse-cu12 [conda] nvidia-cusparselt-cu12 [conda] nvidia-cutlass-dsl [conda] nvidia-cutlass-dsl-libs-base [conda] nvidia-ml-py [conda] nvidia-nccl-cu12 [conda] nvidia-nvjitlink-cu12 [conda] nvidia-nvshmem-cu12 [conda] nvidia-nvtx-cu12 [conda] pynvml [conda] pyzmq [conda] torch [conda] torchaudio [conda] torchvision [conda] transformers [conda] triton u12==12.8.90 bs-base==4.4.1 0.5.2 pypi_0 pypi 2.2.6 pypi_0 pypi 12.8.4.1 pypi_0 pypi 12.8.90 pypi_0 pypi 12.8.93 pypi_0 pypi 12.8.90 pypi_0 pypi 9.10.2.21 pypi_0 pypi 1.18.0 pypi_0 pypi 11.3.3.83 pypi_0 pypi 1.13.1.3 pypi_0 pypi 10.3.9.90 pypi_0 pypi 11.7.3.90 pypi_0 pypi 12.5.8.93 pypi_0 pypi 0.7.1 pypi_0 pypi 4.4.1 pypi_0 pypi 4.4.1 pypi_0 pypi 13.590.48 pypi_0 pypi 2.27.5 pypi_0 pypi 12.8.93 pypi_0 pypi 3.3.20 pypi_0 pypi 12.8.90 pypi_0 pypi 13.0.1 pypi_0 pypi 27.1.0 pypi_0 pypi 2.9.0 pypi_0 pypi 2.9.0 pypi_0 pypi 0.24.0 pypi_0 pypi 4.57.6 pypi_0 pypi 3.5.0 pypi_0 pypi

============================== vLLM Info

ROCM Version : Could not collect vLLM Version : 0.11.2 vLLM Build Flags: CUDA Archs: Not Set; ROCm: Disabled GPU Topology: GPU0 NIC0 NIC1 NIC2 CPU Affinity NUMA Affinity GPU NUMA ID GPU0 X SYS SYS NODE 4,9 0 N/A NIC0 SYS X PIX SYS NIC1 SYS PIX X SYS NIC2 NODE SYS SYS X

Legend:

X = Self SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI) NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU) PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge) PIX = Connection traversing at most a single PCIe bridge NV# = Connection traversing a bonded set of # NVLinks

NIC Legend:

NIC0: mlx5_2 NIC1: mlx5_3 NIC2: mlx5_bond_0

============================== Environment Variables

LD_LIBRARY_PATH=/opt/common/cuda/cuda-13.1.1/lib64: CUDA_HOME=/opt/common/cuda/cuda-13.1.1 CUDA_HOME=/opt/common/cuda/cuda-13.1.1 CUDA_VISIBLE_DEVICES=0 CUDA_VISIBLE_DEVICES=0 PYTORCH_NVML_BASED_CUDA_CHECK=1 TORCHINDUCTOR_COMPILE_THREADS=1

</details>

🐛 Describe the bug

Possibily Related to: #37076

When fuzzing vllm 0.18.0 with lora, I found an independent trigger for KV cache block corruption, specific to multi-adapter deployments. Rapidly alternating between two LoRA adapters while long-prefix requests are mid-prefill causes non-deterministic output at temperature=0. Confirmed 10/10 runs on a single minimal trace.

The same trace reproduces at 5/10 on a base-model server without --enable-lora, but the LoRA-aware KV cache manager doubles the reproduction rate, pointing to an additional aliasing surface in the per-adapter block namespace.

The divergence is not cancel-induced: the corrupted request (thrash_9) completes at 72ms, and the first cancel in the trace does not occur until 225ms.

Why a different issue from #37076:

The cancel-induced path (reported separately) triggers corruption when a cancelled request's blocks are freed and recycled before the GPU has cleared them.

This trigger here is different for two parts:

First Cancellation is not required. A stripped-down reproduction script with no cancel or disconnect events still confirms corruption in 7–9/10 runs across four requests (thrash_1, thrash_3, thrash_5, thrash_9). The original fuzzer-found trace does contain cancels, but they are not the cause.

And also, LoRA namespace pressure. When --enable-lora is active, the KV cache manager keys blocks on (lora_id, block_hash) rather than just block_hash. Rapidly alternating between lora_a and lora_b at 8ms intervals while two anchor requests (r1, r2) are chunked-prefilling a 600-token shared prefix forces the cache manager to resolve cross-adapter block ownership at every scheduler step. The same trace confirms at only 5/10 on a base server — the adapter-namespace multiplexing is an amplifying factor that creates a new aliasing window on top of the base race.

Original Trace:

The original trace has 11 "thrash" requests that alternate between lora_a and lora_b every 8ms, bracketed by anchor requests with long shared prefixes. Do note the reproduction results can look different, this is one of the reproduction I ran, in total 3 rounds were ran, 10 times each run, all confirmed. Below shows the one results.

event	request	offset_ms	adapter	prompt_len	prefix_len	max_tokens	stream	diverged
send	r1	0	lora_a	600	512	256	true	✓
send	thrash_0	0	lora_a	600	599	256	true
send	thrash_1	8	lora_b	600	0	256	true	✓
send	thrash_2	16	lora_a	600	0	256	true	✓
send	thrash_3	24	lora_b	600	0	256	true	✓
send	thrash_4	32	lora_a	600	0	256	true	✓
send	thrash_5	40	lora_b	600	0	256	true	✓
send	thrash_6	48	lora_a	600	0	256	true	✓
send	r2	50	lora_b	600	599	256	true
send	thrash_7	56	lora_b	600	0	256	true
send	thrash_8	64	lora_a	600	0	256	true
send	thrash_9	72	lora_b	600	0	256	true	✓
send	thrash_10	80	lora_a	600	0	256	true
send	r3	100	lora_a	600	512	256	true
send	r4	150	lora_b	513	512	256	true
cancel	r3	225	—	—	—	—	—
send	r5	300	lora_b	600	512	256	true
send	r6	400	lora_a	600	512	256	true	✓
cancel	r5	402	—	—	—	—	—

Some interesting facts:

9 of 19 requests diverge, r1, thrash_1 through thrash_6, thrash_9, and r6. The corruption is not isolated to one request; it propagates across the majority of the thrash burst.
r1 and r2 arrive simultaneously with nearly full prefix overlap (512/599 of 600 tokens), forcing a long chunked prefill spanning multiple scheduler steps.
Eleven thrash requests arrive every 8ms, alternating adapters. This creates sustained cross-adapter cache pressure throughout the prefill window.
Cancels at 225ms and 402ms are irrelevant to the divergence. The earliest diverging request (r1) fires at offset=0ms; the first cancel does not occur until 225ms.

Standalone repro output (no cancels, vLLM 0.11.2, 10 runs):

  [DIVERGED] thrash_1  (9/10 runs differ from run 1)
    Run  1: ' u9659_9860 u9659_1138 u9659_7561 u9659_2174 u9659_8725 ...'
    Run  2: ' u9659_298  u9659_998  u9659_7618 u9659_826  u9659_2823 ...'
  [DIVERGED] thrash_3  (7/10 runs differ from run 1)
    Run  1: ' u1112_834 u1112_134 u1112_884 u1112_134 u1112_884 ...'  ← repeating loop
    Run  2: ' u1112_834 u1112_134 u1112_884 u1112_2124 u1112_8077 ...'
  [DIVERGED] thrash_5  (9/10 runs differ from run 1)
    Run  1: ' u3439_888 u3439_2646 u3439_2446 u3439_888 u3439_984 ...'
    Run  2: ' u3439_832 u3439_2646 u3439_656  u3439_888 u3439_312 ...'
  [DIVERGED] thrash_9  (9/10 runs differ from run 1)
    Run  1: ' u8427_788 u8427_107 u8427_384 u8427_924 u8427_780 ...'
    Run  2: ' u8427_7886 u8427_107 u8427_218 u8427_8344 u8427_333 ...'

Notable: all four corrupted requests are lora_b requests (odd-indexed thrash events), the original trace has both. The lora_a requests (thrash_2, thrash_4, thrash_6) are clean across all runs. This asymmetry is a interesting just for this trace, I think one adapter's block registration window is consistently losing the race to the other's.

Reproduce:

Here are the scripts you will need: gen_lora_weights, repro.py

Step 1 — generate LoRA weights (requires torch, transformers, peft):

python3 gen_lora_weights.py --output-dir /tmp/lora_weights

This creates two adapters (lora_a, lora_b) with different random seeds so they produce distinct outputs, making contamination immediately visible.

Step 2 — start vLLM:

python -m vllm.entrypoints.openai.api_server \
    --model Qwen/Qwen2.5-0.5B-Instruct \
    --enable-prefix-caching \
    --enable-chunked-prefill \
    --enable-lora \
    --lora-modules lora_a=/tmp/lora_weights/lora_a \
                   lora_b=/tmp/lora_weights/lora_b \
    --gpu-memory-utilization 0.95 \
    --max-model-len 32768

Step 3 — run the reproduction script (requires httpx):

python3 repro.py --base-url http://localhost:8000 --runs 10

Expected output: 4 of the 9 monitored requests (thrash_1, thrash_3, thrash_5, thrash_9) diverge in 7–9 of 10 runs. All four are lora_b requests; lora_a requests are clean.

To confirm the base race is also present without LoRA, you can also try restart vLLM without --enable-lora and rerun — expect ~5/10 with fewer diverged requests.

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

TL;DR

The issue can be mitigated by modifying the cache manager to handle cross-adapter block ownership more robustly, potentially by introducing additional synchronization or by modifying the cache key to reduce aliasing.

Guidance

Investigate the cache manager's handling of cross-adapter block ownership and consider introducing additional synchronization to prevent aliasing.
Modify the reproduction script to test different cache key strategies, such as using a unique identifier for each adapter, to reduce aliasing.
Consider implementing a mechanism to detect and handle cache corruption, such as checksums or error-correcting codes, to improve the robustness of the system.
Review the documentation for --enable-lora and --enable-prefix-caching to ensure that the expected behavior is clearly described and that any potential issues or limitations are highlighted.

Example

No specific code example is provided, as the issue requires a deeper understanding of the cache manager and the LoRA-aware KV cache manager implementation. However, a potential approach could involve modifying the cache key to include a unique identifier for each adapter, such as:

cache_key = (lora_id, block_hash)

This could help reduce aliasing and improve the robustness of the cache manager.

Notes

The issue appears to be related to the interaction between the LoRA-aware KV cache manager and the cache manager, and may require a deeper understanding of the implementation details to resolve. The provided reproduction script and standalone repro output can be used to test and validate any potential fixes.

Recommendation

Apply a workaround by modifying the cache manager to handle cross-adapter block ownership more robustly, such as by introducing additional synchronization or modifying the cache key to reduce aliasing. This can help mitigate the issue until a more permanent fix can be implemented.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #inference speed #output truncation #response parsing #environment variable

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.