vllm - ✅(Solved) Fix [Bug]: KV block corruption in base scheduler, Non-deterministic output at temperature=0 without prefix caching [1 pull requests, 1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#39146Fetched 2026-04-08 03:01:46
View on GitHub
Comments
0
Participants
1
Timeline
1
Reactions
6
Author
Participants
Timeline (top)
labeled ×1

We fuzzed with prefix-cache but forgot to fuzz without it 😅. But when testing --speculative-config, we found a KV block corruption bug that reproduces with no --enable-prefix-caching. Identical prompts at temperature=0 produce completely different output sequences across runs, confirmed 10/10 on three independent traces.

The findings were originally discovered while running with --speculative-config active, but a controlled isolation test (re-running each trace against a server with speculative decoding removed) confirmed all three reproduce identically without it. The minimum reproduction config is a fully stock vLLM server — no APC, no spec, no LoRA.

This is distinct from #37076, because that requires --enable-prefix-caching and shared prefix content. PR #37164 addresses the TOCTOU race inside get_computed_blocks(), while it's not merged, that TOCTOU should not affect the base vllm. SO, these findings point to a separate block lifecycle bug in the base scheduler's non-APC path.

Root Cause

This is distinct from #37076, because that requires --enable-prefix-caching and shared prefix content. PR #37164 addresses the TOCTOU race inside get_computed_blocks(), while it's not merged, that TOCTOU should not affect the base vllm. SO, these findings point to a separate block lifecycle bug in the base scheduler's non-APC path.

Fix Action

Fix / Workaround

#37076 / PR #37164 fix a TOCTOU race where cache_full_blocks inserts newly allocated blocks into the prefix cache hash table before the GPU forward pass completes. The patch pre-pins blocks inside get_computed_blocks().

PR fix notes

PR #39283: [Bugfix] Zero recycled KV cache blocks for FullAttention models

Description (problem / solution / changelog)

Summary

Closes #39146. The KV block zeroing pipeline from #35219 was gated to Mamba-only models; enabling it for FullAttention prevents stale K/V in partial-block tail slots from propagating NaN through masked softmax.

Changed files

  • tests/v1/core/test_kv_cache_utils.py (modified, +20/-0)
  • vllm/v1/kv_cache_interface.py (modified, +5/-1)

Code Example

/nfshomes/yunze/miniconda3/envs/vllm-fuzz/lib/python3.11/site-packages/torch/cuda/__init__.py:63: FutureWarning: The pynvml package is deprecated. Please install nvidia-ml-py instead. If you did not install pynvml directly, please report this to the maintainers of the package that installed pynvml for you.
  import pynvml  # type: ignore[import]
Collecting environment information...
==============================
        System Info
==============================
OS                           : Red Hat Enterprise Linux release 8.10 (Ootpa) (x86_64)
GCC version                  : (GCC) 8.5.0 20210514 (Red Hat 8.5.0-28)
Clang version                : Could not collect
CMake version                : version 3.26.5
Libc version                 : glibc-2.28

==============================
       PyTorch Info
==============================
PyTorch version              : 2.9.0+cu128
Is debug build               : False
CUDA used to build PyTorch   : 12.8
ROCM used to build PyTorch   : N/A

==============================
      Python Environment
==============================
Python version               : 3.11.14 (main, Oct 21 2025, 18:31:21) [GCC 11.2.0] (64-bit runtime)
Python platform              : Linux-4.18.0-553.109.1.el8_10.x86_64-x86_64-with-glibc2.28

==============================
       CUDA / GPU Info
==============================
Is CUDA available            : True
CUDA runtime version         : 13.1.115
CUDA_MODULE_LOADING set to   : 
GPU models and configuration : GPU 0: NVIDIA RTX A6000
Nvidia driver version        : 590.48.01
cuDNN version                : Could not collect
HIP runtime version          : N/A
MIOpen runtime version       : N/A
Is XNNPACK available         : True

==============================
          CPU Info
==============================
Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              32
On-line CPU(s) list: 0-31
Thread(s) per core:  1
Core(s) per socket:  16
Socket(s):           2
NUMA node(s):        2
Vendor ID:           AuthenticAMD
CPU family:          23
Model:               49
Model name:          AMD EPYC 7302 16-Core Processor
Stepping:            0
CPU MHz:             3000.000
CPU max MHz:         3000.0000
CPU min MHz:         1500.0000
BogoMIPS:            6000.12
Virtualization:      AMD-V
L1d cache:           32K
L1i cache:           32K
L2 cache:            512K
L3 cache:            16384K
NUMA node0 CPU(s):   0-15
NUMA node1 CPU(s):   16-31
Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate ssbd mba ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local clzero irperf xsaveerptr wbnoinvd amd_ppin arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif v_spec_ctrl umip rdpid overflow_recov succor smca sme sev sev_es

==============================
Versions of relevant libraries
==============================
[pip3] flashinfer-python==0.5.2
[pip3] numpy==2.2.6
[pip3] nvidia-cublas-cu12==12.8.4.1
[pip3] nvidia-cuda-cupti-cu12==12.8.90
[pip3] nvidia-cuda-nvrtc-cu12==12.8.93
[pip3] nvidia-cuda-runtime-cu12==12.8.90
[pip3] nvidia-cudnn-cu12==9.10.2.21
[pip3] nvidia-cudnn-frontend==1.18.0
[pip3] nvidia-cufft-cu12==11.3.3.83
[pip3] nvidia-cufile-cu12==1.13.1.3
[pip3] nvidia-curand-cu12==10.3.9.90
[pip3] nvidia-cusolver-cu12==11.7.3.90
[pip3] nvidia-cusparse-cu12==12.5.8.93
[pip3] nvidia-cusparselt-cu12==0.7.1
[pip3] nvidia-cutlass-dsl==4.4.1
[pip3] nvidia-cutlass-dsl-libs-base==4.4.1
[pip3] nvidia-ml-py==13.590.48
[pip3] nvidia-nccl-cu12==2.27.5
[pip3] nvidia-nvjitlink-cu12==12.8.93
[pip3] nvidia-nvshmem-cu12==3.3.20
[pip3] nvidia-nvtx-cu12==12.8.90
[pip3] pynvml==13.0.1
[pip3] pyzmq==27.1.0
[pip3] torch==2.9.0
[pip3] torchaudio==2.9.0
[pip3] torchvision==0.24.0
[pip3] transformers==4.57.6
[pip3] triton==3.5.0
[conda] flashinfer-python                    0.5.2            pypi_0           pypi
[conda] numpy                                2.2.6            pypi_0           pypi
[conda] nvidia-cublas-cu12                   12.8.4.1         pypi_0           pypi
[conda] nvidia-cuda-cupti-cu12               12.8.90          pypi_0           pypi
[conda] nvidia-cuda-nvrtc-cu12               12.8.93          pypi_0           pypi
[conda] nvidia-cuda-runtime-cu12             12.8.90          pypi_0           pypi
[conda] nvidia-cudnn-cu12                    9.10.2.21        pypi_0           pypi
[conda] nvidia-cudnn-frontend                1.18.0           pypi_0           pypi
[conda] nvidia-cufft-cu12                    11.3.3.83        pypi_0           pypi
[conda] nvidia-cufile-cu12                   1.13.1.3         pypi_0           pypi
[conda] nvidia-curand-cu12                   10.3.9.90        pypi_0           pypi
[conda] nvidia-cusolver-cu12                 11.7.3.90        pypi_0           pypi
[conda] nvidia-cusparse-cu12                 12.5.8.93        pypi_0           pypi
[conda] nvidia-cusparselt-cu12               0.7.1            pypi_0           pypi
[conda] nvidia-cutlass-dsl                   4.4.1            pypi_0           pypi
[conda] nvidia-cutlass-dsl-libs-base         4.4.1            pypi_0           pypi
[conda] nvidia-ml-py                         13.590.48        pypi_0           pypi
[conda] nvidia-nccl-cu12                     2.27.5           pypi_0           pypi
[conda] nvidia-nvjitlink-cu12                12.8.93          pypi_0           pypi
[conda] nvidia-nvshmem-cu12                  3.3.20           pypi_0           pypi
[conda] nvidia-nvtx-cu12                     12.8.90          pypi_0           pypi
[conda] pynvml                               13.0.1           pypi_0           pypi
[conda] pyzmq                                27.1.0           pypi_0           pypi
[conda] torch                                2.9.0            pypi_0           pypi
[conda] torchaudio                           2.9.0            pypi_0           pypi
[conda] torchvision                          0.24.0           pypi_0           pypi
[conda] transformers                         4.57.6           pypi_0           pypi
[conda] triton                               3.5.0            pypi_0           pypi

==============================
         vLLM Info
==============================
ROCM Version                 : Could not collect
vLLM Version                 : 0.11.2
vLLM Build Flags:
  CUDA Archs: Not Set; ROCm: Disabled
GPU Topology:
  	GPU0	NIC0	NIC1	NIC2	CPU Affinity	NUMA Affinity	GPU NUMA ID
GPU0	 X 	SYS	SYS	NODE	4,9	0		N/A
NIC0	SYS	 X 	PIX	SYS				
NIC1	SYS	PIX	 X 	SYS				
NIC2	NODE	SYS	SYS	 X 				

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

NIC Legend:

  NIC0: mlx5_2
  NIC1: mlx5_3
  NIC2: mlx5_bond_0

==============================
     Environment Variables
==============================
LD_LIBRARY_PATH=/opt/common/cuda/cuda-13.1.1/lib64:
CUDA_HOME=/opt/common/cuda/cuda-13.1.1
CUDA_HOME=/opt/common/cuda/cuda-13.1.1
CUDA_VISIBLE_DEVICES=0
CUDA_VISIBLE_DEVICES=0
PYTORCH_NVML_BASED_CUDA_CHECK=1
TORCHINDUCTOR_COMPILE_THREADS=1

---

diverged: r2, r3, r1_b, r2_b, r4_b, r5_s_s_b, r4, r5_s_s_b_b,
          r4_b_b, r4_s_s_b_storm_b, r4_s_s_b_storm  (11 / 21 total)
runs_diverged: 10 / 10

---

python -m vllm.entrypoints.openai.api_server \
    --model Qwen/Qwen2.5-0.5B-Instruct \
    --gpu-memory-utilization 0.95 \
    --max-model-len 32768

---

python3 repro.py --base-url http://localhost:8000
RAW_BUFFERClick to expand / collapse

Your current environment

<details> <summary>The output of <code>python collect_env.py</code></summary>
/nfshomes/yunze/miniconda3/envs/vllm-fuzz/lib/python3.11/site-packages/torch/cuda/__init__.py:63: FutureWarning: The pynvml package is deprecated. Please install nvidia-ml-py instead. If you did not install pynvml directly, please report this to the maintainers of the package that installed pynvml for you.
  import pynvml  # type: ignore[import]
Collecting environment information...
==============================
        System Info
==============================
OS                           : Red Hat Enterprise Linux release 8.10 (Ootpa) (x86_64)
GCC version                  : (GCC) 8.5.0 20210514 (Red Hat 8.5.0-28)
Clang version                : Could not collect
CMake version                : version 3.26.5
Libc version                 : glibc-2.28

==============================
       PyTorch Info
==============================
PyTorch version              : 2.9.0+cu128
Is debug build               : False
CUDA used to build PyTorch   : 12.8
ROCM used to build PyTorch   : N/A

==============================
      Python Environment
==============================
Python version               : 3.11.14 (main, Oct 21 2025, 18:31:21) [GCC 11.2.0] (64-bit runtime)
Python platform              : Linux-4.18.0-553.109.1.el8_10.x86_64-x86_64-with-glibc2.28

==============================
       CUDA / GPU Info
==============================
Is CUDA available            : True
CUDA runtime version         : 13.1.115
CUDA_MODULE_LOADING set to   : 
GPU models and configuration : GPU 0: NVIDIA RTX A6000
Nvidia driver version        : 590.48.01
cuDNN version                : Could not collect
HIP runtime version          : N/A
MIOpen runtime version       : N/A
Is XNNPACK available         : True

==============================
          CPU Info
==============================
Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              32
On-line CPU(s) list: 0-31
Thread(s) per core:  1
Core(s) per socket:  16
Socket(s):           2
NUMA node(s):        2
Vendor ID:           AuthenticAMD
CPU family:          23
Model:               49
Model name:          AMD EPYC 7302 16-Core Processor
Stepping:            0
CPU MHz:             3000.000
CPU max MHz:         3000.0000
CPU min MHz:         1500.0000
BogoMIPS:            6000.12
Virtualization:      AMD-V
L1d cache:           32K
L1i cache:           32K
L2 cache:            512K
L3 cache:            16384K
NUMA node0 CPU(s):   0-15
NUMA node1 CPU(s):   16-31
Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate ssbd mba ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local clzero irperf xsaveerptr wbnoinvd amd_ppin arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif v_spec_ctrl umip rdpid overflow_recov succor smca sme sev sev_es

==============================
Versions of relevant libraries
==============================
[pip3] flashinfer-python==0.5.2
[pip3] numpy==2.2.6
[pip3] nvidia-cublas-cu12==12.8.4.1
[pip3] nvidia-cuda-cupti-cu12==12.8.90
[pip3] nvidia-cuda-nvrtc-cu12==12.8.93
[pip3] nvidia-cuda-runtime-cu12==12.8.90
[pip3] nvidia-cudnn-cu12==9.10.2.21
[pip3] nvidia-cudnn-frontend==1.18.0
[pip3] nvidia-cufft-cu12==11.3.3.83
[pip3] nvidia-cufile-cu12==1.13.1.3
[pip3] nvidia-curand-cu12==10.3.9.90
[pip3] nvidia-cusolver-cu12==11.7.3.90
[pip3] nvidia-cusparse-cu12==12.5.8.93
[pip3] nvidia-cusparselt-cu12==0.7.1
[pip3] nvidia-cutlass-dsl==4.4.1
[pip3] nvidia-cutlass-dsl-libs-base==4.4.1
[pip3] nvidia-ml-py==13.590.48
[pip3] nvidia-nccl-cu12==2.27.5
[pip3] nvidia-nvjitlink-cu12==12.8.93
[pip3] nvidia-nvshmem-cu12==3.3.20
[pip3] nvidia-nvtx-cu12==12.8.90
[pip3] pynvml==13.0.1
[pip3] pyzmq==27.1.0
[pip3] torch==2.9.0
[pip3] torchaudio==2.9.0
[pip3] torchvision==0.24.0
[pip3] transformers==4.57.6
[pip3] triton==3.5.0
[conda] flashinfer-python                    0.5.2            pypi_0           pypi
[conda] numpy                                2.2.6            pypi_0           pypi
[conda] nvidia-cublas-cu12                   12.8.4.1         pypi_0           pypi
[conda] nvidia-cuda-cupti-cu12               12.8.90          pypi_0           pypi
[conda] nvidia-cuda-nvrtc-cu12               12.8.93          pypi_0           pypi
[conda] nvidia-cuda-runtime-cu12             12.8.90          pypi_0           pypi
[conda] nvidia-cudnn-cu12                    9.10.2.21        pypi_0           pypi
[conda] nvidia-cudnn-frontend                1.18.0           pypi_0           pypi
[conda] nvidia-cufft-cu12                    11.3.3.83        pypi_0           pypi
[conda] nvidia-cufile-cu12                   1.13.1.3         pypi_0           pypi
[conda] nvidia-curand-cu12                   10.3.9.90        pypi_0           pypi
[conda] nvidia-cusolver-cu12                 11.7.3.90        pypi_0           pypi
[conda] nvidia-cusparse-cu12                 12.5.8.93        pypi_0           pypi
[conda] nvidia-cusparselt-cu12               0.7.1            pypi_0           pypi
[conda] nvidia-cutlass-dsl                   4.4.1            pypi_0           pypi
[conda] nvidia-cutlass-dsl-libs-base         4.4.1            pypi_0           pypi
[conda] nvidia-ml-py                         13.590.48        pypi_0           pypi
[conda] nvidia-nccl-cu12                     2.27.5           pypi_0           pypi
[conda] nvidia-nvjitlink-cu12                12.8.93          pypi_0           pypi
[conda] nvidia-nvshmem-cu12                  3.3.20           pypi_0           pypi
[conda] nvidia-nvtx-cu12                     12.8.90          pypi_0           pypi
[conda] pynvml                               13.0.1           pypi_0           pypi
[conda] pyzmq                                27.1.0           pypi_0           pypi
[conda] torch                                2.9.0            pypi_0           pypi
[conda] torchaudio                           2.9.0            pypi_0           pypi
[conda] torchvision                          0.24.0           pypi_0           pypi
[conda] transformers                         4.57.6           pypi_0           pypi
[conda] triton                               3.5.0            pypi_0           pypi

==============================
         vLLM Info
==============================
ROCM Version                 : Could not collect
vLLM Version                 : 0.11.2
vLLM Build Flags:
  CUDA Archs: Not Set; ROCm: Disabled
GPU Topology:
  	GPU0	NIC0	NIC1	NIC2	CPU Affinity	NUMA Affinity	GPU NUMA ID
GPU0	 X 	SYS	SYS	NODE	4,9	0		N/A
NIC0	SYS	 X 	PIX	SYS				
NIC1	SYS	PIX	 X 	SYS				
NIC2	NODE	SYS	SYS	 X 				

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

NIC Legend:

  NIC0: mlx5_2
  NIC1: mlx5_3
  NIC2: mlx5_bond_0

==============================
     Environment Variables
==============================
LD_LIBRARY_PATH=/opt/common/cuda/cuda-13.1.1/lib64:
CUDA_HOME=/opt/common/cuda/cuda-13.1.1
CUDA_HOME=/opt/common/cuda/cuda-13.1.1
CUDA_VISIBLE_DEVICES=0
CUDA_VISIBLE_DEVICES=0
PYTORCH_NVML_BASED_CUDA_CHECK=1
TORCHINDUCTOR_COMPILE_THREADS=1
</details>

🐛 Describe the bug

Related to: #37076 , PR #37164

Summary

We fuzzed with prefix-cache but forgot to fuzz without it 😅. But when testing --speculative-config, we found a KV block corruption bug that reproduces with no --enable-prefix-caching. Identical prompts at temperature=0 produce completely different output sequences across runs, confirmed 10/10 on three independent traces.

The findings were originally discovered while running with --speculative-config active, but a controlled isolation test (re-running each trace against a server with speculative decoding removed) confirmed all three reproduce identically without it. The minimum reproduction config is a fully stock vLLM server — no APC, no spec, no LoRA.

This is distinct from #37076, because that requires --enable-prefix-caching and shared prefix content. PR #37164 addresses the TOCTOU race inside get_computed_blocks(), while it's not merged, that TOCTOU should not affect the base vllm. SO, these findings point to a separate block lifecycle bug in the base scheduler's non-APC path.

Background: how this differs from #37076 and PR #37164

#37076 / PR #37164 fix a TOCTOU race where cache_full_blocks inserts newly allocated blocks into the prefix cache hash table before the GPU forward pass completes. The patch pre-pins blocks inside get_computed_blocks().

In my perspective, what we have now is independent on two parts:

  1. No --enable-prefix-caching required. get_computed_blocks() is never called without APC. PR #37164 does not touch this code path.

  2. No shared prefix required. All requests in our traces have completely unique prompts (prefix_len=0, distinct token sequences). There is no shared cache content to race over.

The corruption reproduces with 4–5 concurrent requests on a fully default server. Any production deployment is potentially affected.

Primary finding — finding_00450 (cleanest)

Note on attached JSON artifacts: the server_flags field in each finding JSON reflects the original discovery config, which included --speculative-config. This field is recorded at discovery time and is not updated by subsequent isolation tests. The isolation test results are reported separately above and confirm spec is not required.

Five requests, no shared state, no cancellations involved in the corruption.

eventrequestoffset_msprompt_lenprefix_lenmax_tokensstreamdiverged
sendr105120512true
sendr21005120512true
sendr32005120512true
sendr43005120512true
sendr520008192016true
cancelr13605

Key observations:

  • r1 and r5 are clean across all 10 runs. r2, r3, r4 diverge in every run.
  • The cancel of r1 occurs at 3605ms — long after r2/r3/r4 would have completed. It is not the cause.
  • r5 (8192 tokens) is a large request submitted 2 seconds after the short ones. Its memory pressure changes the block allocation state visible to subsequent runs.
  • No prefix sharing, no APC, no spec engine involvement.

Second, finding_01410, same as the above :)

A more heavily mutated trace with 21 concurrent requests (mix of 3000-token and 512-token prompts), all prefix_len=0. 11 of 21 requests diverge in 10/10 runs. The larger batch and mixed sizes amplify the corruption rate, consistent with the hypothesis that block allocation order under concurrency is the trigger.

diverged: r2, r3, r1_b, r2_b, r4_b, r5_s_s_b, r4, r5_s_s_b_b,
          r4_b_b, r4_s_s_b_storm_b, r4_s_s_b_storm  (11 / 21 total)
runs_diverged: 10 / 10

Related finding — finding_00030 (cancel path)

A cancel/retry pattern: 5 requests cancelled mid-generation, 5 fresh retries sent 60ms later. The original requests (r01r05) are clean. The retry requests (r01_retryr05_retry) diverge 10/10 runs.

This is potentially a different issue, I put it here as the same since I suspect the underlying issue might be the same, not entirely sure yet.

eventrequestoffset_msprompt_lenprefix_lendiverged
sendr01–r050–402560
cancelr01–r05200–240
sendr01_retry–r05_retry300–3402560✓ all 5

Isolation: speculative decoding is not the cause

Because the findings were discovered with --speculative-config in use, we re-ran each trace against a server with speculative decoding fully removed to rule out the spec engine as the cause. All three reproduced identically — same diverged requests, same 10/10 rate.

My hypothesis

We know without --enable-prefix-caching, the V1 scheduler's block allocator does not track block identity through hash table. When requests complete or are cancelled, KV blocks are returned to free pool. But If those blocks are not zeroed before reuse, a subsequent request that receives them will decode from stale KV data belonging to a different request.

The pattern in finding_00450, r1 and r5 clean, r2/r3/r4 corrupted, is consistent with r1's blocks being the "first" fresh allocation (pool is clean on the very first run), while r2/r3/r4 receive blocks recycled from a prior reproduce run's completed requests. The large r5 (8192 tokens) changes the block pressure enough that across successive runs the allocation order and thus the "dirty" block distribution shifts, producing different outputs each time.

Abd finding_00030's cancel path is the same mechanism but via an few explicit cancellations: r01-r05 are cancelled mid-generation, freeing their blocks immediately. The retries arrive 60ms later and receive those dirty blocks.

Again, this seems different from #37076's uninitialized-but-registered block race. There, a block is registered in the hash table before its GPU data is written. Here, a block that previously held valid data for request A is recycled to request B without clearing the GPU memory first.

Reproduction:

You will need these findings: primary: finding_00030_999829240.json second(corroboration): finding_00450_862114934.json cancel/retry: finding_01410_1760617970.json

and repro.py

Step 1 — start vLLM as it is:

python -m vllm.entrypoints.openai.api_server \
    --model Qwen/Qwen2.5-0.5B-Instruct \
    --gpu-memory-utilization 0.95 \
    --max-model-len 32768

Note: Make sure your findings are in the same directory as repro.py, and don't change the findings name, I imported them directly in the script.

Step 2 — run the script (requires httpx):

python3 repro.py --base-url http://localhost:8000

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

TL;DR

The most likely fix is to ensure that KV blocks are properly cleared before being reused by the V1 scheduler's block allocator.

Guidance

  • Review the block allocation and deallocation logic in the V1 scheduler to identify where blocks are not being properly cleared.
  • Implement a mechanism to zero out KV blocks before they are returned to the free pool to prevent stale data from being reused.
  • Verify that the fix resolves the issue by re-running the reproduction script with the modified code.
  • Consider adding additional logging or debugging statements to monitor block allocation and deallocation to catch any similar issues in the future.

Example

# Example of how to zero out a block before returning it to the free pool
def return_block_to_pool(block):
    # Zero out the block's KV data
    block.kv_data = [0] * len(block.kv_data)
    # Return the block to the free pool
    free_pool.append(block)

Notes

  • The issue seems to be related to the V1 scheduler's block allocator not properly handling block reuse, leading to stale data being used by subsequent requests.
  • The provided reproduction script and findings should be used to verify the fix and ensure that it resolves the issue.

Recommendation

Apply the workaround of zeroing out KV blocks before returning them to the free pool to prevent stale data from being reused. This should resolve the issue and prevent similar problems in the future.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING