vllm - ✅(Solved) Fix [Bug]: Async cancel/resubmit with reused runtime request id can drive num_output_placeholders below zero [1 pull requests, 2 comments, 3 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#42492Fetched 2026-05-14 03:29:40
View on GitHub
Comments
2
Participants
3
Timeline
10
Reactions
0
Timeline (top)
commented ×2mentioned ×2subscribed ×2closed ×1

I found a frontend-legal async request-identity bug.

The reproducer first queues work for one live request (req2a), then cancels that request, and finally submits a new request (req2b) that reuses the same runtime request id (late_req2).

Because the old async batch is still queued, draining that old batch later can apply its output to the new live request object instead of the cancelled one. When that happens, vLLM decrements request.num_output_placeholders on the wrong request object and the async scheduler invariant below fails:

assert request.num_output_placeholders >= 0

This is not a malformed-input bug. The workload is frontend-legal. The failure comes from stale queued output being rebound to the wrong live request after runtime request id reuse.

This issue is different from the double-streaming_update row-mapping issue. This one specifically depends on cancel -> resubmit with the same runtime request id -> drain old queued batch.

Error Message

  • error.txt on failure Representative traceback from the verified run:

Root Cause

This is not a malformed-input bug. The problem is that async queued work for an old request can outlive that request object, while a new request is allowed to reuse the same runtime request identity.

In the verified run, the old req2a batch is still queued after cancellation. Then req2b is submitted with the same runtime request id. When the old batch finally drains, vLLM updates the current live request object for that id, which is now req2b, not the cancelled req2a.

That causes stale queued output from the old request to consume placeholder credit on the new request. The result is that request.num_output_placeholders is decremented on the wrong object and falls below zero.

So the real bug is in how async queued output is rebound during runtime request id reuse after cancel/resubmit.

Fix Action

Fix / Workaround

============================== CPU Info

Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Address sizes: 52 bits physical, 57 bits virtual Byte Order: Little Endian CPU(s): 384 On-line CPU(s) list: 0-383 Vendor ID: AuthenticAMD Model name: AMD EPYC 9654 96-Core Processor CPU family: 25 Model: 17 Thread(s) per core: 2 Core(s) per socket: 96 Socket(s): 2 Stepping: 1 Frequency boost: enabled CPU max MHz: 3707.8120 CPU min MHz: 1500.0000 BogoMIPS: 4792.57 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good amd_lbr_v2 nopl nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 invpcid_single hw_pstate ssbd mba perfmon_v2 ibrs ibpb stibp ibrs_enhanced vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local avx512_bf16 clzero irperf xsaveerptr rdpru wbnoinvd amd_ppin cppc arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif x2avic v_spec_ctrl vnmi avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq la57 rdpid overflow_recov succor smca fsrm flush_l1d Virtualization: AMD-V L1d cache: 6 MiB (192 instances) L1i cache: 6 MiB (192 instances) L2 cache: 192 MiB (192 instances) L3 cache: 768 MiB (24 instances) NUMA node(s): 2 NUMA node0 CPU(s): 0-95,192-287 NUMA node1 CPU(s): 96-191,288-383 Vulnerability Gather data sampling: Not affected Vulnerability Itlb multihit: Not affected Vulnerability L1tf: Not affected Vulnerability Mds: Not affected Vulnerability Meltdown: Not affected Vulnerability Mmio stale data: Not affected Vulnerability Retbleed: Not affected Vulnerability Spec rstack overflow: Mitigation; Safe RET Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Vulnerability Spectre v2: Mitigation; Enhanced / Automatic IBRS; IBPB conditional; STIBP always-on; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected Vulnerability Srbds: Not affected Vulnerability Tsx async abort: Not affected

PR fix notes

PR #42523: [Bugfix][V1] Drop stale outputs after runtime request id reuse

Description (problem / solution / changelog)

Summary

Fixes #42492.

In v1 async scheduling, an async batch produced by Scheduler.schedule() can be drained later by Scheduler.update_from_output(). The drain loop looks up each request via self.requests.get(req_id) — a name-keyed lookup. If the request is cancelled and a new request is submitted reusing the same runtime request id between the two calls, that lookup resolves to the new request, and the old batch's output is applied to it. This decrements num_output_placeholders on a request that never participated in the old batch, underflowing past zero and tripping assert request.num_output_placeholders >= 0 in async_scheduler.py:53.

The bug is frontend-legal: the API server and tests both allow id reuse after cancellation. The reproducer in the issue confirms the assertion crash.

Root cause

request_id is a name. Across cancel + resubmit, the name is reused but the underlying Request object identity changes. The scheduler's async accounting (num_output_placeholders, spec rejection refunds, encoder-input frees, etc.) is tied to the specific Request object that was scheduled, so a name-only lookup at drain time can bind the output to the wrong object.

Fix

Snapshot per-batch Request object identity at schedule time and use it to gate the drain loop:

  1. Scheduler.__init__: add self._inflight_request_snapshots: dict[int, dict[str, Request]] = {}.
  2. Scheduler._update_after_schedule: build a req_id -> Request mapping in the existing per-request loop and store it under id(scheduler_output).
  3. Scheduler.update_from_output: pop the snapshot at entry (always — so id() reuse cannot resurrect a stale entry), and in the per-req drain loop, after the existing is_finished() early-continue, skip any req where request_snapshot[req_id] is not request.

The snapshot is bounded by the live SchedulerOutputs (≤ batch_queue_size in PP mode, ≤ 1 otherwise) and has no IPC implications — id() is only used in the scheduler process, where each SchedulerOutput Python object is held alive by the caller (EngineCore.step / step_with_batch_queue's batch_queue) until update_from_output runs. The happy path is unaffected: request_snapshot[req_id] is request holds whenever the id has not been reused.

AsyncScheduler does not override update_from_output and chains super() in _update_after_schedule, so the fix in the base class covers both schedulers without subclass changes.

Tests

  • New regression test test_async_runtime_id_reuse_after_cancel_no_underflow in tests/v1/core/test_async_scheduler.py (CPU-only, pytest.mark.cpu_test). It schedules request_a with id late_req2, cancels it, submits request_b reusing late_req2, then drains the old scheduler_output. Without the fix, the assertion fires; with the fix, request_b.num_output_placeholders == 0 and request_b is untouched.
  • Two existing tests (test_abort_request_when_structured_output_fsm_cannot_advance in both test_scheduler.py and test_async_scheduler.py) construct a Scheduler via object.__new__ and bypass __init__. Each gets a one-line scheduler._inflight_request_snapshots = {} in its manual setup so the new attribute is present.
$ pytest tests/v1/core/test_scheduler.py tests/v1/core/test_async_scheduler.py -m cpu_test
================ 106 passed, 17 warnings in 176.58s (0:02:56) =================

I also verified the new test fails on main and passes on this branch by reverting only the scheduler.py hunk.

Incidental finding (disclosure, not in this PR)

While auditing the scheduler for other places that look up a request by request_id after a cross-step delay, I noticed the same identity-vs-name shape in Scheduler._update_from_kv_xfer_finished (the KV connector finished_recving / finished_sending path): a stale finished_sending for a reused id would reach _free_blocks(self.requests[req_id]) and free the new request's blocks. It is not reproduced by #42492 (the traceback there is purely the placeholder underflow), and the KV transfer path drains a different output structure with no back-pointer to the originating SchedulerOutput, so the per-SchedulerOutput snapshot mechanism used here does not transplant to it directly (a request-local or per-transfer identity token would be the natural fit). Flagging it here so it is not forgotten; happy to file a separate issue and/or follow-up PR if maintainers want it tracked, but keeping this PR focused on the one reproducer in #42492.

Duplicate-work check

$ gh pr list --repo vllm-project/vllm --state open --search "42492 in:body"
(none)
$ gh pr list --repo vllm-project/vllm --state open --search "num_output_placeholders"
(no matching PR addresses runtime id reuse / cancel+resubmit)

#35759 (AsyncScheduler streaming-input / chunked-prefill crash) and #38624 (preempt path async tokens) both touch async scheduler accounting but address different code paths.

Changed files

  • tests/v1/core/test_async_scheduler.py (modified, +45/-0)
  • tests/v1/core/test_scheduler.py (modified, +1/-0)
  • vllm/v1/core/sched/scheduler.py (modified, +37/-0)

Code Example

Collecting environment information...
==============================
        System Info
==============================
OS                           : Ubuntu 22.04.5 LTS (x86_64)
GCC version                  : (Ubuntu 13.1.0-8ubuntu1~22.04) 13.1.0
Clang version                : 16.0.6 (++20231112100510+7cbf1a259152-1~exp1~20231112100554.106)
CMake version                : version 3.22.1
Libc version                 : glibc-2.35

==============================
       PyTorch Info
==============================
PyTorch version              : 2.10.0+cu128
Is debug build               : False
CUDA used to build PyTorch   : 12.8
ROCM used to build PyTorch   : N/A
XPU used to build PyTorch    : N/A

==============================
      Python Environment
==============================
Python version               : 3.12.9 | packaged by Anaconda, Inc. | (main, Feb  6 2025, 18:56:27) [GCC 11.2.0] (64-bit runtime)
Python platform              : Linux-6.5.0-35-generic-x86_64-with-glibc2.35
    
==============================
       CUDA / GPU Info
==============================
Is CUDA available            : True
CUDA runtime version         : 12.8.61
CUDA_MODULE_LOADING set to   : 
GPU models and configuration : 
GPU 0: NVIDIA GeForce RTX 4090
GPU 1: NVIDIA GeForce RTX 4090

Nvidia driver version        : 570.86.10
cuDNN version                : Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.9.3.0
/usr/lib/x86_64-linux-gnu/libcudnn_adv.so.9.3.0
/usr/lib/x86_64-linux-gnu/libcudnn_cnn.so.9.3.0
/usr/lib/x86_64-linux-gnu/libcudnn_engines_precompiled.so.9.3.0
/usr/lib/x86_64-linux-gnu/libcudnn_engines_runtime_compiled.so.9.3.0
/usr/lib/x86_64-linux-gnu/libcudnn_graph.so.9.3.0
/usr/lib/x86_64-linux-gnu/libcudnn_heuristic.so.9.3.0
/usr/lib/x86_64-linux-gnu/libcudnn_ops.so.9.3.0
HIP runtime version          : N/A
MIOpen runtime version       : N/A
Is XNNPACK available         : True

==============================
          CPU Info
==============================
Architecture:                       x86_64
CPU op-mode(s):                     32-bit, 64-bit
Address sizes:                      52 bits physical, 57 bits virtual
Byte Order:                         Little Endian
CPU(s):                             384
On-line CPU(s) list:                0-383
Vendor ID:                          AuthenticAMD
Model name:                         AMD EPYC 9654 96-Core Processor
CPU family:                         25
Model:                              17
Thread(s) per core:                 2
Core(s) per socket:                 96
Socket(s):                          2
Stepping:                           1
Frequency boost:                    enabled
CPU max MHz:                        3707.8120
CPU min MHz:                        1500.0000
BogoMIPS:                           4792.57
Flags:                              fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good amd_lbr_v2 nopl nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 invpcid_single hw_pstate ssbd mba perfmon_v2 ibrs ibpb stibp ibrs_enhanced vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local avx512_bf16 clzero irperf xsaveerptr rdpru wbnoinvd amd_ppin cppc arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif x2avic v_spec_ctrl vnmi avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq la57 rdpid overflow_recov succor smca fsrm flush_l1d
Virtualization:                     AMD-V
L1d cache:                          6 MiB (192 instances)
L1i cache:                          6 MiB (192 instances)
L2 cache:                           192 MiB (192 instances)
L3 cache:                           768 MiB (24 instances)
NUMA node(s):                       2
NUMA node0 CPU(s):                  0-95,192-287
NUMA node1 CPU(s):                  96-191,288-383
Vulnerability Gather data sampling: Not affected
Vulnerability Itlb multihit:        Not affected
Vulnerability L1tf:                 Not affected
Vulnerability Mds:                  Not affected
Vulnerability Meltdown:             Not affected
Vulnerability Mmio stale data:      Not affected
Vulnerability Retbleed:             Not affected
Vulnerability Spec rstack overflow: Mitigation; Safe RET
Vulnerability Spec store bypass:    Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:           Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:           Mitigation; Enhanced / Automatic IBRS; IBPB conditional; STIBP always-on; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected
Vulnerability Srbds:                Not affected
Vulnerability Tsx async abort:      Not affected

==============================
Versions of relevant libraries
==============================
[pip3] flashinfer-python==0.6.4
[pip3] numpy==2.0.2
[pip3] nvidia-cublas-cu12==12.8.4.1
[pip3] nvidia-cuda-cupti-cu12==12.8.90
[pip3] nvidia-cuda-nvrtc-cu12==12.8.93
[pip3] nvidia-cuda-runtime-cu12==12.8.90
[pip3] nvidia-cudnn-cu12==9.10.2.21
[pip3] nvidia-cudnn-frontend==1.18.0
[pip3] nvidia-cufft-cu12==11.3.3.83
[pip3] nvidia-cufile-cu12==1.13.1.3
[pip3] nvidia-curand-cu12==10.3.9.90
[pip3] nvidia-cusolver-cu12==11.7.3.90
[pip3] nvidia-cusparse-cu12==12.5.8.93
[pip3] nvidia-cusparselt-cu12==0.7.1
[pip3] nvidia-cutlass-dsl==4.4.2
[pip3] nvidia-cutlass-dsl-libs-base==4.4.2
[pip3] nvidia-ml-py==13.590.48
[pip3] nvidia-nccl-cu12==2.27.5
[pip3] nvidia-nvjitlink-cu12==12.8.93
[pip3] nvidia-nvshmem-cu12==3.4.5
[pip3] nvidia-nvtx-cu12==12.8.90
[pip3] optree==0.15.0
[pip3] pyzmq==27.1.0
[pip3] torch==2.10.0+cu128
[pip3] torch_c_dlpack_ext==0.1.5
[pip3] torchaudio==2.10.0+cu128
[pip3] torchvision==0.25.0+cu128
[pip3] transformers==4.56.1
[pip3] triton==3.6.0
[conda] flashinfer-python         0.6.4                    pypi_0    pypi
[conda] numpy                     2.0.2                    pypi_0    pypi
[conda] nvidia-cublas-cu12        12.8.4.1                 pypi_0    pypi
[conda] nvidia-cuda-cupti-cu12    12.8.90                  pypi_0    pypi
[conda] nvidia-cuda-nvrtc-cu12    12.8.93                  pypi_0    pypi
[conda] nvidia-cuda-runtime-cu12  12.8.90                  pypi_0    pypi
[conda] nvidia-cudnn-cu12         9.10.2.21                pypi_0    pypi
[conda] nvidia-cudnn-frontend     1.18.0                   pypi_0    pypi
[conda] nvidia-cufft-cu12         11.3.3.83                pypi_0    pypi
[conda] nvidia-cufile-cu12        1.13.1.3                 pypi_0    pypi
[conda] nvidia-curand-cu12        10.3.9.90                pypi_0    pypi
[conda] nvidia-cusolver-cu12      11.7.3.90                pypi_0    pypi
[conda] nvidia-cusparse-cu12      12.5.8.93                pypi_0    pypi
[conda] nvidia-cusparselt-cu12    0.7.1                    pypi_0    pypi
[conda] nvidia-cutlass-dsl        4.4.2                    pypi_0    pypi
[conda] nvidia-cutlass-dsl-libs-base 4.4.2                    pypi_0    pypi
[conda] nvidia-ml-py              13.590.48                pypi_0    pypi
[conda] nvidia-nccl-cu12          2.27.5                   pypi_0    pypi
[conda] nvidia-nvjitlink-cu12     12.8.93                  pypi_0    pypi
[conda] nvidia-nvshmem-cu12       3.4.5                    pypi_0    pypi
[conda] nvidia-nvtx-cu12          12.8.90                  pypi_0    pypi
[conda] optree                    0.15.0                   pypi_0    pypi
[conda] pyzmq                     27.1.0                   pypi_0    pypi
[conda] torch                     2.10.0+cu128             pypi_0    pypi
[conda] torch-c-dlpack-ext        0.1.5                    pypi_0    pypi
[conda] torchaudio                2.10.0+cu128             pypi_0    pypi
[conda] torchvision               0.25.0+cu128             pypi_0    pypi
[conda] transformers              4.56.1                   pypi_0    pypi
[conda] triton                    3.6.0                    pypi_0    pypi

==============================
         vLLM Info
==============================
ROCM Version                 : Could not collect
vLLM Version                 : 0.17.1
vLLM Build Flags:
  CUDA Archs: 8.9; ROCm: Disabled; XPU: Disabled
GPU Topology:
        GPU0    GPU1    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X      NODE    96-191,288-383  1               N/A
GPU1    NODE     X      96-191,288-383  1               N/A

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

==============================
     Environment Variables
==============================
TORCH_CUDA_ARCH_LIST=8.9
CUDA_PATH=/usr/local/cuda
LD_LIBRARY_PATH=/usr/local/cuda/lib64:/home/neil/code/llm/llama.cpp/build-cuda/bin
CUDA_HOME=/usr/local/cuda
CUDA_HOME=/usr/local/cuda
CUDAToolkit_ROOT=/usr/local/cuda
PYTORCH_NVML_BASED_CUDA_CHECK=1
TORCHINDUCTOR_COMPILE_THREADS=1
TORCHINDUCTOR_CACHE_DIR=/tmp/torchinductor_neil

---

assert request.num_output_placeholders >= 0

---

# repro_g10_sameid_underflow_local.py
   {"type": "run_async_queue_until_deferred", "label": "queue_old_req2a_batch", ...}
   {"type": "cancel", "label": "cancel_late_target_immediate", ...}
   {"type": "submit", "label": "submit_late_reentry_sameid", ...}
   {"type": "drain_async_queue", "label": "drain_old_req2a_batch"}

---

# vllm/v1/core/sched/scheduler.py
   def add_request(self, request: Request) -> None:
       ...
       self.waiting.add_request(request)
       self.requests[request.request_id] = request

---

# vllm/v1/core/sched/scheduler.py
   def finish_requests(...):
       ...
       request.status = finished_status
       self._free_request(request, ...)

   def _free_blocks(self, request: Request):
       ...
       del self.requests[request.request_id]

---

# vllm/v1/core/sched/async_scheduler.py
   cur_num_spec_tokens = len(spec_decode_tokens.get(req_id, ()))
   request.num_output_placeholders += 1 + cur_num_spec_tokens
   request.spec_token_ids = self._spec_token_placeholders

---

# vllm/v1/core/sched/scheduler.py
   request = self.requests[req_id]
   ...
   new_token_ids, stopped = self._update_request_with_output(
       request, new_token_ids
   )

---

# vllm/v1/core/sched/async_scheduler.py
   request.num_output_placeholders -= len(new_token_ids)
   assert request.num_output_placeholders >= 0

---

export POC_PY=/path/to/python3
export G10_LOCAL=/path/to/repro_g10_sameid_underflow_local.py
export VLLM_POC_G10_MODEL=/path/to/qwen3

CUDA_VISIBLE_DEVICES=0 "$POC_PY" "$G10_LOCAL" \
  --model "$VLLM_POC_G10_MODEL" \
  --run-name g10_sameid_underflow_local

---

File ".../vllm/v1/core/sched/scheduler.py", line 1358, in update_from_output
    new_token_ids, stopped = self._update_request_with_output(
  File ".../vllm/v1/core/sched/async_scheduler.py", line 53, in _update_request_with_output
    assert request.num_output_placeholders >= 0
AssertionError
RAW_BUFFERClick to expand / collapse

Your current environment

<details> <summary>The output of <code>python collect_env.py</code></summary>
Collecting environment information...
==============================
        System Info
==============================
OS                           : Ubuntu 22.04.5 LTS (x86_64)
GCC version                  : (Ubuntu 13.1.0-8ubuntu1~22.04) 13.1.0
Clang version                : 16.0.6 (++20231112100510+7cbf1a259152-1~exp1~20231112100554.106)
CMake version                : version 3.22.1
Libc version                 : glibc-2.35

==============================
       PyTorch Info
==============================
PyTorch version              : 2.10.0+cu128
Is debug build               : False
CUDA used to build PyTorch   : 12.8
ROCM used to build PyTorch   : N/A
XPU used to build PyTorch    : N/A

==============================
      Python Environment
==============================
Python version               : 3.12.9 | packaged by Anaconda, Inc. | (main, Feb  6 2025, 18:56:27) [GCC 11.2.0] (64-bit runtime)
Python platform              : Linux-6.5.0-35-generic-x86_64-with-glibc2.35
    
==============================
       CUDA / GPU Info
==============================
Is CUDA available            : True
CUDA runtime version         : 12.8.61
CUDA_MODULE_LOADING set to   : 
GPU models and configuration : 
GPU 0: NVIDIA GeForce RTX 4090
GPU 1: NVIDIA GeForce RTX 4090

Nvidia driver version        : 570.86.10
cuDNN version                : Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.9.3.0
/usr/lib/x86_64-linux-gnu/libcudnn_adv.so.9.3.0
/usr/lib/x86_64-linux-gnu/libcudnn_cnn.so.9.3.0
/usr/lib/x86_64-linux-gnu/libcudnn_engines_precompiled.so.9.3.0
/usr/lib/x86_64-linux-gnu/libcudnn_engines_runtime_compiled.so.9.3.0
/usr/lib/x86_64-linux-gnu/libcudnn_graph.so.9.3.0
/usr/lib/x86_64-linux-gnu/libcudnn_heuristic.so.9.3.0
/usr/lib/x86_64-linux-gnu/libcudnn_ops.so.9.3.0
HIP runtime version          : N/A
MIOpen runtime version       : N/A
Is XNNPACK available         : True

==============================
          CPU Info
==============================
Architecture:                       x86_64
CPU op-mode(s):                     32-bit, 64-bit
Address sizes:                      52 bits physical, 57 bits virtual
Byte Order:                         Little Endian
CPU(s):                             384
On-line CPU(s) list:                0-383
Vendor ID:                          AuthenticAMD
Model name:                         AMD EPYC 9654 96-Core Processor
CPU family:                         25
Model:                              17
Thread(s) per core:                 2
Core(s) per socket:                 96
Socket(s):                          2
Stepping:                           1
Frequency boost:                    enabled
CPU max MHz:                        3707.8120
CPU min MHz:                        1500.0000
BogoMIPS:                           4792.57
Flags:                              fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good amd_lbr_v2 nopl nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 invpcid_single hw_pstate ssbd mba perfmon_v2 ibrs ibpb stibp ibrs_enhanced vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local avx512_bf16 clzero irperf xsaveerptr rdpru wbnoinvd amd_ppin cppc arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif x2avic v_spec_ctrl vnmi avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq la57 rdpid overflow_recov succor smca fsrm flush_l1d
Virtualization:                     AMD-V
L1d cache:                          6 MiB (192 instances)
L1i cache:                          6 MiB (192 instances)
L2 cache:                           192 MiB (192 instances)
L3 cache:                           768 MiB (24 instances)
NUMA node(s):                       2
NUMA node0 CPU(s):                  0-95,192-287
NUMA node1 CPU(s):                  96-191,288-383
Vulnerability Gather data sampling: Not affected
Vulnerability Itlb multihit:        Not affected
Vulnerability L1tf:                 Not affected
Vulnerability Mds:                  Not affected
Vulnerability Meltdown:             Not affected
Vulnerability Mmio stale data:      Not affected
Vulnerability Retbleed:             Not affected
Vulnerability Spec rstack overflow: Mitigation; Safe RET
Vulnerability Spec store bypass:    Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:           Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:           Mitigation; Enhanced / Automatic IBRS; IBPB conditional; STIBP always-on; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected
Vulnerability Srbds:                Not affected
Vulnerability Tsx async abort:      Not affected

==============================
Versions of relevant libraries
==============================
[pip3] flashinfer-python==0.6.4
[pip3] numpy==2.0.2
[pip3] nvidia-cublas-cu12==12.8.4.1
[pip3] nvidia-cuda-cupti-cu12==12.8.90
[pip3] nvidia-cuda-nvrtc-cu12==12.8.93
[pip3] nvidia-cuda-runtime-cu12==12.8.90
[pip3] nvidia-cudnn-cu12==9.10.2.21
[pip3] nvidia-cudnn-frontend==1.18.0
[pip3] nvidia-cufft-cu12==11.3.3.83
[pip3] nvidia-cufile-cu12==1.13.1.3
[pip3] nvidia-curand-cu12==10.3.9.90
[pip3] nvidia-cusolver-cu12==11.7.3.90
[pip3] nvidia-cusparse-cu12==12.5.8.93
[pip3] nvidia-cusparselt-cu12==0.7.1
[pip3] nvidia-cutlass-dsl==4.4.2
[pip3] nvidia-cutlass-dsl-libs-base==4.4.2
[pip3] nvidia-ml-py==13.590.48
[pip3] nvidia-nccl-cu12==2.27.5
[pip3] nvidia-nvjitlink-cu12==12.8.93
[pip3] nvidia-nvshmem-cu12==3.4.5
[pip3] nvidia-nvtx-cu12==12.8.90
[pip3] optree==0.15.0
[pip3] pyzmq==27.1.0
[pip3] torch==2.10.0+cu128
[pip3] torch_c_dlpack_ext==0.1.5
[pip3] torchaudio==2.10.0+cu128
[pip3] torchvision==0.25.0+cu128
[pip3] transformers==4.56.1
[pip3] triton==3.6.0
[conda] flashinfer-python         0.6.4                    pypi_0    pypi
[conda] numpy                     2.0.2                    pypi_0    pypi
[conda] nvidia-cublas-cu12        12.8.4.1                 pypi_0    pypi
[conda] nvidia-cuda-cupti-cu12    12.8.90                  pypi_0    pypi
[conda] nvidia-cuda-nvrtc-cu12    12.8.93                  pypi_0    pypi
[conda] nvidia-cuda-runtime-cu12  12.8.90                  pypi_0    pypi
[conda] nvidia-cudnn-cu12         9.10.2.21                pypi_0    pypi
[conda] nvidia-cudnn-frontend     1.18.0                   pypi_0    pypi
[conda] nvidia-cufft-cu12         11.3.3.83                pypi_0    pypi
[conda] nvidia-cufile-cu12        1.13.1.3                 pypi_0    pypi
[conda] nvidia-curand-cu12        10.3.9.90                pypi_0    pypi
[conda] nvidia-cusolver-cu12      11.7.3.90                pypi_0    pypi
[conda] nvidia-cusparse-cu12      12.5.8.93                pypi_0    pypi
[conda] nvidia-cusparselt-cu12    0.7.1                    pypi_0    pypi
[conda] nvidia-cutlass-dsl        4.4.2                    pypi_0    pypi
[conda] nvidia-cutlass-dsl-libs-base 4.4.2                    pypi_0    pypi
[conda] nvidia-ml-py              13.590.48                pypi_0    pypi
[conda] nvidia-nccl-cu12          2.27.5                   pypi_0    pypi
[conda] nvidia-nvjitlink-cu12     12.8.93                  pypi_0    pypi
[conda] nvidia-nvshmem-cu12       3.4.5                    pypi_0    pypi
[conda] nvidia-nvtx-cu12          12.8.90                  pypi_0    pypi
[conda] optree                    0.15.0                   pypi_0    pypi
[conda] pyzmq                     27.1.0                   pypi_0    pypi
[conda] torch                     2.10.0+cu128             pypi_0    pypi
[conda] torch-c-dlpack-ext        0.1.5                    pypi_0    pypi
[conda] torchaudio                2.10.0+cu128             pypi_0    pypi
[conda] torchvision               0.25.0+cu128             pypi_0    pypi
[conda] transformers              4.56.1                   pypi_0    pypi
[conda] triton                    3.6.0                    pypi_0    pypi

==============================
         vLLM Info
==============================
ROCM Version                 : Could not collect
vLLM Version                 : 0.17.1
vLLM Build Flags:
  CUDA Archs: 8.9; ROCm: Disabled; XPU: Disabled
GPU Topology:
        GPU0    GPU1    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X      NODE    96-191,288-383  1               N/A
GPU1    NODE     X      96-191,288-383  1               N/A

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

==============================
     Environment Variables
==============================
TORCH_CUDA_ARCH_LIST=8.9
CUDA_PATH=/usr/local/cuda
LD_LIBRARY_PATH=/usr/local/cuda/lib64:/home/neil/code/llm/llama.cpp/build-cuda/bin
CUDA_HOME=/usr/local/cuda
CUDA_HOME=/usr/local/cuda
CUDAToolkit_ROOT=/usr/local/cuda
PYTORCH_NVML_BASED_CUDA_CHECK=1
TORCHINDUCTOR_COMPILE_THREADS=1
TORCHINDUCTOR_CACHE_DIR=/tmp/torchinductor_neil
</details>

🐛 Describe the bug

Describe the bug

Version: vLLM 0.17.1
Model: Qwen/Qwen3-0.6B-GPTQ-Int8
Hardware reproduced on: NVIDIA GeForce RTX 4090, single GPU

Summary

I found a frontend-legal async request-identity bug.

The reproducer first queues work for one live request (req2a), then cancels that request, and finally submits a new request (req2b) that reuses the same runtime request id (late_req2).

Because the old async batch is still queued, draining that old batch later can apply its output to the new live request object instead of the cancelled one. When that happens, vLLM decrements request.num_output_placeholders on the wrong request object and the async scheduler invariant below fails:

assert request.num_output_placeholders >= 0

This is not a malformed-input bug. The workload is frontend-legal. The failure comes from stale queued output being rebound to the wrong live request after runtime request id reuse.

This issue is different from the double-streaming_update row-mapping issue. This one specifically depends on cancel -> resubmit with the same runtime request id -> drain old queued batch.

Trigger chain

  1. Submit req0 and req1 so the engine is already running a live async mixed batch.
  2. Submit req2a with runtime request id late_req2.
  3. Let the engine queue an async batch that still contains work for req2a.
  4. Cancel req2a.
  5. Submit a new request req2b that reuses the same runtime request id late_req2.
  6. Drain the old queued batch after req2b is already live.
  7. The old batch output is applied to the new request object instead of the cancelled old one.
  8. request.num_output_placeholders is decremented on the wrong request object, falls below zero, and triggers the async scheduler assertion.

Details

Trigger path in code

  1. The reproducer explicitly creates this runtime-id reuse sequence:
    • queue old req2a batch
    • cancel req2a
    • submit req2b with the same runtime_request_id="late_req2"
    • drain the old queued batch This is visible directly in the standalone script:
    # repro_g10_sameid_underflow_local.py
    {"type": "run_async_queue_until_deferred", "label": "queue_old_req2a_batch", ...}
    {"type": "cancel", "label": "cancel_late_target_immediate", ...}
    {"type": "submit", "label": "submit_late_reentry_sameid", ...}
    {"type": "drain_async_queue", "label": "drain_old_req2a_batch"}
  2. Scheduler state is keyed by request.request_id. A newly added request is inserted into self.requests[request.request_id].
    # vllm/v1/core/sched/scheduler.py
    def add_request(self, request: Request) -> None:
        ...
        self.waiting.add_request(request)
        self.requests[request.request_id] = request
  3. When a request is cancelled, the scheduler frees it and removes that key from the same request map.
    # vllm/v1/core/sched/scheduler.py
    def finish_requests(...):
        ...
        request.status = finished_status
        self._free_request(request, ...)
    
    def _free_blocks(self, request: Request):
        ...
        del self.requests[request.request_id]
  4. In async scheduling mode, placeholder accounting is added before outputs are returned.
    # vllm/v1/core/sched/async_scheduler.py
    cur_num_spec_tokens = len(spec_decode_tokens.get(req_id, ()))
    request.num_output_placeholders += 1 + cur_num_spec_tokens
    request.spec_token_ids = self._spec_token_placeholders
  5. Later, when queued output is drained, update_from_output() looks up the request object for each req_id from the current scheduler map.
    # vllm/v1/core/sched/scheduler.py
    request = self.requests[req_id]
    ...
    new_token_ids, stopped = self._update_request_with_output(
        request, new_token_ids
    )
    After cancel/resubmit with the same runtime request id, that lookup can now resolve to the new live request object instead of the cancelled old one.
  6. The async scheduler then subtracts the drained output tokens from request.num_output_placeholders.
    # vllm/v1/core/sched/async_scheduler.py
    request.num_output_placeholders -= len(new_token_ids)
    assert request.num_output_placeholders >= 0
  7. In this issue, the subtraction runs against the wrong live request object, so the placeholder count underflows and the assertion fires.

Local script breakdown

repro_g10_sameid_underflow_local.py is a standalone local reproducer.

  • It reads the Qwen3 checkpoint path from VLLM_POC_G10_MODEL or the built-in /path/to/qwen3 placeholder
  • It creates one EngineCore directly in async scheduling mode
  • It submits:
    • one anchor request
    • one overlapping peer request
    • one late target request
  • It explicitly queues the late target batch, cancels that request, and then submits a new request that reuses the same runtime_request_id
  • It drains the old queued batch after the new request is already live
  • It writes:
    • campaign.json
    • repro_config.json
    • request_history.json
    • request_states.json
    • action_log.json
    • error.txt on failure

Local repro

repro_g10_sameid_underflow_local.py

export POC_PY=/path/to/python3
export G10_LOCAL=/path/to/repro_g10_sameid_underflow_local.py
export VLLM_POC_G10_MODEL=/path/to/qwen3

CUDA_VISIBLE_DEVICES=0 "$POC_PY" "$G10_LOCAL" \
  --model "$VLLM_POC_G10_MODEL" \
  --run-name g10_sameid_underflow_local

Reproduce Environment

ItemValue
OSUbuntu 22.04.5 LTS
KernelLinux 6.5.0-35-generic
GPU2 x NVIDIA GeForce RTX 4090
GPU memory24564 MiB each
Driver570.86.10
CUDA runtime12.8 (nvidia-smi)
CUDA toolkit12.8.61 (nvcc)
Python3.12.9
vLLM0.17.1
PyTorch2.10.0+cu128
transformers4.56.1
tokenizers0.22.0
flash_attn2.8.3
triton3.6.0
numpy2.0.2

Observed result

Representative traceback from the verified run:

  File ".../vllm/v1/core/sched/scheduler.py", line 1358, in update_from_output
    new_token_ids, stopped = self._update_request_with_output(
  File ".../vllm/v1/core/sched/async_scheduler.py", line 53, in _update_request_with_output
    assert request.num_output_placeholders >= 0
AssertionError

Root cause

This is not a malformed-input bug. The problem is that async queued work for an old request can outlive that request object, while a new request is allowed to reuse the same runtime request identity.

In the verified run, the old req2a batch is still queued after cancellation. Then req2b is submitted with the same runtime request id. When the old batch finally drains, vLLM updates the current live request object for that id, which is now req2b, not the cancelled req2a.

That causes stale queued output from the old request to consume placeholder credit on the new request. The result is that request.num_output_placeholders is decremented on the wrong object and falls below zero.

So the real bug is in how async queued output is rebound during runtime request id reuse after cancel/resubmit.

Attachments

The attachment bundle for this report should contain:

  • repro_g10_sameid_underflow_local.py

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

vllm - ✅(Solved) Fix [Bug]: Async cancel/resubmit with reused runtime request id can drive num_output_placeholders below zero [1 pull requests, 2 comments, 3 participants]