vllm - 💡(How to fix) Fix [Performance]: V1 sample_tokens p99 can include sampled-output readiness; moving the wait to get_output did not improve serving throughput in my setup

vllm2026-05-22 21:30:52

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

I found a reproducible V1 profiling-attribution issue on vLLM 0.21.0: in my setup, sample_tokens p99 appears to include a sampled-output readiness/event-sync wait.

A comparison against MRV2 / a readiness-split path moves the visible wait out of sample_tokens and into AsyncModelRunnerOutput.get_output(), but stock steady-state serving progress did not improve.

This is not a throughput-win claim. The narrower result is:

V1 sample_tokens can include a large sampled-output readiness/event-sync fence.
A comparison path collapses sample_tokens p99 from ~19-31 ms to ~1 ms.
The output wait moves to AsyncModelRunnerOutput.get_output().
Expected token counts and logprob counts matched in all reported rows.
Continuous decode throughput and streaming ITL did not improve in this setup.

Caveat: VLLM_USE_V2_MODEL_RUNNER=1 is not a minimal one-line readiness-split patch; it enables Model Runner V2, which has other implementation differences. I am using it here as a comparison path to show where the readiness wait is observed, not as proof that MRV2 itself is the proposed fix.

Related prior work: #22754 discusses sampled-token D2H readiness and blocking behavior around sampled_token_ids.tolist(), with #22760 listed there as the event-sync/non-blocking-copy mitigation. This issue is adjacent but narrower: it is about where sampled-output readiness appears in V1 profiling timings, and whether moving that wait changes stock serving progress.

Root Cause

Profiling attribution: sample_tokens can look much worse than the sampler itself because it includes sampled-output readiness/event synchronization.
Future output scheduling: if there are modes where output materialization can be naturally deferred or batched, keeping decode-critical sampled state separate from host-visible output materialization may matter.
Regression avoidance: a patch that only reduces sample_tokens p99 can be misleading unless AsyncModelRunnerOutput.get_output, ITL, and steady-state throughput are measured too.

Fix Action

Fix / Workaround

I am not proposing an immediate performance patch or claiming a serving throughput win.

Code Example

V1 mixed-history sample_tokens p99:                 19.237 ms
V1 update_async_event_synchronize p99:              18.224 ms

---

comparison-path mixed-history sample_tokens p99:     1.111 ms
Host AsyncModelRunnerOutput.get_output p99:         ~19 ms

---

V1 no-history sample_tokens p99:                    19.287 ms
V1 no-history with logprobs disabled p99:           18.708 ms

---

Collecting environment information...
  uv is set
  ==============================
          System Info
  ==============================
  OS                           : Ubuntu 24.04.4 LTS (x86_64)
  GCC version                  : (Ubuntu 13.3.0-6ubuntu2~24.04.1) 13.3.0
  Clang version                : Could not collect
  CMake version                : version 3.28.3
  Libc version                 : glibc-2.39

  ==============================
         PyTorch Info
  ==============================
  PyTorch version              : 2.11.0+cu130
  Is debug build               : False
  CUDA used to build PyTorch   : 13.0
  ROCM used to build PyTorch   : N/A
  XPU used to build PyTorch    : N/A

  ==============================
        Python Environment
  ==============================
  Python version               : 3.12.3 (main, Mar 23 2026, 19:04:32) [GCC 13.3.0] (64-bit runtime)
  Python platform              : Linux-6.17.0-29-generic-x86_64-with-glibc2.39

  ==============================
         CUDA / GPU Info
  ==============================
  Is CUDA available            : True
  CUDA runtime version         : 13.0.88
  CUDA_MODULE_LOADING set to   :
  GPU models and configuration : GPU 0: NVIDIA GeForce RTX 3090
  Nvidia driver version        : 580.159.03
  cuDNN version                : Could not collect
  HIP runtime version          : N/A
  MIOpen runtime version       : N/A
  Is XNNPACK available         : True

  ==============================
            CPU Info
  ==============================
  Architecture:                            x86_64
  CPU op-mode(s):                          32-bit, 64-bit
  Address sizes:                           48 bits physical, 48 bits virtual
  Byte Order:                              Little Endian
  CPU(s):                                  16
  On-line CPU(s) list:                     0-15
  Vendor ID:                               AuthenticAMD
  Model name:                              AMD Ryzen 7 5800X 8-Core Processor
  CPU family:                              25
  Model:                                   33
  Thread(s) per core:                      2
  Core(s) per socket:                      8
  Socket(s):                               1
  Stepping:                                0
  Frequency boost:                         enabled
  CPU(s) scaling MHz:                      69%
  CPU max MHz:                             4853.5850
  CPU min MHz:                             555.5310
  BogoMIPS:                                7600.20
  Virtualization:                          AMD-V
  NUMA node(s):                            1
  NUMA node0 CPU(s):                       0-15

  ==============================
  Versions of relevant libraries
  ==============================
  [pip3] No relevant packages
  [conda] Could not collect

  ==============================
           vLLM Info
  ==============================
  ROCM Version                 : Could not collect
  vLLM Version                 : 0.21.0
  vLLM Build Flags:
    CUDA Archs: Not Set; ROCm: Disabled; XPU: Disabled
  GPU Topology:
        GPU0    CPU Affinity    NUMA Affinity   GPU NUMA ID
  GPU0   X      0-15    0               N/A

  Legend:

    X    = Self
    SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
    NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
    PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
    PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
    PIX  = Connection traversing at most a single PCIe bridge
    NV#  = Connection traversing a bonded set of # NVLinks

  ==============================
       Environment Variables
  ==============================
  TORCH_EXTENSIONS_DIR=/data/ai/cache/torch_extensions

RAW_BUFFERClick to expand / collapse

Proposal to improve performance

I am not proposing an immediate performance patch or claiming a serving throughput win.

This issue is a profiling/attribution discussion: in my setup, V1 sample_tokens p99 appears to include sampled-output readiness/event-sync work. A comparison path moves the visible wait out of sample_tokens and into AsyncModelRunnerOutput.get_output(), but steady-state serving throughput did not improve.

The possible future direction, if maintainers think this matters, would be to clarify ownership of sampled-output readiness and avoid interpreting a lower sample_tokens p99 as a serving-level win unless get_output, ITL, and steady-state throughput are measured too.

Report of performance regression

This is not a regression claim.

I am not claiming that vLLM 0.21.0 regressed versus a previous release, and I am not claiming that the comparison path improves current serving performance. The evidence below is about where sampled-output readiness appears in profiler timings.

Misc discussion on performance

Summary

I found a reproducible V1 profiling-attribution issue on vLLM 0.21.0: in my setup, sample_tokens p99 appears to include a sampled-output readiness/event-sync wait.

This is not a throughput-win claim. The narrower result is:

V1 sample_tokens can include a large sampled-output readiness/event-sync fence.
A comparison path collapses sample_tokens p99 from ~19-31 ms to ~1 ms.
The output wait moves to AsyncModelRunnerOutput.get_output().
Expected token counts and logprob counts matched in all reported rows.
Continuous decode throughput and streaming ITL did not improve in this setup.

Internal timing signal

On vLLM 0.21.0 V1:

V1 mixed-history sample_tokens p99:                 19.237 ms
V1 update_async_event_synchronize p99:              18.224 ms

Using the comparison path:

comparison-path mixed-history sample_tokens p99:     1.111 ms
Host AsyncModelRunnerOutput.get_output p99:         ~19 ms

So the sampled-output readiness wait appears movable out of sample_tokens, but it is still paid at the output boundary.

One changed result versus my older local evidence: on 0.21.0, the old "no-history is fast" control no longer holds in my setup:

V1 no-history sample_tokens p99:                    19.287 ms
V1 no-history with logprobs disabled p99:           18.708 ms

So I would not frame this as a history/logprob-only issue. The narrower interpretation is: V1 can put sampled-output host readiness on the sample_tokens timing path even when the serving-level wait will still be paid later at output consumption.

One possible explanation for the no-history control is that this is not only about explicit history-dependent logits processors. In V1 async scheduling, sampled token IDs can still be retained and later synchronized for output-token bookkeeping, so the readiness wait can appear even when the workload is not using my older history/logprob-heavy trigger.

Continuous decode control

I then checked whether moving the wait out of sample_tokens improves steady-state serving progress. It did not in this setup.

row	tok/s	sample_tokens p99	get_output p99	streaming ITL p99
V1, max_num_seqs=8, ~128 out tok/req	359.2	19.096 ms	0.026 ms	n/a
comparison path, max_num_seqs=8, ~128 out tok/req	357.8	0.990 ms	18.466 ms	n/a
V1 streaming, max_num_seqs=8	358.1	19.120 ms	0.026 ms	20.392 ms
comparison path streaming, max_num_seqs=8	358.5	0.999 ms	18.346 ms	29.224 ms
V1, max_num_seqs=32, ~128 out tok/req	1208.8	31.381 ms	0.028 ms	n/a
comparison path, max_num_seqs=32, ~128 out tok/req	1205.1	1.071 ms	22.147 ms	n/a
comparison path, forced immediate output consumption	344.5	20.275 ms	19.312 ms	n/a

All reported rows matched expected token counts and logprob counts.

Interpretation

The readiness split is visible in profiling: sample_tokens p99 collapses from ~19-31 ms to ~1 ms in the comparison path.
Async materialization windows can overlap decode/sampling work.
But stock steady-state serving progress does not improve in this setup: throughput is flat, and streaming ITL did not improve in the tested row.
Forced-immediate output consumption removes the overlap and restores a ~20 ms sample_tokens p99, confirming this is movement of the readiness fence rather than removal of the underlying output materialization cost.

Why this may still be useful

This may be useful for maintainers for three reasons:

Profiling attribution: sample_tokens can look much worse than the sampler itself because it includes sampled-output readiness/event synchronization.
Future output scheduling: if there are modes where output materialization can be naturally deferred or batched, keeping decode-critical sampled state separate from host-visible output materialization may matter.
Regression avoidance: a patch that only reduces sample_tokens p99 can be misleading unless AsyncModelRunnerOutput.get_output, ITL, and steady-state throughput are measured too.

What I am not claiming

I am not claiming this improves current offline generate() wall time.
I am not claiming a serving throughput win in stock continuous decode.
I am not claiming the event sync itself is wrong; it is where the host observes outstanding readiness work.
I am not claiming this is specific to history-dependent logits processors, since the 0.21.0 no-history control also has the tail.
I am not claiming that VLLM_USE_V2_MODEL_RUNNER=1 is an isolated minimal fix; it is a comparison path with other implementation differences.

Question for maintainers

Is this worth tracking as a V1 profiling/output-readiness ownership issue, or is the current behavior expected and best handled as documentation/profiling guidance?

I can share the local probe/artifacts if useful. I would not propose this as a performance PR unless there is a serving mode maintainers care about where moving sampled-output readiness out of sample_tokens creates a real serving-level win.

Your current environment (if you think it is necessary)

Relevant setup for the measurements:

vLLM: 0.21.0
Engine: V1
Baseline path: VLLM_USE_V2_MODEL_RUNNER=0
Comparison path: VLLM_USE_V2_MODEL_RUNNER=1 / MRV2
GPU: RTX 3090 24GB
Model: Qwen/Qwen2.5-Coder-7B-Instruct
Workload: CUDA graph decode probes, logprobs enabled, mixed sampling params, ignore_eos=True, output length around 128 tokens/request
Instrumentation: sample_tokens, update_async_event_synchronize, AsyncModelRunnerOutput.get_output, async materialization windows, expected token/logprob count checks, and a manual streaming loop with RequestOutputKind.DELTA

Output of python collect_env.py:

Collecting environment information...
  uv is set
  ==============================
          System Info
  ==============================
  OS                           : Ubuntu 24.04.4 LTS (x86_64)
  GCC version                  : (Ubuntu 13.3.0-6ubuntu2~24.04.1) 13.3.0
  Clang version                : Could not collect
  CMake version                : version 3.28.3
  Libc version                 : glibc-2.39

  ==============================
         PyTorch Info
  ==============================
  PyTorch version              : 2.11.0+cu130
  Is debug build               : False
  CUDA used to build PyTorch   : 13.0
  ROCM used to build PyTorch   : N/A
  XPU used to build PyTorch    : N/A

  ==============================
        Python Environment
  ==============================
  Python version               : 3.12.3 (main, Mar 23 2026, 19:04:32) [GCC 13.3.0] (64-bit runtime)
  Python platform              : Linux-6.17.0-29-generic-x86_64-with-glibc2.39

  ==============================
         CUDA / GPU Info
  ==============================
  Is CUDA available            : True
  CUDA runtime version         : 13.0.88
  CUDA_MODULE_LOADING set to   :
  GPU models and configuration : GPU 0: NVIDIA GeForce RTX 3090
  Nvidia driver version        : 580.159.03
  cuDNN version                : Could not collect
  HIP runtime version          : N/A
  MIOpen runtime version       : N/A
  Is XNNPACK available         : True

  ==============================
            CPU Info
  ==============================
  Architecture:                            x86_64
  CPU op-mode(s):                          32-bit, 64-bit
  Address sizes:                           48 bits physical, 48 bits virtual
  Byte Order:                              Little Endian
  CPU(s):                                  16
  On-line CPU(s) list:                     0-15
  Vendor ID:                               AuthenticAMD
  Model name:                              AMD Ryzen 7 5800X 8-Core Processor
  CPU family:                              25
  Model:                                   33
  Thread(s) per core:                      2
  Core(s) per socket:                      8
  Socket(s):                               1
  Stepping:                                0
  Frequency boost:                         enabled
  CPU(s) scaling MHz:                      69%
  CPU max MHz:                             4853.5850
  CPU min MHz:                             555.5310
  BogoMIPS:                                7600.20
  Virtualization:                          AMD-V
  NUMA node(s):                            1
  NUMA node0 CPU(s):                       0-15

  ==============================
  Versions of relevant libraries
  ==============================
  [pip3] No relevant packages
  [conda] Could not collect

  ==============================
           vLLM Info
  ==============================
  ROCM Version                 : Could not collect
  vLLM Version                 : 0.21.0
  vLLM Build Flags:
    CUDA Archs: Not Set; ROCm: Disabled; XPU: Disabled
  GPU Topology:
        GPU0    CPU Affinity    NUMA Affinity   GPU NUMA ID
  GPU0   X      0-15    0               N/A

  Legend:

    X    = Self
    SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
    NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
    PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
    PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
    PIX  = Connection traversing at most a single PCIe bridge
    NV#  = Connection traversing a bonded set of # NVLinks

  ==============================
       Environment Variables
  ==============================
  TORCH_EXTENSIONS_DIR=/data/ai/cache/torch_extensions

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering