vllm - ✅(Solved) Fix [Bug]: vllm bench: "Peak output token throughput" is "less than Output token throughput" [1 pull requests, 2 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#37666Fetched 2026-04-08 01:04:12
View on GitHub
Comments
2
Participants
2
Timeline
4
Reactions
0
Author
Participants
Timeline (top)
commented ×2cross-referenced ×1labeled ×1

Fix Action

Fix / Workaround

============================== CPU Info

Architecture: aarch64
CPU op-mode(s): 64-bit Byte Order: Little Endian CPU(s): 640 On-line CPU(s) list: 0-639 Vendor ID: HiSilicon BIOS Vendor ID: HiSilicon Model name: - BIOS Model name: Kunpeng 920 7280Z To be filled by O.E.M. CPU @ 2.9GHz BIOS CPU family: 280 Model: 0
Thread(s) per core: 2
Core(s) per socket: 80
Socket(s): 4
Stepping: 0x0
Frequency boost: disabled
CPU(s) scaling MHz: 100% CPU max MHz: 2900.0000 CPU min MHz: 400.0000 BogoMIPS: 200.00 Flags: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm ssbs sb dcpodp flagm2 frint svei8mm svef32mm svef64mm svebf16 i8mm bf16 dgh rng ecv L1d cache: 20 MiB (320 instances) L1i cache: 20 MiB (320 instances) L2 cache: 400 MiB (320 instances) L3 cache: 560 MiB (8 instances) NUMA node(s): 8 NUMA node0 CPU(s): 0-79 NUMA node1 CPU(s): 80-159 NUMA node2 CPU(s): 160-239 NUMA node3 CPU(s): 240-319 NUMA node4 CPU(s): 320-399 NUMA node5 CPU(s): 400-479 NUMA node6 CPU(s): 480-559 NUMA node7 CPU(s): 560-639 Vulnerability Gather data sampling: Not affected Vulnerability Itlb multihit: Not affected Vulnerability L1tf: Not affected Vulnerability Mds: Not affected Vulnerability Meltdown: Not affected Vulnerability Mmio stale data: Not affected Vulnerability Retbleed: Not affected Vulnerability Spec rstack overflow: Not affected Vulnerability Spec store bypass: Not affected Vulnerability Spectre v1: Mitigation; __user pointer sanitization Vulnerability Spectre v2: Not affected Vulnerability Srbds: Not affected Vulnerability Tsx async abort: Not affected

PR fix notes

PR #37690: fix(bench): compute peak output throughput from token-volume decode windows

Description (problem / solution / changelog)

Summary

Fix vllm bench serve peak output throughput calculation so it is based on generated token volume over decode windows, not stream chunk-event count.

Root Cause

max_output_tokens_per_s was computed from ttft/itl event timestamps. In chunked streaming, one event can contain multiple tokens, so event-count bucketing can undercount output token throughput and report peak lower than average throughput.

Changes

  • Reworked peak output throughput calculation in vllm/benchmarks/serve.py:
    • Use actual_output_lens token counts with decode windows (start_time + ttft to start_time + latency)
    • Accumulate per-second token volume by overlap with 1-second buckets
    • Keep concurrent request peak tracking with overlap-based bucket counting
  • Added regression test tests/benchmarks/test_serve_metrics.py to cover chunked-streaming behavior.

Why this is not duplicating an existing open PR

Tests

  • python3 -m py_compile vllm/benchmarks/serve.py tests/benchmarks/test_serve_metrics.py
  • python3 -m pytest -q tests/benchmarks/test_serve_metrics.py ❌ (not runnable in this local environment: missing pytest/torch dependencies)

AI Assistance

This change was developed with AI-assisted tooling and reviewed by a human contributor.

Closes #37666

Changed files

  • tests/benchmarks/test_serve_metrics.py (added, +47/-0)
  • vllm/benchmarks/serve.py (modified, +50/-22)

Code Example

System Info                                                                                                                                                                                            [60/1887]
Collecting environment information...                                                                       
==============================                                                                              
        System Info           
==============================          
OS                           : openEuler 24.03 (LTS-SP2) (aarch64)
GCC version                  : (GCC) 12.3.1 (openEuler 12.3.1-99.oe2403sp2)
Clang version                : 17.0.6 ( 17.0.6-45.oe2403sp2)
CMake version                : version 4.2.1
Libc version                 : glibc-2.38  
                                                      
==============================             
       PyTorch Info                             
==============================                  
PyTorch version              : 2.8.0+cpu        
Is debug build               : False            
CUDA used to build PyTorch   : None             
ROCM used to build PyTorch   : N/A              
                                                      
==============================                  
      Python Environment                        
==============================                                                                              
Python version               : 3.11.14 (main, Jan 21 2026, 07:05:42) [GCC 12.3.1 (openEuler 12.3.1-99.oe2403sp2)] (64-bit runtime)
Python platform              : Linux-5.10.0-296.0.0.199.oe2203sp4.aarch64-aarch64-with-glibc2.38
                                                      
==============================
       CUDA / GPU Info        
==============================
Is CUDA available            : False
CUDA runtime version         : No CUDA
CUDA_MODULE_LOADING set to   : N/A
GPU models and configuration : No CUDA
Nvidia driver version        : No CUDA
cuDNN version                : No CUDA
HIP runtime version          : N/A
MIOpen runtime version       : N/A
Is XNNPACK available         : True

==============================
          CPU Info
==============================
Architecture:                       aarch64     
CPU op-mode(s):                     64-bit
Byte Order:                         Little Endian
CPU(s):                             640
On-line CPU(s) list:                0-639
Vendor ID:                          HiSilicon
BIOS Vendor ID:                     HiSilicon
Model name:                         -
BIOS Model name:                    Kunpeng 920 7280Z To be filled by O.E.M. CPU @ 2.9GHz
BIOS CPU family:                    280
Model:                              0                                                                                                                                                                                   
Thread(s) per core:                 2                                                                                                                                                                                   
Core(s) per socket:                 80                                                                                                                                                                                  
Socket(s):                          4                                                                                                                                                                                   
Stepping:                           0x0                                                                                                                                                                                 
Frequency boost:                    disabled                                                                                                                                                                            
CPU(s) scaling MHz:                 100%
CPU max MHz:                        2900.0000
CPU min MHz:                        400.0000
BogoMIPS:                           200.00
Flags:                              fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm ssbs sb dcpodp
 flagm2 frint svei8mm svef32mm svef64mm svebf16 i8mm bf16 dgh rng ecv
L1d cache:                          20 MiB (320 instances)
L1i cache:                          20 MiB (320 instances)
L2 cache:                           400 MiB (320 instances)
L3 cache:                           560 MiB (8 instances)
NUMA node(s):                       8
NUMA node0 CPU(s):                  0-79
NUMA node1 CPU(s):                  80-159
NUMA node2 CPU(s):                  160-239
NUMA node3 CPU(s):                  240-319
NUMA node4 CPU(s):                  320-399
NUMA node5 CPU(s):                  400-479
NUMA node6 CPU(s):                  480-559
NUMA node7 CPU(s):                  560-639
Vulnerability Gather data sampling: Not affected
Vulnerability Itlb multihit:        Not affected
Vulnerability L1tf:                 Not affected
Vulnerability Mds:                  Not affected
Vulnerability Meltdown:             Not affected
Vulnerability Mmio stale data:      Not affected
Vulnerability Retbleed:             Not affected
Vulnerability Spec rstack overflow: Not affected
Vulnerability Spec store bypass:    Not affected
Vulnerability Spectre v1:           Mitigation; __user pointer sanitization
Vulnerability Spectre v2:           Not affected
Vulnerability Srbds:                Not affected
Vulnerability Tsx async abort:      Not affected

==============================
Versions of relevant libraries
==============================
[pip3] numpy==1.26.4
[pip3] pyzmq==27.1.0
[pip3] torch==2.8.0+cpu
[pip3] torch_npu==2.8.0
[pip3] torchvision==0.23.0
[pip3] transformers==4.57.6
[pip3] triton-ascend==3.2.0
[conda] Could not collect

==============================
         vLLM Info
==============================
ROCM Version                 : Could not collect
vLLM Version                 : 0.13.0
vLLM Build Flags:
  CUDA Archs: Not Set; ROCm: Disabled
GPU Topology:
  Could not collect

==============================
     Environment Variables
==============================
LD_LIBRARY_PATH=/usr/local/Ascend/nnal/atb/latest/atb/cxx_abi_1/lib:/usr/local/Ascend/nnal/atb/latest/atb/cxx_abi_1/examples:/usr/local/Ascend/nnal/atb/latest/atb/cxx_abi_1/tests/atbopstest:/usr/local/Ascend/cann-8.5.0/lib64:/usr/local/Ascend/cann-8.5.0/lib64/plugin/opskernel:/usr/local/Ascend/cann-8.5.0/lib64/plugin/nnengine:/usr/local/Ascend/cann-8.5.0/opp/built-in/op_impl/ai_core/tbe/op_tiling/lib/linux/aarch64:/usr/local/Ascend/cann-8.5.0/tools/aml/lib64:/usr/local/Ascend/cann-8.5.0/tools/aml/lib64/plugin:/usr/local/Ascend/driver/lib64:/usr/local/Ascend/driver/lib64/common:/usr/local/Ascend/driver/lib64/driver:/usr/local/Ascend/nnal/atb/latest/atb/cxx_abi_1/lib:/usr/local/Ascend/nnal/atb/latest/atb/cxx_abi_1/examples:/usr/local/Ascend/nnal/atb/latest/atb/cxx_abi_1/tests/atbopstest:/usr/local/Ascend/ascend-toolkit/latest/tools/aml/lib64:/usr/local/Ascend/ascend-toolkit/latest/tools/aml/lib64/plugin:/usr/local/Ascend/ascend-toolkit/latest/lib64:/usr/local/Ascend/ascend-toolkit/latest/lib64/plugin/opskernel:/usr/local/Ascend/ascend-toolkit/latest/lib64/plugin/nnengine:/usr/local/Ascend/ascend-toolkit/latest/opp/built-in/op_impl/ai_core/tbe/op_tiling:/usr/local/Ascend/driver/lib64:/usr/local/Ascend/driver/lib64/common/:/usr/local/Ascend/driver/lib64/driver/:
OMP_NUM_THREADS=1
TORCH_DEVICE_BACKEND_AUTOLOAD=1
PYTORCH_NVML_BASED_CUDA_CHECK=1
TORCHINDUCTOR_COMPILE_THREADS=1
RAW_BUFFERClick to expand / collapse

Your current environment

<details> <summary>The output of <code>python collect_env.py</code></summary>
        System Info                                                                                                                                                                                            [60/1887]
Collecting environment information...                                                                       
==============================                                                                              
        System Info           
==============================          
OS                           : openEuler 24.03 (LTS-SP2) (aarch64)
GCC version                  : (GCC) 12.3.1 (openEuler 12.3.1-99.oe2403sp2)
Clang version                : 17.0.6 ( 17.0.6-45.oe2403sp2)
CMake version                : version 4.2.1
Libc version                 : glibc-2.38  
                                                      
==============================             
       PyTorch Info                             
==============================                  
PyTorch version              : 2.8.0+cpu        
Is debug build               : False            
CUDA used to build PyTorch   : None             
ROCM used to build PyTorch   : N/A              
                                                      
==============================                  
      Python Environment                        
==============================                                                                              
Python version               : 3.11.14 (main, Jan 21 2026, 07:05:42) [GCC 12.3.1 (openEuler 12.3.1-99.oe2403sp2)] (64-bit runtime)
Python platform              : Linux-5.10.0-296.0.0.199.oe2203sp4.aarch64-aarch64-with-glibc2.38
                                                      
==============================
       CUDA / GPU Info        
==============================
Is CUDA available            : False
CUDA runtime version         : No CUDA
CUDA_MODULE_LOADING set to   : N/A
GPU models and configuration : No CUDA
Nvidia driver version        : No CUDA
cuDNN version                : No CUDA
HIP runtime version          : N/A
MIOpen runtime version       : N/A
Is XNNPACK available         : True

==============================
          CPU Info
==============================
Architecture:                       aarch64     
CPU op-mode(s):                     64-bit
Byte Order:                         Little Endian
CPU(s):                             640
On-line CPU(s) list:                0-639
Vendor ID:                          HiSilicon
BIOS Vendor ID:                     HiSilicon
Model name:                         -
BIOS Model name:                    Kunpeng 920 7280Z To be filled by O.E.M. CPU @ 2.9GHz
BIOS CPU family:                    280
Model:                              0                                                                                                                                                                                   
Thread(s) per core:                 2                                                                                                                                                                                   
Core(s) per socket:                 80                                                                                                                                                                                  
Socket(s):                          4                                                                                                                                                                                   
Stepping:                           0x0                                                                                                                                                                                 
Frequency boost:                    disabled                                                                                                                                                                            
CPU(s) scaling MHz:                 100%
CPU max MHz:                        2900.0000
CPU min MHz:                        400.0000
BogoMIPS:                           200.00
Flags:                              fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm ssbs sb dcpodp
 flagm2 frint svei8mm svef32mm svef64mm svebf16 i8mm bf16 dgh rng ecv
L1d cache:                          20 MiB (320 instances)
L1i cache:                          20 MiB (320 instances)
L2 cache:                           400 MiB (320 instances)
L3 cache:                           560 MiB (8 instances)
NUMA node(s):                       8
NUMA node0 CPU(s):                  0-79
NUMA node1 CPU(s):                  80-159
NUMA node2 CPU(s):                  160-239
NUMA node3 CPU(s):                  240-319
NUMA node4 CPU(s):                  320-399
NUMA node5 CPU(s):                  400-479
NUMA node6 CPU(s):                  480-559
NUMA node7 CPU(s):                  560-639
Vulnerability Gather data sampling: Not affected
Vulnerability Itlb multihit:        Not affected
Vulnerability L1tf:                 Not affected
Vulnerability Mds:                  Not affected
Vulnerability Meltdown:             Not affected
Vulnerability Mmio stale data:      Not affected
Vulnerability Retbleed:             Not affected
Vulnerability Spec rstack overflow: Not affected
Vulnerability Spec store bypass:    Not affected
Vulnerability Spectre v1:           Mitigation; __user pointer sanitization
Vulnerability Spectre v2:           Not affected
Vulnerability Srbds:                Not affected
Vulnerability Tsx async abort:      Not affected

==============================
Versions of relevant libraries
==============================
[pip3] numpy==1.26.4
[pip3] pyzmq==27.1.0
[pip3] torch==2.8.0+cpu
[pip3] torch_npu==2.8.0
[pip3] torchvision==0.23.0
[pip3] transformers==4.57.6
[pip3] triton-ascend==3.2.0
[conda] Could not collect

==============================
         vLLM Info
==============================
ROCM Version                 : Could not collect
vLLM Version                 : 0.13.0
vLLM Build Flags:
  CUDA Archs: Not Set; ROCm: Disabled
GPU Topology:
  Could not collect

==============================
     Environment Variables
==============================
LD_LIBRARY_PATH=/usr/local/Ascend/nnal/atb/latest/atb/cxx_abi_1/lib:/usr/local/Ascend/nnal/atb/latest/atb/cxx_abi_1/examples:/usr/local/Ascend/nnal/atb/latest/atb/cxx_abi_1/tests/atbopstest:/usr/local/Ascend/cann-8.5.0/lib64:/usr/local/Ascend/cann-8.5.0/lib64/plugin/opskernel:/usr/local/Ascend/cann-8.5.0/lib64/plugin/nnengine:/usr/local/Ascend/cann-8.5.0/opp/built-in/op_impl/ai_core/tbe/op_tiling/lib/linux/aarch64:/usr/local/Ascend/cann-8.5.0/tools/aml/lib64:/usr/local/Ascend/cann-8.5.0/tools/aml/lib64/plugin:/usr/local/Ascend/driver/lib64:/usr/local/Ascend/driver/lib64/common:/usr/local/Ascend/driver/lib64/driver:/usr/local/Ascend/nnal/atb/latest/atb/cxx_abi_1/lib:/usr/local/Ascend/nnal/atb/latest/atb/cxx_abi_1/examples:/usr/local/Ascend/nnal/atb/latest/atb/cxx_abi_1/tests/atbopstest:/usr/local/Ascend/ascend-toolkit/latest/tools/aml/lib64:/usr/local/Ascend/ascend-toolkit/latest/tools/aml/lib64/plugin:/usr/local/Ascend/ascend-toolkit/latest/lib64:/usr/local/Ascend/ascend-toolkit/latest/lib64/plugin/opskernel:/usr/local/Ascend/ascend-toolkit/latest/lib64/plugin/nnengine:/usr/local/Ascend/ascend-toolkit/latest/opp/built-in/op_impl/ai_core/tbe/op_tiling:/usr/local/Ascend/driver/lib64:/usr/local/Ascend/driver/lib64/common/:/usr/local/Ascend/driver/lib64/driver/:
OMP_NUM_THREADS=1
TORCH_DEVICE_BACKEND_AUTOLOAD=1
PYTORCH_NVML_BASED_CUDA_CHECK=1
TORCHINDUCTOR_COMPILE_THREADS=1```

</details>


### 🐛 Describe the bug

    vllm bench serve \
        --dataset-name random \
        --num-prompts "${NUM_PROMPTS}" \
        --max-concurrency 10 \
        --random-input 1024 \
        --random-output 1024 \
        --host "${HOST}" \
        --port "${PORT}" \
        --backend "openai-chat" \
        --percentile-metrics ttft,tpot,itl,e2el \
        --model "${MODEL_NAME}" \
        --tokenizer "${TOKENIZER_PATH}" \
        --ignore-eos

============ Serving Benchmark Result ============
Successful requests:                     20        
Failed requests:                         0         
Maximum request concurrency:             10        
Benchmark duration (s):                  56.71     
Total input tokens:                      20460     
Total generated tokens:                  20480     
Request throughput (req/s):              0.35      
Output token throughput (tok/s):         361.11    
Peak output token throughput (tok/s):    180.00    
Peak concurrent requests:                13.00     
Total token throughput (tok/s):          721.88    

### Before submitting a new issue...

- [x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the [documentation page](https://docs.vllm.ai/en/latest/), which can answer lots of frequently asked questions.

extent analysis

Fix Plan

To address the issue, we need to optimize the benchmark serving configuration. Here are the steps:

  • Increase the number of threads used by PyTorch:
    • Set the OMP_NUM_THREADS environment variable to a higher value, e.g., export OMP_NUM_THREADS=4
    • Set the TORCHINDUCTOR_COMPILE_THREADS environment variable to a higher value, e.g., export TORCHINDUCTOR_COMPILE_THREADS=4
  • Update the vllm bench serve command to use a higher concurrency level:
    • Increase the --max-concurrency value, e.g., --max-concurrency 20
  • Consider using a more efficient model or tokenizer:
    • Check the documentation for the openai-chat backend to see if there are any optimized models or tokenizers available
    • Update the --model and --tokenizer options to use the optimized models or tokenizers

Example code:

export OMP_NUM_THREADS=4
export TORCHINDUCTOR_COMPILE_THREADS=4

vllm bench serve \
    --dataset-name random \
    --num-prompts "${NUM_PROMPTS}" \
    --max-concurrency 20 \
    --random-input 1024 \
    --random-output 1024 \
    --host "${HOST}" \
    --port "${PORT}" \
    --backend "openai-chat" \
    --percentile-metrics ttft,tpot,itl,e2el \
    --model "${MODEL_NAME}" \
    --tokenizer "${TOKENIZER_PATH}" \
    --ignore-eos

Verification

To verify that the fix worked, run the updated vllm bench serve command and check the benchmark results. The request throughput and output token throughput should be higher than before.

Extra Tips

  • Make sure to check the documentation for the openai-chat backend to see if there are any specific optimization guidelines or recommendations.
  • Consider using a profiling tool to identify performance bottlenecks in the benchmark serving configuration.
  • If the issue persists, try reducing the --random-input and --random-output values to reduce the load on the system.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

vllm - ✅(Solved) Fix [Bug]: vllm bench: "Peak output token throughput" is "less than Output token throughput" [1 pull requests, 2 comments, 2 participants]