vllm - ✅(Solved) Fix [Bug]: vllm bench: "Peak output token throughput" is "less than Output token throughput" [1 pull requests, 2 comments, 2 participants]

vllm2026-03-20 09:39:44

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

vllm-project/vllm#37666•Fetched 2026-04-08 01:04:12

View on GitHub

Comments

Participants

Timeline

Reactions

Author

AskyJx

Participants

AskyJx

howardpen9

Timeline (top)

commented ×2cross-referenced ×1labeled ×1

Fix Action

Fix / Workaround

============================== CPU Info

Architecture: aarch64
CPU op-mode(s): 64-bit Byte Order: Little Endian CPU(s): 640 On-line CPU(s) list: 0-639 Vendor ID: HiSilicon BIOS Vendor ID: HiSilicon Model name: - BIOS Model name: Kunpeng 920 7280Z To be filled by O.E.M. CPU @ 2.9GHz BIOS CPU family: 280 Model: 0
Thread(s) per core: 2
Core(s) per socket: 80
Socket(s): 4
Stepping: 0x0
Frequency boost: disabled
CPU(s) scaling MHz: 100% CPU max MHz: 2900.0000 CPU min MHz: 400.0000 BogoMIPS: 200.00 Flags: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm ssbs sb dcpodp flagm2 frint svei8mm svef32mm svef64mm svebf16 i8mm bf16 dgh rng ecv L1d cache: 20 MiB (320 instances) L1i cache: 20 MiB (320 instances) L2 cache: 400 MiB (320 instances) L3 cache: 560 MiB (8 instances) NUMA node(s): 8 NUMA node0 CPU(s): 0-79 NUMA node1 CPU(s): 80-159 NUMA node2 CPU(s): 160-239 NUMA node3 CPU(s): 240-319 NUMA node4 CPU(s): 320-399 NUMA node5 CPU(s): 400-479 NUMA node6 CPU(s): 480-559 NUMA node7 CPU(s): 560-639 Vulnerability Gather data sampling: Not affected Vulnerability Itlb multihit: Not affected Vulnerability L1tf: Not affected Vulnerability Mds: Not affected Vulnerability Meltdown: Not affected Vulnerability Mmio stale data: Not affected Vulnerability Retbleed: Not affected Vulnerability Spec rstack overflow: Not affected Vulnerability Spec store bypass: Not affected Vulnerability Spectre v1: Mitigation; __user pointer sanitization Vulnerability Spectre v2: Not affected Vulnerability Srbds: Not affected Vulnerability Tsx async abort: Not affected

PR fix notes

PR #37690: fix(bench): compute peak output throughput from token-volume decode windows

Repository: vllm-project/vllm
Author: howardpen9
State: open | merged: False
Link: https://github.com/vllm-project/vllm/pull/37690

Description (problem / solution / changelog)

Summary

Fix vllm bench serve peak output throughput calculation so it is based on generated token volume over decode windows, not stream chunk-event count.

Root Cause

max_output_tokens_per_s was computed from ttft/itl event timestamps. In chunked streaming, one event can contain multiple tokens, so event-count bucketing can undercount output token throughput and report peak lower than average throughput.

Changes

Reworked peak output throughput calculation in vllm/benchmarks/serve.py:
- Use actual_output_lens token counts with decode windows (start_time + ttft to start_time + latency)
- Accumulate per-second token volume by overlap with 1-second buckets
- Keep concurrent request peak tracking with overlap-based bucket counting
Added regression test tests/benchmarks/test_serve_metrics.py to cover chunked-streaming behavior.

Why this is not duplicating an existing open PR

Open PR #35471 also touches this metric area, but this patch is materially different:
- this patch removes dependency on chunk-event count (itl) entirely for peak token throughput,
- uses decode-window overlap integration with actual token volume,
- and includes explicit degenerate-window handling.
Coordination/differentiation notes were added on issue #37666:
- https://github.com/vllm-project/vllm/issues/37666#issuecomment-4098247969
- https://github.com/vllm-project/vllm/issues/37666#issuecomment-4098250795

Tests

python3 -m py_compile vllm/benchmarks/serve.py tests/benchmarks/test_serve_metrics.py ✅
python3 -m pytest -q tests/benchmarks/test_serve_metrics.py ❌ (not runnable in this local environment: missing pytest/torch dependencies)

AI Assistance

This change was developed with AI-assisted tooling and reviewed by a human contributor.

Closes #37666

Changed files

tests/benchmarks/test_serve_metrics.py (added, +47/-0)
vllm/benchmarks/serve.py (modified, +50/-22)

Code Example

System Info                                                                                                                                                                                            [60/1887]
Collecting environment information...                                                                       
==============================                                                                              
        System Info           
==============================          
OS                           : openEuler 24.03 (LTS-SP2) (aarch64)
GCC version                  : (GCC) 12.3.1 (openEuler 12.3.1-99.oe2403sp2)
Clang version                : 17.0.6 ( 17.0.6-45.oe2403sp2)
CMake version                : version 4.2.1
Libc version                 : glibc-2.38  
                                                      
==============================             
       PyTorch Info                             
==============================                  
PyTorch version              : 2.8.0+cpu        
Is debug build               : False            
CUDA used to build PyTorch   : None             
ROCM used to build PyTorch   : N/A              
                                                      
==============================                  
      Python Environment                        
==============================                                                                              
Python version               : 3.11.14 (main, Jan 21 2026, 07:05:42) [GCC 12.3.1 (openEuler 12.3.1-99.oe2403sp2)] (64-bit runtime)
Python platform              : Linux-5.10.0-296.0.0.199.oe2203sp4.aarch64-aarch64-with-glibc2.38
                                                      
==============================
       CUDA / GPU Info        
==============================
Is CUDA available            : False
CUDA runtime version         : No CUDA
CUDA_MODULE_LOADING set to   : N/A
GPU models and configuration : No CUDA
Nvidia driver version        : No CUDA
cuDNN version                : No CUDA
HIP runtime version          : N/A
MIOpen runtime version       : N/A
Is XNNPACK available         : True

==============================
          CPU Info
==============================
Architecture:                       aarch64     
CPU op-mode(s):                     64-bit
Byte Order:                         Little Endian
CPU(s):                             640
On-line CPU(s) list:                0-639
Vendor ID:                          HiSilicon
BIOS Vendor ID:                     HiSilicon
Model name:                         -
BIOS Model name:                    Kunpeng 920 7280Z To be filled by O.E.M. CPU @ 2.9GHz
BIOS CPU family:                    280
Model:                              0                                                                                                                                                                                   
Thread(s) per core:                 2                                                                                                                                                                                   
Core(s) per socket:                 80                                                                                                                                                                                  
Socket(s):                          4                                                                                                                                                                                   
Stepping:                           0x0                                                                                                                                                                                 
Frequency boost:                    disabled                                                                                                                                                                            
CPU(s) scaling MHz:                 100%
CPU max MHz:                        2900.0000
CPU min MHz:                        400.0000
BogoMIPS:                           200.00
Flags:                              fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm ssbs sb dcpodp
 flagm2 frint svei8mm svef32mm svef64mm svebf16 i8mm bf16 dgh rng ecv
L1d cache:                          20 MiB (320 instances)
L1i cache:                          20 MiB (320 instances)
L2 cache:                           400 MiB (320 instances)
L3 cache:                           560 MiB (8 instances)
NUMA node(s):                       8
NUMA node0 CPU(s):                  0-79
NUMA node1 CPU(s):                  80-159
NUMA node2 CPU(s):                  160-239
NUMA node3 CPU(s):                  240-319
NUMA node4 CPU(s):                  320-399
NUMA node5 CPU(s):                  400-479
NUMA node6 CPU(s):                  480-559
NUMA node7 CPU(s):                  560-639
Vulnerability Gather data sampling: Not affected
Vulnerability Itlb multihit:        Not affected
Vulnerability L1tf:                 Not affected
Vulnerability Mds:                  Not affected
Vulnerability Meltdown:             Not affected
Vulnerability Mmio stale data:      Not affected
Vulnerability Retbleed:             Not affected
Vulnerability Spec rstack overflow: Not affected
Vulnerability Spec store bypass:    Not affected
Vulnerability Spectre v1:           Mitigation; __user pointer sanitization
Vulnerability Spectre v2:           Not affected
Vulnerability Srbds:                Not affected
Vulnerability Tsx async abort:      Not affected

==============================
Versions of relevant libraries
==============================
[pip3] numpy==1.26.4
[pip3] pyzmq==27.1.0
[pip3] torch==2.8.0+cpu
[pip3] torch_npu==2.8.0
[pip3] torchvision==0.23.0
[pip3] transformers==4.57.6
[pip3] triton-ascend==3.2.0
[conda] Could not collect

==============================
         vLLM Info
==============================
ROCM Version                 : Could not collect
vLLM Version                 : 0.13.0
vLLM Build Flags:
  CUDA Archs: Not Set; ROCm: Disabled
GPU Topology:
  Could not collect

==============================
     Environment Variables
==============================
LD_LIBRARY_PATH=/usr/local/Ascend/nnal/atb/latest/atb/cxx_abi_1/lib:/usr/local/Ascend/nnal/atb/latest/atb/cxx_abi_1/examples:/usr/local/Ascend/nnal/atb/latest/atb/cxx_abi_1/tests/atbopstest:/usr/local/Ascend/cann-8.5.0/lib64:/usr/local/Ascend/cann-8.5.0/lib64/plugin/opskernel:/usr/local/Ascend/cann-8.5.0/lib64/plugin/nnengine:/usr/local/Ascend/cann-8.5.0/opp/built-in/op_impl/ai_core/tbe/op_tiling/lib/linux/aarch64:/usr/local/Ascend/cann-8.5.0/tools/aml/lib64:/usr/local/Ascend/cann-8.5.0/tools/aml/lib64/plugin:/usr/local/Ascend/driver/lib64:/usr/local/Ascend/driver/lib64/common:/usr/local/Ascend/driver/lib64/driver:/usr/local/Ascend/nnal/atb/latest/atb/cxx_abi_1/lib:/usr/local/Ascend/nnal/atb/latest/atb/cxx_abi_1/examples:/usr/local/Ascend/nnal/atb/latest/atb/cxx_abi_1/tests/atbopstest:/usr/local/Ascend/ascend-toolkit/latest/tools/aml/lib64:/usr/local/Ascend/ascend-toolkit/latest/tools/aml/lib64/plugin:/usr/local/Ascend/ascend-toolkit/latest/lib64:/usr/local/Ascend/ascend-toolkit/latest/lib64/plugin/opskernel:/usr/local/Ascend/ascend-toolkit/latest/lib64/plugin/nnengine:/usr/local/Ascend/ascend-toolkit/latest/opp/built-in/op_impl/ai_core/tbe/op_tiling:/usr/local/Ascend/driver/lib64:/usr/local/Ascend/driver/lib64/common/:/usr/local/Ascend/driver/lib64/driver/:
OMP_NUM_THREADS=1
TORCH_DEVICE_BACKEND_AUTOLOAD=1
PYTORCH_NVML_BASED_CUDA_CHECK=1
TORCHINDUCTOR_COMPILE_THREADS=1

RAW_BUFFERClick to expand / collapse

Your current environment

<details> <summary>The output of <code>python collect_env.py</code></summary>

        System Info                                                                                                                                                                                            [60/1887]
Collecting environment information...                                                                       
==============================                                                                              
        System Info           
==============================          
OS                           : openEuler 24.03 (LTS-SP2) (aarch64)
GCC version                  : (GCC) 12.3.1 (openEuler 12.3.1-99.oe2403sp2)
Clang version                : 17.0.6 ( 17.0.6-45.oe2403sp2)
CMake version                : version 4.2.1
Libc version                 : glibc-2.38  
                                                      
==============================             
       PyTorch Info                             
==============================                  
PyTorch version              : 2.8.0+cpu        
Is debug build               : False            
CUDA used to build PyTorch   : None             
ROCM used to build PyTorch   : N/A              
                                                      
==============================                  
      Python Environment                        
==============================                                                                              
Python version               : 3.11.14 (main, Jan 21 2026, 07:05:42) [GCC 12.3.1 (openEuler 12.3.1-99.oe2403sp2)] (64-bit runtime)
Python platform              : Linux-5.10.0-296.0.0.199.oe2203sp4.aarch64-aarch64-with-glibc2.38
                                                      
==============================
       CUDA / GPU Info        
==============================
Is CUDA available            : False
CUDA runtime version         : No CUDA
CUDA_MODULE_LOADING set to   : N/A
GPU models and configuration : No CUDA
Nvidia driver version        : No CUDA
cuDNN version                : No CUDA
HIP runtime version          : N/A
MIOpen runtime version       : N/A
Is XNNPACK available         : True

==============================
          CPU Info
==============================
Architecture:                       aarch64     
CPU op-mode(s):                     64-bit
Byte Order:                         Little Endian
CPU(s):                             640
On-line CPU(s) list:                0-639
Vendor ID:                          HiSilicon
BIOS Vendor ID:                     HiSilicon
Model name:                         -
BIOS Model name:                    Kunpeng 920 7280Z To be filled by O.E.M. CPU @ 2.9GHz
BIOS CPU family:                    280
Model:                              0                                                                                                                                                                                   
Thread(s) per core:                 2                                                                                                                                                                                   
Core(s) per socket:                 80                                                                                                                                                                                  
Socket(s):                          4                                                                                                                                                                                   
Stepping:                           0x0                                                                                                                                                                                 
Frequency boost:                    disabled                                                                                                                                                                            
CPU(s) scaling MHz:                 100%
CPU max MHz:                        2900.0000
CPU min MHz:                        400.0000
BogoMIPS:                           200.00
Flags:                              fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm ssbs sb dcpodp
 flagm2 frint svei8mm svef32mm svef64mm svebf16 i8mm bf16 dgh rng ecv
L1d cache:                          20 MiB (320 instances)
L1i cache:                          20 MiB (320 instances)
L2 cache:                           400 MiB (320 instances)
L3 cache:                           560 MiB (8 instances)
NUMA node(s):                       8
NUMA node0 CPU(s):                  0-79
NUMA node1 CPU(s):                  80-159
NUMA node2 CPU(s):                  160-239
NUMA node3 CPU(s):                  240-319
NUMA node4 CPU(s):                  320-399
NUMA node5 CPU(s):                  400-479
NUMA node6 CPU(s):                  480-559
NUMA node7 CPU(s):                  560-639
Vulnerability Gather data sampling: Not affected
Vulnerability Itlb multihit:        Not affected
Vulnerability L1tf:                 Not affected
Vulnerability Mds:                  Not affected
Vulnerability Meltdown:             Not affected
Vulnerability Mmio stale data:      Not affected
Vulnerability Retbleed:             Not affected
Vulnerability Spec rstack overflow: Not affected
Vulnerability Spec store bypass:    Not affected
Vulnerability Spectre v1:           Mitigation; __user pointer sanitization
Vulnerability Spectre v2:           Not affected
Vulnerability Srbds:                Not affected
Vulnerability Tsx async abort:      Not affected

==============================
Versions of relevant libraries
==============================
[pip3] numpy==1.26.4
[pip3] pyzmq==27.1.0
[pip3] torch==2.8.0+cpu
[pip3] torch_npu==2.8.0
[pip3] torchvision==0.23.0
[pip3] transformers==4.57.6
[pip3] triton-ascend==3.2.0
[conda] Could not collect

==============================
         vLLM Info
==============================
ROCM Version                 : Could not collect
vLLM Version                 : 0.13.0
vLLM Build Flags:
  CUDA Archs: Not Set; ROCm: Disabled
GPU Topology:
  Could not collect

==============================
     Environment Variables
==============================
LD_LIBRARY_PATH=/usr/local/Ascend/nnal/atb/latest/atb/cxx_abi_1/lib:/usr/local/Ascend/nnal/atb/latest/atb/cxx_abi_1/examples:/usr/local/Ascend/nnal/atb/latest/atb/cxx_abi_1/tests/atbopstest:/usr/local/Ascend/cann-8.5.0/lib64:/usr/local/Ascend/cann-8.5.0/lib64/plugin/opskernel:/usr/local/Ascend/cann-8.5.0/lib64/plugin/nnengine:/usr/local/Ascend/cann-8.5.0/opp/built-in/op_impl/ai_core/tbe/op_tiling/lib/linux/aarch64:/usr/local/Ascend/cann-8.5.0/tools/aml/lib64:/usr/local/Ascend/cann-8.5.0/tools/aml/lib64/plugin:/usr/local/Ascend/driver/lib64:/usr/local/Ascend/driver/lib64/common:/usr/local/Ascend/driver/lib64/driver:/usr/local/Ascend/nnal/atb/latest/atb/cxx_abi_1/lib:/usr/local/Ascend/nnal/atb/latest/atb/cxx_abi_1/examples:/usr/local/Ascend/nnal/atb/latest/atb/cxx_abi_1/tests/atbopstest:/usr/local/Ascend/ascend-toolkit/latest/tools/aml/lib64:/usr/local/Ascend/ascend-toolkit/latest/tools/aml/lib64/plugin:/usr/local/Ascend/ascend-toolkit/latest/lib64:/usr/local/Ascend/ascend-toolkit/latest/lib64/plugin/opskernel:/usr/local/Ascend/ascend-toolkit/latest/lib64/plugin/nnengine:/usr/local/Ascend/ascend-toolkit/latest/opp/built-in/op_impl/ai_core/tbe/op_tiling:/usr/local/Ascend/driver/lib64:/usr/local/Ascend/driver/lib64/common/:/usr/local/Ascend/driver/lib64/driver/:
OMP_NUM_THREADS=1
TORCH_DEVICE_BACKEND_AUTOLOAD=1
PYTORCH_NVML_BASED_CUDA_CHECK=1
TORCHINDUCTOR_COMPILE_THREADS=1```

</details>


### 🐛 Describe the bug

    vllm bench serve \
        --dataset-name random \
        --num-prompts "${NUM_PROMPTS}" \
        --max-concurrency 10 \
        --random-input 1024 \
        --random-output 1024 \
        --host "${HOST}" \
        --port "${PORT}" \
        --backend "openai-chat" \
        --percentile-metrics ttft,tpot,itl,e2el \
        --model "${MODEL_NAME}" \
        --tokenizer "${TOKENIZER_PATH}" \
        --ignore-eos

============ Serving Benchmark Result ============
Successful requests:                     20        
Failed requests:                         0         
Maximum request concurrency:             10        
Benchmark duration (s):                  56.71     
Total input tokens:                      20460     
Total generated tokens:                  20480     
Request throughput (req/s):              0.35      
Output token throughput (tok/s):         361.11    
Peak output token throughput (tok/s):    180.00    
Peak concurrent requests:                13.00     
Total token throughput (tok/s):          721.88    

### Before submitting a new issue...

- [x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the [documentation page](https://docs.vllm.ai/en/latest/), which can answer lots of frequently asked questions.

extent analysis

Fix Plan

To address the issue, we need to optimize the benchmark serving configuration. Here are the steps:

Increase the number of threads used by PyTorch:
- Set the OMP_NUM_THREADS environment variable to a higher value, e.g., export OMP_NUM_THREADS=4
- Set the TORCHINDUCTOR_COMPILE_THREADS environment variable to a higher value, e.g., export TORCHINDUCTOR_COMPILE_THREADS=4
Update the vllm bench serve command to use a higher concurrency level:
- Increase the --max-concurrency value, e.g., --max-concurrency 20
Consider using a more efficient model or tokenizer:
- Check the documentation for the openai-chat backend to see if there are any optimized models or tokenizers available
- Update the --model and --tokenizer options to use the optimized models or tokenizers

Example code:

export OMP_NUM_THREADS=4
export TORCHINDUCTOR_COMPILE_THREADS=4

vllm bench serve \
    --dataset-name random \
    --num-prompts "${NUM_PROMPTS}" \
    --max-concurrency 20 \
    --random-input 1024 \
    --random-output 1024 \
    --host "${HOST}" \
    --port "${PORT}" \
    --backend "openai-chat" \
    --percentile-metrics ttft,tpot,itl,e2el \
    --model "${MODEL_NAME}" \
    --tokenizer "${TOKENIZER_PATH}" \
    --ignore-eos

Verification

To verify that the fix worked, run the updated vllm bench serve command and check the benchmark results. The request throughput and output token throughput should be higher than before.

Extra Tips

Make sure to check the documentation for the openai-chat backend to see if there are any specific optimization guidelines or recommendations.
Consider using a profiling tool to identify performance bottlenecks in the benchmark serving configuration.
If the issue persists, try reducing the --random-input and --random-output values to reduce the load on the system.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #ssr #installation #tensor shape #tokenizer error #prompt formatting #chain error #conversation history #environment variable

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

vllm - ✅(Solved) Fix [Bug]: vllm bench: "Peak output token throughput" is "less than Output token throughput" [1 pull requests, 2 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Fix Action

Fix / Workaround

============================== CPU Info

PR fix notes

PR #37690: fix(bench): compute peak output throughput from token-volume decode windows

Description (problem / solution / changelog)

Summary

Root Cause

Changes

Why this is not duplicating an existing open PR

Tests

AI Assistance

Changed files

Code Example

Your current environment

extent analysis

Fix Plan

Verification

Extra Tips

Still need to ship something?

TRENDING

vllm - ✅(Solved) Fix [Bug]: vllm bench: "Peak output token throughput" is "less than Output token throughput" [1 pull requests, 2 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Fix Action

Fix / Workaround

============================== CPU Info

PR fix notes

PR #37690: fix(bench): compute peak output throughput from token-volume decode windows

Description (problem / solution / changelog)

Summary

Root Cause

Changes

Why this is not duplicating an existing open PR

Tests

AI Assistance

Changed files

Code Example

Your current environment

extent analysis

Fix Plan

Verification

Extra Tips

Still need to ship something?

RELATED_DISCOVERY

TRENDING