vllm - ✅(Solved) Fix Port custom ops to native Inductor multi-stream support [1 pull requests, 2 comments, 3 participants]

vllm2026-03-18 03:12:35

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

vllm-project/vllm#37372•Fetched 2026-04-08 00:53:13

View on GitHub

Comments

Participants

Timeline

Reactions

Author

Participants

Assignees

Timeline (top)

mentioned ×3subscribed ×3commented ×2assigned ×1

Fix Action

Fixed

Fixed by PR: [Perf] Enable dual stream execution of input projection for Qwen3 (https://github.com/vllm-project/vllm/pull/36795)

PR fix notes

PR #36795: [Perf] Enable dual stream execution of input projection for Qwen3

Repository: vllm-project/vllm
Author: xyang16
State: closed | merged: True
Link: https://github.com/vllm-project/vllm/pull/36795

Description (problem / solution / changelog)

Purpose

This PR Enable dual stream execution of input projection for Qwen3 Next.

Parallelize the execution of in_proj_qkvz and in_proj_ba in 2 streams, because their outputs are independent.
Wrap the implementation in custom op for torch.compile.

Profiling

Main:

PR:

Main: nvjet_tst_64x8_64x16_4x2_h_bz_TNT (in_proj_qkvz) and nvjet_tst_64x8_64x16_1x2_h_bz_TNT (in_proj_ba) kernels launched sequentially.

PR: kernels launched in parallel.

Benchmarking

Benchmarked on H200.

Qwen3

vllm serve Qwen/Qwen3-Next-80B-A3B-Instruct \
    --tensor-parallel-size 1 \
    --max-num-seqs 16 \
    --no-enable-prefix-caching

vllm bench serve \
        --model Qwen/Qwen3-Next-80B-A3B-Instruct \
        --dataset-name sharegpt \
        --dataset-path /tmp/ShareGPT_V3_unfiltered_cleaned_split.json \
        --sharegpt-output-len 300 \
        --num-prompts ${num_prompts} \
        --max-concurrency 16 \
        --num-warmups 50 \
        --ignore-eos \
        --temperature 0

Main:

============ Serving Benchmark Result ============
Successful requests:                     960       
Failed requests:                         0         
Maximum request concurrency:             16        
Benchmark duration (s):                  121.46    
Total input tokens:                      219140    
Total generated tokens:                  288000    
Request throughput (req/s):              7.90      
Output token throughput (tok/s):         2371.20   
Peak output token throughput (tok/s):    2640.00   
Peak concurrent requests:                32.00     
Total token throughput (tok/s):          4175.46   
---------------Time to First Token----------------
Mean TTFT (ms):                          176.72    
Median TTFT (ms):                        193.47    
P99 TTFT (ms):                           222.77    
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          6.18      
Median TPOT (ms):                        6.15      
P99 TPOT (ms):                           6.48      
---------------Inter-token Latency----------------
Mean ITL (ms):                           6.18      
Median ITL (ms):                         6.13      
P99 ITL (ms):                            6.94      
==================================================

PR:

============ Serving Benchmark Result ============
Successful requests:                     960       
Failed requests:                         0         
Maximum request concurrency:             16        
Benchmark duration (s):                  120.20    
Total input tokens:                      219140    
Total generated tokens:                  288000    
Request throughput (req/s):              7.99      
Output token throughput (tok/s):         2396.09   
Peak output token throughput (tok/s):    2672.00   
Peak concurrent requests:                32.00     
Total token throughput (tok/s):          4219.28   
---------------Time to First Token----------------
Mean TTFT (ms):                          191.90    
Median TTFT (ms):                        214.37    
P99 TTFT (ms):                           249.63    
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          6.06      
Median TPOT (ms):                        6.04      
P99 TPOT (ms):                           6.40      
---------------Inter-token Latency----------------
Mean ITL (ms):                           6.06      
Median ITL (ms):                         6.02      
P99 ITL (ms):                            6.72      
==================================================

Qwen3 fp8

vllm serve Qwen/Qwen3-Next-80B-A3B-Instruct-FP8 \
    --tensor-parallel-size 1 \
    --max-num-seqs 16 \
    --no-enable-prefix-caching

vllm bench serve \
        --model Qwen/Qwen3-Next-80B-A3B-Instruct-FP8 \
        --dataset-name sharegpt \
        --dataset-path /tmp/ShareGPT_V3_unfiltered_cleaned_split.json \
        --sharegpt-output-len 300 \
        --num-prompts ${num_prompts} \
        --max-concurrency 16 \
        --num-warmups 50 \
        --ignore-eos \
        --temperature 0

Main:

============ Serving Benchmark Result ============
Successful requests:                     960       
Failed requests:                         0         
Maximum request concurrency:             16        
Benchmark duration (s):                  195.14    
Total input tokens:                      227546    
Total generated tokens:                  288000    
Request throughput (req/s):              4.92      
Output token throughput (tok/s):         1475.89   
Peak output token throughput (tok/s):    1648.00   
Peak concurrent requests:                32.00     
Total token throughput (tok/s):          2641.98   
---------------Time to First Token----------------
Mean TTFT (ms):                          162.16    
Median TTFT (ms):                        156.37    
P99 TTFT (ms):                           234.19    
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          10.33     
Median TPOT (ms):                        10.31     
P99 TPOT (ms):                           10.77     
---------------Inter-token Latency----------------
Mean ITL (ms):                           10.33     
Median ITL (ms):                         10.24     
P99 ITL (ms):                            11.43     
==================================================

PR:

============ Serving Benchmark Result ============
Successful requests:                     960       
Failed requests:                         0         
Maximum request concurrency:             16        
Benchmark duration (s):                  191.31    
Total input tokens:                      219140    
Total generated tokens:                  288000    
Request throughput (req/s):              5.02      
Output token throughput (tok/s):         1505.38   
Peak output token throughput (tok/s):    1712.00   
Peak concurrent requests:                32.00     
Total token throughput (tok/s):          2650.83   
---------------Time to First Token----------------
Mean TTFT (ms):                          236.09    
Median TTFT (ms):                        236.95    
P99 TTFT (ms):                           380.30    
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          9.87      
Median TPOT (ms):                        9.83      
P99 TPOT (ms):                           10.49     
---------------Inter-token Latency----------------
Mean ITL (ms):                           9.87      
Median ITL (ms):                         9.81      
P99 ITL (ms):                            10.99     
==================================================

Qwen3.5

vllm serve Qwen/Qwen3.5-35B-A3B \
    --tensor-parallel-size 1 \
    --max-num-seqs 16 \
    --no-enable-prefix-caching

vllm bench serve \
        --model Qwen/Qwen3.5-35B-A3B \
        --dataset-name sharegpt \
        --dataset-path /tmp/ShareGPT_V3_unfiltered_cleaned_split.json \
        --sharegpt-output-len 300 \
        --num-prompts ${num_prompts} \
        --max-concurrency 16 \
        --num-warmups 50 \
        --ignore-eos \
        --temperature 0

Main:

============ Serving Benchmark Result ============
Successful requests:                     960       
Failed requests:                         0         
Maximum request concurrency:             16        
Benchmark duration (s):                  197.96    
Total input tokens:                      227546    
Total generated tokens:                  288000    
Request throughput (req/s):              4.85      
Output token throughput (tok/s):         1454.81   
Peak output token throughput (tok/s):    1648.00   
Peak concurrent requests:                32.00     
Total token throughput (tok/s):          2604.25   
---------------Time to First Token----------------
Mean TTFT (ms):                          142.74    
Median TTFT (ms):                        152.15    
P99 TTFT (ms):                           199.72    
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          10.55     
Median TPOT (ms):                        10.59     
P99 TPOT (ms):                           11.10     
---------------Inter-token Latency----------------
Mean ITL (ms):                           10.55     
Median ITL (ms):                         10.24     
P99 ITL (ms):                            12.05     
==================================================

PR:

============ Serving Benchmark Result ============
Successful requests:                     960       
Failed requests:                         0         
Maximum request concurrency:             16        
Benchmark duration (s):                  190.61    
Total input tokens:                      227546    
Total generated tokens:                  288000    
Request throughput (req/s):              5.04      
Output token throughput (tok/s):         1510.93   
Peak output token throughput (tok/s):    1715.00   
Peak concurrent requests:                32.00     
Total token throughput (tok/s):          2704.71   
---------------Time to First Token----------------
Mean TTFT (ms):                          169.85    
Median TTFT (ms):                        173.71    
P99 TTFT (ms):                           254.99    
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          10.05     
Median TPOT (ms):                        10.06     
P99 TPOT (ms):                           10.28     
---------------Inter-token Latency----------------
Mean ITL (ms):                           10.05     
Median ITL (ms):                         10.06     
P99 ITL (ms):                            11.37     
==================================================

Accuracy Testing

Qwen3

python3 -m lm_eval --model local-completions \
  --model_args model=Qwen/Qwen3-Next-80B-A3B-Instruct,base_url=http://127.0.0.1:8000/v1/completions,num_concurrent=16 \
  --tasks gsm8k

Main:

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.8575|±  |0.0096|
|     |       |strict-match    |     5|exact_match|↑  |0.8150|±  |0.0107|

PR:

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.8552|±  |0.0097|
|     |       |strict-match    |     5|exact_match|↑  |0.8082|±  |0.0108|

Qwen3 fp8

python3 -m lm_eval --model local-completions \
  --model_args model=Qwen/Qwen3-Next-80B-A3B-Instruct-FP8,base_url=http://127.0.0.1:8000/v1/completions,num_concurrent=16 \
  --tasks gsm8k

Main:

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.8491|±  |0.0099|
|     |       |strict-match    |     5|exact_match|↑  |0.8127|±  |0.0107|

PR:

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.8575|±  |0.0096|
|     |       |strict-match    |     5|exact_match|↑  |0.8127|±  |0.0107|

Qwen3.5

python3 -m lm_eval --model local-completions \
  --model_args model=Qwen/Qwen3.5-35B-A3B,base_url=http://127.0.0.1:8000/v1/completions,num_concurrent=16 \
  --tasks gsm8k

Main:

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.8476|±  |0.0099|
|     |       |strict-match    |     5|exact_match|↑  |0.8370|±  |0.0102|

PR:

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.8499|±  |0.0098|
|     |       |strict-match    |     5|exact_match|↑  |0.8332|±  |0.0103|

<details> <summary> Essential Elements of an Effective PR Description Checklist </summary>

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

</details>

Changed files

vllm/model_executor/models/qwen3_5.py (modified, +6/-2)
vllm/model_executor/models/qwen3_next.py (modified, +61/-3)
vllm/utils/multi_stream_utils.py (added, +48/-0)

RAW_BUFFERClick to expand / collapse

Native multi-stream in torch.compile will be supported soon. We should clean the custom op implementation like https://github.com/vllm-project/vllm/pull/36795 when native multi-stream support is available in the next pytorch release.

extent analysis

Fix Plan

The fix involves removing the custom op implementation for multi-stream support in torch.compile once native support is available.

Steps to Fix

Wait for the next PyTorch release with native multi-stream support.
Remove the custom op implementation:
- Delete the custom op code, e.g., custom_op.py.
- Update the model code to use the native torch.compile with multi-stream support.

Example code snippet:

import torch

# Before (custom op implementation)
# from custom_op import CustomOp
# model = torch.compile(CustomOp())

# After (native multi-stream support)
model = torch.compile(model, mode="max_automatic", dynamic=True)

Update any dependent code to use the native torch.compile API.

Verification

Verify that the model compiles successfully with the native torch.compile API.
Test the model with multi-stream inputs to ensure correct functionality.

Extra Tips

Monitor the PyTorch release notes for the native multi-stream support announcement.
Review the PyTorch documentation for any updates on using torch.compile with multi-stream support.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #ssr #installation #tensor shape #autograd error #agent setup #task chaining #parallel task #integration issue #index setup

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

vllm - ✅(Solved) Fix Port custom ops to native Inductor multi-stream support [1 pull requests, 2 comments, 3 participants]

Recommended Tools

GitHub issue graph ai analysis

Fix Action

Fixed

PR fix notes

PR #36795: [Perf] Enable dual stream execution of input projection for Qwen3

Description (problem / solution / changelog)

Purpose

Profiling

Benchmarking

Accuracy Testing

Changed files

extent analysis

Fix Plan

Steps to Fix

Verification

Extra Tips

Still need to ship something?

TRENDING

vllm - ✅(Solved) Fix Port custom ops to native Inductor multi-stream support [1 pull requests, 2 comments, 3 participants]

Recommended Tools

GitHub issue graph ai analysis

Fix Action

Fixed

PR fix notes

PR #36795: [Perf] Enable dual stream execution of input projection for Qwen3

Description (problem / solution / changelog)

Purpose

Profiling

Benchmarking

Accuracy Testing

Changed files

extent analysis

Fix Plan

Steps to Fix

Verification

Extra Tips

Still need to ship something?

RELATED_DISCOVERY

TRENDING