vllm - ✅(Solved) Fix Port custom ops to native Inductor multi-stream support [1 pull requests, 2 comments, 3 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#37372Fetched 2026-04-08 00:53:13
View on GitHub
Comments
2
Participants
3
Timeline
10
Reactions
0
Author
Assignees
Timeline (top)
mentioned ×3subscribed ×3commented ×2assigned ×1

Fix Action

Fixed

PR fix notes

PR #36795: [Perf] Enable dual stream execution of input projection for Qwen3

Description (problem / solution / changelog)

Purpose

This PR Enable dual stream execution of input projection for Qwen3 Next.

  • Parallelize the execution of in_proj_qkvz and in_proj_ba in 2 streams, because their outputs are independent.
  • Wrap the implementation in custom op for torch.compile.

Profiling

Main:

<img width="1722" height="180" alt="Screenshot 2026-03-10 at 6 08 16 PM" src="https://github.com/user-attachments/assets/58f0249e-c9de-47a6-8f00-218781344132" />

PR:

<img width="1723" height="139" alt="Screenshot 2026-03-10 at 6 08 31 PM" src="https://github.com/user-attachments/assets/10cd4f6b-df85-42b2-b164-92aab942c98b" />

Main: nvjet_tst_64x8_64x16_4x2_h_bz_TNT (in_proj_qkvz) and nvjet_tst_64x8_64x16_1x2_h_bz_TNT (in_proj_ba) kernels launched sequentially.

PR: kernels launched in parallel.

Benchmarking

Benchmarked on H200.

  • Qwen3
vllm serve Qwen/Qwen3-Next-80B-A3B-Instruct \
    --tensor-parallel-size 1 \
    --max-num-seqs 16 \
    --no-enable-prefix-caching
vllm bench serve \
        --model Qwen/Qwen3-Next-80B-A3B-Instruct \
        --dataset-name sharegpt \
        --dataset-path /tmp/ShareGPT_V3_unfiltered_cleaned_split.json \
        --sharegpt-output-len 300 \
        --num-prompts ${num_prompts} \
        --max-concurrency 16 \
        --num-warmups 50 \
        --ignore-eos \
        --temperature 0

Main:

============ Serving Benchmark Result ============
Successful requests:                     960       
Failed requests:                         0         
Maximum request concurrency:             16        
Benchmark duration (s):                  121.46    
Total input tokens:                      219140    
Total generated tokens:                  288000    
Request throughput (req/s):              7.90      
Output token throughput (tok/s):         2371.20   
Peak output token throughput (tok/s):    2640.00   
Peak concurrent requests:                32.00     
Total token throughput (tok/s):          4175.46   
---------------Time to First Token----------------
Mean TTFT (ms):                          176.72    
Median TTFT (ms):                        193.47    
P99 TTFT (ms):                           222.77    
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          6.18      
Median TPOT (ms):                        6.15      
P99 TPOT (ms):                           6.48      
---------------Inter-token Latency----------------
Mean ITL (ms):                           6.18      
Median ITL (ms):                         6.13      
P99 ITL (ms):                            6.94      
==================================================

PR:

============ Serving Benchmark Result ============
Successful requests:                     960       
Failed requests:                         0         
Maximum request concurrency:             16        
Benchmark duration (s):                  120.20    
Total input tokens:                      219140    
Total generated tokens:                  288000    
Request throughput (req/s):              7.99      
Output token throughput (tok/s):         2396.09   
Peak output token throughput (tok/s):    2672.00   
Peak concurrent requests:                32.00     
Total token throughput (tok/s):          4219.28   
---------------Time to First Token----------------
Mean TTFT (ms):                          191.90    
Median TTFT (ms):                        214.37    
P99 TTFT (ms):                           249.63    
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          6.06      
Median TPOT (ms):                        6.04      
P99 TPOT (ms):                           6.40      
---------------Inter-token Latency----------------
Mean ITL (ms):                           6.06      
Median ITL (ms):                         6.02      
P99 ITL (ms):                            6.72      
==================================================
  • Qwen3 fp8
vllm serve Qwen/Qwen3-Next-80B-A3B-Instruct-FP8 \
    --tensor-parallel-size 1 \
    --max-num-seqs 16 \
    --no-enable-prefix-caching
vllm bench serve \
        --model Qwen/Qwen3-Next-80B-A3B-Instruct-FP8 \
        --dataset-name sharegpt \
        --dataset-path /tmp/ShareGPT_V3_unfiltered_cleaned_split.json \
        --sharegpt-output-len 300 \
        --num-prompts ${num_prompts} \
        --max-concurrency 16 \
        --num-warmups 50 \
        --ignore-eos \
        --temperature 0

Main:

============ Serving Benchmark Result ============
Successful requests:                     960       
Failed requests:                         0         
Maximum request concurrency:             16        
Benchmark duration (s):                  195.14    
Total input tokens:                      227546    
Total generated tokens:                  288000    
Request throughput (req/s):              4.92      
Output token throughput (tok/s):         1475.89   
Peak output token throughput (tok/s):    1648.00   
Peak concurrent requests:                32.00     
Total token throughput (tok/s):          2641.98   
---------------Time to First Token----------------
Mean TTFT (ms):                          162.16    
Median TTFT (ms):                        156.37    
P99 TTFT (ms):                           234.19    
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          10.33     
Median TPOT (ms):                        10.31     
P99 TPOT (ms):                           10.77     
---------------Inter-token Latency----------------
Mean ITL (ms):                           10.33     
Median ITL (ms):                         10.24     
P99 ITL (ms):                            11.43     
==================================================

PR:

============ Serving Benchmark Result ============
Successful requests:                     960       
Failed requests:                         0         
Maximum request concurrency:             16        
Benchmark duration (s):                  191.31    
Total input tokens:                      219140    
Total generated tokens:                  288000    
Request throughput (req/s):              5.02      
Output token throughput (tok/s):         1505.38   
Peak output token throughput (tok/s):    1712.00   
Peak concurrent requests:                32.00     
Total token throughput (tok/s):          2650.83   
---------------Time to First Token----------------
Mean TTFT (ms):                          236.09    
Median TTFT (ms):                        236.95    
P99 TTFT (ms):                           380.30    
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          9.87      
Median TPOT (ms):                        9.83      
P99 TPOT (ms):                           10.49     
---------------Inter-token Latency----------------
Mean ITL (ms):                           9.87      
Median ITL (ms):                         9.81      
P99 ITL (ms):                            10.99     
==================================================
  • Qwen3.5
vllm serve Qwen/Qwen3.5-35B-A3B \
    --tensor-parallel-size 1 \
    --max-num-seqs 16 \
    --no-enable-prefix-caching
vllm bench serve \
        --model Qwen/Qwen3.5-35B-A3B \
        --dataset-name sharegpt \
        --dataset-path /tmp/ShareGPT_V3_unfiltered_cleaned_split.json \
        --sharegpt-output-len 300 \
        --num-prompts ${num_prompts} \
        --max-concurrency 16 \
        --num-warmups 50 \
        --ignore-eos \
        --temperature 0

Main:

============ Serving Benchmark Result ============
Successful requests:                     960       
Failed requests:                         0         
Maximum request concurrency:             16        
Benchmark duration (s):                  197.96    
Total input tokens:                      227546    
Total generated tokens:                  288000    
Request throughput (req/s):              4.85      
Output token throughput (tok/s):         1454.81   
Peak output token throughput (tok/s):    1648.00   
Peak concurrent requests:                32.00     
Total token throughput (tok/s):          2604.25   
---------------Time to First Token----------------
Mean TTFT (ms):                          142.74    
Median TTFT (ms):                        152.15    
P99 TTFT (ms):                           199.72    
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          10.55     
Median TPOT (ms):                        10.59     
P99 TPOT (ms):                           11.10     
---------------Inter-token Latency----------------
Mean ITL (ms):                           10.55     
Median ITL (ms):                         10.24     
P99 ITL (ms):                            12.05     
==================================================

PR:

============ Serving Benchmark Result ============
Successful requests:                     960       
Failed requests:                         0         
Maximum request concurrency:             16        
Benchmark duration (s):                  190.61    
Total input tokens:                      227546    
Total generated tokens:                  288000    
Request throughput (req/s):              5.04      
Output token throughput (tok/s):         1510.93   
Peak output token throughput (tok/s):    1715.00   
Peak concurrent requests:                32.00     
Total token throughput (tok/s):          2704.71   
---------------Time to First Token----------------
Mean TTFT (ms):                          169.85    
Median TTFT (ms):                        173.71    
P99 TTFT (ms):                           254.99    
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          10.05     
Median TPOT (ms):                        10.06     
P99 TPOT (ms):                           10.28     
---------------Inter-token Latency----------------
Mean ITL (ms):                           10.05     
Median ITL (ms):                         10.06     
P99 ITL (ms):                            11.37     
==================================================

Accuracy Testing

  • Qwen3
python3 -m lm_eval --model local-completions \
  --model_args model=Qwen/Qwen3-Next-80B-A3B-Instruct,base_url=http://127.0.0.1:8000/v1/completions,num_concurrent=16 \
  --tasks gsm8k

Main:

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.8575|±  |0.0096|
|     |       |strict-match    |     5|exact_match|↑  |0.8150|±  |0.0107|

PR:

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.8552|±  |0.0097|
|     |       |strict-match    |     5|exact_match|↑  |0.8082|±  |0.0108|
  • Qwen3 fp8
python3 -m lm_eval --model local-completions \
  --model_args model=Qwen/Qwen3-Next-80B-A3B-Instruct-FP8,base_url=http://127.0.0.1:8000/v1/completions,num_concurrent=16 \
  --tasks gsm8k

Main:

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.8491|±  |0.0099|
|     |       |strict-match    |     5|exact_match|↑  |0.8127|±  |0.0107|

PR:

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.8575|±  |0.0096|
|     |       |strict-match    |     5|exact_match|↑  |0.8127|±  |0.0107|
  • Qwen3.5
python3 -m lm_eval --model local-completions \
  --model_args model=Qwen/Qwen3.5-35B-A3B,base_url=http://127.0.0.1:8000/v1/completions,num_concurrent=16 \
  --tasks gsm8k

Main:

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.8476|±  |0.0099|
|     |       |strict-match    |     5|exact_match|↑  |0.8370|±  |0.0102|

PR:

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.8499|±  |0.0098|
|     |       |strict-match    |     5|exact_match|↑  |0.8332|±  |0.0103|

<details> <summary> Essential Elements of an Effective PR Description Checklist </summary>
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.
</details>

Changed files

  • vllm/model_executor/models/qwen3_5.py (modified, +6/-2)
  • vllm/model_executor/models/qwen3_next.py (modified, +61/-3)
  • vllm/utils/multi_stream_utils.py (added, +48/-0)
RAW_BUFFERClick to expand / collapse

Native multi-stream in torch.compile will be supported soon. We should clean the custom op implementation like https://github.com/vllm-project/vllm/pull/36795 when native multi-stream support is available in the next pytorch release.

extent analysis

Fix Plan

The fix involves removing the custom op implementation for multi-stream support in torch.compile once native support is available.

Steps to Fix

  • Wait for the next PyTorch release with native multi-stream support.
  • Remove the custom op implementation:
    • Delete the custom op code, e.g., custom_op.py.
    • Update the model code to use the native torch.compile with multi-stream support.
  • Example code snippet:
    import torch
    
    # Before (custom op implementation)
    # from custom_op import CustomOp
    # model = torch.compile(CustomOp())
    
    # After (native multi-stream support)
    model = torch.compile(model, mode="max_automatic", dynamic=True)
  • Update any dependent code to use the native torch.compile API.

Verification

  • Verify that the model compiles successfully with the native torch.compile API.
  • Test the model with multi-stream inputs to ensure correct functionality.

Extra Tips

  • Monitor the PyTorch release notes for the native multi-stream support announcement.
  • Review the PyTorch documentation for any updates on using torch.compile with multi-stream support.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING