vllm - ✅(Solved) Fix [Bug]: FLASHINFER_CUTLASS and FLASHINFER_TRTLLM do not work for Qwen3.5 Bf16 DP/EP [2 pull requests, 15 comments, 3 participants]

robertgshaw2-redhat · 2026-03-21T19:19:01Z

[vllm] PR 36838: enable flashinfer moe kernel for DP + EP - Repository: vllm-project/vllm - Author: czhu-cohere - State: open | merged: False - Link: https://g… # PR #36838: enable flashinfer moe kernel for DP + EP - Repository: vllm-project/vllm - Author: czhu-cohere - State: open | merged: False - Link: https://github.com/vllm-project/vllm/pull/36838 ## Description (problem / solution / changelog) ## Purpose Previously the BF16 flashinfer moe kernel is disabled when dp > 1. I think the kernel itself should be able to support it, we just need to enable on the vLLM side. Also add test to verify the kernel selection logic works as intended. ## Test Plan `pytest tests/kernels/moe/test_unquantized_backend_selection.py` run gsm8k with bf16 qwen 3a30b on 2xB200 DP2 EP2 and compare the result with different moe backend. server command ``` # triton/default backend vllm serve Qwen/Qwen3-30B-A3B \ --data-parallel-size 2 \ --enable-expert-parallel \ --trust-remote-code \ --port 8000 # flashinfer cutlass VLLM_USE_FLASHINFER_MOE_FP16=1 VLLM_FLASHINFER_MOE_BACKEND=throughput vllm serve Qwen/Qwen3-30B-A3B \ --data-parallel-size 2 \ --enable-expert-parallel \ --trust-remote-code \ --port 8000 # flashinfer trtllm VLLM_USE_FLASHINFER_MOE_FP16=1 VLLM_FLASHINFER_MOE_BACKEND=latency vllm serve Qwen/Qwen3-30B-A3B \ --data-parallel-size 2 \ --enable-expert-parallel \ --trust-remote-code \ --port 8000 ``` test command ``` python -m lm_eval \ --model local-completions \ --model_args "model=Qwen/Qwen3-30B-A3B,base_url=http://localhost:8000/v1/completions,num_concurrent=128,max_retries=5,tokenized_requests=False,tokenizer=Qwen/Qwen3-30B-A3B" \ --tasks gsm8k_cot \ --batch_size auto \ --log_samples \ --output_path /tmp/lm_eval_qwen_dp2_ep ``` ## Test Result Backend | flexible-extract | stderr -- | -- | -- Triton (default) | 0.8870 | ±0.0087 FlashInfer CUTLASS (throughput) | 0.8992 | ±0.0083 FlashInfer TRTLLM (latency) | 0.9007 | ±0.0082 `pytest tests/kernels/moe/test_unquantized_backend_selection.py` pass --- Essential Elements of an Effective PR Description Checklist - [x] The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)". - [x] The test plan, such as providing test command. - [x] The test results, such as pasting the results comparison before and after, or e2e results - [ ] (Optional) The necessary documentation update, such as updating `supported_models.md` and `examples` for a new model. - [ ] (Optional) Release notes update. If your change is user facing, please update the release notes draft in the [Google Doc](https://docs.google.com/document/d/1YyVqrgX4gHTtrstbq8oWUImOyPCKSGnJ7xtTpmXzlRs/edit?tab=t.0). ## Changed files - `tests/kernels/moe/test_unquantized_backend_selection.py` (modified, +88/-2) - `vllm/model_executor/layers/fused_moe/oracle/unquantized.py` (modified, +1/-7) - `vllm/model_executor/layers/fused_moe/unquantized_fused_moe_method.py` (modified, +8/-0) --- # PR #39593: [Bug]: Fix FlashInfer CUTLASS BF16 + CUDA graphs IMA - Repository: vllm-project/vllm - Author: yzong-rh - State: open | merged: False - Link: https://github.com/vllm-project/vllm/pull/39593 ## Description (problem / solution / changelog) ## Purpose Fix illegal memory access crash in FlashInfer CUTLASS BF16 MoE with DP/DEP + CUDA graphs. Closes **Root cause:** When CUDA graphs are enabled under data parallelism, all ranks pad to the same batch size. Attention often produces NaN for padding tokens. In BF16, NaN propagates through RMSNorm, projections, and residual adds until it reaches the MoE router. The topK kernel picks the **same expert index K times** when encountering NaN values. FlashInfer Cutlass MoE kernels do not expect duplicate expert indices per token — the resulting out-of-bounds access causes a hard CUDA error. NaNs in padding tokens post-attention Instrumentation of attention output (Qwen3-30B-A3B BF16, piecewise CUDA graphs, FlashInfer CUTLASS MoE, GSM8K eval) confirms NaN is present in padding rows on virtually every padded step: | Config | Steps with padding | NaN in padding | Inf in padding | Mean pad tokens | |---|---|---|---|---| | 1 GPU | 62 | 62 (100%) | 5 | 4.0 | | DP=2 | 832 | 829 (99.6%) | 22 | 16.7 | | DEP=2 | 729 | 729 (100%) | 11 | 14.2 | DP/DEP produce ~13× more padding events with much larger pad regions than single-GPU bucket rounding, which is why the bug manifests under DP but not single GPU. Flashinfer CUTLASS kernel IMA with duplicate expert indices `cutlass_fp8_bf16_simple.py` shows that calling FlashInfer's `cutlass_fused_moe` in eager mode with all tokens routed to the same expert (simulating the duplicate-ID condition) crashes with an illegal memory access on both BF16 and FP8 paths. [cutlass_fp8_bf16_simple.py](https://github.com/user-attachments/files/26649670/cutlass_fp8_bf16_simple.py) FP8 was i

vllm2026-03-21 19:19:01

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

vllm-project/vllm#37758•Fetched 2026-04-08 01:13:01

View on GitHub

Comments

Participants

Timeline

Reactions

Author

Participants

Timeline (top)

commented ×15subscribed ×4cross-referenced ×1labeled ×1

PR fix notes

PR #36838: enable flashinfer moe kernel for DP + EP

Repository: vllm-project/vllm
Author: czhu-cohere
State: open | merged: False
Link: https://github.com/vllm-project/vllm/pull/36838

Description (problem / solution / changelog)

Purpose

Previously the BF16 flashinfer moe kernel is disabled when dp > 1. I think the kernel itself should be able to support it, we just need to enable on the vLLM side. Also add test to verify the kernel selection logic works as intended.

Test Plan

pytest tests/kernels/moe/test_unquantized_backend_selection.py run gsm8k with bf16 qwen 3a30b on 2xB200 DP2 EP2 and compare the result with different moe backend.

server command

# triton/default backend
vllm serve Qwen/Qwen3-30B-A3B \
  --data-parallel-size 2 \
  --enable-expert-parallel \
  --trust-remote-code \
  --port 8000

# flashinfer cutlass
VLLM_USE_FLASHINFER_MOE_FP16=1 VLLM_FLASHINFER_MOE_BACKEND=throughput vllm serve Qwen/Qwen3-30B-A3B \
  --data-parallel-size 2 \
  --enable-expert-parallel \
  --trust-remote-code \
  --port 8000

# flashinfer trtllm
VLLM_USE_FLASHINFER_MOE_FP16=1 VLLM_FLASHINFER_MOE_BACKEND=latency vllm serve Qwen/Qwen3-30B-A3B \
  --data-parallel-size 2 \
  --enable-expert-parallel \
  --trust-remote-code \
  --port 8000

test command

python -m lm_eval \
  --model local-completions \
  --model_args "model=Qwen/Qwen3-30B-A3B,base_url=http://localhost:8000/v1/completions,num_concurrent=128,max_retries=5,tokenized_requests=False,tokenizer=Qwen/Qwen3-30B-A3B" \
  --tasks gsm8k_cot \
  --batch_size auto \
  --log_samples \
  --output_path /tmp/lm_eval_qwen_dp2_ep

Test Result

Backend	flexible-extract	stderr
Triton (default)	0.8870	±0.0087
FlashInfer CUTLASS (throughput)	0.8992	±0.0083
FlashInfer TRTLLM (latency)	0.9007	±0.0082

pytest tests/kernels/moe/test_unquantized_backend_selection.py pass

<details> <summary> Essential Elements of an Effective PR Description Checklist </summary>

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

</details>

Changed files

tests/kernels/moe/test_unquantized_backend_selection.py (modified, +88/-2)
vllm/model_executor/layers/fused_moe/oracle/unquantized.py (modified, +1/-7)
vllm/model_executor/layers/fused_moe/unquantized_fused_moe_method.py (modified, +8/-0)

PR #39593: [Bug]: Fix FlashInfer CUTLASS BF16 + CUDA graphs IMA

Repository: vllm-project/vllm
Author: yzong-rh
State: open | merged: False
Link: https://github.com/vllm-project/vllm/pull/39593

Description (problem / solution / changelog)

Purpose

Fix illegal memory access crash in FlashInfer CUTLASS BF16 MoE with DP/DEP + CUDA graphs.

Closes https://github.com/vllm-project/vllm/issues/37758

Root cause: When CUDA graphs are enabled under data parallelism, all ranks pad to the same batch size. Attention often produces NaN for padding tokens. In BF16, NaN propagates through RMSNorm, projections, and residual adds until it reaches the MoE router. The topK kernel picks the same expert index K times when encountering NaN values. FlashInfer Cutlass MoE kernels do not expect duplicate expert indices per token — the resulting out-of-bounds access causes a hard CUDA error.

<details> <summary>NaNs in padding tokens post-attention</summary>

Instrumentation of attention output (Qwen3-30B-A3B BF16, piecewise CUDA graphs, FlashInfer CUTLASS MoE, GSM8K eval) confirms NaN is present in padding rows on virtually every padded step:

Config	Steps with padding	NaN in padding	Inf in padding	Mean pad tokens
1 GPU	62	62 (100%)	5	4.0
DP=2	832	829 (99.6%)	22	16.7
DEP=2	729	729 (100%)	11	14.2

DP/DEP produce ~13× more padding events with much larger pad regions than single-GPU bucket rounding, which is why the bug manifests under DP but not single GPU.

</details> <details> <summary>Flashinfer CUTLASS kernel IMA with duplicate expert indices</summary>

cutlass_fp8_bf16_simple.py shows that calling FlashInfer's cutlass_fused_moe in eager mode with all tokens routed to the same expert (simulating the duplicate-ID condition) crashes with an illegal memory access on both BF16 and FP8 paths.

cutlass_fp8_bf16_simple.py

</details>

FP8 was immune because the FP8 quantization kernel (scaled_fp8_conversion) uses fminf/fmaxf, which silently clamp NaN to ±448.0, acting as an implicit NaN firewall at every linear layer.

Fix: Sanitize NaN in the fused softmax/sigmoid+topk CUDA kernel (csrc/moe/topk_softmax_kernels.cu). This is applied in three places:

moeSoftmax (fallback path for non-power-of-2 expert counts): NaN is replaced with -FLT_MAX in the max-reduce, exp-sum, and final softmax loops.
moeSigmoid (fallback sigmoid path): NaN is replaced before the sigmoid computation.
topkGating (fused warp-level path for power-of-2 / multiple-of-64 expert counts): NaN is replaced in-register immediately after loading, before softmax/sigmoid and argmax.

topK now returns K distinct indices, never duplicates.

Test Plan

tests/kernels/moe/test_routing.py
- Includes additional tests test_topk_nan_row_distinct_experts and test_grouped_topk_nan_row_distinct_experts, which inject all-NaN rows into top-k. They verify that NaN rows return distinct values and that non NaN rows match the baseline.
End-to-end GSM8K correctness (Qwen3-30B-A3B BF16 and Qwen3.5-35B-A3B BF16, FlashInfer CUTLASS MoE, piecewise CUDA graphs):

Model Measured Expected
Qwen3-30B-A3B BF16 (DP=2) 0.8870 0.8800
Qwen3-30B-A3B BF16 (DEP=2) 0.8901 0.8800
Qwen3.5-35B-A3B BF16 (DP=2) 0.8408 0.8400
Qwen3.5-35B-A3B BF16 (DEP=2) 0.8560 0.8400

Model	Measured	Expected
Qwen3-30B-A3B BF16 (DP=2)	0.8870	0.8800
Qwen3-30B-A3B BF16 (DEP=2)	0.8901	0.8800
Qwen3.5-35B-A3B BF16 (DP=2)	0.8408	0.8400
Qwen3.5-35B-A3B BF16 (DEP=2)	0.8560	0.8400

Micro-benchmark (benchmark_router_select_experts.py, 128 tokens × 128 experts, top_k=8, bfloat16, 10% NaNs, B200, mirrors Qwen3-30B):

The NaN check is fused into the existing kernel rather than added as a separate torch.nan_to_num(-float('inf')) call before routing. A separate nan_to_num launch adds ~30% overhead:

Variant	Kernel	p20	p50	p80
Before fix	topk_softmax	10.21 µs	10.27 µs	10.37 µs
Before fix	topk_sigmoid	10.24 µs	12.13 µs	12.32 µs
After fix (fused)	topk_softmax	10.11 µs	10.24 µs	10.30 µs
After fix (fused)	topk_sigmoid	10.24 µs	12.16 µs	10.32 µs
`torch.nan_to_num`	topk_softmax	11.42 µs	13.28 µs	13.44 µs
`torch.nan_to_num`	topk_sigmoid	13.15 µs	13.31 µs	13.44 µs

benchmark_router_select_experts.py

E2E Latency:

vllm serve Qwen/Qwen3-30B-A3B \
  --data-parallel-size 2 \
  --max-model-len 8192

vllm bench serve \
  --backend vllm \
  --model Qwen/Qwen3-30B-A3B \
  --endpoint /v1/completions \
  --num-prompts 128 \
  --random-input-len 2 \
  --random-output-len 512 \
  --num-warmups 128 \
  --request-rate inf \
  --temperature 0

Results within run to run variance

<details> <summary>Before fix:</summary>

============ Serving Benchmark Result ============
Successful requests:                     128       
Failed requests:                         0         
Benchmark duration (s):                  4.74      
Total input tokens:                      256       
Total generated tokens:                  65536     
Request throughput (req/s):              27.01     
Output token throughput (tok/s):         13826.70  
Peak output token throughput (tok/s):    14336.00  
Peak concurrent requests:                128.00    
Total token throughput (tok/s):          13880.71  
---------------Time to First Token----------------
Mean TTFT (ms):                          97.14     
Median TTFT (ms):                        99.54     
P99 TTFT (ms):                           116.43    
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          9.06      
Median TPOT (ms):                        9.06      
P99 TPOT (ms):                           9.06      
---------------Inter-token Latency----------------
Mean ITL (ms):                           9.06      
Median ITL (ms):                         9.06      
P99 ITL (ms):                            10.59     
==================================================

</details> <details> <summary>After fix:</summary>

============ Serving Benchmark Result ============
Successful requests:                     128       
Failed requests:                         0         
Benchmark duration (s):                  4.72      
Total input tokens:                      256       
Total generated tokens:                  65536     
Request throughput (req/s):              27.11     
Output token throughput (tok/s):         13881.52  
Peak output token throughput (tok/s):    14336.00  
Peak concurrent requests:                128.00    
Total token throughput (tok/s):          13935.74  
---------------Time to First Token----------------
Mean TTFT (ms):                          95.84     
Median TTFT (ms):                        97.08     
P99 TTFT (ms):                           109.43    
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          9.03      
Median TPOT (ms):                        9.03      
P99 TPOT (ms):                           9.03      
---------------Inter-token Latency----------------
Mean ITL (ms):                           9.03      
Median ITL (ms):                         9.00      
P99 ITL (ms):                            11.15     
==================================================

</details> <details> <summary>torch.nan_to_num</summary>

============ Serving Benchmark Result ============
Successful requests:                     128       
Failed requests:                         0         
Benchmark duration (s):                  4.83      
Total input tokens:                      256       
Total generated tokens:                  65536     
Request throughput (req/s):              26.48     
Output token throughput (tok/s):         13556.04  
Peak output token throughput (tok/s):    14031.00  
Peak concurrent requests:                128.00    
Total token throughput (tok/s):          13608.99  
---------------Time to First Token----------------
Mean TTFT (ms):                          97.80     
Median TTFT (ms):                        104.58    
P99 TTFT (ms):                           115.65    
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          9.23      
Median TPOT (ms):                        9.23      
P99 TPOT (ms):                           9.24      
---------------Inter-token Latency----------------
Mean ITL (ms):                           9.23      
Median ITL (ms):                         9.22      
P99 ITL (ms):                            11.01     
==================================================

</details>

Test Result

See above.

<details> <summary> Essential Elements of an Effective PR Description Checklist </summary>

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

</details>

BEFORE SUBMITTING, PLEASE READ https://docs.vllm.ai/en/latest/contributing (anything written below this line will be removed by GitHub Actions)

Changed files

csrc/moe/topk_softmax_kernels.cu (modified, +8/-3)
tests/kernels/moe/test_routing.py (modified, +168/-3)
vllm/model_executor/layers/fused_moe/oracle/unquantized.py (modified, +0/-6)

Code Example

VLLM_USE_FLASHINFER_MOE_FP16=1 VLLM_FLASHINFER_MOE_BACKEND=latency chg run --gpus 2 -- pytest -s -v evals/gsm8k/test_gsm8k_correctness.py --config-list-file=configs/models-qwen35-blackwell.txt

VLLM_USE_FLASHINFER_MOE_FP16=1 VLLM_FLASHINFER_MOE_BACKEND=throughput chg run --gpus 2 -- pytest -s -v evals/gsm8k/test_gsm8k_correctness.py --config-list-file=configs/models-qwen35-blackwell.txt

RAW_BUFFERClick to expand / collapse

Your current environment

b200, main

🐛 Describe the bug

both of these fail

VLLM_USE_FLASHINFER_MOE_FP16=1 VLLM_FLASHINFER_MOE_BACKEND=latency chg run --gpus 2 -- pytest -s -v evals/gsm8k/test_gsm8k_correctness.py --config-list-file=configs/models-qwen35-blackwell.txt

VLLM_USE_FLASHINFER_MOE_FP16=1 VLLM_FLASHINFER_MOE_BACKEND=throughput chg run --gpus 2 -- pytest -s -v evals/gsm8k/test_gsm8k_correctness.py --config-list-file=configs/models-qwen35-blackwell.txt

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

Fix Plan

The fix involves modifying the environment variables and the pytest command to correctly utilize the FlashInfer MOE backend.

Step-by-Step Solution

Set the VLLM_USE_FLASHINFER_MOE_FP16 environment variable to 1 to enable FP16 precision.
Set the VLLM_FLASHINFER_MOE_BACKEND environment variable to either latency or throughput depending on the desired backend.
Modify the pytest command to include the --gpu option to specify the number of GPUs to use.

Example code:

VLLM_USE_FLASHINFER_MOE_FP16=1 VLLM_FLASHINFER_MOE_BACKEND=latency \
chg run --gpus 2 -- pytest -s -v evals/gsm8k/test_gsm8k_correctness.py \
--config-list-file=configs/models-qwen35-blackwell.txt --gpu 2

Alternatively, for the throughput backend:

VLLM_USE_FLASHINFER_MOE_FP16=1 VLLM_FLASHINFER_MOE_BACKEND=throughput \
chg run --gpus 2 -- pytest -s -v evals/gsm8k/test_gsm8k_correctness.py \
--config-list-file=configs/models-qwen35-blackwell.txt --gpu 2

Verification

Run the modified pytest command and verify that the tests pass without errors. Check the output for any warnings or errors related to the FlashInfer MOE backend.

Extra Tips

Ensure that the configs/models-qwen35-blackwell.txt file is correctly formatted and contains the necessary configuration settings.
If issues persist, try reducing the number of GPUs used or disabling FP16 precision to troubleshoot the problem.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#agent setup #task chaining #parallel task #integration issue #index setup

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

vllm - ✅(Solved) Fix [Bug]: FLASHINFER_CUTLASS and FLASHINFER_TRTLLM do not work for Qwen3.5 Bf16 DP/EP [2 pull requests, 15 comments, 3 participants]

Recommended Tools

GitHub issue graph ai analysis

PR fix notes

PR #36838: enable flashinfer moe kernel for DP + EP

Description (problem / solution / changelog)

Purpose

Test Plan

Test Result

Changed files

PR #39593: [Bug]: Fix FlashInfer CUTLASS BF16 + CUDA graphs IMA

Description (problem / solution / changelog)

Purpose

Test Plan

Test Result

Changed files

Code Example

Your current environment

🐛 Describe the bug

Before submitting a new issue...

extent analysis

Fix Plan

Step-by-Step Solution

Verification

Extra Tips

Still need to ship something?

TRENDING

vllm - ✅(Solved) Fix [Bug]: FLASHINFER_CUTLASS and FLASHINFER_TRTLLM do not work for Qwen3.5 Bf16 DP/EP [2 pull requests, 15 comments, 3 participants]

Recommended Tools

GitHub issue graph ai analysis

PR fix notes

PR #36838: enable flashinfer moe kernel for DP + EP

Description (problem / solution / changelog)

Purpose

Test Plan

Test Result

Changed files

PR #39593: [Bug]: Fix FlashInfer CUTLASS BF16 + CUDA graphs IMA

Description (problem / solution / changelog)

Purpose

Test Plan

Test Result

Changed files

Code Example

Your current environment

🐛 Describe the bug

Before submitting a new issue...

extent analysis

Fix Plan

Step-by-Step Solution

Verification

Extra Tips

Still need to ship something?

RELATED_DISCOVERY

TRENDING