vllm - ✅(Solved) Fix [Bug]: FLASHINFER_CUTLASS and FLASHINFER_TRTLLM do not work for Qwen3.5 Bf16 DP/EP [2 pull requests, 15 comments, 3 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#37758Fetched 2026-04-08 01:13:01
View on GitHub
Comments
15
Participants
3
Timeline
23
Reactions
0
Timeline (top)
commented ×15subscribed ×4cross-referenced ×1labeled ×1

PR fix notes

PR #36838: enable flashinfer moe kernel for DP + EP

Description (problem / solution / changelog)

Purpose

Previously the BF16 flashinfer moe kernel is disabled when dp > 1. I think the kernel itself should be able to support it, we just need to enable on the vLLM side. Also add test to verify the kernel selection logic works as intended.

Test Plan

pytest tests/kernels/moe/test_unquantized_backend_selection.py run gsm8k with bf16 qwen 3a30b on 2xB200 DP2 EP2 and compare the result with different moe backend.

server command

# triton/default backend
vllm serve Qwen/Qwen3-30B-A3B \
  --data-parallel-size 2 \
  --enable-expert-parallel \
  --trust-remote-code \
  --port 8000

# flashinfer cutlass
VLLM_USE_FLASHINFER_MOE_FP16=1 VLLM_FLASHINFER_MOE_BACKEND=throughput vllm serve Qwen/Qwen3-30B-A3B \
  --data-parallel-size 2 \
  --enable-expert-parallel \
  --trust-remote-code \
  --port 8000

# flashinfer trtllm
VLLM_USE_FLASHINFER_MOE_FP16=1 VLLM_FLASHINFER_MOE_BACKEND=latency vllm serve Qwen/Qwen3-30B-A3B \
  --data-parallel-size 2 \
  --enable-expert-parallel \
  --trust-remote-code \
  --port 8000

test command

python -m lm_eval \
  --model local-completions \
  --model_args "model=Qwen/Qwen3-30B-A3B,base_url=http://localhost:8000/v1/completions,num_concurrent=128,max_retries=5,tokenized_requests=False,tokenizer=Qwen/Qwen3-30B-A3B" \
  --tasks gsm8k_cot \
  --batch_size auto \
  --log_samples \
  --output_path /tmp/lm_eval_qwen_dp2_ep

Test Result

Backendflexible-extractstderr
Triton (default)0.8870±0.0087
FlashInfer CUTLASS (throughput)0.8992±0.0083
FlashInfer TRTLLM (latency)0.9007±0.0082

pytest tests/kernels/moe/test_unquantized_backend_selection.py pass


<details> <summary> Essential Elements of an Effective PR Description Checklist </summary>
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.
</details>

Changed files

  • tests/kernels/moe/test_unquantized_backend_selection.py (modified, +88/-2)
  • vllm/model_executor/layers/fused_moe/oracle/unquantized.py (modified, +1/-7)
  • vllm/model_executor/layers/fused_moe/unquantized_fused_moe_method.py (modified, +8/-0)

PR #39593: [Bug]: Fix FlashInfer CUTLASS BF16 + CUDA graphs IMA

Description (problem / solution / changelog)

Purpose

Fix illegal memory access crash in FlashInfer CUTLASS BF16 MoE with DP/DEP + CUDA graphs.

Closes https://github.com/vllm-project/vllm/issues/37758

Root cause: When CUDA graphs are enabled under data parallelism, all ranks pad to the same batch size. Attention often produces NaN for padding tokens. In BF16, NaN propagates through RMSNorm, projections, and residual adds until it reaches the MoE router. The topK kernel picks the same expert index K times when encountering NaN values. FlashInfer Cutlass MoE kernels do not expect duplicate expert indices per token — the resulting out-of-bounds access causes a hard CUDA error.

<details> <summary>NaNs in padding tokens post-attention</summary>

Instrumentation of attention output (Qwen3-30B-A3B BF16, piecewise CUDA graphs, FlashInfer CUTLASS MoE, GSM8K eval) confirms NaN is present in padding rows on virtually every padded step:

ConfigSteps with paddingNaN in paddingInf in paddingMean pad tokens
1 GPU6262 (100%)54.0
DP=2832829 (99.6%)2216.7
DEP=2729729 (100%)1114.2

DP/DEP produce ~13× more padding events with much larger pad regions than single-GPU bucket rounding, which is why the bug manifests under DP but not single GPU.

</details> <details> <summary>Flashinfer CUTLASS kernel IMA with duplicate expert indices</summary>

cutlass_fp8_bf16_simple.py shows that calling FlashInfer's cutlass_fused_moe in eager mode with all tokens routed to the same expert (simulating the duplicate-ID condition) crashes with an illegal memory access on both BF16 and FP8 paths.

cutlass_fp8_bf16_simple.py

</details>

FP8 was immune because the FP8 quantization kernel (scaled_fp8_conversion) uses fminf/fmaxf, which silently clamp NaN to ±448.0, acting as an implicit NaN firewall at every linear layer.

Fix: Sanitize NaN in the fused softmax/sigmoid+topk CUDA kernel (csrc/moe/topk_softmax_kernels.cu). This is applied in three places:

  1. moeSoftmax (fallback path for non-power-of-2 expert counts): NaN is replaced with -FLT_MAX in the max-reduce, exp-sum, and final softmax loops.
  2. moeSigmoid (fallback sigmoid path): NaN is replaced before the sigmoid computation.
  3. topkGating (fused warp-level path for power-of-2 / multiple-of-64 expert counts): NaN is replaced in-register immediately after loading, before softmax/sigmoid and argmax.

topK now returns K distinct indices, never duplicates.

Test Plan

  1. tests/kernels/moe/test_routing.py

    • Includes additional tests test_topk_nan_row_distinct_experts and test_grouped_topk_nan_row_distinct_experts, which inject all-NaN rows into top-k. They verify that NaN rows return distinct values and that non NaN rows match the baseline.
  2. End-to-end GSM8K correctness (Qwen3-30B-A3B BF16 and Qwen3.5-35B-A3B BF16, FlashInfer CUTLASS MoE, piecewise CUDA graphs):

    ModelMeasuredExpected
    Qwen3-30B-A3B BF16 (DP=2)0.88700.8800
    Qwen3-30B-A3B BF16 (DEP=2)0.89010.8800
    Qwen3.5-35B-A3B BF16 (DP=2)0.84080.8400
    Qwen3.5-35B-A3B BF16 (DEP=2)0.85600.8400
  3. Micro-benchmark (benchmark_router_select_experts.py, 128 tokens × 128 experts, top_k=8, bfloat16, 10% NaNs, B200, mirrors Qwen3-30B):

    The NaN check is fused into the existing kernel rather than added as a separate torch.nan_to_num(-float('inf')) call before routing. A separate nan_to_num launch adds ~30% overhead:

    VariantKernelp20p50p80
    Before fixtopk_softmax10.21 µs10.27 µs10.37 µs
    Before fixtopk_sigmoid10.24 µs12.13 µs12.32 µs
    After fix (fused)topk_softmax10.11 µs10.24 µs10.30 µs
    After fix (fused)topk_sigmoid10.24 µs12.16 µs10.32 µs
    torch.nan_to_numtopk_softmax11.42 µs13.28 µs13.44 µs
    torch.nan_to_numtopk_sigmoid13.15 µs13.31 µs13.44 µs

benchmark_router_select_experts.py

  1. E2E Latency:
vllm serve Qwen/Qwen3-30B-A3B \
  --data-parallel-size 2 \
  --max-model-len 8192

vllm bench serve \
  --backend vllm \
  --model Qwen/Qwen3-30B-A3B \
  --endpoint /v1/completions \
  --num-prompts 128 \
  --random-input-len 2 \
  --random-output-len 512 \
  --num-warmups 128 \
  --request-rate inf \
  --temperature 0

Results within run to run variance

<details> <summary>Before fix:</summary>
============ Serving Benchmark Result ============
Successful requests:                     128       
Failed requests:                         0         
Benchmark duration (s):                  4.74      
Total input tokens:                      256       
Total generated tokens:                  65536     
Request throughput (req/s):              27.01     
Output token throughput (tok/s):         13826.70  
Peak output token throughput (tok/s):    14336.00  
Peak concurrent requests:                128.00    
Total token throughput (tok/s):          13880.71  
---------------Time to First Token----------------
Mean TTFT (ms):                          97.14     
Median TTFT (ms):                        99.54     
P99 TTFT (ms):                           116.43    
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          9.06      
Median TPOT (ms):                        9.06      
P99 TPOT (ms):                           9.06      
---------------Inter-token Latency----------------
Mean ITL (ms):                           9.06      
Median ITL (ms):                         9.06      
P99 ITL (ms):                            10.59     
==================================================
</details> <details> <summary>After fix:</summary>
============ Serving Benchmark Result ============
Successful requests:                     128       
Failed requests:                         0         
Benchmark duration (s):                  4.72      
Total input tokens:                      256       
Total generated tokens:                  65536     
Request throughput (req/s):              27.11     
Output token throughput (tok/s):         13881.52  
Peak output token throughput (tok/s):    14336.00  
Peak concurrent requests:                128.00    
Total token throughput (tok/s):          13935.74  
---------------Time to First Token----------------
Mean TTFT (ms):                          95.84     
Median TTFT (ms):                        97.08     
P99 TTFT (ms):                           109.43    
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          9.03      
Median TPOT (ms):                        9.03      
P99 TPOT (ms):                           9.03      
---------------Inter-token Latency----------------
Mean ITL (ms):                           9.03      
Median ITL (ms):                         9.00      
P99 ITL (ms):                            11.15     
==================================================
</details> <details> <summary>torch.nan_to_num</summary>
============ Serving Benchmark Result ============
Successful requests:                     128       
Failed requests:                         0         
Benchmark duration (s):                  4.83      
Total input tokens:                      256       
Total generated tokens:                  65536     
Request throughput (req/s):              26.48     
Output token throughput (tok/s):         13556.04  
Peak output token throughput (tok/s):    14031.00  
Peak concurrent requests:                128.00    
Total token throughput (tok/s):          13608.99  
---------------Time to First Token----------------
Mean TTFT (ms):                          97.80     
Median TTFT (ms):                        104.58    
P99 TTFT (ms):                           115.65    
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          9.23      
Median TPOT (ms):                        9.23      
P99 TPOT (ms):                           9.24      
---------------Inter-token Latency----------------
Mean ITL (ms):                           9.23      
Median ITL (ms):                         9.22      
P99 ITL (ms):                            11.01     
==================================================
</details>

Test Result

See above.


<details> <summary> Essential Elements of an Effective PR Description Checklist </summary>
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.
</details>

BEFORE SUBMITTING, PLEASE READ https://docs.vllm.ai/en/latest/contributing (anything written below this line will be removed by GitHub Actions)

Changed files

  • csrc/moe/topk_softmax_kernels.cu (modified, +8/-3)
  • tests/kernels/moe/test_routing.py (modified, +168/-3)
  • vllm/model_executor/layers/fused_moe/oracle/unquantized.py (modified, +0/-6)

Code Example

VLLM_USE_FLASHINFER_MOE_FP16=1 VLLM_FLASHINFER_MOE_BACKEND=latency chg run --gpus 2 -- pytest -s -v evals/gsm8k/test_gsm8k_correctness.py --config-list-file=configs/models-qwen35-blackwell.txt

VLLM_USE_FLASHINFER_MOE_FP16=1 VLLM_FLASHINFER_MOE_BACKEND=throughput chg run --gpus 2 -- pytest -s -v evals/gsm8k/test_gsm8k_correctness.py --config-list-file=configs/models-qwen35-blackwell.txt
RAW_BUFFERClick to expand / collapse

Your current environment

b200, main

🐛 Describe the bug

both of these fail

VLLM_USE_FLASHINFER_MOE_FP16=1 VLLM_FLASHINFER_MOE_BACKEND=latency chg run --gpus 2 -- pytest -s -v evals/gsm8k/test_gsm8k_correctness.py --config-list-file=configs/models-qwen35-blackwell.txt

VLLM_USE_FLASHINFER_MOE_FP16=1 VLLM_FLASHINFER_MOE_BACKEND=throughput chg run --gpus 2 -- pytest -s -v evals/gsm8k/test_gsm8k_correctness.py --config-list-file=configs/models-qwen35-blackwell.txt

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

Fix Plan

The fix involves modifying the environment variables and the pytest command to correctly utilize the FlashInfer MOE backend.

Step-by-Step Solution

  • Set the VLLM_USE_FLASHINFER_MOE_FP16 environment variable to 1 to enable FP16 precision.
  • Set the VLLM_FLASHINFER_MOE_BACKEND environment variable to either latency or throughput depending on the desired backend.
  • Modify the pytest command to include the --gpu option to specify the number of GPUs to use.

Example code:

VLLM_USE_FLASHINFER_MOE_FP16=1 VLLM_FLASHINFER_MOE_BACKEND=latency \
chg run --gpus 2 -- pytest -s -v evals/gsm8k/test_gsm8k_correctness.py \
--config-list-file=configs/models-qwen35-blackwell.txt --gpu 2

Alternatively, for the throughput backend:

VLLM_USE_FLASHINFER_MOE_FP16=1 VLLM_FLASHINFER_MOE_BACKEND=throughput \
chg run --gpus 2 -- pytest -s -v evals/gsm8k/test_gsm8k_correctness.py \
--config-list-file=configs/models-qwen35-blackwell.txt --gpu 2

Verification

Run the modified pytest command and verify that the tests pass without errors. Check the output for any warnings or errors related to the FlashInfer MOE backend.

Extra Tips

  • Ensure that the configs/models-qwen35-blackwell.txt file is correctly formatted and contains the necessary configuration settings.
  • If issues persist, try reducing the number of GPUs used or disabling FP16 precision to troubleshoot the problem.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING