vllm - ✅(Solved) Fix [Bug][Tracking Issue]: NaNs in CUDA Graph padding regions corrupt activations in some per-token kernels [1 pull requests, 1 participants]

vllm2026-04-16 18:06:45

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

vllm-project/vllm#40047•Fetched 2026-04-17 08:27:28

View on GitHub

Comments

Participants

Timeline

Reactions

Author

tlrmchlsmth

Participants

tlrmchlsmth

While debugging NaNs found during WideEP GB200 deployments of DeepSeek-R1-0528-NVFP4-v2 https://github.com/vllm-project/vllm/issues/37890, we have identified several kernels that leak NaNs from the CUDA Graph padding region into activation tokens.

Even though each of these kernels is supposed to operate on each token independently, NaNs in some tokens can affect the others. In some cases this happens due to warp reductions used to compute scales for group quantization.

Collecting the issues here to avoid filing a separate issue for each. We've landed a band-aid fix for (1) and have identified a somewhat intrusive band-aid fix for (2), (3), and likely (4).

(1) FlashInfer: Padding NaNs corrupts activation scales in TRT-LLM `mm_fp4`

See FlashInfer issue: https://github.com/flashinfer-ai/flashinfer/issues/2861 We landed a band-aid fix in https://github.com/vllm-project/vllm/pull/38148, which resolves the issues by zeroing out the scale padding. This could be removed once the mm_fp4 bug is fixed.

(2) FlashInfer Bug: Padding NaNs corrupts activation scales in `silu_and_mul_scaled_nvfp4_experts_quantize` and `scaled_fp4_grouped_quantize`

See FlashInfer issue: https://github.com/flashinfer-ai/flashinfer/issues/3057 See failing test in vLLM Bug Hunt: https://github.com/tlrmchlsmth/vllm/pull/33
A bandaid fix is to zero out padding at the beginning of MoE layer: https://github.com/elvircrn/vllm/commit/b77030c452f2d4173aa7915d6d7cb510f04c80ea Potential FlashInfer fix: https://github.com/flashinfer-ai/flashinfer/compare/main...tlrmchlsmth:flashinfer:fix/nvfp4-expert-quant-mask-warp-sync?expand=1

(3) FlashInfer grouped_gemm_nt_masked cross-expert NaN contamination.

Repro: https://gist.github.com/elvircrn/6fd6acdf75a44757362de660cb81ca54 Bandaid fix: Zero out padding at the begging of MoE layer https://github.com/elvircrn/vllm/commit/b77030c452f2d4173aa7915d6d7cb510f04c80ea

(4) vLLM Bug: Padding NaNs corrupts activation scales in `silu_and_mul_scaled_fp4_experts_quant`

This is a vLLM kernel in nvfp4_experts_quant.cu. Affects CutlassExpertsFp4 codepath Repro in vLLM Bug Hunt: https://github.com/tlrmchlsmth/vllm/pull/33

Possible (untested) band-aid fix: Zero out padding at the begging of MoE layer https://github.com/elvircrn/vllm/commit/b77030c452f2d4173aa7915d6d7cb510f04c80ea

Root Cause

Collecting the issues here to avoid filing a separate issue for each. We've landed a band-aid fix for (1) and have identified a somewhat intrusive band-aid fix for (2), (3), and likely (4).

(1) FlashInfer: Padding NaNs corrupts activation scales in TRT-LLM `mm_fp4`

(2) FlashInfer Bug: Padding NaNs corrupts activation scales in `silu_and_mul_scaled_nvfp4_experts_quantize` and `scaled_fp4_grouped_quantize`

(3) FlashInfer grouped_gemm_nt_masked cross-expert NaN contamination.

(4) vLLM Bug: Padding NaNs corrupts activation scales in `silu_and_mul_scaled_fp4_experts_quant`

This is a vLLM kernel in nvfp4_experts_quant.cu. Affects CutlassExpertsFp4 codepath Repro in vLLM Bug Hunt: https://github.com/tlrmchlsmth/vllm/pull/33

Possible (untested) band-aid fix: Zero out padding at the begging of MoE layer https://github.com/elvircrn/vllm/commit/b77030c452f2d4173aa7915d6d7cb510f04c80ea

PR fix notes

PR #39743: [Bugfix] Fix FlashInfer NVFP4 cross-row scale corruption in MoE quant

Repository: vllm-project/vllm
Author: tlrmchlsmth
State: closed | merged: False
Link: https://github.com/vllm-project/vllm/pull/39743

Description (problem / solution / changelog)

FlashInfer's silu_and_mul_scaled_nvfp4_experts_quantize and scaled_fp4_grouped_quantize kernels corrupt real token scales when padding rows (beyond masked_m) contain NaN or garbage data.

This affects the FlashInferCuteDSLBatchedExperts MoE path used with NVFP4 weights (e.g. nvidia/DeepSeek-R1-0528-NVFP4-v2 with DeepEP LL). The corruption produces wrong finite values (silent accuracy degradation).

Fix: zero-fill padding rows in flashinfer_cutedsl_moe_masked before calling the FlashInfer quantization kernels.

Tests:

test_silu_quant_cross_row_corruption: direct kernel test (xfail, proving the underlying FlashInfer kernel bug exists)
test_grouped_quant_cross_row_corruption: direct kernel test (xfail)
test_cutedsl_wrapper_nan_padding: wrapper test (PASSES with fix)

Changed files

.buildkite/test_areas/kernels.yaml (modified, +3/-0)
tests/kernels/moe/test_flashinfer_nvfp4_quant_padding.py (added, +329/-0)
vllm/model_executor/layers/fused_moe/experts/flashinfer_cutedsl_batched_moe.py (modified, +17/-1)

RAW_BUFFERClick to expand / collapse

Summary

Collecting the issues here to avoid filing a separate issue for each. We've landed a band-aid fix for (1) and have identified a somewhat intrusive band-aid fix for (2), (3), and likely (4).

(1) FlashInfer: Padding NaNs corrupts activation scales in TRT-LLM `mm_fp4`

(2) FlashInfer Bug: Padding NaNs corrupts activation scales in `silu_and_mul_scaled_nvfp4_experts_quantize` and `scaled_fp4_grouped_quantize`

(3) FlashInfer grouped_gemm_nt_masked cross-expert NaN contamination.

(4) vLLM Bug: Padding NaNs corrupts activation scales in `silu_and_mul_scaled_fp4_experts_quant`

This is a vLLM kernel in nvfp4_experts_quant.cu. Affects CutlassExpertsFp4 codepath Repro in vLLM Bug Hunt: https://github.com/tlrmchlsmth/vllm/pull/33

Possible (untested) band-aid fix: Zero out padding at the begging of MoE layer https://github.com/elvircrn/vllm/commit/b77030c452f2d4173aa7915d6d7cb510f04c80ea

extent analysis

TL;DR

Zeroing out padding at the beginning of the MoE layer may mitigate NaN contamination issues in various FlashInfer and vLLM kernels.

Guidance

Identify and review the specific kernel causing the NaN leak, such as mm_fp4, silu_and_mul_scaled_nvfp4_experts_quantize, or silu_and_mul_scaled_fp4_experts_quant.
Apply the band-aid fix of zeroing out padding at the beginning of the MoE layer, as seen in https://github.com/elvircrn/vllm/commit/b77030c452f2d4173aa7915d6d7cb510f04c80ea.
Verify the fix by re-running the failing tests, such as those in https://github.com/tlrmchlsmth/vllm/pull/33.
Consider exploring the potential FlashInfer fix proposed in https://github.com/flashinfer-ai/flashinfer/compare/main...tlrmchlsmth:flashinfer:fix/nvfp4-expert-quant-mask-warp-sync?expand=1 for a more permanent solution.

Example

No code snippet is provided as the issue does not contain sufficient information for a concrete example.

Notes

The provided fixes are band-aids and may not address the underlying issues. A more thorough investigation into the causes of NaN contamination is necessary for a permanent solution.

Recommendation

Apply the workaround of zeroing out padding at the beginning of the MoE layer, as it has been shown to mitigate the NaN contamination issues in some cases.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#pipeline error #runtime error #dependency conflict #environment setup #docker error

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

vllm - ✅(Solved) Fix [Bug][Tracking Issue]: NaNs in CUDA Graph padding regions corrupt activations in some per-token kernels [1 pull requests, 1 participants]

Recommended Tools

GitHub issue graph ai analysis

(1) FlashInfer: Padding NaNs corrupts activation scales in TRT-LLM `mm_fp4`

(2) FlashInfer Bug: Padding NaNs corrupts activation scales in `silu_and_mul_scaled_nvfp4_experts_quantize` and `scaled_fp4_grouped_quantize`

(3) FlashInfer grouped_gemm_nt_masked cross-expert NaN contamination.

(4) vLLM Bug: Padding NaNs corrupts activation scales in `silu_and_mul_scaled_fp4_experts_quant`

Root Cause

(1) FlashInfer: Padding NaNs corrupts activation scales in TRT-LLM `mm_fp4`

(2) FlashInfer Bug: Padding NaNs corrupts activation scales in `silu_and_mul_scaled_nvfp4_experts_quantize` and `scaled_fp4_grouped_quantize`

(3) FlashInfer grouped_gemm_nt_masked cross-expert NaN contamination.

(4) vLLM Bug: Padding NaNs corrupts activation scales in `silu_and_mul_scaled_fp4_experts_quant`

PR fix notes

PR #39743: [Bugfix] Fix FlashInfer NVFP4 cross-row scale corruption in MoE quant

Description (problem / solution / changelog)

Changed files

Summary

(1) FlashInfer: Padding NaNs corrupts activation scales in TRT-LLM `mm_fp4`

(2) FlashInfer Bug: Padding NaNs corrupts activation scales in `silu_and_mul_scaled_nvfp4_experts_quantize` and `scaled_fp4_grouped_quantize`

(3) FlashInfer grouped_gemm_nt_masked cross-expert NaN contamination.

(4) vLLM Bug: Padding NaNs corrupts activation scales in `silu_and_mul_scaled_fp4_experts_quant`

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

TRENDING

vllm - ✅(Solved) Fix [Bug][Tracking Issue]: NaNs in CUDA Graph padding regions corrupt activations in some per-token kernels [1 pull requests, 1 participants]

Recommended Tools

GitHub issue graph ai analysis

(1) FlashInfer: Padding NaNs corrupts activation scales in TRT-LLM mm_fp4

(2) FlashInfer Bug: Padding NaNs corrupts activation scales in silu_and_mul_scaled_nvfp4_experts_quantize and scaled_fp4_grouped_quantize

(3) FlashInfer grouped_gemm_nt_masked cross-expert NaN contamination.

(4) vLLM Bug: Padding NaNs corrupts activation scales in silu_and_mul_scaled_fp4_experts_quant

Root Cause

(1) FlashInfer: Padding NaNs corrupts activation scales in TRT-LLM mm_fp4

(2) FlashInfer Bug: Padding NaNs corrupts activation scales in silu_and_mul_scaled_nvfp4_experts_quantize and scaled_fp4_grouped_quantize

(3) FlashInfer grouped_gemm_nt_masked cross-expert NaN contamination.

(4) vLLM Bug: Padding NaNs corrupts activation scales in silu_and_mul_scaled_fp4_experts_quant

PR fix notes

PR #39743: [Bugfix] Fix FlashInfer NVFP4 cross-row scale corruption in MoE quant

Description (problem / solution / changelog)

Changed files

Summary

(1) FlashInfer: Padding NaNs corrupts activation scales in TRT-LLM mm_fp4

(2) FlashInfer Bug: Padding NaNs corrupts activation scales in silu_and_mul_scaled_nvfp4_experts_quantize and scaled_fp4_grouped_quantize

(3) FlashInfer grouped_gemm_nt_masked cross-expert NaN contamination.

(4) vLLM Bug: Padding NaNs corrupts activation scales in silu_and_mul_scaled_fp4_experts_quant

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING

(1) FlashInfer: Padding NaNs corrupts activation scales in TRT-LLM `mm_fp4`

(2) FlashInfer Bug: Padding NaNs corrupts activation scales in `silu_and_mul_scaled_nvfp4_experts_quantize` and `scaled_fp4_grouped_quantize`

(4) vLLM Bug: Padding NaNs corrupts activation scales in `silu_and_mul_scaled_fp4_experts_quant`

(1) FlashInfer: Padding NaNs corrupts activation scales in TRT-LLM `mm_fp4`

(2) FlashInfer Bug: Padding NaNs corrupts activation scales in `silu_and_mul_scaled_nvfp4_experts_quantize` and `scaled_fp4_grouped_quantize`

(4) vLLM Bug: Padding NaNs corrupts activation scales in `silu_and_mul_scaled_fp4_experts_quant`

(1) FlashInfer: Padding NaNs corrupts activation scales in TRT-LLM `mm_fp4`

(2) FlashInfer Bug: Padding NaNs corrupts activation scales in `silu_and_mul_scaled_nvfp4_experts_quantize` and `scaled_fp4_grouped_quantize`

(4) vLLM Bug: Padding NaNs corrupts activation scales in `silu_and_mul_scaled_fp4_experts_quant`