vllm - ✅(Solved) Fix [Bug][Tracking Issue]: NaNs in CUDA Graph padding regions corrupt activations in some per-token kernels [1 pull requests, 1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#40047Fetched 2026-04-17 08:27:28
View on GitHub
Comments
0
Participants
1
Timeline
0
Reactions
1
Participants

While debugging NaNs found during WideEP GB200 deployments of DeepSeek-R1-0528-NVFP4-v2 https://github.com/vllm-project/vllm/issues/37890, we have identified several kernels that leak NaNs from the CUDA Graph padding region into activation tokens.

Even though each of these kernels is supposed to operate on each token independently, NaNs in some tokens can affect the others. In some cases this happens due to warp reductions used to compute scales for group quantization.

Collecting the issues here to avoid filing a separate issue for each. We've landed a band-aid fix for (1) and have identified a somewhat intrusive band-aid fix for (2), (3), and likely (4).

(1) FlashInfer: Padding NaNs corrupts activation scales in TRT-LLM mm_fp4

See FlashInfer issue: https://github.com/flashinfer-ai/flashinfer/issues/2861 We landed a band-aid fix in https://github.com/vllm-project/vllm/pull/38148, which resolves the issues by zeroing out the scale padding. This could be removed once the mm_fp4 bug is fixed.

(2) FlashInfer Bug: Padding NaNs corrupts activation scales in silu_and_mul_scaled_nvfp4_experts_quantize and scaled_fp4_grouped_quantize

See FlashInfer issue: https://github.com/flashinfer-ai/flashinfer/issues/3057 See failing test in vLLM Bug Hunt: https://github.com/tlrmchlsmth/vllm/pull/33
A bandaid fix is to zero out padding at the beginning of MoE layer: https://github.com/elvircrn/vllm/commit/b77030c452f2d4173aa7915d6d7cb510f04c80ea Potential FlashInfer fix: https://github.com/flashinfer-ai/flashinfer/compare/main...tlrmchlsmth:flashinfer:fix/nvfp4-expert-quant-mask-warp-sync?expand=1

(3) FlashInfer grouped_gemm_nt_masked cross-expert NaN contamination.

Repro: https://gist.github.com/elvircrn/6fd6acdf75a44757362de660cb81ca54 Bandaid fix: Zero out padding at the begging of MoE layer https://github.com/elvircrn/vllm/commit/b77030c452f2d4173aa7915d6d7cb510f04c80ea

(4) vLLM Bug: Padding NaNs corrupts activation scales in silu_and_mul_scaled_fp4_experts_quant

This is a vLLM kernel in nvfp4_experts_quant.cu. Affects CutlassExpertsFp4 codepath Repro in vLLM Bug Hunt: https://github.com/tlrmchlsmth/vllm/pull/33

Possible (untested) band-aid fix: Zero out padding at the begging of MoE layer https://github.com/elvircrn/vllm/commit/b77030c452f2d4173aa7915d6d7cb510f04c80ea

Root Cause

While debugging NaNs found during WideEP GB200 deployments of DeepSeek-R1-0528-NVFP4-v2 https://github.com/vllm-project/vllm/issues/37890, we have identified several kernels that leak NaNs from the CUDA Graph padding region into activation tokens.

Even though each of these kernels is supposed to operate on each token independently, NaNs in some tokens can affect the others. In some cases this happens due to warp reductions used to compute scales for group quantization.

Collecting the issues here to avoid filing a separate issue for each. We've landed a band-aid fix for (1) and have identified a somewhat intrusive band-aid fix for (2), (3), and likely (4).

(1) FlashInfer: Padding NaNs corrupts activation scales in TRT-LLM mm_fp4

See FlashInfer issue: https://github.com/flashinfer-ai/flashinfer/issues/2861 We landed a band-aid fix in https://github.com/vllm-project/vllm/pull/38148, which resolves the issues by zeroing out the scale padding. This could be removed once the mm_fp4 bug is fixed.

(2) FlashInfer Bug: Padding NaNs corrupts activation scales in silu_and_mul_scaled_nvfp4_experts_quantize and scaled_fp4_grouped_quantize

See FlashInfer issue: https://github.com/flashinfer-ai/flashinfer/issues/3057 See failing test in vLLM Bug Hunt: https://github.com/tlrmchlsmth/vllm/pull/33
A bandaid fix is to zero out padding at the beginning of MoE layer: https://github.com/elvircrn/vllm/commit/b77030c452f2d4173aa7915d6d7cb510f04c80ea Potential FlashInfer fix: https://github.com/flashinfer-ai/flashinfer/compare/main...tlrmchlsmth:flashinfer:fix/nvfp4-expert-quant-mask-warp-sync?expand=1

(3) FlashInfer grouped_gemm_nt_masked cross-expert NaN contamination.

Repro: https://gist.github.com/elvircrn/6fd6acdf75a44757362de660cb81ca54 Bandaid fix: Zero out padding at the begging of MoE layer https://github.com/elvircrn/vllm/commit/b77030c452f2d4173aa7915d6d7cb510f04c80ea

(4) vLLM Bug: Padding NaNs corrupts activation scales in silu_and_mul_scaled_fp4_experts_quant

This is a vLLM kernel in nvfp4_experts_quant.cu. Affects CutlassExpertsFp4 codepath Repro in vLLM Bug Hunt: https://github.com/tlrmchlsmth/vllm/pull/33

Possible (untested) band-aid fix: Zero out padding at the begging of MoE layer https://github.com/elvircrn/vllm/commit/b77030c452f2d4173aa7915d6d7cb510f04c80ea

PR fix notes

PR #39743: [Bugfix] Fix FlashInfer NVFP4 cross-row scale corruption in MoE quant

Description (problem / solution / changelog)

FlashInfer's silu_and_mul_scaled_nvfp4_experts_quantize and scaled_fp4_grouped_quantize kernels corrupt real token scales when padding rows (beyond masked_m) contain NaN or garbage data.

This affects the FlashInferCuteDSLBatchedExperts MoE path used with NVFP4 weights (e.g. nvidia/DeepSeek-R1-0528-NVFP4-v2 with DeepEP LL). The corruption produces wrong finite values (silent accuracy degradation).

Fix: zero-fill padding rows in flashinfer_cutedsl_moe_masked before calling the FlashInfer quantization kernels.

Tests:

  • test_silu_quant_cross_row_corruption: direct kernel test (xfail, proving the underlying FlashInfer kernel bug exists)
  • test_grouped_quant_cross_row_corruption: direct kernel test (xfail)
  • test_cutedsl_wrapper_nan_padding: wrapper test (PASSES with fix)

Changed files

  • .buildkite/test_areas/kernels.yaml (modified, +3/-0)
  • tests/kernels/moe/test_flashinfer_nvfp4_quant_padding.py (added, +329/-0)
  • vllm/model_executor/layers/fused_moe/experts/flashinfer_cutedsl_batched_moe.py (modified, +17/-1)
RAW_BUFFERClick to expand / collapse

Summary

While debugging NaNs found during WideEP GB200 deployments of DeepSeek-R1-0528-NVFP4-v2 https://github.com/vllm-project/vllm/issues/37890, we have identified several kernels that leak NaNs from the CUDA Graph padding region into activation tokens.

Even though each of these kernels is supposed to operate on each token independently, NaNs in some tokens can affect the others. In some cases this happens due to warp reductions used to compute scales for group quantization.

Collecting the issues here to avoid filing a separate issue for each. We've landed a band-aid fix for (1) and have identified a somewhat intrusive band-aid fix for (2), (3), and likely (4).

(1) FlashInfer: Padding NaNs corrupts activation scales in TRT-LLM mm_fp4

See FlashInfer issue: https://github.com/flashinfer-ai/flashinfer/issues/2861 We landed a band-aid fix in https://github.com/vllm-project/vllm/pull/38148, which resolves the issues by zeroing out the scale padding. This could be removed once the mm_fp4 bug is fixed.

(2) FlashInfer Bug: Padding NaNs corrupts activation scales in silu_and_mul_scaled_nvfp4_experts_quantize and scaled_fp4_grouped_quantize

See FlashInfer issue: https://github.com/flashinfer-ai/flashinfer/issues/3057 See failing test in vLLM Bug Hunt: https://github.com/tlrmchlsmth/vllm/pull/33
A bandaid fix is to zero out padding at the beginning of MoE layer: https://github.com/elvircrn/vllm/commit/b77030c452f2d4173aa7915d6d7cb510f04c80ea Potential FlashInfer fix: https://github.com/flashinfer-ai/flashinfer/compare/main...tlrmchlsmth:flashinfer:fix/nvfp4-expert-quant-mask-warp-sync?expand=1

(3) FlashInfer grouped_gemm_nt_masked cross-expert NaN contamination.

Repro: https://gist.github.com/elvircrn/6fd6acdf75a44757362de660cb81ca54 Bandaid fix: Zero out padding at the begging of MoE layer https://github.com/elvircrn/vllm/commit/b77030c452f2d4173aa7915d6d7cb510f04c80ea

(4) vLLM Bug: Padding NaNs corrupts activation scales in silu_and_mul_scaled_fp4_experts_quant

This is a vLLM kernel in nvfp4_experts_quant.cu. Affects CutlassExpertsFp4 codepath Repro in vLLM Bug Hunt: https://github.com/tlrmchlsmth/vllm/pull/33

Possible (untested) band-aid fix: Zero out padding at the begging of MoE layer https://github.com/elvircrn/vllm/commit/b77030c452f2d4173aa7915d6d7cb510f04c80ea

extent analysis

TL;DR

Zeroing out padding at the beginning of the MoE layer may mitigate NaN contamination issues in various FlashInfer and vLLM kernels.

Guidance

Example

No code snippet is provided as the issue does not contain sufficient information for a concrete example.

Notes

The provided fixes are band-aids and may not address the underlying issues. A more thorough investigation into the causes of NaN contamination is necessary for a permanent solution.

Recommendation

Apply the workaround of zeroing out padding at the beginning of the MoE layer, as it has been shown to mitigate the NaN contamination issues in some cases.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING