vllm - ✅(Solved) Fix Add fused_topk to CI. Follow-up to 39391 [2 pull requests, 1 comments, 1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#40457Fetched 2026-04-22 07:45:30
View on GitHub
Comments
1
Participants
1
Timeline
6
Reactions
0
Participants
Timeline (top)
cross-referenced ×2closed ×1commented ×1mentioned ×1

Root Cause

#39391 was merged because want it to be fixed in v0.20.

Fix Action

Fixed

PR fix notes

PR #39391: fix: clamp NaN/Inf in topk_softmax to prevent duplicate expert IDs

Description (problem / solution / changelog)

Purpose

Fix https://github.com/vllm-project/vllm/issues/39244

Fix CUDA illegal memory access crash when serving MoE models (e.g., Qwen3.5-397B-A17B-FP8) with FlashInfer CUTLASS MoE and CUDA graphs at high concurrency on H200.

CUDA graph replay pads the batch to the nearest capture size. Padded tokens have degenerate hidden states that produce NaN gating logits. The topkGating kernel's softmax outputs all-NaN, and the argmax loop picks expert 0 for every top-k slot (IEEE 754: NaN > NaN is false), producing duplicate expert IDs [0,0,0,0,0,0,0,0]. These duplicates trigger an uninitialized-memory bug in FlashInfer's three-step MoE sort, causing finalizeMoeRoutingKernel to dereference wild pointers.

The fix clamps NaN/Inf values to 0 after softmax/sigmoid scoring in topkGating, before the argmax selection loop. With all-zero scores, argmax picks unique experts [0,1,2,...,k-1] via index tie-breaking. Zero performance overhead.

Test Plan

  • Kernel unit test: verify topk_softmax produces unique expert IDs for NaN/Inf/normal gating inputs

  • Kernel microbenchmark: compare eager + CUDA graph replay latency for normal vs NaN inputs (batch 1-512, 128/256 experts)

  • End-to-end: serve Qwen3.5-397B-A17B-FP8 (TP=4, EP=4, CUDA graphs, VLLM_USE_FLASHINFER_MOE_FP8=1) with 8 concurrent requests

  • Full sweep: sglang benchmark conc 1-512, ISL=1600, OSL=600, REPEAT=5 on 4x H200

  • tests/kernels/moe/test_fused_topk.py::test_fused_topk_nan_inf_clamp

Test Result

Kernel correctness (H200):

InputBefore fixAfter fix
normalunique IDsunique IDs (unchanged)
all_nan[0,0,0,0,0,0,0,0] (512/512 dup)[0,1,2,3,4,5,6,7] (0 dup)
all_inf[0,0,0,0,0,0,0,0] (512/512 dup)[0,1,2,3,4,5,6,7] (0 dup)

Kernel perf (H200, CUDA graph replay, median of 1000 runs):

BatchExpertsnormal (us)all_nan (us)diff
11288.298.22-0.8%
1281288.268.32+0.8%
5121288.518.64+1.5%
5122568.908.93+0.4%

All within noise. Zero measurable overhead.

End-to-end (4x H200, Qwen3.5-397B-A17B-FP8):

TestBeforeAfter
8 concurrent curl5/8 OK, 3/8 crash8/8 HTTP 200
sweep conc 1-512crash at conc 16+all pass

unit test

StateKernelTest result
BEFOREpartial fix60 failed, 12 passed
AFTER — all three clamps activefull fix72 passed

Summary

vLLM crashes with CUDA error: an illegal memory access was encountered when serving Qwen3.5-397B-A17B-FP8 with VLLM_USE_FLASHINFER_MOE_FP8=1 and CUDA graphs enabled. The crash occurs at high concurrency (8+ requests) when the MoE batch size exceeds 256 tokens.

Root Cause

CUDA graph replay pads the batch to the nearest capture size (e.g., 300 real tokens padded to 512). Padded tokens have stale/degenerate hidden states that produce NaN gating logits in the MoE router. The topk_softmax CUDA kernel then produces duplicate expert IDs for NaN inputs (e.g., [0,0,0,0,0,0,0,0] for every padded token), because IEEE 754 NaN > NaN is always false, so the argmax never updates from expert 0, and the -10000 zeroing of the winner also fails (-10000 > NaN is false).

These duplicate expert IDs trigger a latent bug in FlashInfer's blockExpertPrefixSumKernel (three-step MoE sort path, used when num_tokens > 256): it uses break after the first expert match, so duplicate expert slots leave unpermuted_row_to_permuted_row entries uninitialized. finalizeMoeRoutingKernel then reads garbage values as row indices, causing wild pointer dereferences.

Chain of events

CUDA graph replay with padded tokens
  -> stale hidden states -> NaN gating logits
    -> topk_softmax produces [0,0,0,0,0,0,0,0] for padded tokens
      -> duplicate expert IDs enter cutlass_fused_moe (num_tokens > 256)
        -> blockExpertPrefixSumKernel skips duplicate slots (break)
          -> unpermuted_row_to_permuted_row has uninitialized entries
            -> finalizeMoeRoutingKernel reads garbage -> OOB -> CRASH

Why it only happens with CUDA graphs

In eager mode, there are no padded tokens -- the batch contains only real tokens with valid hidden states, the router produces unique expert IDs, and the three-step sort works correctly. The crash requires:

  1. Batch size > 256 (three-step sort path)
  2. Duplicate expert IDs (from NaN gating on padded tokens)

Both conditions only occur together during CUDA graph replay at high concurrency.

Fix

Clamp NaN/Inf values to 0 in topk_softmax after softmax/sigmoid scoring, before the argmax selection loop:

// csrc/moe/topk_softmax_kernels.cu, after line 443
#pragma unroll
for (int ii = 0; ii < VPT; ++ii) {
    if (isnan(row_chunk[ii]) || isinf(row_chunk[ii])) {
        row_chunk[ii] = 0.f;
    }
}

With all-zero scores, the argmax uses index tie-breaking to pick unique experts [0,1,2,...,k-1], preventing duplicates. Normal (non-NaN) inputs are unaffected -- the clamp is a no-op.

Why this is the right fix location

The topk_softmax kernel (csrc/moe/topk_softmax_kernels.cu:266) is where the NaN propagates into duplicate expert IDs. Fixing it here:

  • Prevents the bad input from reaching ANY downstream MoE kernel (FlashInfer, Triton, etc.)
  • Zero performance overhead (see benchmarks below)
  • Handles all NaN sources (CUDA graph padding, numerical overflow, any future degenerate input)

Performance Impact

Benchmarked on H200, production MoE configs (128/256 experts, top_k=8). The fix adds isnan/isinf checks (single PTX predicate instructions) per element. The kernel is memory-bandwidth bound, so the extra comparisons are invisible:

Eager mode (us, median of 1000 runs)

BatchExpertsnormalall_nandiff
112810.9410.91-0.3%
812810.7210.720.0%
3212810.7210.75+0.3%
12812810.7210.66-0.6%
25612810.7810.72-0.6%
51212810.6610.69+0.3%
51225610.7210.720.0%

CUDA graph replay mode (us, median of 1000 runs)

BatchExpertsnormalall_nandiff
11288.298.22-0.8%
81288.198.16-0.4%
321288.298.19-1.2%
1281288.268.32+0.8%
2561288.328.320.0%
5121288.518.64+1.5%
5122568.908.93+0.4%

All differences are within noise (<2%). Zero measurable overhead.

Verification

Standalone (topk kernel)

Before fix:

all_nan: dup_tokens=512/512  topk_ids=[0,0,0,0,0,0,0,0]
all_inf: dup_tokens=512/512  topk_ids=[0,0,0,0,0,0,0,0]

After fix:

all_nan: dup_tokens=0/512  topk_ids=[0,1,2,3,4,5,6,7]
all_inf: dup_tokens=0/512  topk_ids=[0,1,2,3,4,5,6,7]


<details> <summary> Essential Elements of an Effective PR Description Checklist </summary>
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.
</details>

Changed files

  • csrc/moe/topk_softmax_kernels.cu (modified, +19/-2)
  • tests/kernels/moe/test_fused_topk.py (modified, +67/-0)

PR #40553: test: add nan/inf clamp regression test for fused_topk_bias

Description (problem / solution / changelog)

Add test_fused_topk_bias_nan_inf_clamp to cover the same NaN/Inf scenario as test_fused_topk_nan_inf_clamp but for the fused_topk_bias entry point. On CUDA, fused_topk_bias routes through the same topk_softmax/topk_sigmoid kernels fixed in #39391 (lines 131/154 of topk_softmax_kernels.cu), so the clamp already applies.

Closes #40457

Purpose

PR #39391 fixed NaN/Inf gating logits producing duplicate expert IDs in fused_topk by clamping scores to 0 in topk_softmax_kernels.cu (lines 131/154 for the warp-level path, line 457 for the fallback path). However, the regression test added in that PR only covered fused_topk. This PR adds the same nan/inf regression test for fused_topk_bias, which is used by DeepSeek-style models with e_score_correction_bias.

On CUDA, fused_topk_bias routes through the same topk_softmax / topk_sigmoid C++ kernels that were patched in #39391, so the clamp already applies. This PR confirms that coverage with an explicit test.

Test Plan

  • Added test_fused_topk_bias_nan_inf_clamp to tests/kernels/moe/test_fused_topk.py, parametrized over:
    • dtype: bfloat16, float16, float32
    • scoring_func: softmax, sigmoid
    • bad_value: NaN, Inf
    • num_experts: 6, 8, 16
    • topk: 3, 4
    • Total: 144 test cases
  • Verified the test is automatically collected by the existing Kernels MoE Test CI step (no .buildkite/ changes needed):
    # Simulate exactly what buildkite runs:
    pytest tests/kernels/moe --collect-only -q | grep nan_inf
    # Shows both test_fused_topk_nan_inf_clamp (added in #39391) and
    # test_fused_topk_bias_nan_inf_clamp (this PR) are collected
  • Ran the full test_fused_topk.py suite (720 tests) to check for regressions.

Test Result

nan/inf regression tests (144 cases)

Kernel clampnan/inf tests passednan/inf tests failed
Disabled (pre-#39391 behaviour)36108
Enabled (post-#39391, this PR)1440

Full test suite (720 cases, 2×H200, CUDA 13.0)

720 passed, 16 warnings in 104.91s

<details> <summary> Essential Elements of an Effective PR Description Checklist </summary>
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
</details>

Changed files

  • tests/kernels/moe/test_fused_topk.py (modified, +69/-0)
RAW_BUFFERClick to expand / collapse

#39391 was merged because want it to be fixed in v0.20.

This comment left unfixed https://github.com/vllm-project/vllm/pull/39391#pullrequestreview-4144377845 :

  • add to CI
  • add fused_topk_bias

extent analysis

TL;DR

The issue can likely be resolved by adding the fused_topk_bias and integrating it into the CI pipeline as suggested in the pull request review.

Guidance

  • Review the pull request #39391 and the specific comment https://github.com/vllm-project/vllm/pull/39391#pullrequestreview-4144377845 to understand the required changes.
  • Add fused_topk_bias as suggested to address the issue.
  • Integrate the changes into the CI pipeline to ensure automated testing and validation.
  • Verify that the addition of fused_topk_bias and its integration into CI resolves the issue without introducing new problems.

Notes

The exact implementation details of fused_topk_bias are not provided, so its integration should be done according to the project's existing coding standards and requirements.

Recommendation

Apply workaround: Integrate fused_topk_bias and add it to CI as the issue is specifically mentioned to be fixed in version v0.20, implying that a direct fix is intended for this version.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

vllm - ✅(Solved) Fix Add fused_topk to CI. Follow-up to 39391 [2 pull requests, 1 comments, 1 participants]