vllm - ✅(Solved) Fix Add fused_topk to CI. Follow-up to 39391 [2 pull requests, 1 comments, 1 participants]

vllm2026-04-21 10:58:16

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

vllm-project/vllm#40457•Fetched 2026-04-22 07:45:30

View on GitHub

Comments

Participants

Timeline

Reactions

Author

vadiklyutiy

Participants

vadiklyutiy

Timeline (top)

cross-referenced ×2closed ×1commented ×1mentioned ×1

Root Cause

#39391 was merged because want it to be fixed in v0.20.

Fix Action

Fixed

Fixed by PR: fix: clamp NaN/Inf in topk_softmax to prevent duplicate expert IDs (https://github.com/vllm-project/vllm/pull/39391)
Fixed by PR: test: add nan/inf clamp regression test for fused_topk_bias (https://github.com/vllm-project/vllm/pull/40553)

PR fix notes

PR #39391: fix: clamp NaN/Inf in topk_softmax to prevent duplicate expert IDs

Repository: vllm-project/vllm
Author: jhaotingc
State: closed | merged: True
Link: https://github.com/vllm-project/vllm/pull/39391

Description (problem / solution / changelog)

Purpose

Fix https://github.com/vllm-project/vllm/issues/39244

Fix CUDA illegal memory access crash when serving MoE models (e.g., Qwen3.5-397B-A17B-FP8) with FlashInfer CUTLASS MoE and CUDA graphs at high concurrency on H200.

CUDA graph replay pads the batch to the nearest capture size. Padded tokens have degenerate hidden states that produce NaN gating logits. The topkGating kernel's softmax outputs all-NaN, and the argmax loop picks expert 0 for every top-k slot (IEEE 754: NaN > NaN is false), producing duplicate expert IDs [0,0,0,0,0,0,0,0]. These duplicates trigger an uninitialized-memory bug in FlashInfer's three-step MoE sort, causing finalizeMoeRoutingKernel to dereference wild pointers.

The fix clamps NaN/Inf values to 0 after softmax/sigmoid scoring in topkGating, before the argmax selection loop. With all-zero scores, argmax picks unique experts [0,1,2,...,k-1] via index tie-breaking. Zero performance overhead.

Test Plan

Kernel unit test: verify topk_softmax produces unique expert IDs for NaN/Inf/normal gating inputs
Kernel microbenchmark: compare eager + CUDA graph replay latency for normal vs NaN inputs (batch 1-512, 128/256 experts)
End-to-end: serve Qwen3.5-397B-A17B-FP8 (TP=4, EP=4, CUDA graphs, VLLM_USE_FLASHINFER_MOE_FP8=1) with 8 concurrent requests
Full sweep: sglang benchmark conc 1-512, ISL=1600, OSL=600, REPEAT=5 on 4x H200
tests/kernels/moe/test_fused_topk.py::test_fused_topk_nan_inf_clamp

Test Result

Kernel correctness (H200):

Input	Before fix	After fix
normal	unique IDs	unique IDs (unchanged)
all_nan	`[0,0,0,0,0,0,0,0]` (512/512 dup)	`[0,1,2,3,4,5,6,7]` (0 dup)
all_inf	`[0,0,0,0,0,0,0,0]` (512/512 dup)	`[0,1,2,3,4,5,6,7]` (0 dup)

Kernel perf (H200, CUDA graph replay, median of 1000 runs):

Batch	Experts	normal (us)	all_nan (us)	diff
1	128	8.29	8.22	-0.8%
128	128	8.26	8.32	+0.8%
512	128	8.51	8.64	+1.5%
512	256	8.90	8.93	+0.4%

All within noise. Zero measurable overhead.

End-to-end (4x H200, Qwen3.5-397B-A17B-FP8):

Test	Before	After
8 concurrent curl	5/8 OK, 3/8 crash	8/8 HTTP 200
sweep conc 1-512	crash at conc 16+	all pass

unit test

State	Kernel	Test result
BEFORE	partial fix	60 failed, 12 passed
AFTER — all three clamps active	full fix	72 passed

Summary

vLLM crashes with CUDA error: an illegal memory access was encountered when serving Qwen3.5-397B-A17B-FP8 with VLLM_USE_FLASHINFER_MOE_FP8=1 and CUDA graphs enabled. The crash occurs at high concurrency (8+ requests) when the MoE batch size exceeds 256 tokens.

Root Cause

CUDA graph replay pads the batch to the nearest capture size (e.g., 300 real tokens padded to 512). Padded tokens have stale/degenerate hidden states that produce NaN gating logits in the MoE router. The topk_softmax CUDA kernel then produces duplicate expert IDs for NaN inputs (e.g., [0,0,0,0,0,0,0,0] for every padded token), because IEEE 754 NaN > NaN is always false, so the argmax never updates from expert 0, and the -10000 zeroing of the winner also fails (-10000 > NaN is false).

These duplicate expert IDs trigger a latent bug in FlashInfer's blockExpertPrefixSumKernel (three-step MoE sort path, used when num_tokens > 256): it uses break after the first expert match, so duplicate expert slots leave unpermuted_row_to_permuted_row entries uninitialized. finalizeMoeRoutingKernel then reads garbage values as row indices, causing wild pointer dereferences.

Chain of events

CUDA graph replay with padded tokens
  -> stale hidden states -> NaN gating logits
    -> topk_softmax produces [0,0,0,0,0,0,0,0] for padded tokens
      -> duplicate expert IDs enter cutlass_fused_moe (num_tokens > 256)
        -> blockExpertPrefixSumKernel skips duplicate slots (break)
          -> unpermuted_row_to_permuted_row has uninitialized entries
            -> finalizeMoeRoutingKernel reads garbage -> OOB -> CRASH

Why it only happens with CUDA graphs

In eager mode, there are no padded tokens -- the batch contains only real tokens with valid hidden states, the router produces unique expert IDs, and the three-step sort works correctly. The crash requires:

Batch size > 256 (three-step sort path)
Duplicate expert IDs (from NaN gating on padded tokens)

Both conditions only occur together during CUDA graph replay at high concurrency.

Fix

Clamp NaN/Inf values to 0 in topk_softmax after softmax/sigmoid scoring, before the argmax selection loop:

// csrc/moe/topk_softmax_kernels.cu, after line 443
#pragma unroll
for (int ii = 0; ii < VPT; ++ii) {
    if (isnan(row_chunk[ii]) || isinf(row_chunk[ii])) {
        row_chunk[ii] = 0.f;
    }
}

With all-zero scores, the argmax uses index tie-breaking to pick unique experts [0,1,2,...,k-1], preventing duplicates. Normal (non-NaN) inputs are unaffected -- the clamp is a no-op.

Why this is the right fix location

The topk_softmax kernel (csrc/moe/topk_softmax_kernels.cu:266) is where the NaN propagates into duplicate expert IDs. Fixing it here:

Prevents the bad input from reaching ANY downstream MoE kernel (FlashInfer, Triton, etc.)
Zero performance overhead (see benchmarks below)
Handles all NaN sources (CUDA graph padding, numerical overflow, any future degenerate input)

Performance Impact

Benchmarked on H200, production MoE configs (128/256 experts, top_k=8). The fix adds isnan/isinf checks (single PTX predicate instructions) per element. The kernel is memory-bandwidth bound, so the extra comparisons are invisible:

Eager mode (us, median of 1000 runs)

Batch	Experts	normal	all_nan	diff
1	128	10.94	10.91	-0.3%
8	128	10.72	10.72	0.0%
32	128	10.72	10.75	+0.3%
128	128	10.72	10.66	-0.6%
256	128	10.78	10.72	-0.6%
512	128	10.66	10.69	+0.3%
512	256	10.72	10.72	0.0%

CUDA graph replay mode (us, median of 1000 runs)

Batch	Experts	normal	all_nan	diff
1	128	8.29	8.22	-0.8%
8	128	8.19	8.16	-0.4%
32	128	8.29	8.19	-1.2%
128	128	8.26	8.32	+0.8%
256	128	8.32	8.32	0.0%
512	128	8.51	8.64	+1.5%
512	256	8.90	8.93	+0.4%

All differences are within noise (<2%). Zero measurable overhead.

Verification

Standalone (topk kernel)

Before fix:

all_nan: dup_tokens=512/512  topk_ids=[0,0,0,0,0,0,0,0]
all_inf: dup_tokens=512/512  topk_ids=[0,0,0,0,0,0,0,0]

After fix:

all_nan: dup_tokens=0/512  topk_ids=[0,1,2,3,4,5,6,7]
all_inf: dup_tokens=0/512  topk_ids=[0,1,2,3,4,5,6,7]

<details> <summary> Essential Elements of an Effective PR Description Checklist </summary>

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

</details>

Changed files

csrc/moe/topk_softmax_kernels.cu (modified, +19/-2)
tests/kernels/moe/test_fused_topk.py (modified, +67/-0)

PR #40553: test: add nan/inf clamp regression test for fused_topk_bias

Repository: vllm-project/vllm
Author: jhaotingc
State: closed | merged: True
Link: https://github.com/vllm-project/vllm/pull/40553

Description (problem / solution / changelog)

Add test_fused_topk_bias_nan_inf_clamp to cover the same NaN/Inf scenario as test_fused_topk_nan_inf_clamp but for the fused_topk_bias entry point. On CUDA, fused_topk_bias routes through the same topk_softmax/topk_sigmoid kernels fixed in #39391 (lines 131/154 of topk_softmax_kernels.cu), so the clamp already applies.

Closes #40457

Purpose

PR #39391 fixed NaN/Inf gating logits producing duplicate expert IDs in fused_topk by clamping scores to 0 in topk_softmax_kernels.cu (lines 131/154 for the warp-level path, line 457 for the fallback path). However, the regression test added in that PR only covered fused_topk. This PR adds the same nan/inf regression test for fused_topk_bias, which is used by DeepSeek-style models with e_score_correction_bias.

On CUDA, fused_topk_bias routes through the same topk_softmax / topk_sigmoid C++ kernels that were patched in #39391, so the clamp already applies. This PR confirms that coverage with an explicit test.

Test Plan

Added test_fused_topk_bias_nan_inf_clamp to tests/kernels/moe/test_fused_topk.py, parametrized over:
- dtype: bfloat16, float16, float32
- scoring_func: softmax, sigmoid
- bad_value: NaN, Inf
- num_experts: 6, 8, 16
- topk: 3, 4
- Total: 144 test cases

Verified the test is automatically collected by the existing Kernels MoE Test CI step (no .buildkite/ changes needed):

# Simulate exactly what buildkite runs:
pytest tests/kernels/moe --collect-only -q | grep nan_inf
# Shows both test_fused_topk_nan_inf_clamp (added in #39391) and
# test_fused_topk_bias_nan_inf_clamp (this PR) are collected

Ran the full test_fused_topk.py suite (720 tests) to check for regressions.

Test Result

nan/inf regression tests (144 cases)

Kernel clamp	nan/inf tests passed	nan/inf tests failed
Disabled (pre-#39391 behaviour)	36	108
Enabled (post-#39391, this PR)	144	0

Full test suite (720 cases, 2×H200, CUDA 13.0)

720 passed, 16 warnings in 104.91s

<details> <summary> Essential Elements of an Effective PR Description Checklist </summary>

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.

</details>

Changed files

tests/kernels/moe/test_fused_topk.py (modified, +69/-0)

RAW_BUFFERClick to expand / collapse

#39391 was merged because want it to be fixed in v0.20.

This comment left unfixed https://github.com/vllm-project/vllm/pull/39391#pullrequestreview-4144377845 :

add to CI
add fused_topk_bias

extent analysis

TL;DR

The issue can likely be resolved by adding the fused_topk_bias and integrating it into the CI pipeline as suggested in the pull request review.

Guidance

Review the pull request #39391 and the specific comment https://github.com/vllm-project/vllm/pull/39391#pullrequestreview-4144377845 to understand the required changes.
Add fused_topk_bias as suggested to address the issue.
Integrate the changes into the CI pipeline to ensure automated testing and validation.
Verify that the addition of fused_topk_bias and its integration into CI resolves the issue without introducing new problems.

Notes

The exact implementation details of fused_topk_bias are not provided, so its integration should be done according to the project's existing coding standards and requirements.

Recommendation

Apply workaround: Integrate fused_topk_bias and add it to CI as the issue is specifically mentioned to be fixed in version v0.20, implying that a direct fix is intended for this version.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#chain error #conversation history #tool integration #LLM response #prompt template

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

vllm - ✅(Solved) Fix Add fused_topk to CI. Follow-up to 39391 [2 pull requests, 1 comments, 1 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Fix Action

Fixed

PR fix notes

PR #39391: fix: clamp NaN/Inf in topk_softmax to prevent duplicate expert IDs

Description (problem / solution / changelog)

Purpose

Test Plan

Test Result

Summary

Root Cause

Chain of events

Why it only happens with CUDA graphs

Fix

Why this is the right fix location

Performance Impact

Eager mode (us, median of 1000 runs)

CUDA graph replay mode (us, median of 1000 runs)

Verification

Standalone (topk kernel)

Changed files

PR #40553: test: add nan/inf clamp regression test for fused_topk_bias

Description (problem / solution / changelog)

Purpose

Test Plan

Test Result

nan/inf regression tests (144 cases)

Full test suite (720 cases, 2×H200, CUDA 13.0)

Changed files

extent analysis

TL;DR

Guidance

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING