vllm - 💡(How to fix) Fix [CI Failure]: IMA in tests/kernels/moe/test_cutedsl_moe.py [1 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#39522Fetched 2026-04-11 06:13:01
View on GitHub
Comments
1
Participants
2
Timeline
5
Reactions
0
Author
Participants
Timeline (top)
added_to_project_v2 ×1commented ×1labeled ×1mentioned ×1

Error Message

[06:41:07.520334] coredump: Detected an exception of type CUDBG_EXCEPTION_WARP_ILLEGAL_ADDRESS (14)

[06:41:07.520340] coredump: - Device: 0 [06:41:07.520342] coredump: - SM: 142 [06:41:07.520345] coredump: - Warp: 0 [06:41:07.520347] coredump: - PC 0x7bdd7b5ef520 [06:41:07.520556] coredump: Stack trace (lane masks: active 0xFFFFFFFF, valid 0xFFFFFFFF): [06:41:07.520563] coredump: #0 0x7bdd7b5efc70 kernel_cutlass_kernel_flashinfergemmkernelsgrouped_gemm_masked_blackwellSm100BlockScaledPersistentDenseGemmKernel_object_at__TiledMMA_ThrLayoutVMNK11110000_PermutationMNK____MMAAtom_ThrID_0

Root Cause

  • Flaky test
  • Can reproduce locally
  • Caused by external libraries (e.g. bug in transformers)

Code Example

[06:41:07.520334] coredump: Detected an exception of type CUDBG_EXCEPTION_WARP_ILLEGAL_ADDRESS (14)
--
[06:41:07.520340] coredump:   - Device: 0
[06:41:07.520342] coredump:   - SM: 142
[06:41:07.520345] coredump:   - Warp: 0
[06:41:07.520347] coredump:   - PC 0x7bdd7b5ef520
[06:41:07.520556] coredump: Stack trace (lane masks: active 0xFFFFFFFF, valid 0xFFFFFFFF):
[06:41:07.520563] coredump:   #0	0x7bdd7b5efc70	kernel_cutlass_kernel_flashinfergemmkernelsgrouped_gemm_masked_blackwellSm100BlockScaledPersistentDenseGemmKernel_object_at__TiledMMA_ThrLayoutVMNK11110000_PermutationMNK____MMAAtom_ThrID_0
RAW_BUFFERClick to expand / collapse

Name of failing test

tests/kernels/moe/test_cutedsl_moe.py::test_flashinfer_cutedsl_moe_masked[1-2-128-256]

Basic information

  • Flaky test
  • Can reproduce locally
  • Caused by external libraries (e.g. bug in transformers)

🧪 Describe the failing test

I'm seeing this test fail on the 04/09/25 nightly (https://buildkite.com/vllm/ci/builds/60760). It appears to be running in both the Kernels (B200) and Kernels Fp4 MoE Test (B200) test groups. I'm not able to reproduce this test locally on a B200 machine.

Relevant log snippet

[06:41:07.520334] coredump: Detected an exception of type CUDBG_EXCEPTION_WARP_ILLEGAL_ADDRESS (14)
--
[06:41:07.520340] coredump:   - Device: 0
[06:41:07.520342] coredump:   - SM: 142
[06:41:07.520345] coredump:   - Warp: 0
[06:41:07.520347] coredump:   - PC 0x7bdd7b5ef520
[06:41:07.520556] coredump: Stack trace (lane masks: active 0xFFFFFFFF, valid 0xFFFFFFFF):
[06:41:07.520563] coredump:   #0	0x7bdd7b5efc70	kernel_cutlass_kernel_flashinfergemmkernelsgrouped_gemm_masked_blackwellSm100BlockScaledPersistentDenseGemmKernel_object_at__TiledMMA_ThrLayoutVMNK11110000_PermutationMNK____MMAAtom_ThrID_0

📝 History of failing test

The first instance of this bug in CI in the Kernels (B200) group is https://buildkite.com/vllm/ci/builds/59119 which is from 03/31/25 and is based on https://github.com/vllm-project/vllm/commit/517b769b5858a8d8d233d277f54461acfc9def63.

The first instance of this bug in CI in the Kernels Fp4 MoE Test (B200) is https://buildkite.com/vllm/ci/builds/58266 which is from 03/26/25 and is based on https://github.com/vllm-project/vllm/commit/be1a85b7a2929f25c93d469fdd733a3576609e70

Given that https://github.com/vllm-project/vllm/commit/be1a85b7a2929f25c93d469fdd733a3576609e70 modifies relevant logic associated with this test, it seems plausible to me that it could have introduced this failure.

CC List.

@zhewenl What do you think? Do you think that https://github.com/vllm-project/vllm/commit/be1a85b7a2929f25c93d469fdd733a3576609e70 could have caused these spurious failures we are seeing in CI?

extent analysis

TL;DR

Investigate the changes introduced in commit be1a85b7a2929f25c93d469fdd733a3576609e70 as a potential cause of the test failures.

Guidance

  • Review the code changes in commit be1a85b7a2929f25c93d469fdd733a3576609e70 to understand their impact on the test_flashinfer_cutedsl_moe_masked test.
  • Check the test logs for any patterns or correlations between the test failures and specific input parameters or test cases.
  • Attempt to reproduce the test failure locally with the same input parameters and test cases to gather more information.
  • Consider reverting or modifying the changes introduced in commit be1a85b7a2929f25c93d469fdd733a3576609e70 to see if it resolves the test failures.

Notes

The exact cause of the test failures is unclear, and further investigation is needed to determine the root cause.

Recommendation

Apply workaround: Revert or modify the changes introduced in commit be1a85b7a2929f25c93d469fdd733a3576609e70 to see if it resolves the test failures, as it is a plausible cause of the issue.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

vllm - 💡(How to fix) Fix [CI Failure]: IMA in tests/kernels/moe/test_cutedsl_moe.py [1 comments, 2 participants]