vllm - 💡(How to fix) Fix [CI] test_moe_layer: all modelopt_fp4 subtests failing on 2 B200s [1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#39536Fetched 2026-04-11 06:12:55
View on GitHub
Comments
0
Participants
1
Timeline
1
Reactions
0
Author
Participants
Timeline (top)
closed ×1

Error Message

Error details

Root Cause

  • Flaky test
  • Can reproduce locally
  • Caused by external libraries (e.g. bug in transformers)

Code Example

[1-128-256-8-2-bfloat16-modelopt_fp4-False-False-False-False-False-flashinfer_nvlink_two_sided-1-1-2]
[1-128-256-8-2-bfloat16-modelopt_fp4-True-False-False-False-False-flashinfer_nvlink_two_sided-1-1-2]
[32-1024-512-64-6-bfloat16-modelopt_fp4-True-False-True-False-False-deepep_high_throughput-1-1-2]

---

26 failed, 163 passed, 235 skipped, 18 warnings in 586.63s
RAW_BUFFERClick to expand / collapse

Name of failing test

tests/kernels/moe/test_moe_layer.py::test_moe_layer — all modelopt_fp4 parameterized subtests

Basic information

  • Flaky test
  • Can reproduce locally
  • Caused by external libraries (e.g. bug in transformers)

🧪 Describe the failing test

In the 2026-04-09 nightly build (#60697), the Kernels FusedMoE Layer Test (2 B200s) step failed with 26 subtests failing. All failing subtests involve modelopt_fp4 quantization format, across three different communication backends:

  • flashinfer_nvlink_two_sided — 20 failed subtests
  • flashinfer_nvlink_one_sided — 20 failed subtests
  • deepep_high_throughput — 12 failed subtests + 1 SIGABRT

Step link: https://buildkite.com/vllm/ci/builds/60697#019d740c-e286-4513-b9dd-7d6dc04ab473

Commit: e5de19ff9a64

Error details

All failures raise RuntimeError from _parallel_worker (line 1629) with a list of failed subtests. The deepep_high_throughput-2-1-True variant also terminated with SIGABRT.

Example failed subtest parameter sets:

[1-128-256-8-2-bfloat16-modelopt_fp4-False-False-False-False-False-flashinfer_nvlink_two_sided-1-1-2]
[1-128-256-8-2-bfloat16-modelopt_fp4-True-False-False-False-False-flashinfer_nvlink_two_sided-1-1-2]
[32-1024-512-64-6-bfloat16-modelopt_fp4-True-False-True-False-False-deepep_high_throughput-1-1-2]

Test results summary

26 failed, 163 passed, 235 skipped, 18 warnings in 586.63s

Potentially causal PRs

These PRs were merged shortly before the nightly and touch MoE/FP4 code paths:

  • #39322 — [Feature] Batch invariant nvfp4 linear support (merged 2026-04-08)
  • #39315 — [Bugfix] FlashInfer MXINT4 MoE crashes, missing do_finalize (merged 2026-04-09)
  • #39045 — [Gemma4] Support quantized MoE (merged 2026-04-09)

Related issues

  • #39503 — Similar FusedMoE Layer Test flakiness on 2 H100s (deepep_low_latency), but this is a different failure pattern (all modelopt_fp4, B200s)

Auto-generated by CI Watch Bot

extent analysis

TL;DR

Investigate the recent changes in PRs #39322, #39315, and #39045 to identify potential causes of the RuntimeError in the FusedMoE Layer Test.

Guidance

  • Review the code changes in PRs #39322, #39315, and #39045 to see if they introduced any issues with the modelopt_fp4 quantization format.
  • Check the test results for any patterns or correlations between the failed subtests and the communication backends (flashinfer_nvlink_two_sided, flashinfer_nvlink_one_sided, deepep_high_throughput).
  • Investigate the SIGABRT termination in the deepep_high_throughput-2-1-True variant to determine if it's related to the RuntimeError issues.
  • Consider re-running the failed subtests individually to gather more information about the failures.

Notes

The issue seems to be related to the recent changes in the MoE/FP4 code paths, and investigating these changes is the most likely way to identify the root cause of the issue.

Recommendation

Apply workaround: Revert or modify the recent changes in PRs #39322, #39315, and #39045 to see if it resolves the issue, as these changes are the most likely cause of the RuntimeError.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

vllm - 💡(How to fix) Fix [CI] test_moe_layer: all modelopt_fp4 subtests failing on 2 B200s [1 participants]