vllm - 💡(How to fix) Fix [Bug]: FLASHINFER_CUTLASS_MXFP4_MXFP8 produces wrong output under expert parallelism

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

Fix Action

Fix / Workaround

Proposed direction (also tracked separately): replace the orthogonal quant_dtype + is_scale_swizzled flag with merged dtype variants ("nvfp4_swizzled" / "mxfp8_swizzled") so the dispatcher can route to the right swizzle helper from a single field, removing this bug class entirely.

Code Example

Your output of `python collect_env.py` here

---

VLLM_USE_FLASHINFER_MOE_MXFP4_MXFP8_CUTLASS=1 \
  vllm serve openai/gpt-oss-20b \
    --data-parallel-size 2 --enable-expert-parallel --gpu-memory-utilization 0.4
# then run gpt_oss.evals gpqa against it
RAW_BUFFERClick to expand / collapse

Your current environment

<details> <summary>The output of <code>python collect_env.py</code></summary>
Your output of `python collect_env.py` here
</details>

🐛 Describe the bug

The post-all-to-all swizzle in the EP/DP-EP prepare-finalize wrappers is gated on quant_dtype == "nvfp4", which means mxfp8 activation scales arrive at the CUTLASS kernel unswizzled. The CUTLASS MoE kernel requires swizzled scales, so outputs are garbage.

This is a pre-existing limitation surfaced by review of #42089, not a regression from that PR.

Affected lines (gate currently quant_dtype == "nvfp4" only):

  • vllm/model_executor/layers/fused_moe/prepare_finalize/naive_dp_ep.py:62
  • vllm/model_executor/layers/fused_moe/prepare_finalize/flashinfer_nvlink_one_sided.py:132-135
  • vllm/model_executor/layers/fused_moe/prepare_finalize/flashinfer_nvlink_two_sided.py:197

Repro (GPQA Eval times out at 1800 s; ~6/1584 items in 11 minutes):

VLLM_USE_FLASHINFER_MOE_MXFP4_MXFP8_CUTLASS=1 \
  vllm serve openai/gpt-oss-20b \
    --data-parallel-size 2 --enable-expert-parallel --gpu-memory-utilization 0.4
# then run gpt_oss.evals gpqa against it

PF in use: MoEPrepareAndFinalizeNaiveDPEPModular. Backend confirmed: FLASHINFER_CUTLASS_MXFP4_MXFP8.

Why the obvious fix isn't quite right

Just expanding the gate to include "mxfp8" and reusing nvfp4_block_scale_interleave would be wrong — that helper produces nvfp4-specific layout. mxfp8 needs swizzle_mxfp8_scale from vllm/model_executor/layers/quantization/utils/mxfp8_utils.py:14, which produces F8_128x4 layout, and needs (M, K) dims at the call site.

Proposed direction (also tracked separately): replace the orthogonal quant_dtype + is_scale_swizzled flag with merged dtype variants ("nvfp4_swizzled" / "mxfp8_swizzled") so the dispatcher can route to the right swizzle helper from a single field, removing this bug class entirely.

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

vllm - 💡(How to fix) Fix [Bug]: FLASHINFER_CUTLASS_MXFP4_MXFP8 produces wrong output under expert parallelism