vllm - 💡(How to fix) Fix [Bug]: FLASHINFER_CUTLASS_MXFP4_MXFP8 produces wrong output under expert parallelism

vllm2026-05-08 21:35:37

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

Fix Action

Fix / Workaround

Proposed direction (also tracked separately): replace the orthogonal quant_dtype + is_scale_swizzled flag with merged dtype variants ("nvfp4_swizzled" / "mxfp8_swizzled") so the dispatcher can route to the right swizzle helper from a single field, removing this bug class entirely.

Code Example

Your output of `python collect_env.py` here

---

VLLM_USE_FLASHINFER_MOE_MXFP4_MXFP8_CUTLASS=1 \
  vllm serve openai/gpt-oss-20b \
    --data-parallel-size 2 --enable-expert-parallel --gpu-memory-utilization 0.4
# then run gpt_oss.evals gpqa against it

RAW_BUFFERClick to expand / collapse

Your current environment

<details> <summary>The output of <code>python collect_env.py</code></summary>

Your output of `python collect_env.py` here

</details>

🐛 Describe the bug

The post-all-to-all swizzle in the EP/DP-EP prepare-finalize wrappers is gated on quant_dtype == "nvfp4", which means mxfp8 activation scales arrive at the CUTLASS kernel unswizzled. The CUTLASS MoE kernel requires swizzled scales, so outputs are garbage.

This is a pre-existing limitation surfaced by review of #42089, not a regression from that PR.

Affected lines (gate currently quant_dtype == "nvfp4" only):

vllm/model_executor/layers/fused_moe/prepare_finalize/naive_dp_ep.py:62
vllm/model_executor/layers/fused_moe/prepare_finalize/flashinfer_nvlink_one_sided.py:132-135
vllm/model_executor/layers/fused_moe/prepare_finalize/flashinfer_nvlink_two_sided.py:197

Repro (GPQA Eval times out at 1800 s; ~6/1584 items in 11 minutes):

VLLM_USE_FLASHINFER_MOE_MXFP4_MXFP8_CUTLASS=1 \
  vllm serve openai/gpt-oss-20b \
    --data-parallel-size 2 --enable-expert-parallel --gpu-memory-utilization 0.4
# then run gpt_oss.evals gpqa against it

PF in use: MoEPrepareAndFinalizeNaiveDPEPModular. Backend confirmed: FLASHINFER_CUTLASS_MXFP4_MXFP8.

Why the obvious fix isn't quite right

Just expanding the gate to include "mxfp8" and reusing nvfp4_block_scale_interleave would be wrong — that helper produces nvfp4-specific layout. mxfp8 needs swizzle_mxfp8_scale from vllm/model_executor/layers/quantization/utils/mxfp8_utils.py:14, which produces F8_128x4 layout, and needs (M, K) dims at the call site.

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#parallel task #integration issue #index setup #retrieval issue #search optimization

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

vllm - 💡(How to fix) Fix [Bug]: FLASHINFER_CUTLASS_MXFP4_MXFP8 produces wrong output under expert parallelism

Recommended Tools

GitHub issue graph ai analysis

Fix Action

Fix / Workaround

Code Example

Your current environment

🐛 Describe the bug

Before submitting a new issue...

Still need to ship something?

TRENDING

vllm - 💡(How to fix) Fix [Bug]: FLASHINFER_CUTLASS_MXFP4_MXFP8 produces wrong output under expert parallelism

Recommended Tools

GitHub issue graph ai analysis

Fix Action

Fix / Workaround

Code Example

Your current environment

🐛 Describe the bug

Before submitting a new issue...

Still need to ship something?

RELATED_DISCOVERY

TRENDING