vllm - 💡(How to fix) Fix [Bug] MXFP8 MoE always falls back to MARLIN on SM_121 (DGX Spark / GB10): TrtLlmFp8ExpertsBase gates on family(100), excluding SM_12x consumer Blackwell

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

On NVIDIA GB10 / DGX Spark (SM_121) serving a modelopt_mxfp8 checkpoint, the MXFP8 MoE backend selector skips FLASHINFER_TRTLLM and falls back to MARLIN W8A16. MoE expert weights are dequantized to BF16 before compute, losing the block-32 MX FP8 precision and significant throughput from the native Blackwell tcgen05.mma MX path.

This is distinct from #43507 which covers VLLM_CUTLASS MoE for tensor/token-scaled FP8-Dynamic models. This issue is specifically about the FLASHINFER_TRTLLM backend path for MXFP8 (OCP block-32 MX) models.

Error Message

from vllm.model_executor.layers.fused_moe.experts.flashinfer_trtllm_moe import has_flashinfer_trtllm_fused_moe ModuleNotFoundError: No module named 'vllm.model_executor.layers.fused_moe.experts.flashinfer_trtllm_moe'

Root Cause

Root cause — precise diagnosis

Fix Action

Fix / Workaround

On SM_121 with a modelopt_mxfp8 checkpoint:

  • MoE layers (the majority of parameters in MoE models) run MARLIN W8A16: weights are dequantized FP8→BF16, compute in BF16. No MX path.
  • Non-MoE linear layers correctly use FlashInferCutlassMxfp8LinearKernel with native tcgen05.mma MX instructions (after the SM_121 CUTLASS patch in #40082).
  • Net result: the Blackwell MX speedup applies only to dense layers. MoE experts — which dominate memory bandwidth and FLOP counts in MoE models — run on the legacy Marlin path.

Code Example

INFO [mxfp8.py:88] Using 'MARLIN' MxFp8 MoE backend.

---

# vllm/model_executor/layers/fused_moe/modular_kernel.py (TrtLlmFp8ExpertsBase)
@staticmethod
def _supports_current_device() -> bool:
    """Supports only Blackwell-family GPUs."""
    p = current_platform
    return (
        p.is_cuda()
        and p.is_device_capability_family(100)   # <-- requires SM_10x only
        and has_flashinfer_trtllm_fused_moe()
    )

---

>>> from vllm.model_executor.layers.fused_moe.experts.flashinfer_trtllm_moe import has_flashinfer_trtllm_fused_moe
ModuleNotFoundError: No module named 'vllm.model_executor.layers.fused_moe.experts.flashinfer_trtllm_moe'

---

# vllm/model_executor/layers/fused_moe/oracle/mxfp8.py
_SUPPORTED_BACKENDS = (
    Fp8MoeBackend.FLASHINFER_TRTLLM,
    Fp8MoeBackend.MARLIN,
    Fp8MoeBackend.XPU,
)

---

>>> current_platform.is_device_capability_family(100)
False
# SM_121 has compute capability 12.1 — family 100 = SM_10x only

---

INFO [cuda.py:...] Using FlashInferCutlassMxfp8LinearKernel

---

INFO [mxfp8.py:88] Using 'MARLIN' MxFp8 MoE backend.

---

@staticmethod
def _supports_current_device() -> bool:
    p = current_platform
    return (
        p.is_cuda()
        and (p.is_device_capability_family(100) or p.is_device_capability_family(120))
        and has_flashinfer_trtllm_fused_moe()
    )
RAW_BUFFERClick to expand / collapse

Summary

On NVIDIA GB10 / DGX Spark (SM_121) serving a modelopt_mxfp8 checkpoint, the MXFP8 MoE backend selector skips FLASHINFER_TRTLLM and falls back to MARLIN W8A16. MoE expert weights are dequantized to BF16 before compute, losing the block-32 MX FP8 precision and significant throughput from the native Blackwell tcgen05.mma MX path.

This is distinct from #43507 which covers VLLM_CUTLASS MoE for tensor/token-scaled FP8-Dynamic models. This issue is specifically about the FLASHINFER_TRTLLM backend path for MXFP8 (OCP block-32 MX) models.

Affected hardware

  • NVIDIA GB10 / DGX Spark → SM_121
  • NVIDIA RTX 5000-series consumer Blackwell → SM_120
  • Any SM_12x consumer/prosumer Blackwell variant

Affected model

  • modelopt_mxfp8 checkpoints (OCP MX FP8 E4M3, block-32 weight + activation scales)
  • Confirmed on: gemma-4-26B-A4B MoE architecture

Symptom

At startup, the engine selects MARLIN instead of FLASHINFER_TRTLLM:

INFO [mxfp8.py:88] Using 'MARLIN' MxFp8 MoE backend.

Root cause — precise diagnosis

1. Device family check in TrtLlmFp8ExpertsBase._supports_current_device()

# vllm/model_executor/layers/fused_moe/modular_kernel.py (TrtLlmFp8ExpertsBase)
@staticmethod
def _supports_current_device() -> bool:
    """Supports only Blackwell-family GPUs."""
    p = current_platform
    return (
        p.is_cuda()
        and p.is_device_capability_family(100)   # <-- requires SM_10x only
        and has_flashinfer_trtllm_fused_moe()
    )

is_device_capability_family(100) checks for SM_10x (datacenter Blackwell: B100/B200 = SM_100). SM_121 (GB10 / DGX Spark) has compute capability 12.1, so family(100) returns False and both TrtLlmFp8ExpertsMonolithic and TrtLlmFp8ExpertsModular are rejected.

2. flashinfer_trtllm_moe module not compiled for SM_12x

>>> from vllm.model_executor.layers.fused_moe.experts.flashinfer_trtllm_moe import has_flashinfer_trtllm_fused_moe
ModuleNotFoundError: No module named 'vllm.model_executor.layers.fused_moe.experts.flashinfer_trtllm_moe'

The kernel binary is not compiled or included for SM_121.

3. MXFP8 oracle only has TRTLLM and MARLIN

# vllm/model_executor/layers/fused_moe/oracle/mxfp8.py
_SUPPORTED_BACKENDS = (
    Fp8MoeBackend.FLASHINFER_TRTLLM,
    Fp8MoeBackend.MARLIN,
    Fp8MoeBackend.XPU,
)

Unlike the FP8 oracle (which has FLASHINFER_CUTLASS and DEEPGEMM as intermediate options), the MXFP8 oracle has no intermediate fast path between TRTLLM and MARLIN. When TRTLLM is unavailable, MARLIN is the only fallback.

Impact

On SM_121 with a modelopt_mxfp8 checkpoint:

  • MoE layers (the majority of parameters in MoE models) run MARLIN W8A16: weights are dequantized FP8→BF16, compute in BF16. No MX path.
  • Non-MoE linear layers correctly use FlashInferCutlassMxfp8LinearKernel with native tcgen05.mma MX instructions (after the SM_121 CUTLASS patch in #40082).
  • Net result: the Blackwell MX speedup applies only to dense layers. MoE experts — which dominate memory bandwidth and FLOP counts in MoE models — run on the legacy Marlin path.

Confirmation

Verified on live DGX Spark (GB10, SM_121) serving gemma-4-26B-A4B-it-MXFP8W8A8:

>>> current_platform.is_device_capability_family(100)
False
# SM_121 has compute capability 12.1 — family 100 = SM_10x only

Non-MoE linear layers correctly report:

INFO [cuda.py:...] Using FlashInferCutlassMxfp8LinearKernel

MoE layers fall back:

INFO [mxfp8.py:88] Using 'MARLIN' MxFp8 MoE backend.

Why SM_12x should be supported

SM_120 (RTX 5000-series) and SM_121 (DGX Spark) implement the same Blackwell tcgen05.mma tensor core instructions with native MX format descriptors as SM_100 (B100/B200). The TRTLLM FP8 MoE kernel uses these instructions and should be compatible with SM_12x with the appropriate compilation target added.

Proposed fix

Two parts:

1. Update device capability check in TrtLlmFp8ExpertsBase._supports_current_device() to include SM_12x:

@staticmethod
def _supports_current_device() -> bool:
    p = current_platform
    return (
        p.is_cuda()
        and (p.is_device_capability_family(100) or p.is_device_capability_family(120))
        and has_flashinfer_trtllm_fused_moe()
    )

2. Build flashinfer_trtllm_moe for SM_12x — add sm_121 (and sm_120) as compilation targets in the FlashInfer TRTLLM MoE build configuration, analogous to the SM_121 additions in the CUTLASS linear kernel path (#40082).

Hardware available for testing

We have a DGX Spark (GB10, SM_121) that can validate this fix end-to-end on a real modelopt_mxfp8 checkpoint with MXFP8 MoE inference. Happy to test PRs.

Related issues

  • #43507 — CUTLASS MoE backend unavailable on SM_120/SM_121 for tensor/token-scaled FP8 (different backend/quant format, same hardware exclusion pattern)
  • #40082 — SM_121 FlashInfer + cutlass-dsl support for non-MoE linear layers (the fix that did work for linear layers, demonstrating SM_121 capability)

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING