vllm - ✅(Solved) Fix [Bug]: Marlin MoE kernel fails with MXFP4-quantized GPT-OSS 20B - Invalid thread config for non-aligned dimensions (K=2880, N=2880) [1 pull requests, 1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#38022Fetched 2026-04-08 01:27:02
View on GitHub
Comments
0
Participants
1
Timeline
4
Reactions
0
Author
Participants
Timeline (top)
referenced ×2cross-referenced ×1labeled ×1

When serving an MXFP4-quantized GPT-OSS 20B model (MoE architecture), vLLM crashes at inference time with a RuntimeError from the Marlin MoE GEMM kernel. The kernel fails to compute a valid thread configuration for the model's matrix dimensions (M=32768, K=2880, N=2880).

The root cause appears to be that the Marlin MoE kernel (moe_wna16_marlin_gemm) requires matrix K and N dimensions to be aligned to certain boundaries (e.g., multiples of 128). GPT-OSS 20B has hidden_size=2880 and intermediate_size=2880, and 2880 is not evenly divisible by 128 (2880 / 128 = 22.5), which causes the kernel's thread configuration lookup to fail and return -1 for thread_k, thread_n, and num_threads.

Error Message

File ".../vllm/model_executor/layers/fused_moe/fused_marlin_moe.py", line 177, in _fused_marlin_moe output = ops.moe_wna16_marlin_gemm( ^^^^^^^^^^^^^^^^^^^^^^^^^^ File ".../vllm/_custom_ops.py", line 2467, in moe_wna16_marlin_gemm return torch.ops._moe_C.moe_wna16_marlin_gemm( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ RuntimeError: Invalid thread config: thread_m_blocks = 4, thread_k = -1, thread_n = -1, num_threads = -1 for MKN = [32768, 2880, 2880] and num_bits = 4, group_size = 32, has_act_order = 0, is_k_full = 1, has_zp = 0, is_zp_float = 0, max_shared_mem = 101376

Root Cause

The root cause appears to be that the Marlin MoE kernel (moe_wna16_marlin_gemm) requires matrix K and N dimensions to be aligned to certain boundaries (e.g., multiples of 128). GPT-OSS 20B has hidden_size=2880 and intermediate_size=2880, and 2880 is not evenly divisible by 128 (2880 / 128 = 22.5), which causes the kernel's thread configuration lookup to fail and return -1 for thread_k, thread_n, and num_threads.

Fix Action

Workaround

Currently the only workaround is to avoid MXFP4 quantization and use FP8 dynamic quantization instead, which uses different kernels without the alignment constraint. However, MXFP4 is the preferred quantization format for this model.

PR fix notes

PR #38222: [Bugfix] Add dimension alignment check to Marlin MoE kernel selection

Description (problem / solution / changelog)

Motivation

When serving an MXFP4-quantized MoE model with non-128-aligned dimensions (e.g. GPT-OSS 20B with hidden_size=2880, intermediate_size=2880), the Marlin MoE kernel crashes at inference time with:

RuntimeError: Invalid thread config: thread_m_blocks = 4, thread_k = -1, thread_n = -1,
num_threads = -1 for MKN = [32768, 2880, 2880]

The Marlin GEMM kernel requires K to be divisible by MIN_THREAD_K (128) and N by MIN_THREAD_N (64), but 2880 % 128 = 64. The non-MoE Marlin path already has verify_marlin_supports_shape() that catches this at load time, but the MoE path (MarlinExpertsBase) had no equivalent check, so the oracle selected Marlin even though it can't handle the dimensions.

Modifications

Override _supports_shape() and is_supported_config() in MarlinExpertsBase to validate both hidden_dim and intermediate_size_per_partition against the kernel's alignment requirements. When Marlin can't handle the dimensions, the oracle skips it and falls back to a compatible kernel (e.g. FlashInfer TRTLLM on Blackwell) instead of crashing.

The check mirrors the existing verify_marlin_supports_shape() logic in marlin_utils.py but operates at the oracle level so the system can fall back gracefully.

Fixes #38022

Changed files

  • vllm/model_executor/layers/fused_moe/fused_marlin_moe.py (modified, +37/-0)

Code Example

vllm serve <mxfp4_quantized_model> \
  --dtype bfloat16 \
  --max-model-len 32768 \
  --gpu-memory-utilization 0.90

---

File ".../vllm/model_executor/layers/fused_moe/fused_marlin_moe.py", line 177, in _fused_marlin_moe
    output = ops.moe_wna16_marlin_gemm(
             ^^^^^^^^^^^^^^^^^^^^^^^^^^
File ".../vllm/_custom_ops.py", line 2467, in moe_wna16_marlin_gemm
    return torch.ops._moe_C.moe_wna16_marlin_gemm(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: Invalid thread config: thread_m_blocks = 4, thread_k = -1, thread_n = -1, num_threads = -1 for MKN = [32768, 2880, 2880] and num_bits = 4, group_size = 32, has_act_order = 0, is_k_full = 1, has_zp = 0, is_zp_float = 0, max_shared_mem = 101376

---

vllm/compilation/cuda_graph.py:251 __call__
→ vllm/compilation/piecewise_backend.py:367 __call__
→ vllm/compilation/compiler_interface.py:445 compiled_graph_wrapper
→ torch._inductor (compiled graph execution)
→ torch.ops.vllm.moe_forward.default
→ vllm/model_executor/layers/fused_moe/runner/default_moe_runner.py:85 _moe_forward
→ vllm/model_executor/layers/fused_moe/runner/default_moe_runner.py:693 forward_impl
→ vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors_moe.py:354 apply
→ vllm/model_executor/layers/fused_moe/modular_kernel.py:1581 apply
→ vllm/model_executor/layers/fused_moe/modular_kernel.py:1365 apply
→ vllm/model_executor/layers/fused_moe/modular_kernel.py:1203 _fused_experts
→ vllm/model_executor/layers/fused_moe/fused_marlin_moe.py:716 apply
→ vllm/model_executor/layers/fused_moe/fused_marlin_moe.py:326 fused_marlin_moe
→ vllm/model_executor/layers/fused_moe/fused_marlin_moe.py:177 _fused_marlin_moe
→ vllm/_custom_ops.py:2467 moe_wna16_marlin_gemm
RuntimeError: Invalid thread config

---

{
  "model_type": "gpt_oss",
  "hidden_size": 2880,
  "intermediate_size": 2880,
  "num_hidden_layers": 24,
  "num_attention_heads": 64,
  "num_key_value_heads": 8,
  "head_dim": 64,
  "num_local_experts": 32,
  "num_experts_per_tok": 4,
  "quantization_config": {
    "quant_method": "compressed-tensors",
    "format": "mxfp4-pack-quantized",
    "config_groups": {
      "group_0": {
        "targets": ["Linear"],
        "weights": {
          "num_bits": 4,
          "group_size": 32,
          "type": "float",
          "strategy": "group",
          "symmetric": true
        },
        "input_activations": {
          "num_bits": 4,
          "group_size": 32,
          "type": "float",
          "strategy": "group",
          "symmetric": true,
          "dynamic": true
        }
      }
    },
    "ignore": ["lm_head"]
  }
}
RAW_BUFFERClick to expand / collapse

Your current environment

  • vLLM version: 0.18.0
  • PyTorch version: 2.10.0+cu128
  • CUDA version: 12.8
  • GPU: NVIDIA RTX PRO 6000 Blackwell Server Edition (96 GB)
  • Driver: 580.126.09
  • Python: 3.12.3
  • Transformers: 4.57.3
  • compressed-tensors: 0.13.0
  • OS: Ubuntu (Linux 6.17.0-1009-aws)

Model

openai/gpt-oss-20b — a Mixture-of-Experts model with 32 experts (4 active per token), hidden_size=2880, intermediate_size=2880, head_dim=64.

The model was fine-tuned (LoRA SFT), merged, and quantized to MXFP4 format using llmcompressor with compressed-tensors. The resulting config.json shows quant_method: "compressed-tensors" with format: "mxfp4-pack-quantized", group_size: 32, num_bits: 4.

🐛 Describe the bug

How to reproduce

vllm serve <mxfp4_quantized_model> \
  --dtype bfloat16 \
  --max-model-len 32768 \
  --gpu-memory-utilization 0.90

The model loads successfully to GPU, but crashes on the first inference request.

Description

When serving an MXFP4-quantized GPT-OSS 20B model (MoE architecture), vLLM crashes at inference time with a RuntimeError from the Marlin MoE GEMM kernel. The kernel fails to compute a valid thread configuration for the model's matrix dimensions (M=32768, K=2880, N=2880).

The root cause appears to be that the Marlin MoE kernel (moe_wna16_marlin_gemm) requires matrix K and N dimensions to be aligned to certain boundaries (e.g., multiples of 128). GPT-OSS 20B has hidden_size=2880 and intermediate_size=2880, and 2880 is not evenly divisible by 128 (2880 / 128 = 22.5), which causes the kernel's thread configuration lookup to fail and return -1 for thread_k, thread_n, and num_threads.

Full traceback

File ".../vllm/model_executor/layers/fused_moe/fused_marlin_moe.py", line 177, in _fused_marlin_moe
    output = ops.moe_wna16_marlin_gemm(
             ^^^^^^^^^^^^^^^^^^^^^^^^^^
File ".../vllm/_custom_ops.py", line 2467, in moe_wna16_marlin_gemm
    return torch.ops._moe_C.moe_wna16_marlin_gemm(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: Invalid thread config: thread_m_blocks = 4, thread_k = -1, thread_n = -1, num_threads = -1 for MKN = [32768, 2880, 2880] and num_bits = 4, group_size = 32, has_act_order = 0, is_k_full = 1, has_zp = 0, is_zp_float = 0, max_shared_mem = 101376

The full call chain:

vllm/compilation/cuda_graph.py:251 __call__
→ vllm/compilation/piecewise_backend.py:367 __call__
→ vllm/compilation/compiler_interface.py:445 compiled_graph_wrapper
→ torch._inductor (compiled graph execution)
→ torch.ops.vllm.moe_forward.default
→ vllm/model_executor/layers/fused_moe/runner/default_moe_runner.py:85 _moe_forward
→ vllm/model_executor/layers/fused_moe/runner/default_moe_runner.py:693 forward_impl
→ vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors_moe.py:354 apply
→ vllm/model_executor/layers/fused_moe/modular_kernel.py:1581 apply
→ vllm/model_executor/layers/fused_moe/modular_kernel.py:1365 apply
→ vllm/model_executor/layers/fused_moe/modular_kernel.py:1203 _fused_experts
→ vllm/model_executor/layers/fused_moe/fused_marlin_moe.py:716 apply
→ vllm/model_executor/layers/fused_moe/fused_marlin_moe.py:326 fused_marlin_moe
→ vllm/model_executor/layers/fused_moe/fused_marlin_moe.py:177 _fused_marlin_moe
→ vllm/_custom_ops.py:2467 moe_wna16_marlin_gemm
→ RuntimeError: Invalid thread config

Expected behavior

vLLM should either:

  1. Support non-power-of-2-aligned dimensions in the Marlin MoE kernel — add thread configuration entries for dimensions like 2880 (divisible by 32 but not 128), or
  2. Gracefully fall back to a non-Marlin kernel (e.g., Triton-based fused MoE) when the Marlin kernel cannot handle the given dimensions, rather than crashing with an opaque RuntimeError, or
  3. Fail early at model load time with a clear error message indicating that MXFP4 quantization with Marlin is unsupported for these matrix dimensions, rather than crashing at the first inference request.

Relevant model config

{
  "model_type": "gpt_oss",
  "hidden_size": 2880,
  "intermediate_size": 2880,
  "num_hidden_layers": 24,
  "num_attention_heads": 64,
  "num_key_value_heads": 8,
  "head_dim": 64,
  "num_local_experts": 32,
  "num_experts_per_tok": 4,
  "quantization_config": {
    "quant_method": "compressed-tensors",
    "format": "mxfp4-pack-quantized",
    "config_groups": {
      "group_0": {
        "targets": ["Linear"],
        "weights": {
          "num_bits": 4,
          "group_size": 32,
          "type": "float",
          "strategy": "group",
          "symmetric": true
        },
        "input_activations": {
          "num_bits": 4,
          "group_size": 32,
          "type": "float",
          "strategy": "group",
          "symmetric": true,
          "dynamic": true
        }
      }
    },
    "ignore": ["lm_head"]
  }
}

Workaround

Currently the only workaround is to avoid MXFP4 quantization and use FP8 dynamic quantization instead, which uses different kernels without the alignment constraint. However, MXFP4 is the preferred quantization format for this model.

Additional context

  • The model loads to GPU without errors — the crash only happens at inference time when the MoE kernel is invoked.
  • Adding --enforce-eager does not resolve the issue since the error originates in the Marlin CUDA kernel itself, not in torch.compile/CUDA graphs.
  • The dimension 2880 is divisible by 32 (the quantization group size), so the quantization itself is valid — only the Marlin kernel's thread tiling is incompatible.
  • This likely affects any MoE model with hidden_size or intermediate_size not aligned to 128 when served with MXFP4/WNA16 Marlin kernels.

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

Fix Plan

To resolve the issue, we need to modify the Marlin MoE kernel to support non-power-of-2-aligned dimensions. Here are the steps:

  • Modify the moe_wna16_marlin_gemm kernel to add thread configuration entries for dimensions like 2880.
  • Update the thread_k, thread_n, and num_threads calculations to handle non-aligned dimensions.
  • Add a fallback mechanism to use a non-Marlin kernel (e.g., Triton-based fused MoE) when the Marlin kernel cannot handle the given dimensions.

Example code changes:

# In moe_wna16_marlin_gemm kernel
def calculate_thread_config(M, K, N):
    # Add special cases for dimensions like 2880
    if K == 2880 and N == 2880:
        thread_k = 32
        thread_n = 32
        num_threads = 4
    else:
        # Existing calculation for aligned dimensions
        thread_k = 128
        thread_n = 128
        num_threads = 8

    return thread_k, thread_n, num_threads

# In fused_marlin_moe.py
def _fused_marlin_moe(...):
    try:
        # Try to use Marlin kernel
        output = ops.moe_wna16_marlin_gemm(...)
    except RuntimeError:
        # Fallback to non-Marlin kernel
        output = ops.moe_wna16_triton_gemm(...)
    return output

Verification

To verify the fix, run the following command:

vllm serve <mxfp4_quantized_model> \
  --dtype bfloat16 \
  --max-model-len 32768 \
  --gpu-memory-utilization 0.90

The model should load successfully and complete inference requests without crashing.

Extra Tips

  • Make sure to test the modified kernel with different input sizes and dimensions to ensure it works correctly.
  • Consider adding a warning or error message when the Marlin kernel is not supported for a given model configuration.
  • If you encounter issues with the fallback mechanism, try using a different non-Marlin kernel or adjusting the thread configuration calculations.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

FAQ

Expected behavior

vLLM should either:

  1. Support non-power-of-2-aligned dimensions in the Marlin MoE kernel — add thread configuration entries for dimensions like 2880 (divisible by 32 but not 128), or
  2. Gracefully fall back to a non-Marlin kernel (e.g., Triton-based fused MoE) when the Marlin kernel cannot handle the given dimensions, rather than crashing with an opaque RuntimeError, or
  3. Fail early at model load time with a clear error message indicating that MXFP4 quantization with Marlin is unsupported for these matrix dimensions, rather than crashing at the first inference request.

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING