vLLM should either: 1. **Support non-power-of-2-aligned dimensions in the Marlin MoE kernel** — add thread configuration entries for dimensions like 2880 (divisible by 32 but not 128), or 2. **Gracefully fall back** to a non-Marlin kernel (e.g., Triton-based fused MoE) when the Marlin kernel cannot handle the given dimensions, rather than crashing with an opaque `RuntimeError`, or 3. **Fail early at model load time** with a clear error message indicating that MXFP4 quantization with Marlin is unsupported for these matrix dimensions, rather than crashing at the first inference request.

vllm - ✅(Solved) Fix [Bug]: Marlin MoE kernel fails with MXFP4-quantized GPT-OSS 20B - Invalid thread config for non-aligned dimensions (K=2880, N=2880) [1 pull requests, 1 participants]

vllm2026-03-24 17:04:38

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

vllm-project/vllm#38022•Fetched 2026-04-08 01:27:02

View on GitHub

Comments

Participants

Timeline

Reactions

Author

zweack

Participants

zweack

Timeline (top)

referenced ×2cross-referenced ×1labeled ×1

When serving an MXFP4-quantized GPT-OSS 20B model (MoE architecture), vLLM crashes at inference time with a RuntimeError from the Marlin MoE GEMM kernel. The kernel fails to compute a valid thread configuration for the model's matrix dimensions (M=32768, K=2880, N=2880).

The root cause appears to be that the Marlin MoE kernel (moe_wna16_marlin_gemm) requires matrix K and N dimensions to be aligned to certain boundaries (e.g., multiples of 128). GPT-OSS 20B has hidden_size=2880 and intermediate_size=2880, and 2880 is not evenly divisible by 128 (2880 / 128 = 22.5), which causes the kernel's thread configuration lookup to fail and return -1 for thread_k, thread_n, and num_threads.

Error Message

File ".../vllm/model_executor/layers/fused_moe/fused_marlin_moe.py", line 177, in _fused_marlin_moe output = ops.moe_wna16_marlin_gemm( ^^^^^^^^^^^^^^^^^^^^^^^^^^ File ".../vllm/_custom_ops.py", line 2467, in moe_wna16_marlin_gemm return torch.ops._moe_C.moe_wna16_marlin_gemm( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ RuntimeError: Invalid thread config: thread_m_blocks = 4, thread_k = -1, thread_n = -1, num_threads = -1 for MKN = [32768, 2880, 2880] and num_bits = 4, group_size = 32, has_act_order = 0, is_k_full = 1, has_zp = 0, is_zp_float = 0, max_shared_mem = 101376

Root Cause

Fix Action

Workaround

Currently the only workaround is to avoid MXFP4 quantization and use FP8 dynamic quantization instead, which uses different kernels without the alignment constraint. However, MXFP4 is the preferred quantization format for this model.

PR fix notes

PR #38222: [Bugfix] Add dimension alignment check to Marlin MoE kernel selection

Repository: vllm-project/vllm
Author: he-yufeng
State: open | merged: False
Link: https://github.com/vllm-project/vllm/pull/38222

Description (problem / solution / changelog)

Motivation

When serving an MXFP4-quantized MoE model with non-128-aligned dimensions (e.g. GPT-OSS 20B with hidden_size=2880, intermediate_size=2880), the Marlin MoE kernel crashes at inference time with:

RuntimeError: Invalid thread config: thread_m_blocks = 4, thread_k = -1, thread_n = -1,
num_threads = -1 for MKN = [32768, 2880, 2880]

The Marlin GEMM kernel requires K to be divisible by MIN_THREAD_K (128) and N by MIN_THREAD_N (64), but 2880 % 128 = 64. The non-MoE Marlin path already has verify_marlin_supports_shape() that catches this at load time, but the MoE path (MarlinExpertsBase) had no equivalent check, so the oracle selected Marlin even though it can't handle the dimensions.

Modifications

Override _supports_shape() and is_supported_config() in MarlinExpertsBase to validate both hidden_dim and intermediate_size_per_partition against the kernel's alignment requirements. When Marlin can't handle the dimensions, the oracle skips it and falls back to a compatible kernel (e.g. FlashInfer TRTLLM on Blackwell) instead of crashing.

The check mirrors the existing verify_marlin_supports_shape() logic in marlin_utils.py but operates at the oracle level so the system can fall back gracefully.

Fixes #38022

Changed files

vllm/model_executor/layers/fused_moe/fused_marlin_moe.py (modified, +37/-0)

Code Example

vllm serve <mxfp4_quantized_model> \
  --dtype bfloat16 \
  --max-model-len 32768 \
  --gpu-memory-utilization 0.90

---

File ".../vllm/model_executor/layers/fused_moe/fused_marlin_moe.py", line 177, in _fused_marlin_moe
    output = ops.moe_wna16_marlin_gemm(
             ^^^^^^^^^^^^^^^^^^^^^^^^^^
File ".../vllm/_custom_ops.py", line 2467, in moe_wna16_marlin_gemm
    return torch.ops._moe_C.moe_wna16_marlin_gemm(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: Invalid thread config: thread_m_blocks = 4, thread_k = -1, thread_n = -1, num_threads = -1 for MKN = [32768, 2880, 2880] and num_bits = 4, group_size = 32, has_act_order = 0, is_k_full = 1, has_zp = 0, is_zp_float = 0, max_shared_mem = 101376

---

vllm/compilation/cuda_graph.py:251 __call__
→ vllm/compilation/piecewise_backend.py:367 __call__
→ vllm/compilation/compiler_interface.py:445 compiled_graph_wrapper
→ torch._inductor (compiled graph execution)
→ torch.ops.vllm.moe_forward.default
→ vllm/model_executor/layers/fused_moe/runner/default_moe_runner.py:85 _moe_forward
→ vllm/model_executor/layers/fused_moe/runner/default_moe_runner.py:693 forward_impl
→ vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors_moe.py:354 apply
→ vllm/model_executor/layers/fused_moe/modular_kernel.py:1581 apply
→ vllm/model_executor/layers/fused_moe/modular_kernel.py:1365 apply
→ vllm/model_executor/layers/fused_moe/modular_kernel.py:1203 _fused_experts
→ vllm/model_executor/layers/fused_moe/fused_marlin_moe.py:716 apply
→ vllm/model_executor/layers/fused_moe/fused_marlin_moe.py:326 fused_marlin_moe
→ vllm/model_executor/layers/fused_moe/fused_marlin_moe.py:177 _fused_marlin_moe
→ vllm/_custom_ops.py:2467 moe_wna16_marlin_gemm
→ RuntimeError: Invalid thread config

---

{
  "model_type": "gpt_oss",
  "hidden_size": 2880,
  "intermediate_size": 2880,
  "num_hidden_layers": 24,
  "num_attention_heads": 64,
  "num_key_value_heads": 8,
  "head_dim": 64,
  "num_local_experts": 32,
  "num_experts_per_tok": 4,
  "quantization_config": {
    "quant_method": "compressed-tensors",
    "format": "mxfp4-pack-quantized",
    "config_groups": {
      "group_0": {
        "targets": ["Linear"],
        "weights": {
          "num_bits": 4,
          "group_size": 32,
          "type": "float",
          "strategy": "group",
          "symmetric": true
        },
        "input_activations": {
          "num_bits": 4,
          "group_size": 32,
          "type": "float",
          "strategy": "group",
          "symmetric": true,
          "dynamic": true
        }
      }
    },
    "ignore": ["lm_head"]
  }
}

RAW_BUFFERClick to expand / collapse

Your current environment

vLLM version: 0.18.0
PyTorch version: 2.10.0+cu128
CUDA version: 12.8
GPU: NVIDIA RTX PRO 6000 Blackwell Server Edition (96 GB)
Driver: 580.126.09
Python: 3.12.3
Transformers: 4.57.3
compressed-tensors: 0.13.0
OS: Ubuntu (Linux 6.17.0-1009-aws)

Model

openai/gpt-oss-20b — a Mixture-of-Experts model with 32 experts (4 active per token), hidden_size=2880, intermediate_size=2880, head_dim=64.

The model was fine-tuned (LoRA SFT), merged, and quantized to MXFP4 format using llmcompressor with compressed-tensors. The resulting config.json shows quant_method: "compressed-tensors" with format: "mxfp4-pack-quantized", group_size: 32, num_bits: 4.

🐛 Describe the bug

How to reproduce

vllm serve <mxfp4_quantized_model> \
  --dtype bfloat16 \
  --max-model-len 32768 \
  --gpu-memory-utilization 0.90

The model loads successfully to GPU, but crashes on the first inference request.

Description

Full traceback

File ".../vllm/model_executor/layers/fused_moe/fused_marlin_moe.py", line 177, in _fused_marlin_moe
    output = ops.moe_wna16_marlin_gemm(
             ^^^^^^^^^^^^^^^^^^^^^^^^^^
File ".../vllm/_custom_ops.py", line 2467, in moe_wna16_marlin_gemm
    return torch.ops._moe_C.moe_wna16_marlin_gemm(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: Invalid thread config: thread_m_blocks = 4, thread_k = -1, thread_n = -1, num_threads = -1 for MKN = [32768, 2880, 2880] and num_bits = 4, group_size = 32, has_act_order = 0, is_k_full = 1, has_zp = 0, is_zp_float = 0, max_shared_mem = 101376

The full call chain:

vllm/compilation/cuda_graph.py:251 __call__
→ vllm/compilation/piecewise_backend.py:367 __call__
→ vllm/compilation/compiler_interface.py:445 compiled_graph_wrapper
→ torch._inductor (compiled graph execution)
→ torch.ops.vllm.moe_forward.default
→ vllm/model_executor/layers/fused_moe/runner/default_moe_runner.py:85 _moe_forward
→ vllm/model_executor/layers/fused_moe/runner/default_moe_runner.py:693 forward_impl
→ vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors_moe.py:354 apply
→ vllm/model_executor/layers/fused_moe/modular_kernel.py:1581 apply
→ vllm/model_executor/layers/fused_moe/modular_kernel.py:1365 apply
→ vllm/model_executor/layers/fused_moe/modular_kernel.py:1203 _fused_experts
→ vllm/model_executor/layers/fused_moe/fused_marlin_moe.py:716 apply
→ vllm/model_executor/layers/fused_moe/fused_marlin_moe.py:326 fused_marlin_moe
→ vllm/model_executor/layers/fused_moe/fused_marlin_moe.py:177 _fused_marlin_moe
→ vllm/_custom_ops.py:2467 moe_wna16_marlin_gemm
→ RuntimeError: Invalid thread config

Expected behavior

vLLM should either:

Support non-power-of-2-aligned dimensions in the Marlin MoE kernel — add thread configuration entries for dimensions like 2880 (divisible by 32 but not 128), or
Gracefully fall back to a non-Marlin kernel (e.g., Triton-based fused MoE) when the Marlin kernel cannot handle the given dimensions, rather than crashing with an opaque RuntimeError, or
Fail early at model load time with a clear error message indicating that MXFP4 quantization with Marlin is unsupported for these matrix dimensions, rather than crashing at the first inference request.

Relevant model config

{
  "model_type": "gpt_oss",
  "hidden_size": 2880,
  "intermediate_size": 2880,
  "num_hidden_layers": 24,
  "num_attention_heads": 64,
  "num_key_value_heads": 8,
  "head_dim": 64,
  "num_local_experts": 32,
  "num_experts_per_tok": 4,
  "quantization_config": {
    "quant_method": "compressed-tensors",
    "format": "mxfp4-pack-quantized",
    "config_groups": {
      "group_0": {
        "targets": ["Linear"],
        "weights": {
          "num_bits": 4,
          "group_size": 32,
          "type": "float",
          "strategy": "group",
          "symmetric": true
        },
        "input_activations": {
          "num_bits": 4,
          "group_size": 32,
          "type": "float",
          "strategy": "group",
          "symmetric": true,
          "dynamic": true
        }
      }
    },
    "ignore": ["lm_head"]
  }
}

Workaround

Additional context

The model loads to GPU without errors — the crash only happens at inference time when the MoE kernel is invoked.
Adding --enforce-eager does not resolve the issue since the error originates in the Marlin CUDA kernel itself, not in torch.compile/CUDA graphs.
The dimension 2880 is divisible by 32 (the quantization group size), so the quantization itself is valid — only the Marlin kernel's thread tiling is incompatible.
This likely affects any MoE model with hidden_size or intermediate_size not aligned to 128 when served with MXFP4/WNA16 Marlin kernels.

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

Fix Plan

To resolve the issue, we need to modify the Marlin MoE kernel to support non-power-of-2-aligned dimensions. Here are the steps:

Modify the moe_wna16_marlin_gemm kernel to add thread configuration entries for dimensions like 2880.
Update the thread_k, thread_n, and num_threads calculations to handle non-aligned dimensions.
Add a fallback mechanism to use a non-Marlin kernel (e.g., Triton-based fused MoE) when the Marlin kernel cannot handle the given dimensions.

Example code changes:

# In moe_wna16_marlin_gemm kernel
def calculate_thread_config(M, K, N):
    # Add special cases for dimensions like 2880
    if K == 2880 and N == 2880:
        thread_k = 32
        thread_n = 32
        num_threads = 4
    else:
        # Existing calculation for aligned dimensions
        thread_k = 128
        thread_n = 128
        num_threads = 8

    return thread_k, thread_n, num_threads

# In fused_marlin_moe.py
def _fused_marlin_moe(...):
    try:
        # Try to use Marlin kernel
        output = ops.moe_wna16_marlin_gemm(...)
    except RuntimeError:
        # Fallback to non-Marlin kernel
        output = ops.moe_wna16_triton_gemm(...)
    return output

Verification

To verify the fix, run the following command:

vllm serve <mxfp4_quantized_model> \
  --dtype bfloat16 \
  --max-model-len 32768 \
  --gpu-memory-utilization 0.90

The model should load successfully and complete inference requests without crashing.

Extra Tips

Make sure to test the modified kernel with different input sizes and dimensions to ensure it works correctly.
Consider adding a warning or error message when the Marlin kernel is not supported for a given model configuration.
If you encounter issues with the fallback mechanism, try using a different non-Marlin kernel or adjusting the thread configuration calculations.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

FAQ

Expected behavior

vLLM should either:

Support non-power-of-2-aligned dimensions in the Marlin MoE kernel — add thread configuration entries for dimensions like 2880 (divisible by 32 but not 128), or
Gracefully fall back to a non-Marlin kernel (e.g., Triton-based fused MoE) when the Marlin kernel cannot handle the given dimensions, rather than crashing with an opaque RuntimeError, or
Fail early at model load time with a clear error message indicating that MXFP4 quantization with Marlin is unsupported for these matrix dimensions, rather than crashing at the first inference request.

#request error #file not found #serialization error #model compatibility #GPU setup

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

vllm - ✅(Solved) Fix [Bug]: Marlin MoE kernel fails with MXFP4-quantized GPT-OSS 20B - Invalid thread config for non-aligned dimensions (K=2880, N=2880) [1 pull requests, 1 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Workaround

PR fix notes

PR #38222: [Bugfix] Add dimension alignment check to Marlin MoE kernel selection

Description (problem / solution / changelog)

Motivation

Modifications

Changed files

Code Example

Your current environment

Model

🐛 Describe the bug

How to reproduce

Description

Full traceback

Expected behavior

Relevant model config

Workaround

Additional context

Before submitting a new issue...

extent analysis

Fix Plan

Verification

Extra Tips

FAQ

Expected behavior

Still need to ship something?

RELATED_DISCOVERY

TRENDING