vllm - 💡(How to fix) Fix [Bug]: [XPU] compressed-tensors WNA16 MoE selector ignores XPU platform, crashes on Marlin path

Code Example

AttributeError: '_OpNamespace' '_C' object has no attribute 'gptq_marlin_repack'

---

# Prefer to use the MarlinMoE kernel when it is supported.
if (
    not check_moe_marlin_supports_layer(layer, group_size)
    or current_platform.is_rocm()
):
    # ... CompressedTensorsWNA16MoEMethod (Triton, works on XPU)
else:
    # ... CompressedTensorsWNA16MarlinMoEMethod (CUDA-only, crashes on XPU)
    return CompressedTensorsWNA16MarlinMoEMethod(...)

---

vllm serve dhruvil237/gemma-4-26B-A4B-it-W4A16 \
    --trust-remote-code \
    --dtype bfloat16 \
    --max-model-len 16384

---

INFO  compressed_tensors_wNa16.py:112] Using XPUwNa16LinearKernel for CompressedTensorsWNA16
INFO  xpu.py:78] Using Triton backend.
INFO  compressed_tensors_moe.py:122] Using CompressedTensorsWNA16MarlinMoEMethod
INFO  compressed_tensors_moe_wna16_marlin.py:88] Using Marlin backend for WNA16 MoE (group_size=64, num_bits=4)
...
INFO  default_loader.py:397] Loading weights took 18.09 seconds
ERROR core.py:1165] EngineCore failed to start.
ERROR core.py:1165] Traceback (most recent call last):
ERROR core.py:1165]   File ".../v1/engine/core.py", line 1139, in run_engine_core
ERROR core.py:1165]     engine_core = EngineCoreProc(*args, ...)
...
ERROR core.py:1165]   File ".../compressed_tensors_moe_wna16_marlin.py", line 403,
                       in process_weights_after_loading
ERROR core.py:1165]     marlin_w13_qweight = ops.gptq_marlin_moe_repack(
ERROR core.py:1165]   File ".../vllm/_custom_ops.py", line 1241, in gptq_marlin_moe_repack
ERROR core.py:1165]     output[e] = torch.ops._C.gptq_marlin_repack(...)
ERROR core.py:1165]   File ".../torch/_ops.py", line 1379, in __getattr__
ERROR core.py:1165]     raise AttributeError(...)
ERROR core.py:1165] AttributeError: '_OpNamespace' '_C' object has no attribute 'gptq_marlin_repack'

---

if (
                     not check_moe_marlin_supports_layer(layer, group_size)
                     or current_platform.is_rocm()
+                    or current_platform.is_xpu()
                 ):
                     from .compressed_tensors_moe_wna16 import (
                         CompressedTensorsWNA16MoEMethod,
                     )

---

INFO  compressed_tensors_moe.py:114] Using CompressedTensorsWNA16MoEMethod

---

$ curl -s .../v1/chat/completions -d '{"model":"gemma-4-26B-A4B",
    "messages":[{"role":"user","content":"Explain why the sky is blue, in 3 paragraphs."}],
    "max_tokens": 400}' | jq -r '.choices[0].message.content'

Your current environment

<details> <summary>Environment</summary> - vLLM version: `0.21.1rc1.dev315+g0b68f21e7` (built from current `main` via `docker/Dockerfile.xpu`) - Platform: `device_config=xpu` - Hardware: Intel Arc Pro B60 (Battlemage, BMG) - OS: Ubuntu 24.04 host, container based on `intel/deep-learning-essentials:2025.3.2-0-devel-ubuntu24.04` - PyTorch: `torch-xpu` from `https://download.pytorch.org/whl/xpu` - Triton: `triton-xpu==3.7.0` </details>

🐛 Describe the bug

On XPU, loading a compressed-tensors W4A16 MoE model crashes during process_weights_after_loading with:

AttributeError: '_OpNamespace' '_C' object has no attribute 'gptq_marlin_repack'

The cause is the backend selector in compressed_tensors_moe.py, which only excludes ROCm from the Marlin path:

https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors_moe/compressed_tensors_moe.py

# Prefer to use the MarlinMoE kernel when it is supported.
if (
    not check_moe_marlin_supports_layer(layer, group_size)
    or current_platform.is_rocm()
):
    # ... CompressedTensorsWNA16MoEMethod (Triton, works on XPU)
else:
    # ... CompressedTensorsWNA16MarlinMoEMethod (CUDA-only, crashes on XPU)
    return CompressedTensorsWNA16MarlinMoEMethod(...)

XPU has the same problem ROCm has — Marlin custom ops are not registered (the entire torch.ops._C namespace is effectively empty on the XPU build, since vllm._C isn't built for XPU) — but XPU isn't in the bypass condition, so it falls through to Marlin and crashes.

Same-format dense W4A16 models work fine on XPU because they go through a different code path (Using XPUwNa16LinearKernel for CompressedTensorsWNA16), so the bug only affects compressed-tensors W4A16 MoE specifically.

Relationship to #41426

PR #41426 adds a native XPU INT4 MoE path for INC/GPTQ-quantized models via vllm-xpu-kernels. That work targets a different selector (override_quantization_method + INC schemes) and doesn't address compressed-tensors. The fix proposed here is a small, scoped change to the compressed-tensors selector that unblocks compressed-tensors-formatted W4A16 MoE checkpoints on XPU using the existing Triton path (which already works on XPU — same path used by other XPU MoE flows).

The two changes are independent and complementary.

Reproduction

Run any compressed-tensors W4A16 MoE model on VLLM_TARGET_DEVICE=xpu. Confirmed with:

dhruvil237/gemma-4-26B-A4B-it-W4A16
(Expected to reproduce with any other *-quantized.w4a16 MoE model, e.g. RedHatAI's MoE w4a16 quants)

Minimal CLI:

vllm serve dhruvil237/gemma-4-26B-A4B-it-W4A16 \
    --trust-remote-code \
    --dtype bfloat16 \
    --max-model-len 16384

Engine log

INFO  compressed_tensors_wNa16.py:112] Using XPUwNa16LinearKernel for CompressedTensorsWNA16
INFO  xpu.py:78] Using Triton backend.
INFO  compressed_tensors_moe.py:122] Using CompressedTensorsWNA16MarlinMoEMethod
INFO  compressed_tensors_moe_wna16_marlin.py:88] Using Marlin backend for WNA16 MoE (group_size=64, num_bits=4)
...
INFO  default_loader.py:397] Loading weights took 18.09 seconds
ERROR core.py:1165] EngineCore failed to start.
ERROR core.py:1165] Traceback (most recent call last):
ERROR core.py:1165]   File ".../v1/engine/core.py", line 1139, in run_engine_core
ERROR core.py:1165]     engine_core = EngineCoreProc(*args, ...)
...
ERROR core.py:1165]   File ".../compressed_tensors_moe_wna16_marlin.py", line 403,
                       in process_weights_after_loading
ERROR core.py:1165]     marlin_w13_qweight = ops.gptq_marlin_moe_repack(
ERROR core.py:1165]   File ".../vllm/_custom_ops.py", line 1241, in gptq_marlin_moe_repack
ERROR core.py:1165]     output[e] = torch.ops._C.gptq_marlin_repack(...)
ERROR core.py:1165]   File ".../torch/_ops.py", line 1379, in __getattr__
ERROR core.py:1165]     raise AttributeError(...)
ERROR core.py:1165] AttributeError: '_OpNamespace' '_C' object has no attribute 'gptq_marlin_repack'

Suggested fix

Add is_xpu() to the existing platform bypass, matching the existing ROCm treatment:

                 if (
                     not check_moe_marlin_supports_layer(layer, group_size)
                     or current_platform.is_rocm()
+                    or current_platform.is_xpu()
                 ):
                     from .compressed_tensors_moe_wna16 import (
                         CompressedTensorsWNA16MoEMethod,
                     )

Verification of the fix

I applied this change locally and the same model loads, compiles via torch.compile, runs warmup, and serves coherent generations end-to-end on B60. After the patch the selector log line becomes:

INFO  compressed_tensors_moe.py:114] Using CompressedTensorsWNA16MoEMethod

The Triton-based CompressedTensorsWNA16MoEMethod in compressed_tensors_moe_wna16.py does not reference torch.ops._C or any other CUDA-only custom op, so it has no platform-specific dependencies beyond what the Triton-XPU runtime already provides (which is already known-good — same path is used by other XPU MoE flows on main).

A simple smoke test on the patched setup:

$ curl -s .../v1/chat/completions -d '{"model":"gemma-4-26B-A4B",
    "messages":[{"role":"user","content":"Explain why the sky is blue, in 3 paragraphs."}],
    "max_tokens": 400}' | jq -r '.choices[0].message.content'

…returns a coherent, on-topic explanation (Rayleigh scattering, wavelengths, sunset effect), confirming the routed-experts MoE path is functionally correct via the Triton method on XPU.

Note:

This issue was created with the help of Claude, if it's irrelevant/not helpful feel free to close - but fwiw the fix DOES work in my local environment and Claude seemed to think it was worth submitting.

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

vllm - 💡(How to fix) Fix [Bug]: [XPU] compressed-tensors WNA16 MoE selector ignores XPU platform, crashes on Marlin path

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Fix / Workaround