vllm - 💡(How to fix) Fix [Bug]: [XPU] compressed-tensors WNA16 MoE selector ignores XPU platform, crashes on Marlin path

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

Error Message

AttributeError: '_OpNamespace' '_C' object has no attribute 'gptq_marlin_repack'

Root Cause

Same-format dense W4A16 models work fine on XPU because they go through a different code path (Using XPUwNa16LinearKernel for CompressedTensorsWNA16), so the bug only affects compressed-tensors W4A16 MoE specifically.

Fix Action

Fix / Workaround

I applied this change locally and the same model loads, compiles via torch.compile, runs warmup, and serves coherent generations end-to-end on B60. After the patch the selector log line becomes:

A simple smoke test on the patched setup:

Code Example

AttributeError: '_OpNamespace' '_C' object has no attribute 'gptq_marlin_repack'

---

# Prefer to use the MarlinMoE kernel when it is supported.
if (
    not check_moe_marlin_supports_layer(layer, group_size)
    or current_platform.is_rocm()
):
    # ... CompressedTensorsWNA16MoEMethod (Triton, works on XPU)
else:
    # ... CompressedTensorsWNA16MarlinMoEMethod (CUDA-only, crashes on XPU)
    return CompressedTensorsWNA16MarlinMoEMethod(...)

---

vllm serve dhruvil237/gemma-4-26B-A4B-it-W4A16 \
    --trust-remote-code \
    --dtype bfloat16 \
    --max-model-len 16384

---

INFO  compressed_tensors_wNa16.py:112] Using XPUwNa16LinearKernel for CompressedTensorsWNA16
INFO  xpu.py:78] Using Triton backend.
INFO  compressed_tensors_moe.py:122] Using CompressedTensorsWNA16MarlinMoEMethod
INFO  compressed_tensors_moe_wna16_marlin.py:88] Using Marlin backend for WNA16 MoE (group_size=64, num_bits=4)
...
INFO  default_loader.py:397] Loading weights took 18.09 seconds
ERROR core.py:1165] EngineCore failed to start.
ERROR core.py:1165] Traceback (most recent call last):
ERROR core.py:1165]   File ".../v1/engine/core.py", line 1139, in run_engine_core
ERROR core.py:1165]     engine_core = EngineCoreProc(*args, ...)
...
ERROR core.py:1165]   File ".../compressed_tensors_moe_wna16_marlin.py", line 403,
                       in process_weights_after_loading
ERROR core.py:1165]     marlin_w13_qweight = ops.gptq_marlin_moe_repack(
ERROR core.py:1165]   File ".../vllm/_custom_ops.py", line 1241, in gptq_marlin_moe_repack
ERROR core.py:1165]     output[e] = torch.ops._C.gptq_marlin_repack(...)
ERROR core.py:1165]   File ".../torch/_ops.py", line 1379, in __getattr__
ERROR core.py:1165]     raise AttributeError(...)
ERROR core.py:1165] AttributeError: '_OpNamespace' '_C' object has no attribute 'gptq_marlin_repack'

---

if (
                     not check_moe_marlin_supports_layer(layer, group_size)
                     or current_platform.is_rocm()
+                    or current_platform.is_xpu()
                 ):
                     from .compressed_tensors_moe_wna16 import (
                         CompressedTensorsWNA16MoEMethod,
                     )

---

INFO  compressed_tensors_moe.py:114] Using CompressedTensorsWNA16MoEMethod

---

$ curl -s .../v1/chat/completions -d '{"model":"gemma-4-26B-A4B",
    "messages":[{"role":"user","content":"Explain why the sky is blue, in 3 paragraphs."}],
    "max_tokens": 400}' | jq -r '.choices[0].message.content'
RAW_BUFFERClick to expand / collapse

Your current environment

<details> <summary>Environment</summary> - vLLM version: `0.21.1rc1.dev315+g0b68f21e7` (built from current `main` via `docker/Dockerfile.xpu`) - Platform: `device_config=xpu` - Hardware: Intel Arc Pro B60 (Battlemage, BMG) - OS: Ubuntu 24.04 host, container based on `intel/deep-learning-essentials:2025.3.2-0-devel-ubuntu24.04` - PyTorch: `torch-xpu` from `https://download.pytorch.org/whl/xpu` - Triton: `triton-xpu==3.7.0` </details>

🐛 Describe the bug

On XPU, loading a compressed-tensors W4A16 MoE model crashes during process_weights_after_loading with:

AttributeError: '_OpNamespace' '_C' object has no attribute 'gptq_marlin_repack'

The cause is the backend selector in compressed_tensors_moe.py, which only excludes ROCm from the Marlin path:

https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors_moe/compressed_tensors_moe.py

# Prefer to use the MarlinMoE kernel when it is supported.
if (
    not check_moe_marlin_supports_layer(layer, group_size)
    or current_platform.is_rocm()
):
    # ... CompressedTensorsWNA16MoEMethod (Triton, works on XPU)
else:
    # ... CompressedTensorsWNA16MarlinMoEMethod (CUDA-only, crashes on XPU)
    return CompressedTensorsWNA16MarlinMoEMethod(...)

XPU has the same problem ROCm has — Marlin custom ops are not registered (the entire torch.ops._C namespace is effectively empty on the XPU build, since vllm._C isn't built for XPU) — but XPU isn't in the bypass condition, so it falls through to Marlin and crashes.

Same-format dense W4A16 models work fine on XPU because they go through a different code path (Using XPUwNa16LinearKernel for CompressedTensorsWNA16), so the bug only affects compressed-tensors W4A16 MoE specifically.

Relationship to #41426

PR #41426 adds a native XPU INT4 MoE path for INC/GPTQ-quantized models via vllm-xpu-kernels. That work targets a different selector (override_quantization_method + INC schemes) and doesn't address compressed-tensors. The fix proposed here is a small, scoped change to the compressed-tensors selector that unblocks compressed-tensors-formatted W4A16 MoE checkpoints on XPU using the existing Triton path (which already works on XPU — same path used by other XPU MoE flows).

The two changes are independent and complementary.

Reproduction

Run any compressed-tensors W4A16 MoE model on VLLM_TARGET_DEVICE=xpu. Confirmed with:

  • dhruvil237/gemma-4-26B-A4B-it-W4A16
  • (Expected to reproduce with any other *-quantized.w4a16 MoE model, e.g. RedHatAI's MoE w4a16 quants)

Minimal CLI:

vllm serve dhruvil237/gemma-4-26B-A4B-it-W4A16 \
    --trust-remote-code \
    --dtype bfloat16 \
    --max-model-len 16384

Engine log

INFO  compressed_tensors_wNa16.py:112] Using XPUwNa16LinearKernel for CompressedTensorsWNA16
INFO  xpu.py:78] Using Triton backend.
INFO  compressed_tensors_moe.py:122] Using CompressedTensorsWNA16MarlinMoEMethod
INFO  compressed_tensors_moe_wna16_marlin.py:88] Using Marlin backend for WNA16 MoE (group_size=64, num_bits=4)
...
INFO  default_loader.py:397] Loading weights took 18.09 seconds
ERROR core.py:1165] EngineCore failed to start.
ERROR core.py:1165] Traceback (most recent call last):
ERROR core.py:1165]   File ".../v1/engine/core.py", line 1139, in run_engine_core
ERROR core.py:1165]     engine_core = EngineCoreProc(*args, ...)
...
ERROR core.py:1165]   File ".../compressed_tensors_moe_wna16_marlin.py", line 403,
                       in process_weights_after_loading
ERROR core.py:1165]     marlin_w13_qweight = ops.gptq_marlin_moe_repack(
ERROR core.py:1165]   File ".../vllm/_custom_ops.py", line 1241, in gptq_marlin_moe_repack
ERROR core.py:1165]     output[e] = torch.ops._C.gptq_marlin_repack(...)
ERROR core.py:1165]   File ".../torch/_ops.py", line 1379, in __getattr__
ERROR core.py:1165]     raise AttributeError(...)
ERROR core.py:1165] AttributeError: '_OpNamespace' '_C' object has no attribute 'gptq_marlin_repack'

Suggested fix

Add is_xpu() to the existing platform bypass, matching the existing ROCm treatment:

                 if (
                     not check_moe_marlin_supports_layer(layer, group_size)
                     or current_platform.is_rocm()
+                    or current_platform.is_xpu()
                 ):
                     from .compressed_tensors_moe_wna16 import (
                         CompressedTensorsWNA16MoEMethod,
                     )

Verification of the fix

I applied this change locally and the same model loads, compiles via torch.compile, runs warmup, and serves coherent generations end-to-end on B60. After the patch the selector log line becomes:

INFO  compressed_tensors_moe.py:114] Using CompressedTensorsWNA16MoEMethod

The Triton-based CompressedTensorsWNA16MoEMethod in compressed_tensors_moe_wna16.py does not reference torch.ops._C or any other CUDA-only custom op, so it has no platform-specific dependencies beyond what the Triton-XPU runtime already provides (which is already known-good — same path is used by other XPU MoE flows on main).

A simple smoke test on the patched setup:

$ curl -s .../v1/chat/completions -d '{"model":"gemma-4-26B-A4B",
    "messages":[{"role":"user","content":"Explain why the sky is blue, in 3 paragraphs."}],
    "max_tokens": 400}' | jq -r '.choices[0].message.content'

…returns a coherent, on-topic explanation (Rayleigh scattering, wavelengths, sunset effect), confirming the routed-experts MoE path is functionally correct via the Triton method on XPU.

Note:

This issue was created with the help of Claude, if it's irrelevant/not helpful feel free to close - but fwiw the fix DOES work in my local environment and Claude seemed to think it was worth submitting.

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING