vllm - 💡(How to fix) Fix [Bug] Regression: GPTQ models fail to load on Intel XPU in v0.19.0 (missing XPU branches in gptq.py) [1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#39474Fetched 2026-04-11 06:13:24
View on GitHub
Comments
0
Participants
1
Timeline
0
Reactions
0
Author
Participants

Error Message

AttributeError: '_OpNamespace' '_C' object has no attribute 'gptq_shuffle'

Root Cause

Comparing vllm/model_executor/layers/quantization/gptq.py between v0.17.x (intel/vllm:latest docker image) and v0.19.0:

Fix Action

Fix / Workaround

Note: --quantization gptq_marlin does not work as a workaround — v0.19.0's ModelConfig strictly rejects a mismatch between the model's quantization_config.quant_method (gptq) and the CLI-passed --quantization argument (gptq_marlin):

We have a working patch at https://github.com/bryanvine/vllm-xpu/commit/33aacbcb2 (branch xpu-build-0.19.0).

With the patch applied, vLLM 0.19.0 on Intel Arc Pro B70 successfully:

  • Loads Qwen3-30B-A3B-Instruct-2507-gptq-4bit (MoE, 18GB weights)
  • Runs EAGLE3 speculative decoding with draft model
  • Serves chat completions (verified end-to-end)

Code Example

vLLM 0.19.0 (built from v0.19.0 tag via docker/Dockerfile.xpu)
PyTorch 2.10.0+xpu
triton-xpu 3.6.0
vllm-xpu-kernels 0.1.5
Intel Arc Pro B70 (Battlemage G31, 32GB VRAM, PCI 8086:e223)
Ubuntu 25.10 host, container based on intel/deep-learning-essentials:2025.3.2-0-devel-ubuntu24.04

---

AttributeError: '_OpNamespace' '_C' object has no attribute 'gptq_shuffle'

---

git clone https://github.com/vllm-project/vllm.git && cd vllm && git checkout v0.19.0
   docker build -f docker/Dockerfile.xpu -t vllm-xpu:0.19.0 .

---

docker run --rm -it --device /dev/dri:/dev/dri --group-add render --group-add video \
     -v /path/to/models:/models:ro \
     vllm-xpu:0.19.0 \
     vllm serve /models/Qwen3-30B-A3B-Instruct-2507-gptq-4bit \
       --dtype auto --trust-remote-code --gpu-memory-utilization 0.9

---

(EngineCore pid=49) ERROR [core.py:1108] EngineCore failed to start.
(EngineCore pid=49) ERROR [core.py:1108] Traceback (most recent call last):
(EngineCore pid=49) ERROR [core.py:1108]   File ".../vllm/v1/engine/core.py", line 1082, in run_engine_core
(EngineCore pid=49) ERROR [core.py:1108]     engine_core = EngineCoreProc(*args, engine_index=dp_rank, **kwargs)
(EngineCore pid=49) ERROR [core.py:1108]   File ".../vllm/v1/engine/core.py", line 848, in __init__
(EngineCore pid=49) ERROR [core.py:1108]     super().__init__(...)
(EngineCore pid=49) ERROR [core.py:1108]   File ".../vllm/v1/engine/core.py", line 114, in __init__
(EngineCore pid=49) ERROR [core.py:1108]     self.model_executor = executor_class(vllm_config)
(EngineCore pid=49) ERROR [core.py:1108]   File ".../vllm/v1/executor/uniproc_executor.py", line 52, in _init_executor
(EngineCore pid=49) ERROR [core.py:1108]     self.driver_worker.load_model()
(EngineCore pid=49) ERROR [core.py:1108]   File ".../vllm/v1/worker/gpu_worker.py", line 323, in load_model
(EngineCore pid=49) ERROR [core.py:1108]     self.model_runner.load_model(...)
(EngineCore pid=49) ERROR [core.py:1108]   File ".../vllm/v1/worker/gpu_model_runner.py", line 4751, in load_model
(EngineCore pid=49) ERROR [core.py:1108]     self.model = model_loader.load_model(...)
(EngineCore pid=49) ERROR [core.py:1108]   File ".../vllm/model_executor/model_loader/base_loader.py", line 81, in load_model
(EngineCore pid=49) ERROR [core.py:1108]     process_weights_after_loading(model, model_config, target_device)
(EngineCore pid=49) ERROR [core.py:1108]   File ".../vllm/model_executor/model_loader/utils.py", line 107, in process_weights_after_loading
(EngineCore pid=49) ERROR [core.py:1108]     quant_method.process_weights_after_loading(module)
(EngineCore pid=49) ERROR [core.py:1108]   File ".../vllm/model_executor/layers/quantization/gptq.py", line 368, in process_weights_after_loading
(EngineCore pid=49) ERROR [core.py:1108]     ops.gptq_shuffle(layer.qweight, layer.g_idx, self.quant_config.weight_bits)
(EngineCore pid=49) ERROR [core.py:1108]   File ".../vllm/_custom_ops.py", line 685, in gptq_shuffle
(EngineCore pid=49) ERROR [core.py:1108]     torch.ops._C.gptq_shuffle(q_weight, q_perm, bit)
(EngineCore pid=49) ERROR [core.py:1108]   File ".../torch/_ops.py", line 1319, in __getattr__
(EngineCore pid=49) ERROR [core.py:1108]     raise AttributeError(...)
(EngineCore pid=49) ERROR [core.py:1108] AttributeError: '_OpNamespace' '_C' object has no attribute 'gptq_shuffle'

---

WARNING [interface.py:229] Failed to import from vllm._C: ModuleNotFoundError("No module named 'vllm._C'")
WARNING [gptq.py:99] Currently, the 4-bit gptq_gemm kernel for GPTQ is buggy. Please switch to gptq_marlin.

---

pydantic_core._pydantic_core.ValidationError: 1 validation error for ModelConfig
  Value error, Quantization method specified in the model config (gptq) does not match
  the quantization method specified in the `quantization` argument (gptq_marlin).

---

if current_platform.is_xpu():
    from vllm_xpu_kernels.quantization._quantize_convert import (
        GPTQUtils,
        transpose_onednn_woq_format,
    )
    if self.quant_config.desc_act and layer.g_idx is not None:
        gptq_utils = GPTQUtils(bits=4, blocksize=self.quant_config.group_size)
        qweight_new, g_idx_new = gptq_utils.shuffle(layer.qweight, layer.g_idx)
        layer.qweight.data.copy_(qweight_new)
        layer.g_idx.data.copy_(g_idx_new)
        del qweight_new, g_idx_new
    transpose_onednn_woq_format(layer, "gptq", True)
    return

---

if current_platform.is_xpu():
    reshaped_x = x.reshape(-1, x.shape[-1])
    out = torch.ops._xpu_C.int4_gemm_w4a16(
        reshaped_x,
        layer.qweight,
        bias,
        layer.scales,
        layer.qzeros,
        self.quant_config.group_size,
        None,
    )
    return out.reshape(x.shape[:-1] + (layer.qweight.shape[-1],))

---

_xpu_C::int4_gemm_w4a16
_xpu_C::int4_gemm_w4a8
_xpu_C::fp8_gemm
_xpu_C::fp8_gemm_w8a16
_xpu_C::gdn_attention
_xpu_C::deepseek_scaling_rope
_xpu_C::bgmv_expand / bgmv_shrink / bgmv_expand_slice
_xpu_C::cutlass_grouped_gemm_interface
_xpu_C::is_bmg / is_pvc
RAW_BUFFERClick to expand / collapse

Your current environment

vLLM 0.19.0 (built from v0.19.0 tag via docker/Dockerfile.xpu)
PyTorch 2.10.0+xpu
triton-xpu 3.6.0
vllm-xpu-kernels 0.1.5
Intel Arc Pro B70 (Battlemage G31, 32GB VRAM, PCI 8086:e223)
Ubuntu 25.10 host, container based on intel/deep-learning-essentials:2025.3.2-0-devel-ubuntu24.04

🐛 Describe the bug

Starting any GPTQ-quantized model on Intel XPU with vLLM 0.19.0 fails during engine initialization with:

AttributeError: '_OpNamespace' '_C' object has no attribute 'gptq_shuffle'

This is a regression from v0.17.x, which handled GPTQ on XPU correctly. v0.17 had explicit if current_platform.is_xpu(): branches in vllm/model_executor/layers/quantization/gptq.py that used vllm_xpu_kernels APIs. These branches were removed in v0.19, causing fallthrough to the CUDA-only vllm._C ops. The XPU build doesn't ship vllm._C at all (WARNING [interface.py:229] Failed to import from vllm._C: ModuleNotFoundError).

Additionally, the XPU kernels shared library at vllm_xpu_kernels/_xpu_C.abi3.so is not auto-loaded by v0.19, so torch.ops._xpu_C.int4_gemm_w4a16 is not registered when gptq.py's apply() method would need it.

Reproduction

  1. Build vLLM 0.19.0 for XPU:

    git clone https://github.com/vllm-project/vllm.git && cd vllm && git checkout v0.19.0
    docker build -f docker/Dockerfile.xpu -t vllm-xpu:0.19.0 .
  2. Attempt to serve any GPTQ model (e.g., btbtyler09/Qwen3-30B-A3B-Instruct-2507-gptq-4bit):

    docker run --rm -it --device /dev/dri:/dev/dri --group-add render --group-add video \
      -v /path/to/models:/models:ro \
      vllm-xpu:0.19.0 \
      vllm serve /models/Qwen3-30B-A3B-Instruct-2507-gptq-4bit \
        --dtype auto --trust-remote-code --gpu-memory-utilization 0.9
  3. Engine core crashes after weight loading with the traceback below.

Full traceback

(EngineCore pid=49) ERROR [core.py:1108] EngineCore failed to start.
(EngineCore pid=49) ERROR [core.py:1108] Traceback (most recent call last):
(EngineCore pid=49) ERROR [core.py:1108]   File ".../vllm/v1/engine/core.py", line 1082, in run_engine_core
(EngineCore pid=49) ERROR [core.py:1108]     engine_core = EngineCoreProc(*args, engine_index=dp_rank, **kwargs)
(EngineCore pid=49) ERROR [core.py:1108]   File ".../vllm/v1/engine/core.py", line 848, in __init__
(EngineCore pid=49) ERROR [core.py:1108]     super().__init__(...)
(EngineCore pid=49) ERROR [core.py:1108]   File ".../vllm/v1/engine/core.py", line 114, in __init__
(EngineCore pid=49) ERROR [core.py:1108]     self.model_executor = executor_class(vllm_config)
(EngineCore pid=49) ERROR [core.py:1108]   File ".../vllm/v1/executor/uniproc_executor.py", line 52, in _init_executor
(EngineCore pid=49) ERROR [core.py:1108]     self.driver_worker.load_model()
(EngineCore pid=49) ERROR [core.py:1108]   File ".../vllm/v1/worker/gpu_worker.py", line 323, in load_model
(EngineCore pid=49) ERROR [core.py:1108]     self.model_runner.load_model(...)
(EngineCore pid=49) ERROR [core.py:1108]   File ".../vllm/v1/worker/gpu_model_runner.py", line 4751, in load_model
(EngineCore pid=49) ERROR [core.py:1108]     self.model = model_loader.load_model(...)
(EngineCore pid=49) ERROR [core.py:1108]   File ".../vllm/model_executor/model_loader/base_loader.py", line 81, in load_model
(EngineCore pid=49) ERROR [core.py:1108]     process_weights_after_loading(model, model_config, target_device)
(EngineCore pid=49) ERROR [core.py:1108]   File ".../vllm/model_executor/model_loader/utils.py", line 107, in process_weights_after_loading
(EngineCore pid=49) ERROR [core.py:1108]     quant_method.process_weights_after_loading(module)
(EngineCore pid=49) ERROR [core.py:1108]   File ".../vllm/model_executor/layers/quantization/gptq.py", line 368, in process_weights_after_loading
(EngineCore pid=49) ERROR [core.py:1108]     ops.gptq_shuffle(layer.qweight, layer.g_idx, self.quant_config.weight_bits)
(EngineCore pid=49) ERROR [core.py:1108]   File ".../vllm/_custom_ops.py", line 685, in gptq_shuffle
(EngineCore pid=49) ERROR [core.py:1108]     torch.ops._C.gptq_shuffle(q_weight, q_perm, bit)
(EngineCore pid=49) ERROR [core.py:1108]   File ".../torch/_ops.py", line 1319, in __getattr__
(EngineCore pid=49) ERROR [core.py:1108]     raise AttributeError(...)
(EngineCore pid=49) ERROR [core.py:1108] AttributeError: '_OpNamespace' '_C' object has no attribute 'gptq_shuffle'

Relevant log lines earlier in the run:

WARNING [interface.py:229] Failed to import from vllm._C: ModuleNotFoundError("No module named 'vllm._C'")
WARNING [gptq.py:99] Currently, the 4-bit gptq_gemm kernel for GPTQ is buggy. Please switch to gptq_marlin.

Note: --quantization gptq_marlin does not work as a workaround — v0.19.0's ModelConfig strictly rejects a mismatch between the model's quantization_config.quant_method (gptq) and the CLI-passed --quantization argument (gptq_marlin):

pydantic_core._pydantic_core.ValidationError: 1 validation error for ModelConfig
  Value error, Quantization method specified in the model config (gptq) does not match
  the quantization method specified in the `quantization` argument (gptq_marlin).

Root cause analysis

Comparing vllm/model_executor/layers/quantization/gptq.py between v0.17.x (intel/vllm:latest docker image) and v0.19.0:

v0.17 had TWO XPU branches that v0.19.0 removed:

In GPTQLinearMethod.process_weights_after_loading (v0.17):

if current_platform.is_xpu():
    from vllm_xpu_kernels.quantization._quantize_convert import (
        GPTQUtils,
        transpose_onednn_woq_format,
    )
    if self.quant_config.desc_act and layer.g_idx is not None:
        gptq_utils = GPTQUtils(bits=4, blocksize=self.quant_config.group_size)
        qweight_new, g_idx_new = gptq_utils.shuffle(layer.qweight, layer.g_idx)
        layer.qweight.data.copy_(qweight_new)
        layer.g_idx.data.copy_(g_idx_new)
        del qweight_new, g_idx_new
    transpose_onednn_woq_format(layer, "gptq", True)
    return

In GPTQLinearMethod.apply (v0.17):

if current_platform.is_xpu():
    reshaped_x = x.reshape(-1, x.shape[-1])
    out = torch.ops._xpu_C.int4_gemm_w4a16(
        reshaped_x,
        layer.qweight,
        bias,
        layer.scales,
        layer.qzeros,
        self.quant_config.group_size,
        None,
    )
    return out.reshape(x.shape[:-1] + (layer.qweight.shape[-1],))

In v0.19.0, both branches are gone and the code goes directly to ops.gptq_shuffle / ops.gptq_gemm, which call torch.ops._C.* — ops that don't exist in the XPU build (vllm._C is CUDA-only).

_xpu_C.abi3.so not auto-loaded

Separately, the XPU kernel shared library vllm_xpu_kernels/_xpu_C.abi3.so is shipped in the wheel but nothing loads it before torch.ops._xpu_C.int4_gemm_w4a16 is needed. The ops registered by the .so include:

_xpu_C::int4_gemm_w4a16
_xpu_C::int4_gemm_w4a8
_xpu_C::fp8_gemm
_xpu_C::fp8_gemm_w8a16
_xpu_C::gdn_attention
_xpu_C::deepseek_scaling_rope
_xpu_C::bgmv_expand / bgmv_shrink / bgmv_expand_slice
_xpu_C::cutlass_grouped_gemm_interface
_xpu_C::is_bmg / is_pvc

Without an explicit torch.ops.load_library(...) call, these are not registered when gptq.py (or any other file expecting them) tries to use them.

Proposed fix

Restore the v0.17 XPU branches in gptq.py, and add an explicit load of the XPU kernels .so when running on XPU.

We have a working patch at https://github.com/bryanvine/vllm-xpu/commit/33aacbcb2 (branch xpu-build-0.19.0).

The auto-loading of _xpu_C.abi3.so could alternatively live in vllm/platforms/xpu.py or vllm_xpu_kernels/__init__.py so it benefits any module that needs _xpu_C ops, not just GPTQ. That may be the cleaner long-term home for it.

Verified working

With the patch applied, vLLM 0.19.0 on Intel Arc Pro B70 successfully:

  • Loads Qwen3-30B-A3B-Instruct-2507-gptq-4bit (MoE, 18GB weights)
  • Runs EAGLE3 speculative decoding with draft model
  • Serves chat completions (verified end-to-end)

The same pattern likely affects other GPTQ-derived quant methods on XPU if they share code paths — worth checking gptq_marlin.py, compressed_tensors.py, etc. for similar removed branches.

Related

  • vllm-xpu-kernels v0.1.5 ships the _xpu_C.abi3.so with int4_gemm_w4a16 and related ops
  • The Intel vLLM docker image at intel/vllm:latest still ships a v0.17-era build that does not have this regression
  • No GPTQ XPU CI coverage appears to exist in the vLLM repo — adding a minimal smoke test loading a small GPTQ model on XPU would have caught this

extent analysis

TL;DR

The most likely fix for the issue is to restore the XPU branches in gptq.py and add an explicit load of the XPU kernels .so when running on XPU.

Guidance

  • Restore the v0.17 XPU branches in gptq.py to handle GPTQ quantization on XPU.
  • Add an explicit load of the XPU kernels .so (_xpu_C.abi3.so) when running on XPU, which can be done in vllm/platforms/xpu.py or vllm_xpu_kernels/__init__.py.
  • Verify that the fix works by loading a GPTQ model on XPU and running it without errors.
  • Consider adding CI coverage for GPTQ on XPU to catch similar regressions in the future.

Example

A working patch is available at https://github.com/bryanvine/vllm-xpu/commit/33aacbcb2 (branch xpu-build-0.19.0), which demonstrates the necessary changes to restore the XPU branches and load the XPU kernels .so.

Notes

The issue is specific to vLLM 0.19.0 and XPU, and the proposed fix is based on the analysis of the code changes between v0.17.x and v0.19.0. The fix may need to be adapted for other versions or platforms.

Recommendation

Apply the workaround by restoring the XPU branches in gptq.py and adding an explicit load of the XPU kernels .so, as this is the most straightforward way to fix the issue without waiting for an official update.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

vllm - 💡(How to fix) Fix [Bug] Regression: GPTQ models fail to load on Intel XPU in v0.19.0 (missing XPU branches in gptq.py) [1 participants]