pytorch - 💡(How to fix) Fix [vllm] [2.12 regression][Quantization] CPU offload diverges from non-offload for w4a16 (Qwen1.5-MoE-A2.7B) [1 comments, 2 participants]

pytorch2026-04-27 19:05:45

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

pytorch/pytorch#181634•Fetched 2026-04-28 06:24:13

View on GitHub

Comments

Participants

Timeline

Reactions

Author

atalman

Participants

angelayi

atalman

Timeline (top)

mentioned ×8subscribed ×8labeled ×2commented ×1

Under torch 2.12.0 + triton 3.7.0, vLLM's test_cpu_offload_compressed_tensors fails because results from running with --cpu-offload-gb 1 differ from running without CPU offload (same args otherwise) on a w4a16-quantized Qwen1.5-MoE model:

AssertionError: Results for model='nm-testing/Qwen1.5-MoE-A2.7B-Chat-quantized.w4a16' are not the same.

The test is a parity check between two settings of the same model (one offloading, one not) — passing means CPU offload is functionally transparent. Failure means CPU offload produces different outputs vs. fully on-GPU on torch 2.12. Passes on torch 2.11. Blocking the torch 2.12 upgrade for vLLM (vllm-project/vllm#40077).

Root Cause

Fix Action

Fix / Workaround

Test body:

@pytest.mark.skipif(
    not is_quant_method_supported("gptq_marlin"),
    reason="gptq_marlin is not supported on this GPU type.",
)
def test_cpu_offload_compressed_tensors(monkeypatch):
    monkeypatch.setenv("VLLM_TEST_FORCE_LOAD_FORMAT", "auto")
    compare_two_settings(
        "nm-testing/Qwen1.5-MoE-A2.7B-Chat-quantized.w4a16",
        ["--enforce_eager"],
        ["--enforce_eager", "--cpu-offload-gb", "1"],
        max_wait_seconds=480,
    )

Code Example

tests/quantization/test_cpu_offload.py::test_cpu_offload_compressed_tensors

---

@pytest.mark.skipif(
    not is_quant_method_supported("gptq_marlin"),
    reason="gptq_marlin is not supported on this GPU type.",
)
def test_cpu_offload_compressed_tensors(monkeypatch):
    monkeypatch.setenv("VLLM_TEST_FORCE_LOAD_FORMAT", "auto")
    compare_two_settings(
        "nm-testing/Qwen1.5-MoE-A2.7B-Chat-quantized.w4a16",
        ["--enforce_eager"],
        ["--enforce_eager", "--cpu-offload-gb", "1"],
        max_wait_seconds=480,
    )

RAW_BUFFERClick to expand / collapse

Summary

AssertionError: Results for model='nm-testing/Qwen1.5-MoE-A2.7B-Chat-quantized.w4a16' are not the same.

Environment

torch: 2.12.0+cu130 (test channel)
triton: 3.7.0
CUDA: 13.0
Python: 3.12.13
Model: nm-testing/Qwen1.5-MoE-A2.7B-Chat-quantized.w4a16 (compressed-tensors w4a16)
Quant method: gptq_marlin

Reproduction

Failing test:

tests/quantization/test_cpu_offload.py::test_cpu_offload_compressed_tensors

Test body:

@pytest.mark.skipif(
    not is_quant_method_supported("gptq_marlin"),
    reason="gptq_marlin is not supported on this GPU type.",
)
def test_cpu_offload_compressed_tensors(monkeypatch):
    monkeypatch.setenv("VLLM_TEST_FORCE_LOAD_FORMAT", "auto")
    compare_two_settings(
        "nm-testing/Qwen1.5-MoE-A2.7B-Chat-quantized.w4a16",
        ["--enforce_eager"],
        ["--enforce_eager", "--cpu-offload-gb", "1"],
        max_wait_seconds=480,
    )

compare_two_settings runs the model under both arg sets and compares output text — they must match. They don't, on torch 2.12.

Reproducibility

torch-2.12 branch: failed https://buildkite.com/vllm/ci/builds/63095#019dcf15-80ae-4e8a-a68a-b2f75c554b00
main (torch 2.11) — passes on every recent build:
- 2026-04-25 daily: https://buildkite.com/vllm/ci/builds/62981
- 2026-04-26 nightly: https://buildkite.com/vllm/ci/builds/62990
- 2026-04-26 daily: https://buildkite.com/vllm/ci/builds/63026
- 2026-04-27 nightly: https://buildkite.com/vllm/ci/builds/63061

Diagnosis request

CPU offload divergence suggests the host-side branch of the quantized matmul (gptq_marlin / compressed-tensors w4a16) on torch 2.12 doesn't bit-identically match the GPU-resident branch — could be a tensor copy/cast/dequant change. Could a maintainer check whether torch 2.12 changes the device-transfer or dequantize numeric behavior for w4a16 weights when half the layers are on CPU?

extent analysis

TL;DR

The most likely fix is to investigate and address the difference in tensor copy, cast, or dequantization behavior between torch 2.11 and torch 2.12 for w4a16 weights.

Guidance

Verify that the issue is indeed caused by the difference in device-transfer or dequantize numeric behavior for w4a16 weights between torch 2.11 and torch 2.12 by checking the torch documentation and release notes for any changes related to quantized matmul and CPU offload.
Check the compare_two_settings function to ensure that it correctly compares the output text from both settings and that the comparison is not affected by any external factors.
Investigate the gptq_marlin quantization method and its interaction with CPU offload in torch 2.12 to identify any potential issues or changes that could be causing the divergence.
Consider testing the model with different quantization methods or settings to see if the issue is specific to gptq_marlin or w4a16 weights.

Example

No code snippet is provided as the issue is more related to the interaction between different components and versions rather than a specific code bug.

Notes

The issue seems to be specific to torch 2.12 and the gptq_marlin quantization method, and it may be related to changes in the device-transfer or dequantize numeric behavior for w4a16 weights. Further investigation is needed to identify the root cause and develop a fix.

Recommendation

Apply a workaround by downgrading to torch 2.11 until the issue is resolved in a future version of torch, as the test passes on torch 2.11 and the issue is specific to torch 2.12.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#tensor shape #autograd error #model save/load #optimization #mixed precision

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

pytorch - 💡(How to fix) Fix [vllm] [2.12 regression][Quantization] CPU offload diverges from non-offload for w4a16 (Qwen1.5-MoE-A2.7B) [1 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Fix Action

Fix / Workaround

Code Example

Summary

Environment

Reproduction

Reproducibility

Diagnosis request

Links

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

TRENDING

pytorch - 💡(How to fix) Fix [vllm] [2.12 regression][Quantization] CPU offload diverges from non-offload for w4a16 (Qwen1.5-MoE-A2.7B) [1 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Fix Action

Fix / Workaround

Code Example

Summary

Environment

Reproduction

Reproducibility

Diagnosis request

Links

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING