pytorch - 💡(How to fix) Fix [vllm] [2.12 regression][Quantization] CPU offload diverges from non-offload for w4a16 (Qwen1.5-MoE-A2.7B) [1 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
pytorch/pytorch#181634Fetched 2026-04-28 06:24:13
View on GitHub
Comments
1
Participants
2
Timeline
20
Reactions
0
Author
Participants
Timeline (top)
mentioned ×8subscribed ×8labeled ×2commented ×1

Under torch 2.12.0 + triton 3.7.0, vLLM's test_cpu_offload_compressed_tensors fails because results from running with --cpu-offload-gb 1 differ from running without CPU offload (same args otherwise) on a w4a16-quantized Qwen1.5-MoE model:

AssertionError: Results for model='nm-testing/Qwen1.5-MoE-A2.7B-Chat-quantized.w4a16' are not the same.

The test is a parity check between two settings of the same model (one offloading, one not) — passing means CPU offload is functionally transparent. Failure means CPU offload produces different outputs vs. fully on-GPU on torch 2.12. Passes on torch 2.11. Blocking the torch 2.12 upgrade for vLLM (vllm-project/vllm#40077).

Root Cause

Under torch 2.12.0 + triton 3.7.0, vLLM's test_cpu_offload_compressed_tensors fails because results from running with --cpu-offload-gb 1 differ from running without CPU offload (same args otherwise) on a w4a16-quantized Qwen1.5-MoE model:

Fix Action

Fix / Workaround

Test body:

@pytest.mark.skipif(
    not is_quant_method_supported("gptq_marlin"),
    reason="gptq_marlin is not supported on this GPU type.",
)
def test_cpu_offload_compressed_tensors(monkeypatch):
    monkeypatch.setenv("VLLM_TEST_FORCE_LOAD_FORMAT", "auto")
    compare_two_settings(
        "nm-testing/Qwen1.5-MoE-A2.7B-Chat-quantized.w4a16",
        ["--enforce_eager"],
        ["--enforce_eager", "--cpu-offload-gb", "1"],
        max_wait_seconds=480,
    )

Code Example

tests/quantization/test_cpu_offload.py::test_cpu_offload_compressed_tensors

---

@pytest.mark.skipif(
    not is_quant_method_supported("gptq_marlin"),
    reason="gptq_marlin is not supported on this GPU type.",
)
def test_cpu_offload_compressed_tensors(monkeypatch):
    monkeypatch.setenv("VLLM_TEST_FORCE_LOAD_FORMAT", "auto")
    compare_two_settings(
        "nm-testing/Qwen1.5-MoE-A2.7B-Chat-quantized.w4a16",
        ["--enforce_eager"],
        ["--enforce_eager", "--cpu-offload-gb", "1"],
        max_wait_seconds=480,
    )
RAW_BUFFERClick to expand / collapse

Summary

Under torch 2.12.0 + triton 3.7.0, vLLM's test_cpu_offload_compressed_tensors fails because results from running with --cpu-offload-gb 1 differ from running without CPU offload (same args otherwise) on a w4a16-quantized Qwen1.5-MoE model:

AssertionError: Results for model='nm-testing/Qwen1.5-MoE-A2.7B-Chat-quantized.w4a16' are not the same.

The test is a parity check between two settings of the same model (one offloading, one not) — passing means CPU offload is functionally transparent. Failure means CPU offload produces different outputs vs. fully on-GPU on torch 2.12. Passes on torch 2.11. Blocking the torch 2.12 upgrade for vLLM (vllm-project/vllm#40077).

Environment

  • torch: 2.12.0+cu130 (test channel)
  • triton: 3.7.0
  • CUDA: 13.0
  • Python: 3.12.13
  • Model: nm-testing/Qwen1.5-MoE-A2.7B-Chat-quantized.w4a16 (compressed-tensors w4a16)
  • Quant method: gptq_marlin

Reproduction

Failing test:

tests/quantization/test_cpu_offload.py::test_cpu_offload_compressed_tensors

Test body:

@pytest.mark.skipif(
    not is_quant_method_supported("gptq_marlin"),
    reason="gptq_marlin is not supported on this GPU type.",
)
def test_cpu_offload_compressed_tensors(monkeypatch):
    monkeypatch.setenv("VLLM_TEST_FORCE_LOAD_FORMAT", "auto")
    compare_two_settings(
        "nm-testing/Qwen1.5-MoE-A2.7B-Chat-quantized.w4a16",
        ["--enforce_eager"],
        ["--enforce_eager", "--cpu-offload-gb", "1"],
        max_wait_seconds=480,
    )

compare_two_settings runs the model under both arg sets and compares output text — they must match. They don't, on torch 2.12.

Reproducibility

Diagnosis request

CPU offload divergence suggests the host-side branch of the quantized matmul (gptq_marlin / compressed-tensors w4a16) on torch 2.12 doesn't bit-identically match the GPU-resident branch — could be a tensor copy/cast/dequant change. Could a maintainer check whether torch 2.12 changes the device-transfer or dequantize numeric behavior for w4a16 weights when half the layers are on CPU?

Links

  • vLLM PR: vllm-project/vllm#40077
  • Umbrella: pytorch/pytorch#180899

cc @jerryzh168 @jianyuh @raghuramank100 @jamesr66a @vkuzo @jgong5 @Xia-Weiwen @leslie-fang-intel

extent analysis

TL;DR

The most likely fix is to investigate and address the difference in tensor copy, cast, or dequantization behavior between torch 2.11 and torch 2.12 for w4a16 weights.

Guidance

  • Verify that the issue is indeed caused by the difference in device-transfer or dequantize numeric behavior for w4a16 weights between torch 2.11 and torch 2.12 by checking the torch documentation and release notes for any changes related to quantized matmul and CPU offload.
  • Check the compare_two_settings function to ensure that it correctly compares the output text from both settings and that the comparison is not affected by any external factors.
  • Investigate the gptq_marlin quantization method and its interaction with CPU offload in torch 2.12 to identify any potential issues or changes that could be causing the divergence.
  • Consider testing the model with different quantization methods or settings to see if the issue is specific to gptq_marlin or w4a16 weights.

Example

No code snippet is provided as the issue is more related to the interaction between different components and versions rather than a specific code bug.

Notes

The issue seems to be specific to torch 2.12 and the gptq_marlin quantization method, and it may be related to changes in the device-transfer or dequantize numeric behavior for w4a16 weights. Further investigation is needed to identify the root cause and develop a fix.

Recommendation

Apply a workaround by downgrading to torch 2.11 until the issue is resolved in a future version of torch, as the test passes on torch 2.11 and the issue is specific to torch 2.12.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING