vllm - ✅(Solved) Fix [Bug] W4A8-INT compressed_tensors silently runs W4A16 — activations never quantized to int8 [2 pull requests, 1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#38064Fetched 2026-04-08 01:26:50
View on GitHub
Comments
0
Participants
1
Timeline
4
Reactions
0
Author
Participants
Timeline (top)
cross-referenced ×2referenced ×2

The W4A8-INT quantization path in compressed_tensors silently degrades to W4A16 on all GPUs. Activations are never quantized to int8 — the kernel receives bf16 input regardless of the configured quantization scheme.

Root Cause

In compressed_tensors_w4a8_int.py, create_weights passes act_type=params_dtype (which is torch.bfloat16) instead of torch.int8 to MPLinearLayerConfig:

# Current (broken)
mp_linear_kernel_config = MPLinearLayerConfig(
    ...
    act_type=params_dtype,   # bf16 — wrong!
    ...
)

# Fix
mp_linear_kernel_config = MPLinearLayerConfig(
    ...
    act_type=torch.int8,     # correct
    ...
)

Because act_type is bf16, the Marlin kernel's apply_gptq_marlin_linear never enters the int8 quantization branch:

if input_dtype == torch.int8:
    reshaped_x, a_scales = marlin_quant_input(...)  # ← never reached

Fix Action

Fix

See: https://github.com/yeshihai/vllm/tree/feat/marlin-w4a8-int

Tested on L40 (SM89) with Qwen2.5-7B:

  • Latency (bs=1, input=512, output=128): 2.72x improvement over W4A16
  • Throughput (16 prompts): 2.14x improvement over W4A16

Related: #38063

PR fix notes

PR #1: feat(quantization): fix W4A8-INT activation quantization and int4 support in Marlin kernel

Description (problem / solution / changelog)

Summary

Fixes two bugs that together made the W4A8-INT compressed_tensors path completely non-functional:

  1. scalar_types.int4 not recognized → W4A8-INT models with signed int4 weights cannot be deployed at all
  2. Activations never quantized to int8 → even when deployment succeeds (uint4b8 weights), the kernel silently runs W4A16

Fixes #38063, Fixes #38064

Root Causes & Changes

1. compressed_tensors_w4a8_int.py

  • act_type=params_dtype (bf16) → act_type=torch.int8
  • Add weight_type=self.quant_type to MPLinearLayerConfig (was missing entirely)
  • Fix group_size passthrough for channelwise quantization

2. marlin_utils.py

  • Add scalar_types.int4 to query_marlin_supported_quant_types so Marlin can be selected for signed int4 weights
  • Widen assert in apply_gptq_marlin_linear to allow int4 alongside uint4b8
  • Guard a_scales * input_global_scale behind if input_global_scale is not None (channelwise path has no global scale)

3. marlin.py (MarlinLinearKernel)

  • Widen assert in process_weights_after_loading to allow int4 in W4A8 path
  • Add int4 packing in transform_w_q: convert [N, K] int8 (signed int4 values) → pack 8 values per int32 → transpose to [K//8, N] → pass through gptq_marlin_repack(is_a_8bit=True)
  • Add effective_wtype remap in apply_weights: pass uint4b8 to the kernel for signed int4 weights (the kernel interprets packed bits identically; the +8 bias is absorbed during packing)

Testing

Tested on L40 (SM89, Ada Lovelace) with Qwen2.5-7B W4A8-INT (compressed_tensors):

MetricW4A16 baselineW4A8-INT (this PR)Improvement
Latency (bs=1, in=512, out=128)2.62s0.96s2.72x
Throughput (16 prompts)531 tok/s1137 tok/s2.14x

Activation dtype confirmed as torch.int8 (verified via dtype print in apply_gptq_marlin_linear).

Kernel Path on SM89

  • CutlassW4A8LinearKernel: requires SM90 + FP8 activations — not applicable
  • MarlinLinearKernel: min SM75, supports INT8 activations — used on SM89

Changed files

  • vllm/model_executor/kernels/linear/mixed_precision/marlin.py (modified, +20/-4)
  • vllm/model_executor/layers/quantization/compressed_tensors/schemes/compressed_tensors_w4a8_int.py (modified, +2/-2)
  • vllm/model_executor/layers/quantization/utils/marlin_utils.py (modified, +9/-4)

PR #38066: feat(quantization): fix W4A8-INT activation quantization and int4 support in Marlin kernel

Description (problem / solution / changelog)

Summary

Fixes two bugs that together made the W4A8-INT compressed_tensors path completely non-functional:

  1. scalar_types.int4 not recognized → W4A8-INT models with signed int4 weights cannot be deployed at all
  2. Activations never quantized to int8 → even when deployment succeeds (uint4b8 weights), the kernel silently runs W4A16

Fixes #38063, Fixes #38064

Root Causes & Changes

1. compressed_tensors_w4a8_int.py

  • act_type=params_dtype (bf16) → act_type=torch.int8
  • Add weight_type=self.quant_type to MPLinearLayerConfig (was missing entirely)
  • Fix group_size passthrough for channelwise quantization

2. marlin_utils.py

  • Add scalar_types.int4 to query_marlin_supported_quant_types so Marlin can be selected for signed int4 weights
  • Widen assert in apply_gptq_marlin_linear to allow int4 alongside uint4b8
  • Guard a_scales * input_global_scale behind if input_global_scale is not None (channelwise path has no global scale)

3. marlin.py (MarlinLinearKernel)

  • Widen assert in process_weights_after_loading to allow int4 in W4A8 path
  • Add int4 packing in transform_w_q: convert [N, K] int8 (signed int4 values) → pack 8 values per int32 → transpose to [K//8, N] → pass through gptq_marlin_repack(is_a_8bit=True)
  • Add effective_wtype remap in apply_weights: pass uint4b8 to the kernel for signed int4 weights (the kernel interprets packed bits identically; the +8 bias is absorbed during packing)

Testing

Tested on L40 (SM89, Ada Lovelace) with Qwen2.5-7B W4A8-INT (compressed_tensors):

MetricW4A16 baselineW4A8-INT (this PR)Improvement
Latency (bs=1, in=512, out=128)2.62s0.96s2.72x
Throughput (16 prompts)531 tok/s1137 tok/s2.14x

Activation dtype confirmed as torch.int8 (verified via dtype print in apply_gptq_marlin_linear).

Kernel Path on SM89

  • CutlassW4A8LinearKernel: requires SM90 + FP8 activations — not applicable
  • MarlinLinearKernel: min SM75, supports INT8 activations — used on SM89

Changed files

  • vllm/model_executor/kernels/linear/mixed_precision/marlin.py (modified, +24/-4)
  • vllm/model_executor/layers/quantization/compressed_tensors/schemes/compressed_tensors_w4a8_int.py (modified, +2/-2)
  • vllm/model_executor/layers/quantization/utils/marlin_utils.py (modified, +10/-5)

Code Example

# Current (broken)
mp_linear_kernel_config = MPLinearLayerConfig(
    ...
    act_type=params_dtype,   # bf16 — wrong!
    ...
)

# Fix
mp_linear_kernel_config = MPLinearLayerConfig(
    ...
    act_type=torch.int8,     # correct
    ...
)

---

if input_dtype == torch.int8:
    reshaped_x, a_scales = marlin_quant_input(...)  # ← never reached

---

print(f"x.dtype={x.dtype}", file=sys.stderr, flush=True)
# Outputs: x.dtype=torch.bfloat16  ← should be torch.int8
RAW_BUFFERClick to expand / collapse

Summary

The W4A8-INT quantization path in compressed_tensors silently degrades to W4A16 on all GPUs. Activations are never quantized to int8 — the kernel receives bf16 input regardless of the configured quantization scheme.

Root Cause

In compressed_tensors_w4a8_int.py, create_weights passes act_type=params_dtype (which is torch.bfloat16) instead of torch.int8 to MPLinearLayerConfig:

# Current (broken)
mp_linear_kernel_config = MPLinearLayerConfig(
    ...
    act_type=params_dtype,   # bf16 — wrong!
    ...
)

# Fix
mp_linear_kernel_config = MPLinearLayerConfig(
    ...
    act_type=torch.int8,     # correct
    ...
)

Because act_type is bf16, the Marlin kernel's apply_gptq_marlin_linear never enters the int8 quantization branch:

if input_dtype == torch.int8:
    reshaped_x, a_scales = marlin_quant_input(...)  # ← never reached

Verification

Add a dtype print in apply_gptq_marlin_linear:

print(f"x.dtype={x.dtype}", file=sys.stderr, flush=True)
# Outputs: x.dtype=torch.bfloat16  ← should be torch.int8

(Use VLLM_ENABLE_V1_MULTIPROCESSING=0 to force single-process so the print is visible.)

Impact

All GPUs (A100, L40, A30, etc.). Every W4A8-INT model loaded via compressed_tensors actually runs as W4A16, wasting memory bandwidth and defeating the purpose of activation quantization.

Fix

See: https://github.com/yeshihai/vllm/tree/feat/marlin-w4a8-int

Tested on L40 (SM89) with Qwen2.5-7B:

  • Latency (bs=1, input=512, output=128): 2.72x improvement over W4A16
  • Throughput (16 prompts): 2.14x improvement over W4A16

Related: #38063

extent analysis

Fix Plan

To fix the issue, update the create_weights function in compressed_tensors_w4a8_int.py to pass torch.int8 instead of params_dtype to MPLinearLayerConfig.

  • Update the mp_linear_kernel_config creation:
mp_linear_kernel_config = MPLinearLayerConfig(
    ...
    act_type=torch.int8,     # correct
    ...
)
  • Verify the fix by adding a dtype print in apply_gptq_marlin_linear:
print(f"x.dtype={x.dtype}", file=sys.stderr, flush=True)

Run the model with VLLM_ENABLE_V1_MULTIPROCESSING=0 to force single-process and check the output.

Verification

After applying the fix, the output of the print statement should be:

x.dtype=torch.int8

This indicates that the input to the Marlin kernel is now correctly quantized to int8.

Extra Tips

  • Make sure to test the fix on different GPUs (e.g., A100, L40, A30) to ensure the issue is resolved across all platforms.
  • Monitor latency and throughput improvements after applying the fix, as seen in the test results (2.72x latency improvement and 2.14x throughput improvement over W4A16).

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

vllm - ✅(Solved) Fix [Bug] W4A8-INT compressed_tensors silently runs W4A16 — activations never quantized to int8 [2 pull requests, 1 participants]