vllm - ✅(Solved) Fix [Bug] W4A8-INT compressed_tensors silently runs W4A16 — activations never quantized to int8 [2 pull requests, 1 participants]

vllm2026-03-25 03:59:30

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

vllm-project/vllm#38064•Fetched 2026-04-08 01:26:50

View on GitHub

Comments

Participants

Timeline

Reactions

Author

yeshihai

Participants

yeshihai

Timeline (top)

cross-referenced ×2referenced ×2

The W4A8-INT quantization path in compressed_tensors silently degrades to W4A16 on all GPUs. Activations are never quantized to int8 — the kernel receives bf16 input regardless of the configured quantization scheme.

Root Cause

In compressed_tensors_w4a8_int.py, create_weights passes act_type=params_dtype (which is torch.bfloat16) instead of torch.int8 to MPLinearLayerConfig:

# Current (broken)
mp_linear_kernel_config = MPLinearLayerConfig(
    ...
    act_type=params_dtype,   # bf16 — wrong!
    ...
)

# Fix
mp_linear_kernel_config = MPLinearLayerConfig(
    ...
    act_type=torch.int8,     # correct
    ...
)

Because act_type is bf16, the Marlin kernel's apply_gptq_marlin_linear never enters the int8 quantization branch:

if input_dtype == torch.int8:
    reshaped_x, a_scales = marlin_quant_input(...)  # ← never reached

Fix Action

Fix

See: https://github.com/yeshihai/vllm/tree/feat/marlin-w4a8-int

Tested on L40 (SM89) with Qwen2.5-7B:

Latency (bs=1, input=512, output=128): 2.72x improvement over W4A16
Throughput (16 prompts): 2.14x improvement over W4A16

Related: #38063

PR fix notes

PR #1: feat(quantization): fix W4A8-INT activation quantization and int4 support in Marlin kernel

Repository: yeshihai/vllm
Author: yeshihai
State: closed | merged: False
Link: https://github.com/yeshihai/vllm/pull/1

Description (problem / solution / changelog)

Summary

Fixes two bugs that together made the W4A8-INT compressed_tensors path completely non-functional:

scalar_types.int4 not recognized → W4A8-INT models with signed int4 weights cannot be deployed at all
Activations never quantized to int8 → even when deployment succeeds (uint4b8 weights), the kernel silently runs W4A16

Fixes #38063, Fixes #38064

Root Causes & Changes

1. `compressed_tensors_w4a8_int.py`

act_type=params_dtype (bf16) → act_type=torch.int8
Add weight_type=self.quant_type to MPLinearLayerConfig (was missing entirely)
Fix group_size passthrough for channelwise quantization

2. `marlin_utils.py`

Add scalar_types.int4 to query_marlin_supported_quant_types so Marlin can be selected for signed int4 weights
Widen assert in apply_gptq_marlin_linear to allow int4 alongside uint4b8
Guard a_scales * input_global_scale behind if input_global_scale is not None (channelwise path has no global scale)

3. `marlin.py` (MarlinLinearKernel)

Widen assert in process_weights_after_loading to allow int4 in W4A8 path
Add int4 packing in transform_w_q: convert [N, K] int8 (signed int4 values) → pack 8 values per int32 → transpose to [K//8, N] → pass through gptq_marlin_repack(is_a_8bit=True)
Add effective_wtype remap in apply_weights: pass uint4b8 to the kernel for signed int4 weights (the kernel interprets packed bits identically; the +8 bias is absorbed during packing)

Testing

Tested on L40 (SM89, Ada Lovelace) with Qwen2.5-7B W4A8-INT (compressed_tensors):

Metric	W4A16 baseline	W4A8-INT (this PR)	Improvement
Latency (bs=1, in=512, out=128)	2.62s	0.96s	2.72x
Throughput (16 prompts)	531 tok/s	1137 tok/s	2.14x

Activation dtype confirmed as torch.int8 (verified via dtype print in apply_gptq_marlin_linear).

Kernel Path on SM89

CutlassW4A8LinearKernel: requires SM90 + FP8 activations — not applicable
MarlinLinearKernel: min SM75, supports INT8 activations — used on SM89

Changed files

vllm/model_executor/kernels/linear/mixed_precision/marlin.py (modified, +20/-4)
vllm/model_executor/layers/quantization/compressed_tensors/schemes/compressed_tensors_w4a8_int.py (modified, +2/-2)
vllm/model_executor/layers/quantization/utils/marlin_utils.py (modified, +9/-4)

PR #38066: feat(quantization): fix W4A8-INT activation quantization and int4 support in Marlin kernel

Repository: vllm-project/vllm
Author: yeshihai
State: open | merged: False
Link: https://github.com/vllm-project/vllm/pull/38066

Description (problem / solution / changelog)

Summary

Fixes two bugs that together made the W4A8-INT compressed_tensors path completely non-functional:

scalar_types.int4 not recognized → W4A8-INT models with signed int4 weights cannot be deployed at all
Activations never quantized to int8 → even when deployment succeeds (uint4b8 weights), the kernel silently runs W4A16

Fixes #38063, Fixes #38064

Root Causes & Changes

1. `compressed_tensors_w4a8_int.py`

act_type=params_dtype (bf16) → act_type=torch.int8
Add weight_type=self.quant_type to MPLinearLayerConfig (was missing entirely)
Fix group_size passthrough for channelwise quantization

2. `marlin_utils.py`

Add scalar_types.int4 to query_marlin_supported_quant_types so Marlin can be selected for signed int4 weights
Widen assert in apply_gptq_marlin_linear to allow int4 alongside uint4b8
Guard a_scales * input_global_scale behind if input_global_scale is not None (channelwise path has no global scale)

3. `marlin.py` (MarlinLinearKernel)

Widen assert in process_weights_after_loading to allow int4 in W4A8 path
Add int4 packing in transform_w_q: convert [N, K] int8 (signed int4 values) → pack 8 values per int32 → transpose to [K//8, N] → pass through gptq_marlin_repack(is_a_8bit=True)
Add effective_wtype remap in apply_weights: pass uint4b8 to the kernel for signed int4 weights (the kernel interprets packed bits identically; the +8 bias is absorbed during packing)

Testing

Tested on L40 (SM89, Ada Lovelace) with Qwen2.5-7B W4A8-INT (compressed_tensors):

Metric	W4A16 baseline	W4A8-INT (this PR)	Improvement
Latency (bs=1, in=512, out=128)	2.62s	0.96s	2.72x
Throughput (16 prompts)	531 tok/s	1137 tok/s	2.14x

Activation dtype confirmed as torch.int8 (verified via dtype print in apply_gptq_marlin_linear).

Kernel Path on SM89

CutlassW4A8LinearKernel: requires SM90 + FP8 activations — not applicable
MarlinLinearKernel: min SM75, supports INT8 activations — used on SM89

Changed files

vllm/model_executor/kernels/linear/mixed_precision/marlin.py (modified, +24/-4)
vllm/model_executor/layers/quantization/compressed_tensors/schemes/compressed_tensors_w4a8_int.py (modified, +2/-2)
vllm/model_executor/layers/quantization/utils/marlin_utils.py (modified, +10/-5)

Code Example

# Current (broken)
mp_linear_kernel_config = MPLinearLayerConfig(
    ...
    act_type=params_dtype,   # bf16 — wrong!
    ...
)

# Fix
mp_linear_kernel_config = MPLinearLayerConfig(
    ...
    act_type=torch.int8,     # correct
    ...
)

---

if input_dtype == torch.int8:
    reshaped_x, a_scales = marlin_quant_input(...)  # ← never reached

---

print(f"x.dtype={x.dtype}", file=sys.stderr, flush=True)
# Outputs: x.dtype=torch.bfloat16  ← should be torch.int8

RAW_BUFFERClick to expand / collapse

Summary

Root Cause

In compressed_tensors_w4a8_int.py, create_weights passes act_type=params_dtype (which is torch.bfloat16) instead of torch.int8 to MPLinearLayerConfig:

# Current (broken)
mp_linear_kernel_config = MPLinearLayerConfig(
    ...
    act_type=params_dtype,   # bf16 — wrong!
    ...
)

# Fix
mp_linear_kernel_config = MPLinearLayerConfig(
    ...
    act_type=torch.int8,     # correct
    ...
)

Because act_type is bf16, the Marlin kernel's apply_gptq_marlin_linear never enters the int8 quantization branch:

if input_dtype == torch.int8:
    reshaped_x, a_scales = marlin_quant_input(...)  # ← never reached

Verification

Add a dtype print in apply_gptq_marlin_linear:

print(f"x.dtype={x.dtype}", file=sys.stderr, flush=True)
# Outputs: x.dtype=torch.bfloat16  ← should be torch.int8

(Use VLLM_ENABLE_V1_MULTIPROCESSING=0 to force single-process so the print is visible.)

Impact

All GPUs (A100, L40, A30, etc.). Every W4A8-INT model loaded via compressed_tensors actually runs as W4A16, wasting memory bandwidth and defeating the purpose of activation quantization.

Fix

See: https://github.com/yeshihai/vllm/tree/feat/marlin-w4a8-int

Tested on L40 (SM89) with Qwen2.5-7B:

Latency (bs=1, input=512, output=128): 2.72x improvement over W4A16
Throughput (16 prompts): 2.14x improvement over W4A16

Related: #38063

extent analysis

Fix Plan

To fix the issue, update the create_weights function in compressed_tensors_w4a8_int.py to pass torch.int8 instead of params_dtype to MPLinearLayerConfig.

Update the mp_linear_kernel_config creation:

mp_linear_kernel_config = MPLinearLayerConfig(
    ...
    act_type=torch.int8,     # correct
    ...
)

Verify the fix by adding a dtype print in apply_gptq_marlin_linear:

print(f"x.dtype={x.dtype}", file=sys.stderr, flush=True)

Run the model with VLLM_ENABLE_V1_MULTIPROCESSING=0 to force single-process and check the output.

Verification

After applying the fix, the output of the print statement should be:

x.dtype=torch.int8

This indicates that the input to the Marlin kernel is now correctly quantized to int8.

Extra Tips

Make sure to test the fix on different GPUs (e.g., A100, L40, A30) to ensure the issue is resolved across all platforms.
Monitor latency and throughput improvements after applying the fix, as seen in the test results (2.72x latency improvement and 2.14x throughput improvement over W4A16).

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#retrieval issue #search optimization #API routing #API middleware #SSR setup

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

vllm - ✅(Solved) Fix [Bug] W4A8-INT compressed_tensors silently runs W4A16 — activations never quantized to int8 [2 pull requests, 1 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Fix Action

Fix

PR fix notes

PR #1: feat(quantization): fix W4A8-INT activation quantization and int4 support in Marlin kernel

Description (problem / solution / changelog)

Summary

Root Causes & Changes

1. compressed_tensors_w4a8_int.py

2. marlin_utils.py

3. marlin.py (MarlinLinearKernel)

Testing

Kernel Path on SM89

Changed files

PR #38066: feat(quantization): fix W4A8-INT activation quantization and int4 support in Marlin kernel

Description (problem / solution / changelog)

Summary

Root Causes & Changes

1. compressed_tensors_w4a8_int.py

2. marlin_utils.py

3. marlin.py (MarlinLinearKernel)

Testing

Kernel Path on SM89

Changed files

Code Example

Summary

Root Cause

Verification

Impact

Fix

extent analysis

Fix Plan

Verification

Extra Tips

Still need to ship something?

RELATED_DISCOVERY

TRENDING

1. `compressed_tensors_w4a8_int.py`

2. `marlin_utils.py`

3. `marlin.py` (MarlinLinearKernel)

1. `compressed_tensors_w4a8_int.py`

2. `marlin_utils.py`

3. `marlin.py` (MarlinLinearKernel)