vllm - ✅(Solved) Fix [Bug] scalar_types.int4 weight type not supported in Marlin kernel, making W4A8-INT models undeployable [2 pull requests, 1 participants]

vllm2026-03-25 03:59:13

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

vllm-project/vllm#38063•Fetched 2026-04-08 01:26:52

View on GitHub

Comments

Participants

Timeline

Reactions

Author

yeshihai

Participants

yeshihai

Timeline (top)

cross-referenced ×3

When loading a compressed_tensors W4A8-INT model with int4 weight type (e.g., quantized with LLM Compressor), vLLM fails to deploy because scalar_types.int4 is not included in the Marlin kernel's supported quantization types.

Root Cause

query_marlin_supported_quant_types in marlin_utils.py only returns [scalar_types.uint4b8, scalar_types.uint8b128] for the non-zero-point path. scalar_types.int4 is missing, so MarlinLinearKernel.can_implement() returns False and no kernel is selected.

# Current (broken)
res = [scalar_types.uint4b8, scalar_types.uint8b128]

# Fix
res = [scalar_types.uint4b8, scalar_types.uint8b128, scalar_types.int4]

Additionally, process_weights_after_loading has an assert that only allows uint4b8 in the W4A8 path:

# Current (broken)
assert c.weight_type == scalar_types.uint4b8

# Fix
assert c.weight_type in (scalar_types.uint4b8, scalar_types.int4)

Fix Action

Fix

See: https://github.com/yeshihai/vllm/tree/feat/marlin-w4a8-int

Tested on L40 (SM89, Ada Lovelace) with Qwen2.5-7B.

PR fix notes

PR #1: feat(quantization): fix W4A8-INT activation quantization and int4 support in Marlin kernel

Repository: yeshihai/vllm
Author: yeshihai
State: closed | merged: False
Link: https://github.com/yeshihai/vllm/pull/1

Description (problem / solution / changelog)

Summary

Fixes two bugs that together made the W4A8-INT compressed_tensors path completely non-functional:

scalar_types.int4 not recognized → W4A8-INT models with signed int4 weights cannot be deployed at all
Activations never quantized to int8 → even when deployment succeeds (uint4b8 weights), the kernel silently runs W4A16

Fixes #38063, Fixes #38064

Root Causes & Changes

1. `compressed_tensors_w4a8_int.py`

act_type=params_dtype (bf16) → act_type=torch.int8
Add weight_type=self.quant_type to MPLinearLayerConfig (was missing entirely)
Fix group_size passthrough for channelwise quantization

2. `marlin_utils.py`

Add scalar_types.int4 to query_marlin_supported_quant_types so Marlin can be selected for signed int4 weights
Widen assert in apply_gptq_marlin_linear to allow int4 alongside uint4b8
Guard a_scales * input_global_scale behind if input_global_scale is not None (channelwise path has no global scale)

3. `marlin.py` (MarlinLinearKernel)

Widen assert in process_weights_after_loading to allow int4 in W4A8 path
Add int4 packing in transform_w_q: convert [N, K] int8 (signed int4 values) → pack 8 values per int32 → transpose to [K//8, N] → pass through gptq_marlin_repack(is_a_8bit=True)
Add effective_wtype remap in apply_weights: pass uint4b8 to the kernel for signed int4 weights (the kernel interprets packed bits identically; the +8 bias is absorbed during packing)

Testing

Tested on L40 (SM89, Ada Lovelace) with Qwen2.5-7B W4A8-INT (compressed_tensors):

Metric	W4A16 baseline	W4A8-INT (this PR)	Improvement
Latency (bs=1, in=512, out=128)	2.62s	0.96s	2.72x
Throughput (16 prompts)	531 tok/s	1137 tok/s	2.14x

Activation dtype confirmed as torch.int8 (verified via dtype print in apply_gptq_marlin_linear).

Kernel Path on SM89

CutlassW4A8LinearKernel: requires SM90 + FP8 activations — not applicable
MarlinLinearKernel: min SM75, supports INT8 activations — used on SM89

Changed files

vllm/model_executor/kernels/linear/mixed_precision/marlin.py (modified, +20/-4)
vllm/model_executor/layers/quantization/compressed_tensors/schemes/compressed_tensors_w4a8_int.py (modified, +2/-2)
vllm/model_executor/layers/quantization/utils/marlin_utils.py (modified, +9/-4)

PR #38066: feat(quantization): fix W4A8-INT activation quantization and int4 support in Marlin kernel

Repository: vllm-project/vllm
Author: yeshihai
State: open | merged: False
Link: https://github.com/vllm-project/vllm/pull/38066

Description (problem / solution / changelog)

Summary

Fixes two bugs that together made the W4A8-INT compressed_tensors path completely non-functional:

scalar_types.int4 not recognized → W4A8-INT models with signed int4 weights cannot be deployed at all
Activations never quantized to int8 → even when deployment succeeds (uint4b8 weights), the kernel silently runs W4A16

Fixes #38063, Fixes #38064

Root Causes & Changes

1. `compressed_tensors_w4a8_int.py`

act_type=params_dtype (bf16) → act_type=torch.int8
Add weight_type=self.quant_type to MPLinearLayerConfig (was missing entirely)
Fix group_size passthrough for channelwise quantization

2. `marlin_utils.py`

Add scalar_types.int4 to query_marlin_supported_quant_types so Marlin can be selected for signed int4 weights
Widen assert in apply_gptq_marlin_linear to allow int4 alongside uint4b8
Guard a_scales * input_global_scale behind if input_global_scale is not None (channelwise path has no global scale)

3. `marlin.py` (MarlinLinearKernel)

Widen assert in process_weights_after_loading to allow int4 in W4A8 path
Add int4 packing in transform_w_q: convert [N, K] int8 (signed int4 values) → pack 8 values per int32 → transpose to [K//8, N] → pass through gptq_marlin_repack(is_a_8bit=True)
Add effective_wtype remap in apply_weights: pass uint4b8 to the kernel for signed int4 weights (the kernel interprets packed bits identically; the +8 bias is absorbed during packing)

Testing

Tested on L40 (SM89, Ada Lovelace) with Qwen2.5-7B W4A8-INT (compressed_tensors):

Metric	W4A16 baseline	W4A8-INT (this PR)	Improvement
Latency (bs=1, in=512, out=128)	2.62s	0.96s	2.72x
Throughput (16 prompts)	531 tok/s	1137 tok/s	2.14x

Activation dtype confirmed as torch.int8 (verified via dtype print in apply_gptq_marlin_linear).

Kernel Path on SM89

CutlassW4A8LinearKernel: requires SM90 + FP8 activations — not applicable
MarlinLinearKernel: min SM75, supports INT8 activations — used on SM89

Changed files

vllm/model_executor/kernels/linear/mixed_precision/marlin.py (modified, +24/-4)
vllm/model_executor/layers/quantization/compressed_tensors/schemes/compressed_tensors_w4a8_int.py (modified, +2/-2)
vllm/model_executor/layers/quantization/utils/marlin_utils.py (modified, +10/-5)

Code Example

# Current (broken)
res = [scalar_types.uint4b8, scalar_types.uint8b128]

# Fix
res = [scalar_types.uint4b8, scalar_types.uint8b128, scalar_types.int4]

---

# Current (broken)
assert c.weight_type == scalar_types.uint4b8

# Fix
assert c.weight_type in (scalar_types.uint4b8, scalar_types.int4)

RAW_BUFFERClick to expand / collapse

Summary

Root Cause

# Current (broken)
res = [scalar_types.uint4b8, scalar_types.uint8b128]

# Fix
res = [scalar_types.uint4b8, scalar_types.uint8b128, scalar_types.int4]

Additionally, process_weights_after_loading has an assert that only allows uint4b8 in the W4A8 path:

# Current (broken)
assert c.weight_type == scalar_types.uint4b8

# Fix
assert c.weight_type in (scalar_types.uint4b8, scalar_types.int4)

Impact

Any W4A8-INT model saved with int4 weight type (signed) cannot be loaded. This includes models quantized via LLM Compressor's compressed_tensors format.

Fix

See: https://github.com/yeshihai/vllm/tree/feat/marlin-w4a8-int

Tested on L40 (SM89, Ada Lovelace) with Qwen2.5-7B.

extent analysis

Fix Plan

To fix the issue, we need to update the query_marlin_supported_quant_types function and the process_weights_after_loading function. Here are the steps:

Update query_marlin_supported_quant_types to include scalar_types.int4:

res = [scalar_types.uint4b8, scalar_types.uint8b128, scalar_types.int4]

Update process_weights_after_loading to allow scalar_types.int4 weight type:

assert c.weight_type in (scalar_types.uint4b8, scalar_types.int4)

Verification

To verify the fix, load a W4A8-INT model with int4 weight type and check if it can be deployed successfully.

Extra Tips

Make sure to test the fix on different hardware configurations, such as L40 (SM89, Ada Lovelace) with Qwen2.5-7B.
Refer to the GitHub branch https://github.com/yeshihai/vllm/tree/feat/marlin-w4a8-int for more information on the fix.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#search optimization #API routing #API middleware #SSR setup #ISR setup

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

vllm - ✅(Solved) Fix [Bug] scalar_types.int4 weight type not supported in Marlin kernel, making W4A8-INT models undeployable [2 pull requests, 1 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Fix Action

Fix

PR fix notes

PR #1: feat(quantization): fix W4A8-INT activation quantization and int4 support in Marlin kernel

Description (problem / solution / changelog)

Summary

Root Causes & Changes

1. compressed_tensors_w4a8_int.py

2. marlin_utils.py

3. marlin.py (MarlinLinearKernel)

Testing

Kernel Path on SM89

Changed files

PR #38066: feat(quantization): fix W4A8-INT activation quantization and int4 support in Marlin kernel

Description (problem / solution / changelog)

Summary

Root Causes & Changes

1. compressed_tensors_w4a8_int.py

2. marlin_utils.py

3. marlin.py (MarlinLinearKernel)

Testing

Kernel Path on SM89

Changed files

Code Example

Summary

Root Cause

Impact

Fix

extent analysis

Fix Plan

Verification

Extra Tips

Still need to ship something?

RELATED_DISCOVERY

TRENDING

1. `compressed_tensors_w4a8_int.py`

2. `marlin_utils.py`

3. `marlin.py` (MarlinLinearKernel)

1. `compressed_tensors_w4a8_int.py`

2. `marlin_utils.py`

3. `marlin.py` (MarlinLinearKernel)