vllm - ✅(Solved) Fix [Bug] scalar_types.int4 weight type not supported in Marlin kernel, making W4A8-INT models undeployable [2 pull requests, 1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#38063Fetched 2026-04-08 01:26:52
View on GitHub
Comments
0
Participants
1
Timeline
3
Reactions
0
Author
Participants
Timeline (top)
cross-referenced ×3

When loading a compressed_tensors W4A8-INT model with int4 weight type (e.g., quantized with LLM Compressor), vLLM fails to deploy because scalar_types.int4 is not included in the Marlin kernel's supported quantization types.

Root Cause

query_marlin_supported_quant_types in marlin_utils.py only returns [scalar_types.uint4b8, scalar_types.uint8b128] for the non-zero-point path. scalar_types.int4 is missing, so MarlinLinearKernel.can_implement() returns False and no kernel is selected.

# Current (broken)
res = [scalar_types.uint4b8, scalar_types.uint8b128]

# Fix
res = [scalar_types.uint4b8, scalar_types.uint8b128, scalar_types.int4]

Additionally, process_weights_after_loading has an assert that only allows uint4b8 in the W4A8 path:

# Current (broken)
assert c.weight_type == scalar_types.uint4b8

# Fix
assert c.weight_type in (scalar_types.uint4b8, scalar_types.int4)

Fix Action

Fix

See: https://github.com/yeshihai/vllm/tree/feat/marlin-w4a8-int

Tested on L40 (SM89, Ada Lovelace) with Qwen2.5-7B.

PR fix notes

PR #1: feat(quantization): fix W4A8-INT activation quantization and int4 support in Marlin kernel

Description (problem / solution / changelog)

Summary

Fixes two bugs that together made the W4A8-INT compressed_tensors path completely non-functional:

  1. scalar_types.int4 not recognized → W4A8-INT models with signed int4 weights cannot be deployed at all
  2. Activations never quantized to int8 → even when deployment succeeds (uint4b8 weights), the kernel silently runs W4A16

Fixes #38063, Fixes #38064

Root Causes & Changes

1. compressed_tensors_w4a8_int.py

  • act_type=params_dtype (bf16) → act_type=torch.int8
  • Add weight_type=self.quant_type to MPLinearLayerConfig (was missing entirely)
  • Fix group_size passthrough for channelwise quantization

2. marlin_utils.py

  • Add scalar_types.int4 to query_marlin_supported_quant_types so Marlin can be selected for signed int4 weights
  • Widen assert in apply_gptq_marlin_linear to allow int4 alongside uint4b8
  • Guard a_scales * input_global_scale behind if input_global_scale is not None (channelwise path has no global scale)

3. marlin.py (MarlinLinearKernel)

  • Widen assert in process_weights_after_loading to allow int4 in W4A8 path
  • Add int4 packing in transform_w_q: convert [N, K] int8 (signed int4 values) → pack 8 values per int32 → transpose to [K//8, N] → pass through gptq_marlin_repack(is_a_8bit=True)
  • Add effective_wtype remap in apply_weights: pass uint4b8 to the kernel for signed int4 weights (the kernel interprets packed bits identically; the +8 bias is absorbed during packing)

Testing

Tested on L40 (SM89, Ada Lovelace) with Qwen2.5-7B W4A8-INT (compressed_tensors):

MetricW4A16 baselineW4A8-INT (this PR)Improvement
Latency (bs=1, in=512, out=128)2.62s0.96s2.72x
Throughput (16 prompts)531 tok/s1137 tok/s2.14x

Activation dtype confirmed as torch.int8 (verified via dtype print in apply_gptq_marlin_linear).

Kernel Path on SM89

  • CutlassW4A8LinearKernel: requires SM90 + FP8 activations — not applicable
  • MarlinLinearKernel: min SM75, supports INT8 activations — used on SM89

Changed files

  • vllm/model_executor/kernels/linear/mixed_precision/marlin.py (modified, +20/-4)
  • vllm/model_executor/layers/quantization/compressed_tensors/schemes/compressed_tensors_w4a8_int.py (modified, +2/-2)
  • vllm/model_executor/layers/quantization/utils/marlin_utils.py (modified, +9/-4)

PR #38066: feat(quantization): fix W4A8-INT activation quantization and int4 support in Marlin kernel

Description (problem / solution / changelog)

Summary

Fixes two bugs that together made the W4A8-INT compressed_tensors path completely non-functional:

  1. scalar_types.int4 not recognized → W4A8-INT models with signed int4 weights cannot be deployed at all
  2. Activations never quantized to int8 → even when deployment succeeds (uint4b8 weights), the kernel silently runs W4A16

Fixes #38063, Fixes #38064

Root Causes & Changes

1. compressed_tensors_w4a8_int.py

  • act_type=params_dtype (bf16) → act_type=torch.int8
  • Add weight_type=self.quant_type to MPLinearLayerConfig (was missing entirely)
  • Fix group_size passthrough for channelwise quantization

2. marlin_utils.py

  • Add scalar_types.int4 to query_marlin_supported_quant_types so Marlin can be selected for signed int4 weights
  • Widen assert in apply_gptq_marlin_linear to allow int4 alongside uint4b8
  • Guard a_scales * input_global_scale behind if input_global_scale is not None (channelwise path has no global scale)

3. marlin.py (MarlinLinearKernel)

  • Widen assert in process_weights_after_loading to allow int4 in W4A8 path
  • Add int4 packing in transform_w_q: convert [N, K] int8 (signed int4 values) → pack 8 values per int32 → transpose to [K//8, N] → pass through gptq_marlin_repack(is_a_8bit=True)
  • Add effective_wtype remap in apply_weights: pass uint4b8 to the kernel for signed int4 weights (the kernel interprets packed bits identically; the +8 bias is absorbed during packing)

Testing

Tested on L40 (SM89, Ada Lovelace) with Qwen2.5-7B W4A8-INT (compressed_tensors):

MetricW4A16 baselineW4A8-INT (this PR)Improvement
Latency (bs=1, in=512, out=128)2.62s0.96s2.72x
Throughput (16 prompts)531 tok/s1137 tok/s2.14x

Activation dtype confirmed as torch.int8 (verified via dtype print in apply_gptq_marlin_linear).

Kernel Path on SM89

  • CutlassW4A8LinearKernel: requires SM90 + FP8 activations — not applicable
  • MarlinLinearKernel: min SM75, supports INT8 activations — used on SM89

Changed files

  • vllm/model_executor/kernels/linear/mixed_precision/marlin.py (modified, +24/-4)
  • vllm/model_executor/layers/quantization/compressed_tensors/schemes/compressed_tensors_w4a8_int.py (modified, +2/-2)
  • vllm/model_executor/layers/quantization/utils/marlin_utils.py (modified, +10/-5)

Code Example

# Current (broken)
res = [scalar_types.uint4b8, scalar_types.uint8b128]

# Fix
res = [scalar_types.uint4b8, scalar_types.uint8b128, scalar_types.int4]

---

# Current (broken)
assert c.weight_type == scalar_types.uint4b8

# Fix
assert c.weight_type in (scalar_types.uint4b8, scalar_types.int4)
RAW_BUFFERClick to expand / collapse

Summary

When loading a compressed_tensors W4A8-INT model with int4 weight type (e.g., quantized with LLM Compressor), vLLM fails to deploy because scalar_types.int4 is not included in the Marlin kernel's supported quantization types.

Root Cause

query_marlin_supported_quant_types in marlin_utils.py only returns [scalar_types.uint4b8, scalar_types.uint8b128] for the non-zero-point path. scalar_types.int4 is missing, so MarlinLinearKernel.can_implement() returns False and no kernel is selected.

# Current (broken)
res = [scalar_types.uint4b8, scalar_types.uint8b128]

# Fix
res = [scalar_types.uint4b8, scalar_types.uint8b128, scalar_types.int4]

Additionally, process_weights_after_loading has an assert that only allows uint4b8 in the W4A8 path:

# Current (broken)
assert c.weight_type == scalar_types.uint4b8

# Fix
assert c.weight_type in (scalar_types.uint4b8, scalar_types.int4)

Impact

Any W4A8-INT model saved with int4 weight type (signed) cannot be loaded. This includes models quantized via LLM Compressor's compressed_tensors format.

Fix

See: https://github.com/yeshihai/vllm/tree/feat/marlin-w4a8-int

Tested on L40 (SM89, Ada Lovelace) with Qwen2.5-7B.

extent analysis

Fix Plan

To fix the issue, we need to update the query_marlin_supported_quant_types function and the process_weights_after_loading function. Here are the steps:

  • Update query_marlin_supported_quant_types to include scalar_types.int4:
res = [scalar_types.uint4b8, scalar_types.uint8b128, scalar_types.int4]
  • Update process_weights_after_loading to allow scalar_types.int4 weight type:
assert c.weight_type in (scalar_types.uint4b8, scalar_types.int4)

Verification

To verify the fix, load a W4A8-INT model with int4 weight type and check if it can be deployed successfully.

Extra Tips

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

vllm - ✅(Solved) Fix [Bug] scalar_types.int4 weight type not supported in Marlin kernel, making W4A8-INT models undeployable [2 pull requests, 1 participants]