vllm - 💡(How to fix) Fix [RFC] Support Intel ARK Toolkit for AutoRound Quantization on Intel Platforms [1 comments, 2 participants]

Code Example

# The current import surface is:
from auto_round_extension.ark import QuantLinear

# QuantLinear is the base ARK integer weight-only linear layer.
# It is intended for packed integer weights with per-group scales and optional zero-points. It is the main class for the generic ARK integer quantized linear path.
# The expected usage pattern is:
QuantLinear(
    bits,
    group_size,
    infeatures,
    outfeatures,
    bias,
    kernel_switch_threshold=128,
    trainable=False,
    weight_dtype=torch.bfloat16,
    **kwargs,
)

---

import torch
from auto_round_extension.ark import QuantLinear
from auto_round_extension.ark.qlinear import ark_post_init

class DemoModel(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.fc1 = QuantLinear(
            bits=4,
            group_size=128,
            infeatures=4096,
            outfeatures=4096,
            bias=True,
        )

    def forward(self, x):
        x = self.fc1(x)
        return x

model = DemoModel()
model.load_state_dict(state_dict, strict=False)
model = ark_post_init(model)

x = torch.randn(1, 16, 4096, device="cpu", dtype=torch.float32)
y = model(x)

---

python3 examples/basic/offline_inference/generate.py \
  --model OPEA/Qwen2.5-0.5B-Instruct-int4-sym-inc \
  --block-size 64 \
  --enforce-eager \
  --max-model-len 8192 \
  --gpu-memory-utilization 0.5

Summary

Intel ARK (Auto Round Kernel) is an advanced toolkit designed to improve LLM quantization support on Intel platforms.

This RFC proposes adopting the Intel ARK toolkit for supported AutoRound format on Intel platforms in vLLM.

Motivation

Integrating Intel ARK natively into vLLM is motivated by two related needs:

1. An Advanced Toolkit specifically tailored for Intel AutoRound Quantization.

vLLM already supports generic quantization formats and vendor-owned quantization toolkits. For example, AMD QUARK provides the AMD ecosystem with a first-class vendor quantization toolkit to simplify and enhance the quantization in vLLM. Users deploying on Intel platforms require a comparable, seamless path for Intel's own quantization workflow, especially for checkpoints produced by Intel AutoRound.

ARK fills that exact role for Intel platforms, similar to how Quark does for AMD. By providing unified quantized Linear and MoE computational capabilities, ARK bridges the quantization workflow gap between vLLM and AutoRound.

2. Native Support for the AutoRound Format on Intel Platforms

While vLLM provides excellent abstractions for generalized formats (such as Compressed Tensors), Intel's native AutoRound format has unique layout and algorithmic characteristics designed to maximize performance on Intel hardware. Integrating ARK provides highly optimized kernel execution specifically tailored for the AutoRound format. This natively unlocks the advanced matrix capabilities of Intel platforms.

Goals

Integrate ARK into vLLM to serve as the quantization toolkit for supported AutoRound format Models.
Integrate more quantization scheme to align with the Intel Neural Compressor (INC) quantization roadmap in #37979.

Build and Dependencies

Add auto-round-lib to requirements/xpu.txt.
Keep ARK aligned with the oneAPI stack already used in the mainstream Intel XPU path to minimize compatibility risks and maintenance overhead within vLLM.

API

The integration introduces a clean, unified Python module exposed via PyBind11. The API design strictly separates weight preprocessing from the execution hot-path, seamlessly plugging into vLLM's standard LinearMethodBase abstraction.

https://github.com/intel/auto-round/blob/main/auto_round_extension/ark/qlinear.py

# The current import surface is:
from auto_round_extension.ark import QuantLinear

# QuantLinear is the base ARK integer weight-only linear layer.
# It is intended for packed integer weights with per-group scales and optional zero-points. It is the main class for the generic ARK integer quantized linear path.
# The expected usage pattern is:
QuantLinear(
    bits,
    group_size,
    infeatures,
    outfeatures,
    bias,
    kernel_switch_threshold=128,
    trainable=False,
    weight_dtype=torch.bfloat16,
    **kwargs,
)

A simple and stable Python contract for ARK quantized linear layers:

construct
load state
call post_init
run forward

import torch
from auto_round_extension.ark import QuantLinear
from auto_round_extension.ark.qlinear import ark_post_init

class DemoModel(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.fc1 = QuantLinear(
            bits=4,
            group_size=128,
            infeatures=4096,
            outfeatures=4096,
            bias=True,
        )

    def forward(self, x):
        x = self.fc1(x)
        return x

model = DemoModel()
model.load_state_dict(state_dict, strict=False)
model = ark_post_init(model)

x = torch.randn(1, 16, 4096, device="cpu", dtype=torch.float32)
y = model(x)

User-Facing Behavior

The default user experience remains unchanged. Users continue to load AutoRound or INC checkpoints the same way they do today.

python3 examples/basic/offline_inference/generate.py \
  --model OPEA/Qwen2.5-0.5B-Instruct-int4-sym-inc \
  --block-size 64 \
  --enforce-eager \
  --max-model-len 8192 \
  --gpu-memory-utilization 0.5

Implementation Plan & Milestones

Phase 1: vLLM Integration PR (WIP) Integrate ARK into the vLLM to support AutoRound W4A16 linear path by submitting the initial PR #39778.
Phase 2: Align with INC Refactor (WIP) Track the upstream INC refactor in #40601 and adapt the ARK integration so that it fits the modular package structure available at merge time.

Feedback Period.

No response

CC List.

@chensuyue @yiliu30 @luoyu-intel @lvliang-intel

extent analysis

TL;DR

Integrate the Intel ARK toolkit into vLLM to support the AutoRound format on Intel platforms by following the proposed implementation plan and milestones.

Guidance

Review the implementation plan and milestones to ensure a smooth integration of ARK into vLLM.
Verify that the ARK integration aligns with the Intel Neural Compressor (INC) quantization roadmap.
Test the QuantLinear class with different input parameters to ensure correct functionality.
Ensure that the default user experience remains unchanged after integrating ARK into vLLM.

Example

import torch
from auto_round_extension.ark import QuantLinear

# Create a QuantLinear instance
fc1 = QuantLinear(
    bits=4,
    group_size=128,
    infeatures=4096,
    outfeatures=4096,
    bias=True,
)

# Test the QuantLinear instance
x = torch.randn(1, 16, 4096, device="cpu", dtype=torch.float32)
y = fc1(x)

Notes

The implementation plan and milestones should be followed carefully to ensure a successful integration of ARK into vLLM. Additionally, the QuantLinear class should be thoroughly tested to ensure correct functionality.

Recommendation

Apply the proposed implementation plan and milestones to integrate ARK into vLLM, as it provides a clear and structured approach to supporting the AutoRound format on Intel platforms.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

vllm - 💡(How to fix) Fix [RFC] Support Intel ARK Toolkit for AutoRound Quantization on Intel Platforms [1 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Code Example

Summary

Motivation

Goals

Build and Dependencies

API

User-Facing Behavior

Implementation Plan & Milestones

Feedback Period.

CC List.

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

TRENDING

vllm - 💡(How to fix) Fix [RFC] Support Intel ARK Toolkit for AutoRound Quantization on Intel Platforms [1 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Code Example

Summary

Motivation

Goals

Build and Dependencies

API

User-Facing Behavior

Implementation Plan & Milestones

Feedback Period.

CC List.

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING