vllm - 💡(How to fix) Fix [RFC] Support Intel ARK Toolkit for AutoRound Quantization on Intel Platforms [1 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#40675Fetched 2026-04-24 05:52:12
View on GitHub
Comments
1
Participants
2
Timeline
14
Reactions
1
Participants
Timeline (top)
subscribed ×5mentioned ×4renamed ×3commented ×1

Intel ARK (Auto Round Kernel) is an advanced toolkit designed to improve LLM quantization support on Intel platforms.

This RFC proposes adopting the Intel ARK toolkit for supported AutoRound format on Intel platforms in vLLM.

Root Cause

Intel ARK (Auto Round Kernel) is an advanced toolkit designed to improve LLM quantization support on Intel platforms.

This RFC proposes adopting the Intel ARK toolkit for supported AutoRound format on Intel platforms in vLLM.

Code Example

# The current import surface is:
from auto_round_extension.ark import QuantLinear

# QuantLinear is the base ARK integer weight-only linear layer.
# It is intended for packed integer weights with per-group scales and optional zero-points. It is the main class for the generic ARK integer quantized linear path.
# The expected usage pattern is:
QuantLinear(
    bits,
    group_size,
    infeatures,
    outfeatures,
    bias,
    kernel_switch_threshold=128,
    trainable=False,
    weight_dtype=torch.bfloat16,
    **kwargs,
)

---

import torch
from auto_round_extension.ark import QuantLinear
from auto_round_extension.ark.qlinear import ark_post_init

class DemoModel(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.fc1 = QuantLinear(
            bits=4,
            group_size=128,
            infeatures=4096,
            outfeatures=4096,
            bias=True,
        )

    def forward(self, x):
        x = self.fc1(x)
        return x

model = DemoModel()
model.load_state_dict(state_dict, strict=False)
model = ark_post_init(model)

x = torch.randn(1, 16, 4096, device="cpu", dtype=torch.float32)
y = model(x)

---

python3 examples/basic/offline_inference/generate.py \
  --model OPEA/Qwen2.5-0.5B-Instruct-int4-sym-inc \
  --block-size 64 \
  --enforce-eager \
  --max-model-len 8192 \
  --gpu-memory-utilization 0.5
RAW_BUFFERClick to expand / collapse

Summary

Intel ARK (Auto Round Kernel) is an advanced toolkit designed to improve LLM quantization support on Intel platforms.

This RFC proposes adopting the Intel ARK toolkit for supported AutoRound format on Intel platforms in vLLM.

Motivation

Integrating Intel ARK natively into vLLM is motivated by two related needs:

1. An Advanced Toolkit specifically tailored for Intel AutoRound Quantization.

vLLM already supports generic quantization formats and vendor-owned quantization toolkits. For example, AMD QUARK provides the AMD ecosystem with a first-class vendor quantization toolkit to simplify and enhance the quantization in vLLM. Users deploying on Intel platforms require a comparable, seamless path for Intel's own quantization workflow, especially for checkpoints produced by Intel AutoRound.

ARK fills that exact role for Intel platforms, similar to how Quark does for AMD. By providing unified quantized Linear and MoE computational capabilities, ARK bridges the quantization workflow gap between vLLM and AutoRound.

2. Native Support for the AutoRound Format on Intel Platforms

While vLLM provides excellent abstractions for generalized formats (such as Compressed Tensors), Intel's native AutoRound format has unique layout and algorithmic characteristics designed to maximize performance on Intel hardware. Integrating ARK provides highly optimized kernel execution specifically tailored for the AutoRound format. This natively unlocks the advanced matrix capabilities of Intel platforms.

Goals

  • Integrate ARK into vLLM to serve as the quantization toolkit for supported AutoRound format Models.

  • Integrate more quantization scheme to align with the Intel Neural Compressor (INC) quantization roadmap in #37979.

Build and Dependencies

  • Add auto-round-lib to requirements/xpu.txt.

  • Keep ARK aligned with the oneAPI stack already used in the mainstream Intel XPU path to minimize compatibility risks and maintenance overhead within vLLM.

API

The integration introduces a clean, unified Python module exposed via PyBind11. The API design strictly separates weight preprocessing from the execution hot-path, seamlessly plugging into vLLM's standard LinearMethodBase abstraction.

https://github.com/intel/auto-round/blob/main/auto_round_extension/ark/qlinear.py

# The current import surface is:
from auto_round_extension.ark import QuantLinear

# QuantLinear is the base ARK integer weight-only linear layer.
# It is intended for packed integer weights with per-group scales and optional zero-points. It is the main class for the generic ARK integer quantized linear path.
# The expected usage pattern is:
QuantLinear(
    bits,
    group_size,
    infeatures,
    outfeatures,
    bias,
    kernel_switch_threshold=128,
    trainable=False,
    weight_dtype=torch.bfloat16,
    **kwargs,
)

A simple and stable Python contract for ARK quantized linear layers:

  • construct
  • load state
  • call post_init
  • run forward
import torch
from auto_round_extension.ark import QuantLinear
from auto_round_extension.ark.qlinear import ark_post_init

class DemoModel(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.fc1 = QuantLinear(
            bits=4,
            group_size=128,
            infeatures=4096,
            outfeatures=4096,
            bias=True,
        )

    def forward(self, x):
        x = self.fc1(x)
        return x

model = DemoModel()
model.load_state_dict(state_dict, strict=False)
model = ark_post_init(model)

x = torch.randn(1, 16, 4096, device="cpu", dtype=torch.float32)
y = model(x)

User-Facing Behavior

The default user experience remains unchanged. Users continue to load AutoRound or INC checkpoints the same way they do today.

python3 examples/basic/offline_inference/generate.py \
  --model OPEA/Qwen2.5-0.5B-Instruct-int4-sym-inc \
  --block-size 64 \
  --enforce-eager \
  --max-model-len 8192 \
  --gpu-memory-utilization 0.5

Implementation Plan & Milestones

  • Phase 1: vLLM Integration PR (WIP) Integrate ARK into the vLLM to support AutoRound W4A16 linear path by submitting the initial PR #39778.

  • Phase 2: Align with INC Refactor (WIP) Track the upstream INC refactor in #40601 and adapt the ARK integration so that it fits the modular package structure available at merge time.

Feedback Period.

No response

CC List.

@chensuyue @yiliu30 @luoyu-intel @lvliang-intel

extent analysis

TL;DR

Integrate the Intel ARK toolkit into vLLM to support the AutoRound format on Intel platforms by following the proposed implementation plan and milestones.

Guidance

  • Review the implementation plan and milestones to ensure a smooth integration of ARK into vLLM.
  • Verify that the ARK integration aligns with the Intel Neural Compressor (INC) quantization roadmap.
  • Test the QuantLinear class with different input parameters to ensure correct functionality.
  • Ensure that the default user experience remains unchanged after integrating ARK into vLLM.

Example

import torch
from auto_round_extension.ark import QuantLinear

# Create a QuantLinear instance
fc1 = QuantLinear(
    bits=4,
    group_size=128,
    infeatures=4096,
    outfeatures=4096,
    bias=True,
)

# Test the QuantLinear instance
x = torch.randn(1, 16, 4096, device="cpu", dtype=torch.float32)
y = fc1(x)

Notes

The implementation plan and milestones should be followed carefully to ensure a successful integration of ARK into vLLM. Additionally, the QuantLinear class should be thoroughly tested to ensure correct functionality.

Recommendation

Apply the proposed implementation plan and milestones to integrate ARK into vLLM, as it provides a clear and structured approach to supporting the AutoRound format on Intel platforms.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

vllm - 💡(How to fix) Fix [RFC] Support Intel ARK Toolkit for AutoRound Quantization on Intel Platforms [1 comments, 2 participants]