vllm - ✅(Solved) Fix [RFC]: Intel Quantization Support Roadmap (H1 2026) [2 pull requests, 1 comments, 1 participants]

vllm2026-03-24 08:14:22

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

vllm-project/vllm#37979•Fetched 2026-04-08 01:22:15

View on GitHub

Comments

Participants

Timeline

Reactions

Author

yiliu30

Participants

yiliu30

Timeline (top)

subscribed ×10mentioned ×8commented ×1cross-referenced ×1

Fix Action

Fix / Workaround

Broad scheme coverage — support the quantization formats that matter for Intel hardware (wNa16 INT, w8a16 FP8), for both Linear and MoE layers, on both XPU and CPU.
Architectural cleanup — decouple the quant_method dispatch logic from INCConfig so INC acts purely as a config translator, not a kernel router.

Decouple the quant_method dispatch logic from INCConfig. Today, INCConfig.get_quant_method() contains per-backend routing that duplicates logic already in GPTQ/AWQ/Marlin configs:

Replace the monolithic get_quant_method() with a two-level dispatch architecture, inspired by the compressed-tensors design:

PR fix notes

PR #37986: [Quantization][Autoround][XPU] Add `W4A16` Support

Repository: vllm-project/vllm
Author: yiliu30
State: closed | merged: True
Link: https://github.com/vllm-project/vllm/pull/37986

Description (problem / solution / changelog)

Purpose

This PR adds the Auto-round W4A16 XPU support back, and it is part of https://github.com/vllm-project/vllm/issues/37979.

Test Plan

Test

python3 examples/basic/offline_inference/generate.py   \
    --model OPEA/Qwen2.5-0.5B-Instruct-int4-sym-inc   \
    --block-size 64 --enforce-eager --max-model-len 4096  --gpu-memory-utilization  0.8

--------------------------------------------------
Prompt: 'Hello, my name is'
Generated text: " Kaitlynn and I'm a 16-year-old high school student"
--------------------------------------------------
Prompt: 'The president of the United States is'
Generated text: ' a person. The president of the United States is an official in charge of the'
--------------------------------------------------
Prompt: 'The capital of France is'
Generated text: " Paris. It is the largest city in Europe, and it's a beautiful place"
--------------------------------------------------
Prompt: 'The future of AI is'
Generated text: ' not just about the technology, but also about how it will be used in our'
--------------------------------------------------

cc @hshen14 @thuang6 @wenhuach21 @Zhenzhong1

Changed files

.buildkite/scripts/hardware_ci/run-xpu-test.sh (modified, +1/-0)
vllm/model_executor/layers/quantization/inc.py (modified, +164/-11)

PR #39778: [Quantization][Autoround][ARK] Add W4A16 Support

Repository: vllm-project/vllm
Author: Zhenzhong1
State: open | merged: False
Link: https://github.com/vllm-project/vllm/pull/39778

Description (problem / solution / changelog)

Motivation

We want to introduce robust native XPU/CPU support for the INT4/FP8 AutoRound format via Intel's auto-round-kernel (ARK), enabling efficient and stable inference for AutoRound-quantized LLMs.

Plan

This PR is part of RFC: Intel Quantization Support Roadmap

There are many stages to introduce the ARK. Stage 1: enable the code via auto-round-lib; Stage 2: collect the API feedback and plan to release the source code soon.

Test

python3 examples/basic/offline_inference/generate.py       --model OPEA/Qwen2.5-0.5B-Instruct-int4-sym-inc       --block-size 64 --enforce-eager --max-model-len 4096  --gpu-memory-utilization  0.8

Changed files

vllm/model_executor/layers/quantization/inc.py (modified, +250/-31)

Code Example

get_quant_method()
├── is_cpu/is_xpu/ipex?  → apply_ipex_quant_layer()   # dead code
├── gptq format?         → apply_gptq_quant_layer()    # duplicates GPTQConfig
└── awq format?          → apply_awq_quant_layer()     # duplicates AWQConfig

---

INCConfig.get_quant_method(layer, prefix)
│
├── LinearBase  → AutoRoundQuantLinearMethod.get_method(config, layer, prefix)
│                   ├── wNa16 INT  → AutoRoundWNA16LinearImpl
│                   └── w8a16 FP8  → AutoRoundFP8LinearImpl
│
└── FusedMoE   → AutoRoundMoEMethod.get_moe_method(config, layer, prefix)
                    ├── wNa16 INT   → AutoRoundWNA16MoEImpl
                    └── w8a16 FP8   → AutoRoundFP8MoEImpl

RAW_BUFFERClick to expand / collapse

Motivation.

Related RFCs

Motivation

Previously, we merged auto_round.py into inc.py as the unified Intel quantization backend. The vllm-xpu-kernels replacement for IPEX is a work in progress. This H1 2026 roadmap covers completing the consolidation, migrating from IPEX to vllm-xpu-kernels, and extending quantization scheme coverage on Intel CPU and XPU platforms.

Goals

Broad scheme coverage — support the quantization formats that matter for Intel hardware (wNa16 INT, w8a16 FP8), for both Linear and MoE layers, on both XPU and CPU.
Architectural cleanup — decouple the quant_method dispatch logic from INCConfig so INC acts purely as a config translator, not a kernel router.

1. Extend Quantization Scheme Coverage

Expand Intel platform support for the quantization schemes needed by production workloads.

1a. wNa16 (INT) — Weight-Only Integer Quantization

XPU: Linear (W4A16), MoE (W4A16)
CPU: Linear (W4A16), MoE (W4A16)

1b. w8a16 (FP8) — FP8 Weight-Only Quantization

XPU: Linear (W8A16 FP8), MoE (W8A16 FP8)
CPU: Linear (W8A16 FP8), MoE (W8A16 FP8)

Note: Some schemes may depend on kernel readiness.

2. Architectural Cleanup

Decouple the quant_method dispatch logic from INCConfig. Today, INCConfig.get_quant_method() contains per-backend routing that duplicates logic already in GPTQ/AWQ/Marlin configs:

get_quant_method()
├── is_cpu/is_xpu/ipex?  → apply_ipex_quant_layer()   # dead code
├── gptq format?         → apply_gptq_quant_layer()    # duplicates GPTQConfig
└── awq format?          → apply_awq_quant_layer()     # duplicates AWQConfig

Proposed Refactoring

Replace the monolithic get_quant_method() with a two-level dispatch architecture, inspired by the compressed-tensors design:

Level 1 — Module-type method (AutoRoundQuantLinearMethod, AutoRoundMoEMethod): Implements the vLLM method interface (LinearMethodBase / FusedMoEMethodBase). A static factory (get_method() / get_moe_method()) resolves the per-layer quantization scheme and delegates to the correct Level 2 impl.

Level 2 — Scheme-specific impl (e.g. AutoRoundWNA16LinearImpl, AutoRoundFP8LinearImpl): Implements an abstract AutoRoundQuantImpl base class that defines create_weights(), process_weights_after_loading(), and apply_weights(). Each impl owns a single quantization scheme and its kernel calls.

INCConfig.get_quant_method(layer, prefix)
│
├── LinearBase  → AutoRoundQuantLinearMethod.get_method(config, layer, prefix)
│                   ├── wNa16 INT  → AutoRoundWNA16LinearImpl
│                   └── w8a16 FP8  → AutoRoundFP8LinearImpl
│
└── FusedMoE   → AutoRoundMoEMethod.get_moe_method(config, layer, prefix)
                    ├── wNa16 INT   → AutoRoundWNA16MoEImpl
                    └── w8a16 FP8   → AutoRoundFP8MoEImpl

Feedback Period.

Please comment on the proposal or suggest alternatives. If there are no strong objections, we will proceed with the timeline above and submit implementation PRs. Thanks!

CC List.

cc @hshen14 @thuang6 @wenhuach21 @Zhenzhong1 @jikunshang @xinyu-intel @xuechendi cc @robertgshaw2-redhat

Any Other Things.

No response

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

Fix Plan

To address the proposed refactoring, follow these steps:

Create a two-level dispatch architecture:
- Level 1: Implement module-type methods (AutoRoundQuantLinearMethod, AutoRoundMoEMethod) that resolve the per-layer quantization scheme and delegate to the correct Level 2 implementation.
- Level 2: Implement scheme-specific implementations (e.g., AutoRoundWNA16LinearImpl, AutoRoundFP8LinearImpl) that own a single quantization scheme and its kernel calls.

Example code for Level 1:

class AutoRoundQuantLinearMethod(LinearMethodBase):
    @staticmethod
    def get_method(config, layer, prefix):
        if config.quant_scheme == "wNa16 INT":
            return AutoRoundWNA16LinearImpl()
        elif config.quant_scheme == "w8a16 FP8":
            return AutoRoundFP8LinearImpl()
        else:
            raise ValueError("Unsupported quantization scheme")

class AutoRoundMoEMethod(FusedMoEMethodBase):
    @staticmethod
    def get_moe_method(config, layer, prefix):
        if config.quant_scheme == "wNa16 INT":
            return AutoRoundWNA16MoEImpl()
        elif config.quant_scheme == "w8a16 FP8":
            return AutoRoundFP8MoEImpl()
        else:
            raise ValueError("Unsupported quantization scheme")

Example code for Level 2:

class AutoRoundQuantImpl:
    def create_weights(self):
        raise NotImplementedError

    def process_weights_after_loading(self):
        raise NotImplementedError

    def apply_weights(self):
        raise NotImplementedError

class AutoRoundWNA16LinearImpl(AutoRoundQuantImpl):
    def create_weights(self):
        # Implement wNa16 INT quantization for linear layers
        pass

    def process_weights_after_loading(self):
        # Implement wNa16 INT quantization for linear layers
        pass

    def apply_weights(self):
        # Implement wNa16 INT quantization for linear layers
        pass

class AutoRoundFP8LinearImpl(AutoRoundQuantImpl):
    def create_weights(self):
        # Implement w8a16 FP8 quantization for linear layers
        pass

    def process_weights_after_loading(self):
        # Implement w8a16 FP8 quantization for linear layers
        pass

    def apply_weights(self):
        # Implement w8a16 FP8 quantization for linear layers
        pass

Verification

To verify the fix, test the refactored code with different quantization schemes and layer types. Ensure that the correct scheme-specific implementation is used for each layer and that the quantization is applied correctly.

Extra Tips

Make sure to update the documentation to reflect the changes in the code.
Consider adding unit tests to cover the different quantization schemes and layer types.
Review the code

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#embedding generation #cache error #pipeline error #runtime error #dependency conflict

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

vllm - ✅(Solved) Fix [RFC]: Intel Quantization Support Roadmap (H1 2026) [2 pull requests, 1 comments, 1 participants]

Recommended Tools

GitHub issue graph ai analysis

Fix Action

Fix / Workaround

PR fix notes

PR #37986: [Quantization][Autoround][XPU] Add W4A16 Support

Description (problem / solution / changelog)

Purpose

Test Plan

Test

Changed files

PR #39778: [Quantization][Autoround][ARK] Add W4A16 Support

Description (problem / solution / changelog)

Motivation

Plan

Test

Changed files

Code Example

Motivation.

Related RFCs

Motivation

Goals

1. Extend Quantization Scheme Coverage

1a. wNa16 (INT) — Weight-Only Integer Quantization

1b. w8a16 (FP8) — FP8 Weight-Only Quantization

2. Architectural Cleanup

Proposed Refactoring

Feedback Period.

CC List.

Any Other Things.

Before submitting a new issue...

extent analysis

Fix Plan

Verification

Extra Tips

Still need to ship something?

RELATED_DISCOVERY

TRENDING

PR #37986: [Quantization][Autoround][XPU] Add `W4A16` Support