vllm - ✅(Solved) Fix [RFC]: Intel Quantization Support Roadmap (H1 2026) [2 pull requests, 1 comments, 1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#37979Fetched 2026-04-08 01:22:15
View on GitHub
Comments
1
Participants
1
Timeline
21
Reactions
1
Author
Participants
Timeline (top)
subscribed ×10mentioned ×8commented ×1cross-referenced ×1

Fix Action

Fix / Workaround

  1. Broad scheme coverage — support the quantization formats that matter for Intel hardware (wNa16 INT, w8a16 FP8), for both Linear and MoE layers, on both XPU and CPU.
  2. Architectural cleanup — decouple the quant_method dispatch logic from INCConfig so INC acts purely as a config translator, not a kernel router.

Decouple the quant_method dispatch logic from INCConfig. Today, INCConfig.get_quant_method() contains per-backend routing that duplicates logic already in GPTQ/AWQ/Marlin configs:

Replace the monolithic get_quant_method() with a two-level dispatch architecture, inspired by the compressed-tensors design:

PR fix notes

PR #37986: [Quantization][Autoround][XPU] Add W4A16 Support

Description (problem / solution / changelog)

Purpose

This PR adds the Auto-round W4A16 XPU support back, and it is part of https://github.com/vllm-project/vllm/issues/37979.

Test Plan

Test

python3 examples/basic/offline_inference/generate.py   \
    --model OPEA/Qwen2.5-0.5B-Instruct-int4-sym-inc   \
    --block-size 64 --enforce-eager --max-model-len 4096  --gpu-memory-utilization  0.8

--------------------------------------------------
Prompt: 'Hello, my name is'
Generated text: " Kaitlynn and I'm a 16-year-old high school student"
--------------------------------------------------
Prompt: 'The president of the United States is'
Generated text: ' a person. The president of the United States is an official in charge of the'
--------------------------------------------------
Prompt: 'The capital of France is'
Generated text: " Paris. It is the largest city in Europe, and it's a beautiful place"
--------------------------------------------------
Prompt: 'The future of AI is'
Generated text: ' not just about the technology, but also about how it will be used in our'
--------------------------------------------------

cc @hshen14 @thuang6 @wenhuach21 @Zhenzhong1

Changed files

  • .buildkite/scripts/hardware_ci/run-xpu-test.sh (modified, +1/-0)
  • vllm/model_executor/layers/quantization/inc.py (modified, +164/-11)

PR #39778: [Quantization][Autoround][ARK] Add W4A16 Support

Description (problem / solution / changelog)

Motivation

We want to introduce robust native XPU/CPU support for the INT4/FP8 AutoRound format via Intel's auto-round-kernel (ARK), enabling efficient and stable inference for AutoRound-quantized LLMs.

Plan

This PR is part of RFC: Intel Quantization Support Roadmap

There are many stages to introduce the ARK. Stage 1: enable the code via auto-round-lib; Stage 2: collect the API feedback and plan to release the source code soon.

Test

python3 examples/basic/offline_inference/generate.py       --model OPEA/Qwen2.5-0.5B-Instruct-int4-sym-inc       --block-size 64 --enforce-eager --max-model-len 4096  --gpu-memory-utilization  0.8
<img width="1078" height="263" alt="image" src="https://github.com/user-attachments/assets/a5c83994-6aaf-404d-a153-db9c7cc613a9" />

Changed files

  • vllm/model_executor/layers/quantization/inc.py (modified, +250/-31)

Code Example

get_quant_method()
├── is_cpu/is_xpu/ipex?apply_ipex_quant_layer()   # dead code
├── gptq format?apply_gptq_quant_layer()    # duplicates GPTQConfig
└── awq format?apply_awq_quant_layer()     # duplicates AWQConfig

---

INCConfig.get_quant_method(layer, prefix)
├── LinearBaseAutoRoundQuantLinearMethod.get_method(config, layer, prefix)
│                   ├── wNa16 INTAutoRoundWNA16LinearImpl
│                   └── w8a16 FP8AutoRoundFP8LinearImpl
└── FusedMoEAutoRoundMoEMethod.get_moe_method(config, layer, prefix)
                    ├── wNa16 INTAutoRoundWNA16MoEImpl
                    └── w8a16 FP8AutoRoundFP8MoEImpl
RAW_BUFFERClick to expand / collapse

Motivation.

Related RFCs

Motivation

Previously, we merged auto_round.py into inc.py as the unified Intel quantization backend. The vllm-xpu-kernels replacement for IPEX is a work in progress. This H1 2026 roadmap covers completing the consolidation, migrating from IPEX to vllm-xpu-kernels, and extending quantization scheme coverage on Intel CPU and XPU platforms.

Goals

  1. Broad scheme coverage — support the quantization formats that matter for Intel hardware (wNa16 INT, w8a16 FP8), for both Linear and MoE layers, on both XPU and CPU.
  2. Architectural cleanup — decouple the quant_method dispatch logic from INCConfig so INC acts purely as a config translator, not a kernel router.

1. Extend Quantization Scheme Coverage

Expand Intel platform support for the quantization schemes needed by production workloads.

1a. wNa16 (INT) — Weight-Only Integer Quantization

  • XPU: Linear (W4A16), MoE (W4A16)
  • CPU: Linear (W4A16), MoE (W4A16)

1b. w8a16 (FP8) — FP8 Weight-Only Quantization

  • XPU: Linear (W8A16 FP8), MoE (W8A16 FP8)
  • CPU: Linear (W8A16 FP8), MoE (W8A16 FP8)

Note: Some schemes may depend on kernel readiness.

2. Architectural Cleanup

Decouple the quant_method dispatch logic from INCConfig. Today, INCConfig.get_quant_method() contains per-backend routing that duplicates logic already in GPTQ/AWQ/Marlin configs:

get_quant_method()
├── is_cpu/is_xpu/ipex?  → apply_ipex_quant_layer()   # dead code
├── gptq format?         → apply_gptq_quant_layer()    # duplicates GPTQConfig
└── awq format?          → apply_awq_quant_layer()     # duplicates AWQConfig

Proposed Refactoring

Replace the monolithic get_quant_method() with a two-level dispatch architecture, inspired by the compressed-tensors design:

Level 1 — Module-type method (AutoRoundQuantLinearMethod, AutoRoundMoEMethod): Implements the vLLM method interface (LinearMethodBase / FusedMoEMethodBase). A static factory (get_method() / get_moe_method()) resolves the per-layer quantization scheme and delegates to the correct Level 2 impl.

Level 2 — Scheme-specific impl (e.g. AutoRoundWNA16LinearImpl, AutoRoundFP8LinearImpl): Implements an abstract AutoRoundQuantImpl base class that defines create_weights(), process_weights_after_loading(), and apply_weights(). Each impl owns a single quantization scheme and its kernel calls.

INCConfig.get_quant_method(layer, prefix)
├── LinearBase  → AutoRoundQuantLinearMethod.get_method(config, layer, prefix)
│                   ├── wNa16 INT  → AutoRoundWNA16LinearImpl
│                   └── w8a16 FP8  → AutoRoundFP8LinearImpl
└── FusedMoE   → AutoRoundMoEMethod.get_moe_method(config, layer, prefix)
                    ├── wNa16 INT   → AutoRoundWNA16MoEImpl
                    └── w8a16 FP8   → AutoRoundFP8MoEImpl

Feedback Period.

Please comment on the proposal or suggest alternatives. If there are no strong objections, we will proceed with the timeline above and submit implementation PRs. Thanks!

CC List.

cc @hshen14 @thuang6 @wenhuach21 @Zhenzhong1 @jikunshang @xinyu-intel @xuechendi cc @robertgshaw2-redhat

Any Other Things.

No response

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

Fix Plan

To address the proposed refactoring, follow these steps:

  • Create a two-level dispatch architecture:
    • Level 1: Implement module-type methods (AutoRoundQuantLinearMethod, AutoRoundMoEMethod) that resolve the per-layer quantization scheme and delegate to the correct Level 2 implementation.
    • Level 2: Implement scheme-specific implementations (e.g., AutoRoundWNA16LinearImpl, AutoRoundFP8LinearImpl) that own a single quantization scheme and its kernel calls.

Example code for Level 1:

class AutoRoundQuantLinearMethod(LinearMethodBase):
    @staticmethod
    def get_method(config, layer, prefix):
        if config.quant_scheme == "wNa16 INT":
            return AutoRoundWNA16LinearImpl()
        elif config.quant_scheme == "w8a16 FP8":
            return AutoRoundFP8LinearImpl()
        else:
            raise ValueError("Unsupported quantization scheme")

class AutoRoundMoEMethod(FusedMoEMethodBase):
    @staticmethod
    def get_moe_method(config, layer, prefix):
        if config.quant_scheme == "wNa16 INT":
            return AutoRoundWNA16MoEImpl()
        elif config.quant_scheme == "w8a16 FP8":
            return AutoRoundFP8MoEImpl()
        else:
            raise ValueError("Unsupported quantization scheme")

Example code for Level 2:

class AutoRoundQuantImpl:
    def create_weights(self):
        raise NotImplementedError

    def process_weights_after_loading(self):
        raise NotImplementedError

    def apply_weights(self):
        raise NotImplementedError

class AutoRoundWNA16LinearImpl(AutoRoundQuantImpl):
    def create_weights(self):
        # Implement wNa16 INT quantization for linear layers
        pass

    def process_weights_after_loading(self):
        # Implement wNa16 INT quantization for linear layers
        pass

    def apply_weights(self):
        # Implement wNa16 INT quantization for linear layers
        pass

class AutoRoundFP8LinearImpl(AutoRoundQuantImpl):
    def create_weights(self):
        # Implement w8a16 FP8 quantization for linear layers
        pass

    def process_weights_after_loading(self):
        # Implement w8a16 FP8 quantization for linear layers
        pass

    def apply_weights(self):
        # Implement w8a16 FP8 quantization for linear layers
        pass

Verification

To verify the fix, test the refactored code with different quantization schemes and layer types. Ensure that the correct scheme-specific implementation is used for each layer and that the quantization is applied correctly.

Extra Tips

  • Make sure to update the documentation to reflect the changes in the code.
  • Consider adding unit tests to cover the different quantization schemes and layer types.
  • Review the code

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

vllm - ✅(Solved) Fix [RFC]: Intel Quantization Support Roadmap (H1 2026) [2 pull requests, 1 comments, 1 participants]