vllm - ✅(Solved) Fix [vLLM IR] Port QuantFP8 to IR op [2 pull requests, 1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#38745Fetched 2026-04-08 02:22:54
View on GitHub
Comments
0
Participants
1
Timeline
3
Reactions
0
Participants
Assignees
Timeline (top)
assigned ×1issue_type_added ×1labeled ×1

PR fix notes

PR #39267: [vllm IR] 1/N Port FP8 Quantization to vLLM IR Ops

Description (problem / solution / changelog)

Purpose

Ports native and aiter quant fp8 ops to IR following #38745 . The table below tracks all quant fp8 IR ops and their provider implementations. ✅ : Ported in this PR 👨🏽‍💻: To be ported in 2/N ❌: No existing implementation

Opnativeaitertritonvllm_c
static_quant_fp8👨🏽‍💻
dynamic_quant_fp8👨🏽‍💻
dynamic_group_quant_fp8👨🏽‍💻👨🏽‍💻
static_group_quant_fp8👨🏽‍💻

Follow up:

  • Port triton ops
  • Port vllm_c ops

Test Plan

Test Result


<details> <summary> Essential Elements of an Effective PR Description Checklist </summary>
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.
</details>

Changed files

  • tests/compile/passes/distributed/test_fusion_all_reduce.py (modified, +1/-1)
  • tests/compile/passes/test_fusion.py (modified, +20/-20)
  • tests/compile/passes/test_fusion_attn.py (modified, +7/-5)
  • tests/compile/passes/test_mla_attn_quant_fusion.py (modified, +8/-5)
  • tests/compile/passes/test_silu_mul_quant_fusion.py (modified, +13/-6)
  • tests/rocm/aiter/test_grouped_quant.py (modified, +15/-15)
  • vllm/_aiter_ops.py (modified, +0/-134)
  • vllm/compilation/passes/fusion/matcher_utils.py (modified, +11/-17)
  • vllm/config/kernel.py (modified, +12/-0)
  • vllm/ir/ops/__init__.py (modified, +13/-1)
  • vllm/ir/ops/quant.py (added, +93/-0)
  • vllm/kernels/aiter_ops.py (modified, +176/-0)
  • vllm/model_executor/layers/quantization/input_quant_fp8.py (modified, +17/-74)
  • vllm/platforms/cuda.py (modified, +8/-1)
  • vllm/platforms/rocm.py (modified, +18/-5)
  • vllm/platforms/xpu.py (modified, +7/-1)

PR #39481: [vllm IR] Port FP8 Quantization to vLLM IR Ops

Description (problem / solution / changelog)

Purpose

Ports quant fp8 ops to IR following #38745 . The table below tracks all quant fp8 IR ops and their provider implementations. ✅ : Ported ❌: No existing implementation

Opnativeaitertritonvllm_c
static_quant_fp8
dynamic_quant_fp8
dynamic_group_quant_fp8
static_group_quant_fp8

Notes:

  • per_token_group_quant_fp8_packed_for_deepgemm: Has different semantics from dynamic_group_quant_fp8 and requires a separate IR op.
  • per_token_group_quant_fp8 : not removed because deep_gemm_moe.py calls it with a pre-allocated out_q output buffer. IR ops must be purely functional and cannot write into caller-provided output tensors, so this usage cannot be expressed as an IR op without refactoring the call site.

Follow ups

  • Aiter dynamic quant pattern: Subclass RMSNormDynamicQuantPattern once it's refactored to use VllmFusionPatternMatcherPass to eliminate the duplicate.
  • Aiter group quant pattern: Same as above for RMSNormGroupQuantPattern.
  • deep_gemm_moe: Replace the per-token group quant call in fp8_utils.py with a direct IR op.

Test Plan

Test Result


<details> <summary> Essential Elements of an Effective PR Description Checklist </summary>
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.
</details>

Changed files

  • tests/compile/passes/distributed/test_fusion_all_reduce.py (modified, +1/-3)
  • tests/compile/passes/test_fusion.py (modified, +10/-21)
  • tests/compile/passes/test_fusion_attn.py (modified, +3/-7)
  • tests/compile/passes/test_mla_attn_quant_fusion.py (modified, +2/-5)
  • tests/compile/passes/test_silu_mul_quant_fusion.py (modified, +14/-16)
  • tests/rocm/aiter/test_grouped_quant.py (modified, +15/-15)
  • vllm/_aiter_ops.py (modified, +0/-134)
  • vllm/compilation/passes/fusion/act_quant_fusion.py (modified, +14/-43)
  • vllm/compilation/passes/fusion/allreduce_rms_fusion.py (modified, +10/-15)
  • vllm/compilation/passes/fusion/attn_quant_fusion.py (modified, +3/-5)
  • vllm/compilation/passes/fusion/matcher_utils.py (modified, +11/-171)
  • vllm/compilation/passes/fusion/mla_attn_quant_fusion.py (modified, +3/-5)
  • vllm/compilation/passes/fusion/rms_quant_fusion.py (modified, +51/-90)
  • vllm/compilation/passes/fusion/rocm_aiter_fusion.py (modified, +52/-47)
  • vllm/compilation/passes/fusion/sequence_parallelism.py (modified, +8/-10)
  • vllm/config/kernel.py (modified, +12/-0)
  • vllm/ir/ops/__init__.py (modified, +13/-1)
  • vllm/ir/ops/quant.py (added, +148/-0)
  • vllm/kernels/__init__.py (modified, +2/-2)
  • vllm/kernels/aiter_ops.py (modified, +193/-0)
  • vllm/kernels/triton/__init__.py (added, +7/-0)
  • vllm/kernels/triton/quant.py (added, +212/-0)
  • vllm/kernels/triton_ops.py (added, +212/-0)
  • vllm/kernels/vllm_c.py (modified, +154/-0)
  • vllm/model_executor/layers/quantization/input_quant_fp8.py (modified, +29/-123)
  • vllm/model_executor/layers/quantization/utils/fp8_utils.py (modified, +2/-0)
  • vllm/platforms/cuda.py (modified, +8/-1)
  • vllm/platforms/rocm.py (modified, +21/-5)
  • vllm/platforms/xpu.py (modified, +7/-1)
RAW_BUFFERClick to expand / collapse

Port the various ops inside the QuantFP8 class. I think we should separate the different types of quantization instead of trying to jam them all into the same op.

Quantization types and corresponding ops:

  • static per-tensor: static_quant_fp8(x: Tensor, scale: Tensor) -> Tensor
  • static per-group: static_group_quant_fp8(x: Tensor, scale: Tensor) -> Tensor
  • dynamic per-token/per-tensor: dynamic_quant_fp8(x: Tensor, per_token: bool, scale_ub: Tensor | None = None) -> tuple[Tensor, Tensor]
    • these could be separate ops as well but I think together is better to consolidate a bit, also dynamic per-tensor is very uncommon these days.
  • dynamic per-group: dynamic_group_quant_fp8(x: Tensor, group_shape: list[int], column_major: bool, use_ue8m0: bool, scale_alignment: int = 1) -> tuple[Tensor, Tensor]

Just like activation ops, we will need to compile the native impl for use inside MoE and MLA: #38744

extent analysis

TL;DR

Separate the different types of quantization into individual ops within the QuantFP8 class to improve organization and maintainability.

Guidance

  • Identify the distinct quantization types (static per-tensor, static per-group, dynamic per-token/per-tensor, dynamic per-group) and create separate methods or functions for each.
  • Consider the trade-offs between consolidating related ops (e.g., dynamic per-token and per-tensor) versus separating them for clarity and flexibility.
  • Review the native implementation compilation requirements for MoE and MLA, as mentioned in #38744, to ensure the new op structure aligns with these needs.
  • Evaluate the potential impact on existing code and interfaces when refactoring the QuantFP8 class.

Example

class QuantFP8:
    def static_quant_fp8(self, x: Tensor, scale: Tensor) -> Tensor:
        # implementation

    def static_group_quant_fp8(self, x: Tensor, scale: Tensor) -> Tensor:
        # implementation

    def dynamic_quant_fp8(self, x: Tensor, per_token: bool, scale_ub: Tensor | None = None) -> tuple[Tensor, Tensor]:
        # implementation

    def dynamic_group_quant_fp8(self, x: Tensor, group_shape: list[int], column_major: bool, use_ue8m0: bool, scale_alignment: int = 1) -> tuple[Tensor, Tensor]:
        # implementation

Notes

The exact implementation details and potential dependencies between these ops are not specified, so further analysis and design are necessary to ensure a correct and efficient refactoring.

Recommendation

Apply workaround: Separate the quantization ops into individual methods within the QuantFP8 class to improve organization and maintainability, while considering the compilation requirements for MoE and MLA.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING