vllm - 💡(How to fix) Fix [RFC]: Porting compiler fusions to manual fusion

StepCodex · 2026-05-20T15:19:03Z

[vllm] Motivation. 42770 lays out the move from full-graph torch.compile to manual fusion expressed directly in model code. This issue tracks the concrete port… ## Motivation. #42770 lays out the move from full-graph `torch.compile` to manual fusion expressed directly in model code. This issue tracks the concrete porting work for **section 1-1** of that RFC: enumerating the fusions currently performed by `vllm/compilation/passes/` and mapping each one to a work item to port to a manual call site in model code. To be clear this workstream is meant to touch all existing portable model definitions where compiler fusion pass would apply previously. The goal is **no performance regression** as we delete each compiler pass. Each fusion below is either: - ported to a "big fused op" callable from model code, - collapsed into another fused op, - or explicitly dropped (with rationale). ## Proposed Change. ### Changes to QuantMethod This is clearly needed for fusions where we are fusing the input quant op from a downstream Linear layer. When a linear layer ops in, by registering a `self.input_quant_key = QuantKey.XX`, the fused operation will return a new `QuantizedActivation` instead of a `torch.Tensor`. When the linear's kernel sees that the input is not a simple unquantized tensor, it will pass it through directly to the matmul without requantizing. For an illustrative example for `RMSNorm + Quant` see https://github.com/vllm-project/vllm/pull/42469 ### Changes to Attention TBD ### Inventory of fusion passes In rough priority order. Sub-issues will be filed per fused op to track ownership, kernel coverage, and the model PRs that switch over to it. #### CUDA - [ ] `allreduce + rms_norm[+quant]` (`AllReduceFusionPass`) - [ ] No quant - [ ] FP8 per-tensor/per-token - [ ] NVFP4 - [ ] `rms_norm[+add] + quant` (`RMSNormQuantFusionPass`) - [ ] FP8 per-tensor/per-token - [ ] FP8 per-token-block (1×128) - [ ] `silu_mul + quant` (`ActivationQuantFusionPass`) - [ ] FP8 per-tensor - [ ] FP8 per-token-block (1×128) - [ ] NVFP4 - [ ] `attn + output_quant` (`AttnQuantFusionPass`) - [ ] FP8 per-tensor - [ ] NVFP4 - [ ] `mla_attn + output_quant` (`MLAAttnQuantFusionPass`) - [ ] FP8 per-tensor - [ ] FP8 per-token-block (1×128) - [ ] NVFP4 - [ ] `q_norm + k_norm + rope` (`QKNormRoPEFusionPass`) - Qwen3 / Gemma path. - [ ] `rope + kv_cache_write` (`RopeKVCacheFusionPass`) - [ ] `mla_rope + kv_cache_concat` (`MLARoPEKVCacheCatFusionPass`) - [ ] `minimax_qk_allreduce + norm` (`MiniMaxQKNormPass`) #### ROCm / aiter - [ ] `allreduce + rms_norm[+add]` (`RocmAiterAllReduceFusionPass`) - [ ] `rms_norm[+add] + (group/dynamic) fp8 quant` (`RocmAiterRMSNormQuantFusionPass`) - [ ] `silu_mul + group fp8 quant` (`RocmAiterSiluMulFp8GroupQuantFusionPass`) - [ ] `add + rms_norm + router_pad` (`RocmAiterTritonAddRMSNormPadFusionPass`) - [ ] `mla_dual_rms_norm` (`MLADualRMSNormFusionPass`) #### Compiler housekeeping (to drop, not port) These exist only to clean up the IR and keep the compiler-fusion world consistent. With manual fusion they have no role: - `NoOpEliminationPass` - `VllmIRLoweringPass` - `UnsafeCloneEliminationPass` - `ScatterSplitReplacementPass` + `SplitCoalescingPass` - `FixFunctionalizationPass` - `PostCleanupPass` These will be removed alongside the `torch.compile` integration removal (roughly the end-of-June milestone in #42770). ### Open questions. 1. **Sequence parallelism** SP today is opt-in (`pass_config.enable_sp`) and matters most for long-context + high TP. Basically it is a special case of all-reduce + rms_norm, where it turns `AR + RMSNorm[+Quant]` into `ReduceScatter + RMSNorm[+Quant] + AllGather`. We should decide whether to keep an SP path at all in the manual-fusion world. 2. **Async-TP** Not sure if this is possible to port to manual fusion. ### In-Progress. - @mgoin / @ProExpertProg tracks per-pass porting and files sub-issues from this list. - @WoosukKwon's DSv4 NVIDIA implementation will be the reference port for P0 CUDA ops. - @tjtanaa will handle the P0 ROCm ops. ### Feedback Period. _No response_ ### CC List. _No response_ ### Any Other Things. _No response_ ### Before submitting a new issue... - [x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the [documentation page](https://docs.vllm.ai/en/latest/), which can answer lots of frequently asked questions.

Motivation.

#42770 lays out the move from full-graph torch.compile to manual fusion expressed directly in model code. This issue tracks the concrete porting work for section 1-1 of that RFC: enumerating the fusions currently performed by vllm/compilation/passes/ and mapping each one to a work item to port to a manual call site in model code. To be clear this workstream is meant to touch all existing portable model definitions where compiler fusion pass would apply previously.

The goal is no performance regression as we delete each compiler pass. Each fusion below is either:

ported to a "big fused op" callable from model code,
collapsed into another fused op,
or explicitly dropped (with rationale).

Proposed Change.

Changes to QuantMethod

This is clearly needed for fusions where we are fusing the input quant op from a downstream Linear layer. When a linear layer ops in, by registering a self.input_quant_key = QuantKey.XX, the fused operation will return a new QuantizedActivation instead of a torch.Tensor. When the linear's kernel sees that the input is not a simple unquantized tensor, it will pass it through directly to the matmul without requantizing.

For an illustrative example for RMSNorm + Quant see https://github.com/vllm-project/vllm/pull/42469

Changes to Attention

TBD

Inventory of fusion passes

In rough priority order. Sub-issues will be filed per fused op to track ownership, kernel coverage, and the model PRs that switch over to it.

CUDA

ROCm / aiter

allreduce + rms_norm[+add] (RocmAiterAllReduceFusionPass)
rms_norm[+add] + (group/dynamic) fp8 quant (RocmAiterRMSNormQuantFusionPass)
silu_mul + group fp8 quant (RocmAiterSiluMulFp8GroupQuantFusionPass)
add + rms_norm + router_pad (RocmAiterTritonAddRMSNormPadFusionPass)
mla_dual_rms_norm (MLADualRMSNormFusionPass)

Compiler housekeeping (to drop, not port)

These exist only to clean up the IR and keep the compiler-fusion world consistent. With manual fusion they have no role:

NoOpEliminationPass
VllmIRLoweringPass
UnsafeCloneEliminationPass
ScatterSplitReplacementPass + SplitCoalescingPass
FixFunctionalizationPass
PostCleanupPass

These will be removed alongside the torch.compile integration removal (roughly the end-of-June milestone in #42770).

Open questions.

Sequence parallelism SP today is opt-in (pass_config.enable_sp) and matters most for long-context + high TP. Basically it is a special case of all-reduce + rms_norm, where it turns AR + RMSNorm[+Quant] into ReduceScatter + RMSNorm[+Quant] + AllGather. We should decide whether to keep an SP path at all in the manual-fusion world.
Async-TP Not sure if this is possible to port to manual fusion.

In-Progress.

@mgoin / @ProExpertProg tracks per-pass porting and files sub-issues from this list.
@WoosukKwon's DSv4 NVIDIA implementation will be the reference port for P0 CUDA ops.
@tjtanaa will handle the P0 ROCm ops.

Feedback Period.

No response

CC List.

No response

Any Other Things.

No response

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

vllm - 💡(How to fix) Fix [RFC]: Porting compiler fusions to manual fusion

Recommended Tools

GitHub issue graph ai analysis

Motivation.

Proposed Change.

Changes to QuantMethod

Changes to Attention

Inventory of fusion passes

CUDA

ROCm / aiter

Compiler housekeeping (to drop, not port)

Open questions.

In-Progress.

Feedback Period.

CC List.

Any Other Things.

Before submitting a new issue...

Still need to ship something?

TRENDING