vllm - 💡(How to fix) Fix [RFC]: Porting compiler fusions to manual fusion

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
RAW_BUFFERClick to expand / collapse

Motivation.

#42770 lays out the move from full-graph torch.compile to manual fusion expressed directly in model code. This issue tracks the concrete porting work for section 1-1 of that RFC: enumerating the fusions currently performed by vllm/compilation/passes/ and mapping each one to a work item to port to a manual call site in model code. To be clear this workstream is meant to touch all existing portable model definitions where compiler fusion pass would apply previously.

The goal is no performance regression as we delete each compiler pass. Each fusion below is either:

  • ported to a "big fused op" callable from model code,
  • collapsed into another fused op,
  • or explicitly dropped (with rationale).

Proposed Change.

Changes to QuantMethod

This is clearly needed for fusions where we are fusing the input quant op from a downstream Linear layer. When a linear layer ops in, by registering a self.input_quant_key = QuantKey.XX, the fused operation will return a new QuantizedActivation instead of a torch.Tensor. When the linear's kernel sees that the input is not a simple unquantized tensor, it will pass it through directly to the matmul without requantizing.

For an illustrative example for RMSNorm + Quant see https://github.com/vllm-project/vllm/pull/42469

Changes to Attention

TBD

Inventory of fusion passes

In rough priority order. Sub-issues will be filed per fused op to track ownership, kernel coverage, and the model PRs that switch over to it.

CUDA

  • allreduce + rms_norm[+quant] (AllReduceFusionPass)
    • No quant
    • FP8 per-tensor/per-token
    • NVFP4
  • rms_norm[+add] + quant (RMSNormQuantFusionPass)
    • FP8 per-tensor/per-token
    • FP8 per-token-block (1×128)
  • silu_mul + quant (ActivationQuantFusionPass)
    • FP8 per-tensor
    • FP8 per-token-block (1×128)
    • NVFP4
  • attn + output_quant (AttnQuantFusionPass)
    • FP8 per-tensor
    • NVFP4
  • mla_attn + output_quant (MLAAttnQuantFusionPass)
    • FP8 per-tensor
    • FP8 per-token-block (1×128)
    • NVFP4
  • q_norm + k_norm + rope (QKNormRoPEFusionPass) - Qwen3 / Gemma path.
  • rope + kv_cache_write (RopeKVCacheFusionPass)
  • mla_rope + kv_cache_concat (MLARoPEKVCacheCatFusionPass)
  • minimax_qk_allreduce + norm (MiniMaxQKNormPass)

ROCm / aiter

  • allreduce + rms_norm[+add] (RocmAiterAllReduceFusionPass)
  • rms_norm[+add] + (group/dynamic) fp8 quant (RocmAiterRMSNormQuantFusionPass)
  • silu_mul + group fp8 quant (RocmAiterSiluMulFp8GroupQuantFusionPass)
  • add + rms_norm + router_pad (RocmAiterTritonAddRMSNormPadFusionPass)
  • mla_dual_rms_norm (MLADualRMSNormFusionPass)

Compiler housekeeping (to drop, not port)

These exist only to clean up the IR and keep the compiler-fusion world consistent. With manual fusion they have no role:

  • NoOpEliminationPass
  • VllmIRLoweringPass
  • UnsafeCloneEliminationPass
  • ScatterSplitReplacementPass + SplitCoalescingPass
  • FixFunctionalizationPass
  • PostCleanupPass

These will be removed alongside the torch.compile integration removal (roughly the end-of-June milestone in #42770).

Open questions.

  1. Sequence parallelism SP today is opt-in (pass_config.enable_sp) and matters most for long-context + high TP. Basically it is a special case of all-reduce + rms_norm, where it turns AR + RMSNorm[+Quant] into ReduceScatter + RMSNorm[+Quant] + AllGather. We should decide whether to keep an SP path at all in the manual-fusion world.
  2. Async-TP Not sure if this is possible to port to manual fusion.

In-Progress.

  • @mgoin / @ProExpertProg tracks per-pass porting and files sub-issues from this list.
  • @WoosukKwon's DSv4 NVIDIA implementation will be the reference port for P0 CUDA ops.
  • @tjtanaa will handle the P0 ROCm ops.

Feedback Period.

No response

CC List.

No response

Any Other Things.

No response

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING