vllm - ✅(Solved) Fix [RFC][vLLM IR]: Automatically compile native impl for IR ops [2 pull requests, 1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#38744Fetched 2026-04-08 02:22:56
View on GitHub
Comments
0
Participants
1
Timeline
24
Reactions
2
Participants
Assignees
Timeline (top)
subscribed ×8mentioned ×7cross-referenced ×4labeled ×2

Root Cause

Alternative: register a compiled_native implementation

I think this is worse, because the dispatching logic becomes more complex, and we'd have to dispatch differently in the "global" and "wrapped" regions.

Fix Action

Fix / Workaround

Sometimes we need to call an IR op inside another opaque torch custom op. That means the IR op will be invisible to model-level compilation, and dispatching to the raw native implementation will hurt performance. This problem is not unique to vLLM IR; it happens for CustomOp instances as well, and we currently circumvent it by wrapping forward_native with torch.compile.

Alternative: register a compiled_native implementation

I think this is worse, because the dispatching logic becomes more complex, and we'd have to dispatch differently in the "global" and "wrapped" regions.

PR fix notes

PR #38775: [vLLM IR] 4/N Compile native implementation

Description (problem / solution / changelog)

Purpose

As described in #38744, we need to compile the native implementations of native ops. This is a draft implementation.

Test Plan

CI, e2e

Test Result


<details> <summary> Essential Elements of an Effective PR Description Checklist </summary>
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.
</details>

Changed files

  • docs/design/vllm_ir.md (added, +626/-0)
  • tests/compile/backend.py (modified, +15/-2)
  • tests/compile/passes/distributed/test_sequence_parallelism.py (modified, +13/-16)
  • tests/compile/passes/ir/test_clone_cleanup.py (added, +370/-0)
  • tests/compile/passes/ir/test_inplace_functionalization.py (added, +403/-0)
  • tests/compile/passes/test_functionalization.py (modified, +2/-2)
  • tests/compile/passes/test_fusion.py (modified, +4/-15)
  • tests/ir/test_compile.py (added, +167/-0)
  • tests/ir/test_inplace_op.py (added, +102/-0)
  • tests/ir/test_op.py (modified, +80/-10)
  • tests/kernels/ir/test_layernorm.py (modified, +11/-6)
  • tests/test_config.py (modified, +6/-3)
  • vllm/_aiter_ops.py (modified, +1/-99)
  • vllm/compilation/backends.py (modified, +19/-0)
  • vllm/compilation/passes/fusion/allreduce_rms_fusion.py (modified, +16/-9)
  • vllm/compilation/passes/fusion/matcher_utils.py (modified, +0/-67)
  • vllm/compilation/passes/fusion/rms_quant_fusion.py (modified, +26/-12)
  • vllm/compilation/passes/fusion/rocm_aiter_fusion.py (modified, +30/-16)
  • vllm/compilation/passes/fusion/sequence_parallelism.py (modified, +11/-9)
  • vllm/compilation/passes/inductor_pass.py (modified, +4/-0)
  • vllm/compilation/passes/ir/clone_elimination.py (added, +117/-0)
  • vllm/compilation/passes/ir/inplace_functionalization.py (added, +98/-0)
  • vllm/compilation/passes/ir/lowering_pass.py (modified, +7/-35)
  • vllm/compilation/passes/ir/utils.py (added, +40/-0)
  • vllm/compilation/passes/pass_manager.py (modified, +7/-1)
  • vllm/config/kernel.py (modified, +9/-1)
  • vllm/config/vllm.py (modified, +1/-2)
  • vllm/envs.py (modified, +3/-7)
  • vllm/ir/op.py (modified, +290/-16)
  • vllm/ir/ops/__init__.py (modified, +2/-2)
  • vllm/ir/ops/layernorm.py (modified, +25/-1)
  • vllm/kernels/__init__.py (modified, +2/-2)
  • vllm/kernels/aiter_ops.py (modified, +71/-0)
  • vllm/kernels/oink_ops.py (modified, +46/-5)
  • vllm/kernels/triton/__init__.py (added, +3/-0)
  • vllm/kernels/triton/layernorm_batch_invariant.py (added, +59/-0)
  • vllm/kernels/vllm_c.py (modified, +28/-0)
  • vllm/model_executor/layers/batch_invariant.py (modified, +1/-0)
  • vllm/model_executor/layers/layernorm.py (modified, +10/-213)
  • vllm/platforms/cuda.py (modified, +10/-3)

PR #38780: [vLLM IR][RMSNorm] Port GemmaRMSNorm to vLLM IR Ops

Description (problem / solution / changelog)

Purpose

[vLLM IR][RMSNorm] Port GemmaRMSNorm to vLLM IR Ops

Test Plan

Qwen3.5-9B functional testing has been conducted on the A100.

Test Result


<details> <summary> Essential Elements of an Effective PR Description Checklist </summary>
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.
</details>

Changed files

  • vllm/ir/ops/layernorm.py (modified, +2/-3)
  • vllm/kernels/aiter_ops.py (modified, +4/-6)
  • vllm/kernels/vllm_c.py (modified, +5/-2)
  • vllm/kernels/xpu_ops.py (modified, +3/-1)
  • vllm/model_executor/layers/layernorm.py (modified, +11/-55)
RAW_BUFFERClick to expand / collapse

Motivation.

Sometimes we need to call an IR op inside another opaque torch custom op. That means the IR op will be invisible to model-level compilation, and dispatching to the raw native implementation will hurt performance. This problem is not unique to vLLM IR; it happens for CustomOp instances as well, and we currently circumvent it by wrapping forward_native with torch.compile.

Prime examples of this are SiluAndMul and QuantFP8 inside fused_moe. The same mechanism is utilized by the _DecodeConcatQuantFP8 inside the MLA custom op.

Proposed Change.

We wrap the native implementation (or multiple implementations) with a torch.compile decorator. We can do that by setting IrOpImpl.impl_fn = torch.compile(IrOpImpl.impl_fn, ...) (including dynamic shape annotations).

The big question is lifetime: ideally we can set this with the set_priority context and restore it after, but will that persist the compiled code, or will it recompile every time? I guess if torch doesn't cache this, we could cache it manually?

We can optionally guard this with torch.compiler.is_compiling() although I think torch.compile already does that for us?

Draft implementation in #38775.

Alternative: just compile once

I worry this would let state escape arbitrarily, so multiple LLM instances with different configs would affect each other.

Alternative: register a compiled_native implementation

I think this is worse, because the dispatching logic becomes more complex, and we'd have to dispatch differently in the "global" and "wrapped" regions.

Feedback Period.

4/1 - 4/8

CC List.

@zou3519 @tjtanaa @gmagogsfm @angelayi @bringlein @LucasWilkinson @mgoin

Any Other Things.

No response

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

TL;DR

Wrap the native implementation of IR ops with torch.compile to improve performance when calling them inside another opaque torch custom op.

Guidance

  • Consider setting IrOpImpl.impl_fn = torch.compile(IrOpImpl.impl_fn, ...) to compile the native implementation, including dynamic shape annotations.
  • Investigate the lifetime of the compiled code and whether it persists across different contexts, potentially requiring manual caching.
  • Evaluate the use of torch.compiler.is_compiling() to guard the compilation process, if necessary.
  • Review the draft implementation in #38775 for a possible solution.

Example

IrOpImpl.impl_fn = torch.compile(IrOpImpl.impl_fn, ...)

This code snippet illustrates how to wrap the native implementation with torch.compile.

Notes

The proposed change aims to address performance issues when calling IR ops inside custom ops. However, the lifetime of the compiled code and potential caching mechanisms need further investigation.

Recommendation

Apply workaround: Wrap the native implementation with torch.compile to improve performance, as this approach seems to be the most promising solution based on the provided information.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING