vllm - ✅(Solved) Fix [RFC][vLLM IR]: Automatically compile native impl for IR ops [2 pull requests, 1 participants]

ProExpertProg · 2026-04-01T18:22:11Z

[vllm] PR 38775: vLLM IR 4/N Compile native implementation - Repository: vllm-project/vllm - Author: ProExpertProg - State: open | merged: False - Link: https:… # PR #38775: [vLLM IR] 4/N Compile native implementation - Repository: vllm-project/vllm - Author: ProExpertProg - State: open | merged: False - Link: https://github.com/vllm-project/vllm/pull/38775 ## Description (problem / solution / changelog) ## Purpose As described in #38744, we need to compile the native implementations of native ops. This is a draft implementation. ## Test Plan CI, e2e ## Test Result --- Essential Elements of an Effective PR Description Checklist - [ ] The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)". - [ ] The test plan, such as providing test command. - [ ] The test results, such as pasting the results comparison before and after, or e2e results - [ ] (Optional) The necessary documentation update, such as updating `supported_models.md` and `examples` for a new model. - [ ] (Optional) Release notes update. If your change is user facing, please update the release notes draft in the [Google Doc](https://docs.google.com/document/d/1YyVqrgX4gHTtrstbq8oWUImOyPCKSGnJ7xtTpmXzlRs/edit?tab=t.0). ## Changed files - `docs/design/vllm_ir.md` (added, +626/-0) - `tests/compile/backend.py` (modified, +15/-2) - `tests/compile/passes/distributed/test_sequence_parallelism.py` (modified, +13/-16) - `tests/compile/passes/ir/test_clone_cleanup.py` (added, +370/-0) - `tests/compile/passes/ir/test_inplace_functionalization.py` (added, +403/-0) - `tests/compile/passes/test_functionalization.py` (modified, +2/-2) - `tests/compile/passes/test_fusion.py` (modified, +4/-15) - `tests/ir/test_compile.py` (added, +167/-0) - `tests/ir/test_inplace_op.py` (added, +102/-0) - `tests/ir/test_op.py` (modified, +80/-10) - `tests/kernels/ir/test_layernorm.py` (modified, +11/-6) - `tests/test_config.py` (modified, +6/-3) - `vllm/_aiter_ops.py` (modified, +1/-99) - `vllm/compilation/backends.py` (modified, +19/-0) - `vllm/compilation/passes/fusion/allreduce_rms_fusion.py` (modified, +16/-9) - `vllm/compilation/passes/fusion/matcher_utils.py` (modified, +0/-67) - `vllm/compilation/passes/fusion/rms_quant_fusion.py` (modified, +26/-12) - `vllm/compilation/passes/fusion/rocm_aiter_fusion.py` (modified, +30/-16) - `vllm/compilation/passes/fusion/sequence_parallelism.py` (modified, +11/-9) - `vllm/compilation/passes/inductor_pass.py` (modified, +4/-0) - `vllm/compilation/passes/ir/clone_elimination.py` (added, +117/-0) - `vllm/compilation/passes/ir/inplace_functionalization.py` (added, +98/-0) - `vllm/compilation/passes/ir/lowering_pass.py` (modified, +7/-35) - `vllm/compilation/passes/ir/utils.py` (added, +40/-0) - `vllm/compilation/passes/pass_manager.py` (modified, +7/-1) - `vllm/config/kernel.py` (modified, +9/-1) - `vllm/config/vllm.py` (modified, +1/-2) - `vllm/envs.py` (modified, +3/-7) - `vllm/ir/op.py` (modified, +290/-16) - `vllm/ir/ops/__init__.py` (modified, +2/-2) - `vllm/ir/ops/layernorm.py` (modified, +25/-1) - `vllm/kernels/__init__.py` (modified, +2/-2) - `vllm/kernels/aiter_ops.py` (modified, +71/-0) - `vllm/kernels/oink_ops.py` (modified, +46/-5) - `vllm/kernels/triton/__init__.py` (added, +3/-0) - `vllm/kernels/triton/layernorm_batch_invariant.py` (added, +59/-0) - `vllm/kernels/vllm_c.py` (modified, +28/-0) - `vllm/model_executor/layers/batch_invariant.py` (modified, +1/-0) - `vllm/model_executor/layers/layernorm.py` (modified, +10/-213) - `vllm/platforms/cuda.py` (modified, +10/-3) --- # PR #38780: [vLLM IR][RMSNorm] Port GemmaRMSNorm to vLLM IR Ops - Repository: vllm-project/vllm - Author: wxsIcey - State: closed | merged: True - Link: https://github.com/vllm-project/vllm/pull/38780 ## Description (problem / solution / changelog) ## Purpose [vLLM IR][RMSNorm] Port GemmaRMSNorm to vLLM IR Ops ## Test Plan Qwen3.5-9B functional testing has been conducted on the A100. ## Test Result --- Essential Elements of an Effective PR Description Checklist - [ ] The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)". - [ ] The test plan, such as providing test command. - [ ] The test results, such as pasting the results comparison before and after, or e2e results - [ ] (Optional) The necessary documentation update, such as updating `supported_models.md` and `examples` for a new model. - [ ] (Optional) Release notes update. If your change is user facing, please update the release notes draft in the [Google Doc](https://docs.google.com/document/d/1YyVqrgX4gHTtrstbq8oWUImOyPCKSGnJ7xtTpmXzlRs/edit?tab=t.0). ## Changed files - `vllm/ir/ops/layernorm.py` (modified, +2/-3) - `vllm/kernels/aiter_ops.py` (modified, +4/-6) - `vllm/kernels/vllm_c.py` (modified, +5/-2) - `vllm/kernels/xpu_ops.py` (modified, +3/-1) - `vllm/model_executor/layers/layernorm.py` (modified, +11/-55) ## Fix / Workaround Sometimes we ne

vllm2026-04-01 18:22:11

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

vllm-project/vllm#38744•Fetched 2026-04-08 02:22:56

View on GitHub

Comments

Participants

Timeline

Reactions

Author

ProExpertProg

Participants

ProExpertProg

Assignees

ProExpertProg

Timeline (top)

subscribed ×8mentioned ×7cross-referenced ×4labeled ×2

Root Cause

Alternative: register a `compiled_native` implementation

I think this is worse, because the dispatching logic becomes more complex, and we'd have to dispatch differently in the "global" and "wrapped" regions.

RAW_BUFFERClick to expand / collapse

Motivation.

Sometimes we need to call an IR op inside another opaque torch custom op. That means the IR op will be invisible to model-level compilation, and dispatching to the raw native implementation will hurt performance. This problem is not unique to vLLM IR; it happens for CustomOp instances as well, and we currently circumvent it by wrapping forward_native with torch.compile.

Prime examples of this are SiluAndMul and QuantFP8 inside fused_moe. The same mechanism is utilized by the _DecodeConcatQuantFP8 inside the MLA custom op.

Proposed Change.

We wrap the native implementation (or multiple implementations) with a torch.compile decorator. We can do that by setting IrOpImpl.impl_fn = torch.compile(IrOpImpl.impl_fn, ...) (including dynamic shape annotations).

The big question is lifetime: ideally we can set this with the set_priority context and restore it after, but will that persist the compiled code, or will it recompile every time? I guess if torch doesn't cache this, we could cache it manually?

We can optionally guard this with torch.compiler.is_compiling() although I think torch.compile already does that for us?

Draft implementation in #38775.

Alternative: just compile once

I worry this would let state escape arbitrarily, so multiple LLM instances with different configs would affect each other.

Alternative: register a `compiled_native` implementation

I think this is worse, because the dispatching logic becomes more complex, and we'd have to dispatch differently in the "global" and "wrapped" regions.

Feedback Period.

4/1 - 4/8

CC List.

@zou3519 @tjtanaa @gmagogsfm @angelayi @bringlein @LucasWilkinson @mgoin

Any Other Things.

No response

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

TL;DR

Wrap the native implementation of IR ops with torch.compile to improve performance when calling them inside another opaque torch custom op.

Guidance

Consider setting IrOpImpl.impl_fn = torch.compile(IrOpImpl.impl_fn, ...) to compile the native implementation, including dynamic shape annotations.
Investigate the lifetime of the compiled code and whether it persists across different contexts, potentially requiring manual caching.
Evaluate the use of torch.compiler.is_compiling() to guard the compilation process, if necessary.
Review the draft implementation in #38775 for a possible solution.

Example

IrOpImpl.impl_fn = torch.compile(IrOpImpl.impl_fn, ...)

This code snippet illustrates how to wrap the native implementation with torch.compile.

Notes

The proposed change aims to address performance issues when calling IR ops inside custom ops. However, the lifetime of the compiled code and potential caching mechanisms need further investigation.

Recommendation

Apply workaround: Wrap the native implementation with torch.compile to improve performance, as this approach seems to be the most promising solution based on the provided information.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #ssr #installation #tensor shape #request timeout

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

vllm - ✅(Solved) Fix [RFC][vLLM IR]: Automatically compile native impl for IR ops [2 pull requests, 1 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Alternative: register a compiled_native implementation

Fix Action

Fix / Workaround

Alternative: register a compiled_native implementation

PR fix notes

PR #38775: [vLLM IR] 4/N Compile native implementation

Description (problem / solution / changelog)

Purpose

Test Plan

Test Result

Changed files

PR #38780: [vLLM IR][RMSNorm] Port GemmaRMSNorm to vLLM IR Ops

Description (problem / solution / changelog)

Purpose

Test Plan

Test Result

Changed files

Motivation.

Proposed Change.

Alternative: just compile once

Alternative: register a compiled_native implementation

Feedback Period.

CC List.

Any Other Things.

Before submitting a new issue...

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING

Alternative: register a `compiled_native` implementation

Alternative: register a `compiled_native` implementation

Alternative: register a `compiled_native` implementation