vllm - 💡(How to fix) Fix [Performance]: RMSNorm op in v0.20 IR layer prevent further pytorch/triton op fusion [1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#41804Fetched 2026-05-07 03:32:52
View on GitHub
Comments
0
Participants
1
Timeline
1
Reactions
0
Author
Participants
Timeline (top)
labeled ×1

Fix Action

Fix / Workaround

cd ..
git clone -b fix-qwen-image-perf-1 https://github.com/fhfuih/vllm-omni.git vllm-omni-patch
cd vllm-omni-patch
uv venv -p 3.12
. .venv/bin/activate
uv pip install vllm==0.20.0 --torch-backend=cu130
git clone https://github.com/vllm-project/vllm-omni.git
cd vllm-omni
uv pip install -e '.[dev]'
</details>

Code Example

uv venv -p 3.12
. .venv/bin/activate
uv pip install vllm==0.20.0 --torch-backend=cu130
git clone https://github.com/vllm-project/vllm-omni.git
cd vllm-omni
uv pip install -e '.[dev]'

---

python examples/offline_inference/text_to_image/text_to_image.py --model Qwen/Qwen-Image --prompt 'Random prompt 1 for benchmarking diffusion models' --negative-prompt 'Negative prompt 1 for benchmarking diffusion models' --width 1536 --height 1536 --num-inference-steps 3 --profiler-config '{
    "profiler": "torch",
    "torch_profiler_dir": "./profiler-0.20.0-cu130",
    "torch_profiler_with_stack": true,
    "torch_profiler_use_gzip": true,
    "torch_profiler_record_shapes": true
  }'

---

# in vllm-omni folder
git checkout dd0fa02547aae9f57e1cb1c80b7db50a4161b8d2
uv venv .19 -p 3.12
. .19/bin/activate
uv pip install vllm==0.19.0 --torch-backend=cu129
uv pip install -e '.[dev]'

---

cd ..
git clone -b fix-qwen-image-perf-1 https://github.com/fhfuih/vllm-omni.git vllm-omni-patch
cd vllm-omni-patch
uv venv -p 3.12
. .venv/bin/activate
uv pip install vllm==0.20.0 --torch-backend=cu130
git clone https://github.com/vllm-project/vllm-omni.git
cd vllm-omni
uv pip install -e '.[dev]'

---

The output of `python collect_env.py`
RAW_BUFFERClick to expand / collapse

Proposal to improve performance

N/A

Report of performance regression

This is discovered when vLLM-Omni uses vllm's RMSNorm op and observes a performance regression. All discussion below assumes torch compile is enabled and the platform is H200 with CUDA platform

Background/Our downstream analysis & setup

<details> <summary>Reproduceable example (based on the links above)</summary>

Set up vllm and vllm omni

uv venv -p 3.12
. .venv/bin/activate
uv pip install vllm==0.20.0 --torch-backend=cu130
git clone https://github.com/vllm-project/vllm-omni.git
cd vllm-omni
uv pip install -e '.[dev]'

Run text-to-image generation with profiling, on main

python examples/offline_inference/text_to_image/text_to_image.py --model Qwen/Qwen-Image --prompt 'Random prompt 1 for benchmarking diffusion models' --negative-prompt 'Negative prompt 1 for benchmarking diffusion models' --width 1536 --height 1536 --num-inference-steps 3 --profiler-config '{
    "profiler": "torch",
    "torch_profiler_dir": "./profiler-0.20.0-cu130",
    "torch_profiler_with_stack": true,
    "torch_profiler_use_gzip": true,
    "torch_profiler_record_shapes": true
  }'

Run the same profiling on v0.19

# in vllm-omni folder
git checkout dd0fa02547aae9f57e1cb1c80b7db50a4161b8d2
uv venv .19 -p 3.12
. .19/bin/activate
uv pip install vllm==0.19.0 --torch-backend=cu129
uv pip install -e '.[dev]'

Run the same profiling with vllm_c setting even if torch.compile is on.

cd ..
git clone -b fix-qwen-image-perf-1 https://github.com/fhfuih/vllm-omni.git vllm-omni-patch
cd vllm-omni-patch
uv venv -p 3.12
. .venv/bin/activate
uv pip install vllm==0.20.0 --torch-backend=cu130
git clone https://github.com/vllm-project/vllm-omni.git
cd vllm-omni
uv pip install -e '.[dev]'
</details>

TLDR

In vllm v0.19 (with torch 2.10 and CUDA 12.9), RMSNorm becomes several native ops in torch Dynamo output, which further gets fused into a torch inductor.

In vllm v0.20 (with torch 2.11 and CUDA 13.0), RMSNorm is seemingly wrapped in one atomic torch op, preventing torch to later fusing it no matter what.

Suggestion 1

When VLLM IR set ir_op_priority to native, it wraps up those unfused native ops, potentially making torch inductor fail to fuse it, leading to ~15%-20% performance regression. CUDA graphs below:

vllm v0.19

<img width="1963" height="1205" alt="Image" src="https://github.com/user-attachments/assets/cd47b7f9-b9fd-40f9-aa5f-d70c99fa3260" />

vllm v0.20, with ir_op_priority set to native

<img width="2974" height="813" alt="Image" src="https://github.com/user-attachments/assets/99764533-fd32-407a-9f7d-f7eba5c9d915" />

(The huge pink block above corresponds to the flagged area)

Since this is "the platform defaults (when compiling with Inductor)", I assume the expected behavior is to "not fuse on vLLM's side" and "let torch inductor optimize it". So the current behavior goes against it.

Suggestion 2 (lower priority, IDK if it is intended)

When VLLM IR set ir_op_priority to vllm_c, the vllm-version of the fused op is slightly lower than the torch inductor's version

vllm v0.19 (again)

<img width="972" height="510" alt="Image" src="https://github.com/user-attachments/assets/09163110-346b-41be-844c-702b37b4fd9c" />

~64us

vllm v0.20, with ir_op_priority set to vllm_c

<img width="1340" height="620" alt="Image" src="https://github.com/user-attachments/assets/4523c7cb-9a17-4467-8cf3-d2ad20ceec2b" />

~83us

I don't know if this is intended or the best we can get. But since the "native" version has performance degradation, we have to compare the vllm_c performance and the "old native" performance in production.

Misc discussion on performance

No response

Your current environment (if you think it is necessary)

The output of `python collect_env.py`

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING