vllm - 💡(How to fix) Fix [Performance]: RMSNorm op in v0.20 IR layer prevent further pytorch/triton op fusion [1 participants]

Fix Action

Fix / Workaround

cd ..
git clone -b fix-qwen-image-perf-1 https://github.com/fhfuih/vllm-omni.git vllm-omni-patch
cd vllm-omni-patch
uv venv -p 3.12
. .venv/bin/activate
uv pip install vllm==0.20.0 --torch-backend=cu130
git clone https://github.com/vllm-project/vllm-omni.git
cd vllm-omni
uv pip install -e '.[dev]'

</details>

Code Example

uv venv -p 3.12
. .venv/bin/activate
uv pip install vllm==0.20.0 --torch-backend=cu130
git clone https://github.com/vllm-project/vllm-omni.git
cd vllm-omni
uv pip install -e '.[dev]'

---

python examples/offline_inference/text_to_image/text_to_image.py --model Qwen/Qwen-Image --prompt 'Random prompt 1 for benchmarking diffusion models' --negative-prompt 'Negative prompt 1 for benchmarking diffusion models' --width 1536 --height 1536 --num-inference-steps 3 --profiler-config '{
    "profiler": "torch",
    "torch_profiler_dir": "./profiler-0.20.0-cu130",
    "torch_profiler_with_stack": true,
    "torch_profiler_use_gzip": true,
    "torch_profiler_record_shapes": true
  }'

---

# in vllm-omni folder
git checkout dd0fa02547aae9f57e1cb1c80b7db50a4161b8d2
uv venv .19 -p 3.12
. .19/bin/activate
uv pip install vllm==0.19.0 --torch-backend=cu129
uv pip install -e '.[dev]'

---

cd ..
git clone -b fix-qwen-image-perf-1 https://github.com/fhfuih/vllm-omni.git vllm-omni-patch
cd vllm-omni-patch
uv venv -p 3.12
. .venv/bin/activate
uv pip install vllm==0.20.0 --torch-backend=cu130
git clone https://github.com/vllm-project/vllm-omni.git
cd vllm-omni
uv pip install -e '.[dev]'

---

The output of `python collect_env.py`

Proposal to improve performance

N/A

Report of performance regression

This is discovered when vLLM-Omni uses vllm's RMSNorm op and observes a performance regression. All discussion below assumes torch compile is enabled and the platform is H200 with CUDA platform

Background/Our downstream analysis & setup

<details> <summary>Reproduceable example (based on the links above)</summary>

Set up vllm and vllm omni

uv venv -p 3.12
. .venv/bin/activate
uv pip install vllm==0.20.0 --torch-backend=cu130
git clone https://github.com/vllm-project/vllm-omni.git
cd vllm-omni
uv pip install -e '.[dev]'

Run text-to-image generation with profiling, on main

python examples/offline_inference/text_to_image/text_to_image.py --model Qwen/Qwen-Image --prompt 'Random prompt 1 for benchmarking diffusion models' --negative-prompt 'Negative prompt 1 for benchmarking diffusion models' --width 1536 --height 1536 --num-inference-steps 3 --profiler-config '{
    "profiler": "torch",
    "torch_profiler_dir": "./profiler-0.20.0-cu130",
    "torch_profiler_with_stack": true,
    "torch_profiler_use_gzip": true,
    "torch_profiler_record_shapes": true
  }'

Run the same profiling on v0.19

# in vllm-omni folder
git checkout dd0fa02547aae9f57e1cb1c80b7db50a4161b8d2
uv venv .19 -p 3.12
. .19/bin/activate
uv pip install vllm==0.19.0 --torch-backend=cu129
uv pip install -e '.[dev]'

Run the same profiling with vllm_c setting even if torch.compile is on.

cd ..
git clone -b fix-qwen-image-perf-1 https://github.com/fhfuih/vllm-omni.git vllm-omni-patch
cd vllm-omni-patch
uv venv -p 3.12
. .venv/bin/activate
uv pip install vllm==0.20.0 --torch-backend=cu130
git clone https://github.com/vllm-project/vllm-omni.git
cd vllm-omni
uv pip install -e '.[dev]'

</details>

TLDR

In vllm v0.19 (with torch 2.10 and CUDA 12.9), RMSNorm becomes several native ops in torch Dynamo output, which further gets fused into a torch inductor.

In vllm v0.20 (with torch 2.11 and CUDA 13.0), RMSNorm is seemingly wrapped in one atomic torch op, preventing torch to later fusing it no matter what.

Suggestion 1

When VLLM IR set ir_op_priority to native, it wraps up those unfused native ops, potentially making torch inductor fail to fuse it, leading to ~15%-20% performance regression. CUDA graphs below:

vllm v0.19

vllm v0.20, with ir_op_priority set to native

(The huge pink block above corresponds to the flagged area)

Since this is "the platform defaults (when compiling with Inductor)", I assume the expected behavior is to "not fuse on vLLM's side" and "let torch inductor optimize it". So the current behavior goes against it.

Suggestion 2 (lower priority, IDK if it is intended)

When VLLM IR set ir_op_priority to vllm_c, the vllm-version of the fused op is slightly lower than the torch inductor's version

vllm v0.19 (again)

~64us

vllm v0.20, with ir_op_priority set to vllm_c

~83us

I don't know if this is intended or the best we can get. But since the "native" version has performance degradation, we have to compare the vllm_c performance and the "old native" performance in production.

Misc discussion on performance

No response

Your current environment (if you think it is necessary)

The output of `python collect_env.py`

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

vllm - 💡(How to fix) Fix [Performance]: RMSNorm op in v0.20 IR layer prevent further pytorch/triton op fusion [1 participants]

Recommended Tools

GitHub issue graph ai analysis

Fix Action

Fix / Workaround

Code Example

Proposal to improve performance

Report of performance regression

Background/Our downstream analysis & setup

TLDR

Suggestion 1

Suggestion 2 (lower priority, IDK if it is intended)

Misc discussion on performance

Your current environment (if you think it is necessary)

Before submitting a new issue...

Still need to ship something?

TRENDING

vllm - 💡(How to fix) Fix [Performance]: RMSNorm op in v0.20 IR layer prevent further pytorch/triton op fusion [1 participants]

Recommended Tools

GitHub issue graph ai analysis

Fix Action

Fix / Workaround

Code Example

Proposal to improve performance

Report of performance regression

Background/Our downstream analysis & setup

TLDR

Suggestion 1

Suggestion 2 (lower priority, IDK if it is intended)

Misc discussion on performance

Your current environment (if you think it is necessary)

Before submitting a new issue...

Still need to ship something?

RELATED_DISCOVERY

TRENDING