vllm - 💡(How to fix) Fix [RFC][vLLM IR] `rms_norm` weight passing inconsistency [1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#39370Fetched 2026-04-09 07:51:30
View on GitHub
Comments
0
Participants
1
Timeline
13
Reactions
0
Participants
Timeline (top)
mentioned ×3project_v2_item_status_changed ×3subscribed ×3labeled ×2

Root Cause

In some models, the RMSNorm layer is unweighted (has_weight=True), effectively the same as the weight tensor set to all 1s. This can also be implemented by not applying the weight tensor at all. However, some implementations (vllm_c, aiter) always require the weight tensor. While always passing the weight tensor does not seem to affect native implementation performance, it does break TPU support, because we don't pass weight as a parameter so the XLA/JAX tensor conversion doesn't port the tensor over to TPU. Passing the parameter makes torch.compile unit tests start doing backwards through the model, which we can mitigate by using @torch.inference_mode or torch.no_grad(). But this still has the issue of passing the weight where we know that weight is all ones and not needed.

RAW_BUFFERClick to expand / collapse

Problem

In some models, the RMSNorm layer is unweighted (has_weight=True), effectively the same as the weight tensor set to all 1s. This can also be implemented by not applying the weight tensor at all. However, some implementations (vllm_c, aiter) always require the weight tensor. While always passing the weight tensor does not seem to affect native implementation performance, it does break TPU support, because we don't pass weight as a parameter so the XLA/JAX tensor conversion doesn't port the tensor over to TPU. Passing the parameter makes torch.compile unit tests start doing backwards through the model, which we can mitigate by using @torch.inference_mode or torch.no_grad(). But this still has the issue of passing the weight where we know that weight is all ones and not needed.

We cannot include weight in supports_args because we don't want an implementation to be skipped because it doesn't support weight=None. We also don't want to allocate as it might affect eager-mode performance.

Possible solutions

  • Current solution: pass weight conditionally, knowing which implementation will get selected. This does not work for fused ops. Perhaps we can do this for TPU-only?
  • Alternative 1: Pass the weight tensor as a parameter and add torch.inference_mode to torch.compile unit tests
  • Alternative 2: Introduce a "workspace" concept where implementations are allowed to one-time allocate workspace tensors (and possibly size them up if needed). The difficulty here is that this would happen after Dynamo tracing because it would have to happen after implementation selection during lowering. But this would be the most long-term fix as workspaces are needed elsewhere as well.
  • Alternative 3: Just allocate weight in implementations that need it and eat the overhead.
  • Alternative 4: Add a has_weight=bool parameter to the IR op: this would pollute the op in my opinion
  • Alternative 5: Rewrite the CUDA and AITER kernels to support weight=None. This would work but isn't scalable, what if we run into something similar for other kernels?

I honestly don't know how to think through all the tradeoffs so would like some input.

CC

@lk-chen @tjtanaa @zou3519

extent analysis

TL;DR

Pass the weight tensor conditionally, considering implementation selection and TPU support, to mitigate performance and compatibility issues.

Guidance

  • Evaluate the tradeoffs of each proposed solution, focusing on performance, compatibility, and scalability.
  • Consider implementing a conditional weight passing mechanism for TPU-only cases to address the immediate issue.
  • Investigate the feasibility of introducing a "workspace" concept to allow implementations to allocate workspace tensors, potentially providing a long-term fix.
  • Assess the impact of allocating weight in implementations that require it, weighing the overhead against the benefits of simplicity.

Example

No explicit code snippet is provided, as the issue focuses on high-level design and tradeoff discussions.

Notes

The optimal solution depends on the specific requirements and constraints of the project, including performance, compatibility, and scalability considerations. A thorough evaluation of each proposed solution is necessary to determine the best approach.

Recommendation

Apply a workaround, such as passing the weight tensor conditionally, to address the immediate issue while exploring more comprehensive solutions, like introducing a "workspace" concept, to provide a long-term fix.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING