vllm - ✅(Solved) Fix fused_add_rms_norm does not branch on has_weight=False (TODO(luka)); FlashNorm weightless RMSNorm cannot realize a speedup on the GPU path [3 pull requests, 1 participants]

vllm2026-05-01 00:20:18

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

vllm-project/vllm#41430•Fetched 2026-05-01 05:33:40

View on GitHub

Comments

Participants

Timeline

Reactions

Author

fm1320

Participants

fm1320

Timeline (top)

cross-referenced ×2

vllm/model_executor/layers/layernorm.py already supports RMSNorm(..., has_weight=False) (added in #40117 for the Gemma-4 KV-shared k_norm). When has_weight=False is set, RMSNorm.weight becomes a buffer of ones (not a registered Parameter) and forward_native correctly passes weight=None into vllm.ir.ops.rms_norm, which has the if weight is not None: x *= weight branch and skips the per-channel multiply.

However, the fused CUDA kernel path used by forward_cuda (vllm._custom_ops.fused_add_rms_norm and rms_norm) does not branch on has_weight=False. It always multiplies by the weight tensor, even when that tensor is the buffer of ones. The mathematics are preserved (multiply by ones is a no-op) but the kernel work is identical to the weighted case, so any speedup expected from has_weight=False is not realized on the GPU path.

An adjacent TODO(luka) marker at vllm/model_executor/layers/layernorm.py:240 (inside forward_native, where weight=None is already passed correctly) acknowledges the broader weight=None passing inconsistency. The actual shim that prevents a speedup on CUDA today is in vllm/kernels/vllm_c.py:rms_norm, which substitutes a tensor of ones for None and dispatches to torch.ops._C.rms_norm, since the C++ kernel requires a weight tensor.

Root Cause

This is the gain that the kernel-side fix in this issue would unlock for vLLM deployments. The companion loader PR (see Companion section below) makes FlashNorm-folded checkpoints loadable in vLLM but cannot deliver the speedup by itself, because forward_cuda runs the kernel as if the weight were a learned tensor.

Fix Action

Fix / Workaround

We verified empirically that the loader-side integration is wired correctly and the kernel is the remaining gap. With the loader patch from the companion PR applied (so has_weight=False is set on Llama RMSNorms when the source HF config sets flashnorm_folded: true):

vLLM 0.9.2, Colab A100, bf16, 256-token greedy decode, median of 20 trials, enforce_eager=True, tensor_parallel_size=1.
Baseline unsloth/Llama-3.2-1B-Instruct (stock vLLM, no patches): 94.09 tok/s.
Flashified open-machine/Llama-3.2-1B-FlashNorm with has_weight=False set via the loader patch: 94.05 tok/s.

PR fix notes

PR #41431: fix(llama): use weightless RMSNorm for FlashNorm-folded checkpoints (has_weight=False)

Repository: vllm-project/vllm
Author: fm1320
State: open | merged: False
Link: https://github.com/vllm-project/vllm/pull/41431

Description (problem / solution / changelog)

Summary

Adds the same one-line has_weight=False plumbing pattern as PR #40117 (fix(gemma4): use weightless k_norm for KV-shared layers) to the Llama loader so that FlashNorm-folded HuggingFace checkpoints load cleanly. Three call sites in vllm/model_executor/models/llama.py get one extra keyword argument each, gated by a flashnorm_folded flag on the model config.

This is purely a loader-side fix. It unblocks loading the weightless-rmsnorm collection on HuggingFace (Llama, Qwen, SmolLM, etc.; this PR covers the Llama path) but does not deliver any speedup by itself, because forward_cuda does not yet branch on has_weight=False. That kernel-side gap is tracked in a companion issue.

Problem

Loading open-machine/Llama-3.2-1B-FlashNorm (or any other FlashNorm-folded Llama checkpoint with "flashnorm_folded": true in config.json) in stock vLLM fails at engine init:

from vllm import LLM
LLM(model='open-machine/Llama-3.2-1B-FlashNorm', dtype='bfloat16', enforce_eager=True)
# RuntimeError: Engine core initialization failed.

Root cause: vllm/model_executor/models/llama.py always calls RMSNorm(...) with the default has_weight=True, so params_dict expects every per-layer *_layernorm.weight and model.norm.weight tensor. FlashNorm-folded checkpoints either (a) keep those weights as all-ones for HF compatibility or (b) omit them; neither layout loads cleanly today.

Background: FlashNorm and the precedent

FlashNorm (Graef et al., 2024) folds the per-channel RMSNorm weight into the next linear, leaving an RMSNorm whose weight is mathematically all ones. A runtime that recognizes the all-ones case can skip the per-channel multiply and recover an end-to-end speedup.

vLLM's RMSNorm class at vllm/model_executor/layers/layernorm.py already supports has_weight=False, added in #40117 for Gemma-4 KV-shared k_norm. This PR extends the same pattern to the Llama path.

Proposed change

Three RMSNorm call sites in vllm/model_executor/models/llama.py get one keyword argument each, gated by a flashnorm_folded flag read from the model's config:

flashnorm_folded = getattr(config, "flashnorm_folded", False)

# In LlamaDecoderLayer.__init__:
self.input_layernorm = RMSNorm(
    config.hidden_size, eps=config.rms_norm_eps,
    has_weight=not flashnorm_folded,
)
self.post_attention_layernorm = RMSNorm(
    config.hidden_size, eps=config.rms_norm_eps,
    has_weight=not flashnorm_folded,
)

# In LlamaModel.__init__:
self.norm = RMSNorm(
    config.hidden_size, eps=config.rms_norm_eps,
    has_weight=not flashnorm_folded,
)

The flashnorm_folded flag is read from the HF config via getattr(...) with a False default, so existing non-flashified Llama checkpoints are unaffected. No schema changes anywhere.

What this PR does and does not do

Does: makes the FlashNorm-folded Llama checkpoints on the HF Hub loadable in vLLM. Engine init succeeds, the model runs, and outputs are numerically equivalent to the unfolded baseline (multiply by ones is a no-op).
Does not: deliver any speedup. The CUDA dispatch through vllm/kernels/vllm_c.py:rms_norm substitutes a ones tensor when weight is None (the C++ kernel _custom_ops.rms_norm requires a weight), and forward_cuda's residual path calls _custom_ops.fused_add_rms_norm which similarly requires weight. An adjacent TODO(luka) at vllm/model_executor/layers/layernorm.py:240 acknowledges the broader weight=None passing inconsistency.

The two pieces are sequenced: this PR is the loader-side compatibility fix; the kernel-side performance fix is described in the companion issue (linked below) and would unlock the actual speedup.

Empirical verification

We tested the runtime equivalent of this source-level patch (a monkey-patch on RMSNorm.__init__ plus a small loader-side filter for the all-ones tensors that the HF-compat checkpoint still carries) on Colab A100, vllm 0.9.2, bf16, 256-token greedy decode, 20 trials, enforce_eager=True, tensor_parallel_size=1:

Model	tok/s
`unsloth/Llama-3.2-1B-Instruct` (stock vLLM, no patches)	94.09
`open-machine/Llama-3.2-1B-FlashNorm` (with `has_weight=False` set on Llama RMSNorms)	94.05

The model loads cleanly, runs, and produces correct outputs. The -0.04% delta confirms that the kernel does the same work in both cases (within measurement noise) and isolates the remaining speedup to the kernel-side fix.

For comparison, the equivalent path in HuggingFace Transformers, where torch's F.rms_norm(weight=None) already implements the weight-null branch, gives +12.77% end-to-end on the same model and hardware. That is the gain the kernel-side companion would close.

Companion issue

The kernel-side fix that actually delivers the speedup is described in: https://github.com/vllm-project/vllm/issues/41430

Together, this PR plus the kernel issue make FlashNorm a first-class option in vLLM deployments.

Other architectures

The same one-line pattern applies to mistral.py, qwen2.py, qwen3.py, phi3.py, and any other RMSNorm-based architecture whose checkpoint can be flashified. Happy to follow up with separate small PRs for those if there is interest; this PR is scoped to Llama for review focus.

References

FlashNorm paper: https://arxiv.org/abs/2407.09577
vLLM PR #40117 (precedent): https://github.com/vllm-project/vllm/pull/40117
vLLM Issue #39370 (related RFC, orthogonal): https://github.com/vllm-project/vllm/issues/39370
llama.cpp Issue #22486 (the equivalent ask in llama.cpp): https://github.com/ggml-org/llama.cpp/issues/22486
HuggingFace Transformers reproducibility for the +12.77% number: https://github.com/OpenMachine-ai/transformer-tricks/blob/main/notebooks/flashNorm_hf_a100.ipynb

Changed files

vllm/model_executor/layers/layernorm.py (modified, +7/-2)
vllm/model_executor/models/llama.py (modified, +27/-3)

PR #14: FlashNorm: realized end-to-end speedup in HuggingFace Transformers (+12.77% on Llama-3.2-1B / A100 / bf16)

Repository: OpenMachine-ai/transformer-tricks
Author: fm1320
State: closed | merged: True
Link: https://github.com/OpenMachine-ai/transformer-tricks/pull/14

Description (problem / solution / changelog)

Summary

Adds a new measurement to the FlashNorm paper: the realized end-to-end speedup of FlashNorm Proposition 1 in stock HuggingFace Transformers, via PyTorch's existing F.rms_norm(weight=None) weightless C++ code path applied as a runtime monkey-patch on LlamaRMSNorm.forward.

Headline: +12.77% on open-machine/Llama-3.2-1B-FlashNorm at bf16, NVIDIA A100 (40 GB), 256-token greedy decode, median of 20 trials with 3 warmup. Same model both variants; the gain isolates kernel choice from folding.

This composes with the decode-profile appendix from #13 (now merged): that one establishes the upper bound at 24.6%; this PR reports how much is realized today via kernel choice alone (roughly half).

Companion upstream contributions (filed; details in the "Other-runtime integrations" section below):

vLLM loader PR: vllm-project/vllm#41431 (fix(llama): use weightless RMSNorm for FlashNorm-folded checkpoints (has_weight=False)).
vLLM kernel issue: vllm-project/vllm#41430 (fused_add_rms_norm does not branch on has_weight=False; needed for the GPU-path speedup).
llama.cpp tracking issue: ggml-org/llama.cpp#22486 (weightless RMSNorm for FlashNorm; runtime kernel exists, loader needs TENSOR_NOT_REQUIRED).

What's in this PR

File	Change
`tex/flashNorm.tex`	One sentence appended to the existing Findings paragraph of `app:rmsnorm-fraction` documenting the realized +12.77% speedup. Plus three small follow-up corrections to the merged content from #13 (see below). Net diff: 3 in-place line modifications, zero new lines. No new sections, no main-body changes.
`notebooks/flashNorm_hf_a100.ipynb`	New 8-cell Colab A100 notebook reproducing the measurement. ~3 min wall-clock.
`notebooks/README.md`	One badge entry for the new notebook.

No new bib entries required.

Bundled corrections to `app:rmsnorm-fraction` (merged in #13)

Three small follow-ups derived from existing reported values (no new measurements):

tab:rmsnorm-fraction 4-bit row: norm fraction 15.01% -> 9.31% (the arithmetically consistent value: 1650 ms / 17720 ms = 0.0931). The norm time and total decode are kept; they are the values the invariance claim of 1640 / 1675 / 1650 ms relies on.
Findings paragraph: 24.62% -> 8.19% -> 15.01% -> 24.62% -> 8.19% -> 9.31%. The "non-monotonic across precisions" narrative still holds; the rebound at 4-bit is smaller than the original table indicated.
Table caption: A100, fp16 inputs -> A100, bf16 inputs. The notebook loads the model at torch.bfloat16 and the row is labelled bf16.

Reproducibility

notebooks/flashNorm_hf_a100.ipynb runs end to end on a Colab A100 in ~3 minutes:

Pip-installs the latest transformers and accelerate (same install pattern as flashNorm_decode_profile.ipynb).
Loads open-machine/Llama-3.2-1B-FlashNorm at bf16 with eager attention.
Benchmarks variant A (HF default LlamaRMSNorm.forward, multiplies by the all-ones weight) and variant B (runtime monkey-patch to F.rms_norm(x, shape, None, eps), skips the multiply).
Saves a JSON results file and downloads it in Colab.

No vLLM, no llama.cpp, no version-pinning conflicts with the rest of the Colab kernel.

Other-runtime integrations (work in progress, not in this PR)

We separately tested the same FlashNorm-folded checkpoint in vLLM and llama.cpp; both are integration-blocked today, with concrete upstream paths to land them. Out of scope for this PR; mentioned here so reviewers see the broader picture.

vLLM: applied the runtime equivalent of the proposed source-level loader fix (has_weight=False plumbing modeled on PR vllm-project/vllm#40117, which added the same pattern for Gemma-4 KV-shared k_norm). With those patches, the FlashNorm-folded checkpoint loads cleanly and runs in vLLM with correct outputs. The speedup itself is not yet realized: the CUDA RMSNorm kernels require a weight tensor (the vllm_c.py impl substitutes ones for None, and _custom_ops.fused_add_rms_norm similarly takes a required weight), so the multiply still happens on the GPU path; an adjacent TODO(luka) at vllm/model_executor/layers/layernorm.py:240 acknowledges this. Measured delta: -0.04% (within noise). Two upstream contributions filed: loader PR at vllm-project/vllm#41431, kernel issue at vllm-project/vllm#41430.
llama.cpp: the converter convert_hf_to_gguf.py correctly recognizes flashnorm_folded: true in the source HF config.json and drops the per-layer norm tensors when writing the GGUF. The runtime then fails to load the resulting GGUF with error loading model: missing tensor 'blk.0.attn_norm.weight'. The runtime model-load path requires the per-layer norm tensors to be present and there is no "treat absent norm as identity" branch. Tracked in upstream issue ggml-org/llama.cpp#22486; we will file a comment there with the empirical reproduction recipe and propose a runtime-side fix.

These integrations will land as separate upstream contributions in those projects' repos. This PR is deliberately scoped to the HF result that is realizable today.

Status

Draft. Open for review.

Changed files

notebooks/README.md (modified, +1/-0)
notebooks/flashNorm_hf_a100.ipynb (added, +222/-0)
tex/flashNorm.tex (modified, +3/-3)

RAW_BUFFERClick to expand / collapse

Summary

Why it matters

FlashNorm (Graef et al.) folds the per-channel RMSNorm weight into the next linear, leaving an RMSNorm whose weight is mathematically all ones. A runtime that recognizes the all-ones weight (or accepts None) can skip the per-channel multiply and recover an end-to-end speedup. Concretely, on the equivalent path in HuggingFace Transformers (F.rms_norm(x, shape, weight=None, eps), the torch C++ kernel that already implements the weight-null branch this issue asks vLLM to implement):

Hardware: NVIDIA A100 (40 GB), bf16, 256-token greedy decode, 20 trials with 3 warmup, eager attention.
Same model class as is typically deployed via vLLM (Llama-3.2-1B-Instruct, open-machine/Llama-3.2-1B-FlashNorm).
Variant A (default LlamaRMSNorm.forward, multiplies by all-ones): 45.00 tok/s.
Variant B (F.rms_norm(weight=None), skips multiply): 50.75 tok/s.
Delta: +12.77% end-to-end (median of 20 trials).

Empirical confirmation that the kernel is the bottleneck

vLLM 0.9.2, Colab A100, bf16, 256-token greedy decode, median of 20 trials, enforce_eager=True, tensor_parallel_size=1.
Baseline unsloth/Llama-3.2-1B-Instruct (stock vLLM, no patches): 94.09 tok/s.
Flashified open-machine/Llama-3.2-1B-FlashNorm with has_weight=False set via the loader patch: 94.05 tok/s.

Delta: -0.04% (within noise), confirming the model loads and runs with correct outputs but the kernel does the same work in both cases. The fused CUDA kernel multiplies by the buffer of ones unconditionally.

For comparison, the HuggingFace Transformers measurement on the same model and hardware (which routes through torch's C++ kernel that does branch on weight is None) gives +12.77%. The gap between -0.04% and +12.77% is exactly what closing this kernel-side TODO would recover.

Proposed change

In vllm/_custom_ops (the CUDA implementation surface used by forward_cuda):

Add a has_weight=False branch to fused_add_rms_norm (and rms_norm for the non-residual case) that skips the per-channel multiply when the flag is set. Two reasonable implementations:
- (a) Branch in the existing kernel: accept an optional has_weight: bool argument and skip the multiply when false. Simpler if the multiply is a separable step that can be elided cleanly.
- (b) Sibling weightless kernel: ship fused_add_rms_norm_weightless (and rms_norm_weightless) and dispatch from the Python wrapper based on self.has_weight. Cleaner if the multiply is fused into the load/store path of the existing kernel and a separate kernel can be tuned independently.
The CUDA dispatch in vllm/kernels/vllm_c.py:rms_norm (and any sibling fused_add_rms_norm impl) detects weight is None (or a has_weight=False flag) and dispatches to the weightless C++ variant instead of substituting a ones tensor.

A reasonable strategy: start with (b) to avoid risking regression on the weighted path; later evaluate consolidation to (a) if the runtime flag does not measurably regress the weighted kernel.

Effect on the runtime once this lands

vLLM with both the companion loader PR and this kernel fix uses the weightless code path on the GPU for FlashNorm-folded checkpoints.
forward_cuda skips the per-channel multiply, matching the savings the HF / torch path already delivers.
Expected end-to-end gain on Llama-3.2-1B-FlashNorm at bf16 / A100: in the same neighborhood as the +12.77% HF measurement, composed with vLLM's existing serving advantage.
The TODO(luka) at layernorm.py:240 and the ones-substitution shim in vllm/kernels/vllm_c.py:rms_norm can both be removed.

Companion

A loader-side PR is being submitted separately against vllm/model_executor/models/llama.py (one-line per call site, modeled on PR #40117's Gemma-4 KV-shared k_norm pattern), gated by a flashnorm_folded config flag. It unblocks loading FlashNorm-folded HuggingFace checkpoints (open-machine/Llama-3.2-1B-FlashNorm and the rest of the weightless-rmsnorm HF collection) but does not deliver the speedup by itself. This kernel issue describes the second piece, which is the actual source of the runtime gain.

References

FlashNorm paper: https://arxiv.org/abs/2407.09577
HuggingFace Transformers reproducibility (where the +12.77% is realized today via the equivalent torch path): https://github.com/OpenMachine-ai/transformer-tricks/blob/main/notebooks/flashNorm_hf_a100.ipynb
vLLM PR #40117 (fix(gemma4): use weightless k_norm for KV-shared layers): the precedent for has_weight=False plumbing on the loader side.
vLLM Issue #39370 ([RFC][vLLM IR] rms_norm weight passing inconsistency): orthogonal but informs the kernel API design (whether to use has_weight: bool or weight: Optional[Tensor] as the signaling mechanism).
llama.cpp Issue #22486 (Feature Request: Support weightless RMSNorm for FlashNorm weight folding trick): the equivalent ask in llama.cpp; the runtime kernel there already supports weight-null and only the loader needs TENSOR_NOT_REQUIRED.

I'm happy to discuss the kernel-side approach (option (a) vs option (b) above) or to draft a PR if there is interest from a maintainer who knows the _custom_ops surface well.

extent analysis

TL;DR

Modify the CUDA kernel in vllm/_custom_ops to add a branch for has_weight=False to skip the per-channel multiply, enabling a speedup for FlashNorm-folded models.

Guidance

Identify the CUDA kernel implementation in vllm/_custom_ops that needs modification, specifically fused_add_rms_norm and rms_norm.
Add a has_weight=False branch to these kernels to skip the per-channel multiply when the flag is set, using either option (a) branching in the existing kernel or option (b) creating a sibling weightless kernel.
Update the CUDA dispatch in vllm/kernels/vllm_c.py:rms_norm to detect weight is None or has_weight=False and dispatch to the weightless C++ variant.
Verify the fix by measuring the end-to-end performance gain on FlashNorm-folded models, expecting a gain similar to the +12.77% measured in HuggingFace Transformers.

Example

No code snippet is provided as the issue does not contain sufficient information to generate a specific code example.

Notes

The proposed change requires modifying the CUDA kernel implementation, which may require expertise in CUDA programming and the vLLM codebase. The issue provides two possible implementation options, and the choice between them may depend on the specific requirements and constraints of the vLLM project.

Recommendation

Apply workaround by modifying the CUDA kernel to add a has_weight=False branch, using option (b) to create a sibling weightless kernel as a safer and more straightforward approach. This will enable the speedup for FlashNorm-folded models without risking regression on the weighted path.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #orchestration issue #cache issue #memory leak #API versioning

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

vllm - ✅(Solved) Fix fused_add_rms_norm does not branch on has_weight=False (TODO(luka)); FlashNorm weightless RMSNorm cannot realize a speedup on the GPU path [3 pull requests, 1 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Fix Action

Fix / Workaround

PR fix notes

PR #41431: fix(llama): use weightless RMSNorm for FlashNorm-folded checkpoints (has_weight=False)

Description (problem / solution / changelog)

Summary

Problem

Background: FlashNorm and the precedent

Proposed change

What this PR does and does not do

Empirical verification

Companion issue

Other architectures

References

Changed files

PR #14: FlashNorm: realized end-to-end speedup in HuggingFace Transformers (+12.77% on Llama-3.2-1B / A100 / bf16)

Description (problem / solution / changelog)

Summary

What's in this PR

Bundled corrections to app:rmsnorm-fraction (merged in #13)

Reproducibility

Other-runtime integrations (work in progress, not in this PR)

Status

Changed files

Summary

Why it matters

Empirical confirmation that the kernel is the bottleneck

Proposed change

Effect on the runtime once this lands

Companion

References

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING

Bundled corrections to `app:rmsnorm-fraction` (merged in #13)