vllm - 💡(How to fix) Fix [Bug]: RMSNorm kernel ignores weight dtype, always uses FP32 (regression in v0.20.0)

vllm2026-05-11 15:14:53

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

vLLM's RMSNorm CUDA kernel (csrc/layernorm_kernels.cu) ignores the weight dtype and always multiplies in FP32, violating the Python reference specification in vllm/ir/ops/layernorm.py. This regression was introduced in v0.20.0 (commit 4d51588e23, PR #40860) and affects all subsequent versions including the current main branch.

Error Message

For models with RMSNorm in multiple layers (e.g., attention Q/K normalization), the error compounds:

Root Cause

The bug was introduced in commit 4d51588e23 (DeepSeek V4 feature PR #40860, April 26, 2026). This PR added FP8 quantization support and modified both the FP8 quantization kernels and the regular RMSNorm kernels.

In the FP8 quantization context, the FP32 multiplication is actually correct because the fused kernel must match the precision of the unfused composite operation (RMSNorm → BF16 cast → FP8 quant).

However, the same pattern was applied to the regular RMSNorm kernel where it's incorrect because it violates the specification that multiplication should happen in weight dtype.

This appears to be an unintended side effect rather than a deliberate design change.

Fix Action

Fix

Revert the regular RMSNorm kernel to v0.19.1 behavior:

// csrc/layernorm_kernels.cu (line ~77-82)
 #pragma unroll
 for (int j = 0; j < VEC_SIZE; j++) {
   float x = static_cast<float>(src1.val[j]);
-  float w = static_cast<float>(src2.val[j]);
-  dst.val[j] = static_cast<scalar_t>(x * s_variance * w);
+  dst.val[j] = ((scalar_t)(x * s_variance)) * src2.val[j];
 }

The same fix is needed in fused_add_rms_norm_kernel (line ~182):

 for (int idx = threadIdx.x; idx < hidden_size; idx += blockDim.x) {
   float x = (float)residual[blockIdx.x * hidden_size + idx];
-  float w = (float)weight[idx];
-  input[blockIdx.x * input_stride + idx] = (scalar_t)(x * s_variance * w);
+  input[blockIdx.x * input_stride + idx] = ((scalar_t)(x * s_variance)) * weight[idx];
 }

Important: Keep the FP32 cast in layernorm_quant_kernels.cu — it's correct there for FP8 quantization.

Code Example

// csrc/layernorm_kernels.cu (lines 77-82, v0.20.0+)
#pragma unroll
for (int j = 0; j < VEC_SIZE; j++) {
  float x = static_cast<float>(src1.val[j]);
  float w = static_cast<float>(src2.val[j]);  // ❌ Always FP32
  dst.val[j] = static_cast<scalar_t>(x * s_variance * w);
}

---

# vllm/ir/ops/layernorm.py (lines 19-21)
if weight is not None:
    x = x.to(weight.dtype) * weight  # ← Multiply in weight.dtype
return x.to(orig_dtype)

---

// csrc/layernorm_kernels.cu (lines 77-82, v0.19.1)
#pragma unroll
for (int j = 0; j < VEC_SIZE; j++) {
  float x = static_cast<float>(src1.val[j]);
  dst.val[j] = ((scalar_t)(x * s_variance)) * src2.val[j];  // ✅ Respects weight dtype
}

---

Max divergence from reference: ~0.03 to 0.06

---

import torch
from vllm import _custom_ops as ops

def reference_rmsnorm(x, weight, eps=1e-6):
    """Reference matching vLLM IR ops spec."""
    orig_dtype = x.dtype
    x_fp32 = x.to(torch.float32)
    variance = x_fp32.pow(2).mean(-1, keepdim=True)
    x_normed = x_fp32 * torch.rsqrt(variance + eps)
    result = x_normed.to(weight.dtype) * weight  # Multiply in weight.dtype
    return result.to(orig_dtype)

# Test with BF16 weights
x = torch.randn(48, 128, dtype=torch.bfloat16, device='cuda')
weight = torch.randn(128, dtype=torch.bfloat16, device='cuda')

out = torch.empty_like(x)
ops.rms_norm(out, x, weight, 1e-6)

ref = reference_rmsnorm(x, weight)
diff = (out.float() - ref.float()).abs().max()

print(f"Max diff: {diff:.6e}")
# v0.19.1: ~0.0 (correct)
# v0.20.0+: ~0.03 (bug)

---

// csrc/layernorm_kernels.cu (line ~77-82)
 #pragma unroll
 for (int j = 0; j < VEC_SIZE; j++) {
   float x = static_cast<float>(src1.val[j]);
-  float w = static_cast<float>(src2.val[j]);
-  dst.val[j] = static_cast<scalar_t>(x * s_variance * w);
+  dst.val[j] = ((scalar_t)(x * s_variance)) * src2.val[j];
 }

---

for (int idx = threadIdx.x; idx < hidden_size; idx += blockDim.x) {
   float x = (float)residual[blockIdx.x * hidden_size + idx];
-  float w = (float)weight[idx];
-  input[blockIdx.x * input_stride + idx] = (scalar_t)(x * s_variance * w);
+  input[blockIdx.x * input_stride + idx] = ((scalar_t)(x * s_variance)) * weight[idx];
 }

RAW_BUFFERClick to expand / collapse

Your current environment

This is a kernel implementation bug affecting all platforms with CUDA. Environment details are not relevant as the issue is in the source code itself (csrc/layernorm_kernels.cu).

Affected versions: v0.20.0, v0.20.1, v0.20.2, and current main branch
Last working version: v0.19.1

🐛 Describe the bug

Summary

Actual Behavior (v0.20.0+)

The kernel casts weight to FP32 and performs multiplication in FP32:

// csrc/layernorm_kernels.cu (lines 77-82, v0.20.0+)
#pragma unroll
for (int j = 0; j < VEC_SIZE; j++) {
  float x = static_cast<float>(src1.val[j]);
  float w = static_cast<float>(src2.val[j]);  // ❌ Always FP32
  dst.val[j] = static_cast<scalar_t>(x * s_variance * w);
}

Expected Behavior (Specification)

The Python reference in vllm/ir/ops/layernorm.py specifies that multiplication should happen in weight dtype, not FP32:

# vllm/ir/ops/layernorm.py (lines 19-21)
if weight is not None:
    x = x.to(weight.dtype) * weight  # ← Multiply in weight.dtype
return x.to(orig_dtype)

v0.19.1 and earlier correctly respected the weight dtype:

// csrc/layernorm_kernels.cu (lines 77-82, v0.19.1)
#pragma unroll
for (int j = 0; j < VEC_SIZE; j++) {
  float x = static_cast<float>(src1.val[j]);
  dst.val[j] = ((scalar_t)(x * s_variance)) * src2.val[j];  // ✅ Respects weight dtype
}

Version Timeline

Version	Release Date	Status	Behavior
v0.19.1	April 15, 2026	✅ Correct	Respects weight dtype
v0.20.0	April 27, 2026	❌ Bug	Always uses FP32
v0.20.1	May 4, 2026	❌ Bug	Always uses FP32
v0.20.2	May 5, 2026	❌ Bug	Always uses FP32
main	Current	❌ Bug	Always uses FP32

Root Cause

However, the same pattern was applied to the regular RMSNorm kernel where it's incorrect because it violates the specification that multiplication should happen in weight dtype.

This appears to be an unintended side effect rather than a deliberate design change.

Impact

Precision Mismatch

When using BF16 weights (the common case for modern LLMs), the kernel:

Expected behavior: Convert normalized values to BF16, multiply in BF16
Actual behavior: Multiply in FP32, then convert to BF16

This causes measurable numerical divergence:

Max divergence from reference: ~0.03 to 0.06

Cumulative Effects

For models with RMSNorm in multiple layers (e.g., attention Q/K normalization), the error compounds:

Single layer: divergence = 0.03
20 layers: divergence = 124 (exponential growth)
Final output match rate: ~2%

This can significantly affect model accuracy, especially for models that rely on precise normalization (e.g., LLaDA2 with Q/K normalization).

Reproduction

import torch
from vllm import _custom_ops as ops

def reference_rmsnorm(x, weight, eps=1e-6):
    """Reference matching vLLM IR ops spec."""
    orig_dtype = x.dtype
    x_fp32 = x.to(torch.float32)
    variance = x_fp32.pow(2).mean(-1, keepdim=True)
    x_normed = x_fp32 * torch.rsqrt(variance + eps)
    result = x_normed.to(weight.dtype) * weight  # Multiply in weight.dtype
    return result.to(orig_dtype)

# Test with BF16 weights
x = torch.randn(48, 128, dtype=torch.bfloat16, device='cuda')
weight = torch.randn(128, dtype=torch.bfloat16, device='cuda')

out = torch.empty_like(x)
ops.rms_norm(out, x, weight, 1e-6)

ref = reference_rmsnorm(x, weight)
diff = (out.float() - ref.float()).abs().max()

print(f"Max diff: {diff:.6e}")
# v0.19.1: ~0.0 (correct)
# v0.20.0+: ~0.03 (bug)

Test results:

Version	Max Diff	Status
v0.19.1	~0.0	✅ Matches specification
v0.20.0	~0.03	❌ Violates specification
v0.20.1	~0.03	❌ Violates specification
v0.20.2	~0.03	❌ Violates specification

Fix

Revert the regular RMSNorm kernel to v0.19.1 behavior:

// csrc/layernorm_kernels.cu (line ~77-82)
 #pragma unroll
 for (int j = 0; j < VEC_SIZE; j++) {
   float x = static_cast<float>(src1.val[j]);
-  float w = static_cast<float>(src2.val[j]);
-  dst.val[j] = static_cast<scalar_t>(x * s_variance * w);
+  dst.val[j] = ((scalar_t)(x * s_variance)) * src2.val[j];
 }

The same fix is needed in fused_add_rms_norm_kernel (line ~182):

 for (int idx = threadIdx.x; idx < hidden_size; idx += blockDim.x) {
   float x = (float)residual[blockIdx.x * hidden_size + idx];
-  float w = (float)weight[idx];
-  input[blockIdx.x * input_stride + idx] = (scalar_t)(x * s_variance * w);
+  input[blockIdx.x * input_stride + idx] = ((scalar_t)(x * s_variance)) * weight[idx];
 }

Important: Keep the FP32 cast in layernorm_quant_kernels.cu — it's correct there for FP8 quantization.

Additional Context

Python reference specification: vllm/ir/ops/layernorm.py
Introducing commit: 4d51588e23 (DeepSeek V4 Rebased #40860, April 26, 2026)
Last good release: v0.19.1 (April 15, 2026)
Python IR ops remain correct: Only the CUDA kernel is affected; the Python reference in vllm/ir/ops/layernorm.py correctly follows the specification

This is a regression that breaks the contract between the IR ops specification and the kernel implementation. The fix restores specification compliance without affecting the FP8 quantization features added in v0.20.0.

Before submitting a new issue...

I have searched for relevant issues, and asked the chatbot at the documentation page
I have tested this bug across multiple vLLM versions (v0.19.1, v0.20.0, v0.20.1) to confirm it's a regression
I have identified the root cause commit and analyzed the code change

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#inference speed #output truncation #response parsing #generation error #database connection

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

vllm - 💡(How to fix) Fix [Bug]: RMSNorm kernel ignores weight dtype, always uses FP32 (regression in v0.20.0)

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Fix

Code Example

Your current environment

🐛 Describe the bug

Summary

Actual Behavior (v0.20.0+)

Expected Behavior (Specification)

Version Timeline

Root Cause

Impact

Precision Mismatch

Cumulative Effects

Reproduction

Fix

Additional Context

Before submitting a new issue...

Still need to ship something?

TRENDING

vllm - 💡(How to fix) Fix [Bug]: RMSNorm kernel ignores weight dtype, always uses FP32 (regression in v0.20.0)

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Fix

Code Example

Your current environment

🐛 Describe the bug

Summary

Actual Behavior (v0.20.0+)

Expected Behavior (Specification)

Version Timeline

Root Cause

Impact

Precision Mismatch

Cumulative Effects

Reproduction

Fix

Additional Context

Before submitting a new issue...

Still need to ship something?

RELATED_DISCOVERY

TRENDING