vllm - 💡(How to fix) Fix [Bug]: RMSNorm kernel ignores weight dtype, always uses FP32 (regression in v0.20.0)

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

vLLM's RMSNorm CUDA kernel (csrc/layernorm_kernels.cu) ignores the weight dtype and always multiplies in FP32, violating the Python reference specification in vllm/ir/ops/layernorm.py. This regression was introduced in v0.20.0 (commit 4d51588e23, PR #40860) and affects all subsequent versions including the current main branch.

Error Message

For models with RMSNorm in multiple layers (e.g., attention Q/K normalization), the error compounds:

Root Cause

The bug was introduced in commit 4d51588e23 (DeepSeek V4 feature PR #40860, April 26, 2026). This PR added FP8 quantization support and modified both the FP8 quantization kernels and the regular RMSNorm kernels.

In the FP8 quantization context, the FP32 multiplication is actually correct because the fused kernel must match the precision of the unfused composite operation (RMSNorm → BF16 cast → FP8 quant).

However, the same pattern was applied to the regular RMSNorm kernel where it's incorrect because it violates the specification that multiplication should happen in weight dtype.

This appears to be an unintended side effect rather than a deliberate design change.

Fix Action

Fix

Revert the regular RMSNorm kernel to v0.19.1 behavior:

// csrc/layernorm_kernels.cu (line ~77-82)
 #pragma unroll
 for (int j = 0; j < VEC_SIZE; j++) {
   float x = static_cast<float>(src1.val[j]);
-  float w = static_cast<float>(src2.val[j]);
-  dst.val[j] = static_cast<scalar_t>(x * s_variance * w);
+  dst.val[j] = ((scalar_t)(x * s_variance)) * src2.val[j];
 }

The same fix is needed in fused_add_rms_norm_kernel (line ~182):

 for (int idx = threadIdx.x; idx < hidden_size; idx += blockDim.x) {
   float x = (float)residual[blockIdx.x * hidden_size + idx];
-  float w = (float)weight[idx];
-  input[blockIdx.x * input_stride + idx] = (scalar_t)(x * s_variance * w);
+  input[blockIdx.x * input_stride + idx] = ((scalar_t)(x * s_variance)) * weight[idx];
 }

Important: Keep the FP32 cast in layernorm_quant_kernels.cu — it's correct there for FP8 quantization.

Code Example

// csrc/layernorm_kernels.cu (lines 77-82, v0.20.0+)
#pragma unroll
for (int j = 0; j < VEC_SIZE; j++) {
  float x = static_cast<float>(src1.val[j]);
  float w = static_cast<float>(src2.val[j]);  // ❌ Always FP32
  dst.val[j] = static_cast<scalar_t>(x * s_variance * w);
}

---

# vllm/ir/ops/layernorm.py (lines 19-21)
if weight is not None:
    x = x.to(weight.dtype) * weight  # ← Multiply in weight.dtype
return x.to(orig_dtype)

---

// csrc/layernorm_kernels.cu (lines 77-82, v0.19.1)
#pragma unroll
for (int j = 0; j < VEC_SIZE; j++) {
  float x = static_cast<float>(src1.val[j]);
  dst.val[j] = ((scalar_t)(x * s_variance)) * src2.val[j];  // ✅ Respects weight dtype
}

---

Max divergence from reference: ~0.03 to 0.06

---

import torch
from vllm import _custom_ops as ops

def reference_rmsnorm(x, weight, eps=1e-6):
    """Reference matching vLLM IR ops spec."""
    orig_dtype = x.dtype
    x_fp32 = x.to(torch.float32)
    variance = x_fp32.pow(2).mean(-1, keepdim=True)
    x_normed = x_fp32 * torch.rsqrt(variance + eps)
    result = x_normed.to(weight.dtype) * weight  # Multiply in weight.dtype
    return result.to(orig_dtype)

# Test with BF16 weights
x = torch.randn(48, 128, dtype=torch.bfloat16, device='cuda')
weight = torch.randn(128, dtype=torch.bfloat16, device='cuda')

out = torch.empty_like(x)
ops.rms_norm(out, x, weight, 1e-6)

ref = reference_rmsnorm(x, weight)
diff = (out.float() - ref.float()).abs().max()

print(f"Max diff: {diff:.6e}")
# v0.19.1: ~0.0 (correct)
# v0.20.0+: ~0.03 (bug)

---

// csrc/layernorm_kernels.cu (line ~77-82)
 #pragma unroll
 for (int j = 0; j < VEC_SIZE; j++) {
   float x = static_cast<float>(src1.val[j]);
-  float w = static_cast<float>(src2.val[j]);
-  dst.val[j] = static_cast<scalar_t>(x * s_variance * w);
+  dst.val[j] = ((scalar_t)(x * s_variance)) * src2.val[j];
 }

---

for (int idx = threadIdx.x; idx < hidden_size; idx += blockDim.x) {
   float x = (float)residual[blockIdx.x * hidden_size + idx];
-  float w = (float)weight[idx];
-  input[blockIdx.x * input_stride + idx] = (scalar_t)(x * s_variance * w);
+  input[blockIdx.x * input_stride + idx] = ((scalar_t)(x * s_variance)) * weight[idx];
 }
RAW_BUFFERClick to expand / collapse

Your current environment

This is a kernel implementation bug affecting all platforms with CUDA. Environment details are not relevant as the issue is in the source code itself (csrc/layernorm_kernels.cu).

Affected versions: v0.20.0, v0.20.1, v0.20.2, and current main branch
Last working version: v0.19.1

🐛 Describe the bug

Summary

vLLM's RMSNorm CUDA kernel (csrc/layernorm_kernels.cu) ignores the weight dtype and always multiplies in FP32, violating the Python reference specification in vllm/ir/ops/layernorm.py. This regression was introduced in v0.20.0 (commit 4d51588e23, PR #40860) and affects all subsequent versions including the current main branch.

Actual Behavior (v0.20.0+)

The kernel casts weight to FP32 and performs multiplication in FP32:

// csrc/layernorm_kernels.cu (lines 77-82, v0.20.0+)
#pragma unroll
for (int j = 0; j < VEC_SIZE; j++) {
  float x = static_cast<float>(src1.val[j]);
  float w = static_cast<float>(src2.val[j]);  // ❌ Always FP32
  dst.val[j] = static_cast<scalar_t>(x * s_variance * w);
}

Expected Behavior (Specification)

The Python reference in vllm/ir/ops/layernorm.py specifies that multiplication should happen in weight dtype, not FP32:

# vllm/ir/ops/layernorm.py (lines 19-21)
if weight is not None:
    x = x.to(weight.dtype) * weight  # ← Multiply in weight.dtype
return x.to(orig_dtype)

v0.19.1 and earlier correctly respected the weight dtype:

// csrc/layernorm_kernels.cu (lines 77-82, v0.19.1)
#pragma unroll
for (int j = 0; j < VEC_SIZE; j++) {
  float x = static_cast<float>(src1.val[j]);
  dst.val[j] = ((scalar_t)(x * s_variance)) * src2.val[j];  // ✅ Respects weight dtype
}

Version Timeline

VersionRelease DateStatusBehavior
v0.19.1April 15, 2026✅ CorrectRespects weight dtype
v0.20.0April 27, 2026❌ BugAlways uses FP32
v0.20.1May 4, 2026❌ BugAlways uses FP32
v0.20.2May 5, 2026❌ BugAlways uses FP32
mainCurrent❌ BugAlways uses FP32

Root Cause

The bug was introduced in commit 4d51588e23 (DeepSeek V4 feature PR #40860, April 26, 2026). This PR added FP8 quantization support and modified both the FP8 quantization kernels and the regular RMSNorm kernels.

In the FP8 quantization context, the FP32 multiplication is actually correct because the fused kernel must match the precision of the unfused composite operation (RMSNorm → BF16 cast → FP8 quant).

However, the same pattern was applied to the regular RMSNorm kernel where it's incorrect because it violates the specification that multiplication should happen in weight dtype.

This appears to be an unintended side effect rather than a deliberate design change.

Impact

Precision Mismatch

When using BF16 weights (the common case for modern LLMs), the kernel:

  • Expected behavior: Convert normalized values to BF16, multiply in BF16
  • Actual behavior: Multiply in FP32, then convert to BF16

This causes measurable numerical divergence:

Max divergence from reference: ~0.03 to 0.06

Cumulative Effects

For models with RMSNorm in multiple layers (e.g., attention Q/K normalization), the error compounds:

  • Single layer: divergence = 0.03
  • 20 layers: divergence = 124 (exponential growth)
  • Final output match rate: ~2%

This can significantly affect model accuracy, especially for models that rely on precise normalization (e.g., LLaDA2 with Q/K normalization).

Reproduction

import torch
from vllm import _custom_ops as ops

def reference_rmsnorm(x, weight, eps=1e-6):
    """Reference matching vLLM IR ops spec."""
    orig_dtype = x.dtype
    x_fp32 = x.to(torch.float32)
    variance = x_fp32.pow(2).mean(-1, keepdim=True)
    x_normed = x_fp32 * torch.rsqrt(variance + eps)
    result = x_normed.to(weight.dtype) * weight  # Multiply in weight.dtype
    return result.to(orig_dtype)

# Test with BF16 weights
x = torch.randn(48, 128, dtype=torch.bfloat16, device='cuda')
weight = torch.randn(128, dtype=torch.bfloat16, device='cuda')

out = torch.empty_like(x)
ops.rms_norm(out, x, weight, 1e-6)

ref = reference_rmsnorm(x, weight)
diff = (out.float() - ref.float()).abs().max()

print(f"Max diff: {diff:.6e}")
# v0.19.1: ~0.0 (correct)
# v0.20.0+: ~0.03 (bug)

Test results:

VersionMax DiffStatus
v0.19.1~0.0✅ Matches specification
v0.20.0~0.03❌ Violates specification
v0.20.1~0.03❌ Violates specification
v0.20.2~0.03❌ Violates specification

Fix

Revert the regular RMSNorm kernel to v0.19.1 behavior:

// csrc/layernorm_kernels.cu (line ~77-82)
 #pragma unroll
 for (int j = 0; j < VEC_SIZE; j++) {
   float x = static_cast<float>(src1.val[j]);
-  float w = static_cast<float>(src2.val[j]);
-  dst.val[j] = static_cast<scalar_t>(x * s_variance * w);
+  dst.val[j] = ((scalar_t)(x * s_variance)) * src2.val[j];
 }

The same fix is needed in fused_add_rms_norm_kernel (line ~182):

 for (int idx = threadIdx.x; idx < hidden_size; idx += blockDim.x) {
   float x = (float)residual[blockIdx.x * hidden_size + idx];
-  float w = (float)weight[idx];
-  input[blockIdx.x * input_stride + idx] = (scalar_t)(x * s_variance * w);
+  input[blockIdx.x * input_stride + idx] = ((scalar_t)(x * s_variance)) * weight[idx];
 }

Important: Keep the FP32 cast in layernorm_quant_kernels.cu — it's correct there for FP8 quantization.

Additional Context

  • Python reference specification: vllm/ir/ops/layernorm.py
  • Introducing commit: 4d51588e23 (DeepSeek V4 Rebased #40860, April 26, 2026)
  • Last good release: v0.19.1 (April 15, 2026)
  • Python IR ops remain correct: Only the CUDA kernel is affected; the Python reference in vllm/ir/ops/layernorm.py correctly follows the specification

This is a regression that breaks the contract between the IR ops specification and the kernel implementation. The fix restores specification compliance without affecting the FP8 quantization features added in v0.20.0.


Before submitting a new issue...

  • I have searched for relevant issues, and asked the chatbot at the documentation page
  • I have tested this bug across multiple vLLM versions (v0.19.1, v0.20.0, v0.20.1) to confirm it's a regression
  • I have identified the root cause commit and analyzed the code change

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

vllm - 💡(How to fix) Fix [Bug]: RMSNorm kernel ignores weight dtype, always uses FP32 (regression in v0.20.0)