vllm - ✅(Solved) Fix [Bug]: Batch invariant aten::IMPL overrides bypassed under torch.compile for RMSNorm [1 pull requests, 2 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#39096Fetched 2026-04-08 03:02:00
View on GitHub
Comments
2
Participants
2
Timeline
10
Reactions
0
Timeline (top)
mentioned ×3subscribed ×3commented ×2cross-referenced ×1

Root Cause

Root Cause

Fix Action

Fix / Workaround

  1. CustomOp.dispatch_forward()RMSNorm.enabled() returns False (because default_on() returns False when custom_ops=["none"])
  2. RMSNorm dispatches to forward_native instead of forward_cuda
  3. forward_native calls ir.ops.rms_normx.pow(2).mean(dim=-1, keepdim=True) (pure PyTorch ops)
  4. Inductor lowers these to its own IR and generates its own Triton reduction kernels, bypassing the aten::mean.dim override at the dispatch table
  5. The Inductor-generated reduction kernel produces batch-size-dependent results on L4

Current workaround

PR fix notes

PR #38566: [Bugfix][CI] Skip flaky test_eagle test

Description (problem / solution / changelog)

Tentative fix for https://github.com/vllm-project/vllm/issues/31913, do not merge until reviewed and approved locally.

EDIT: I think disabling async with DP>1 is too harsh, still I would appreciate any comment with more context to clarify this is indeed not needed.

Changed files

  • tests/v1/distributed/test_eagle_dp.py (modified, +8/-1)

Code Example

import torch
import os

os.environ["VLLM_BATCH_INVARIANT"] = "1"

from vllm.model_executor.layers.batch_invariant import enable_batch_invariant_mode
from vllm.model_executor.layers.layernorm import RMSNorm

enable_batch_invariant_mode()

hidden_size = 4096
eps = 1e-5
norm = RMSNorm(hidden_size, eps=eps).cuda()

torch.manual_seed(42)
x_single = torch.randn(1, hidden_size, dtype=torch.bfloat16, device="cuda")
x_batch = torch.randn(4, hidden_size, dtype=torch.bfloat16, device="cuda")
x_batch[0] = x_single[0]

# forward_cuda path (batch invariant) — should match
out_cuda_single = norm.forward_cuda(x_single.clone())
out_cuda_batch = norm.forward_cuda(x_batch.clone())
print("forward_cuda match:", torch.equal(out_cuda_single[0], out_cuda_batch[0]))

# forward_native path (compiled by Inductor) — may NOT match on L4
out_native_single = norm.forward_native(x_single.clone())
out_native_batch = norm.forward_native(x_batch.clone())
print("forward_native match:", torch.equal(out_native_single[0], out_native_batch[0]))
RAW_BUFFERClick to expand / collapse

Your current environment

Reproducible on L4 (SM89) GPUs in CI. Environment details to be added once reproduced on an L4 node.

🐛 Describe the bug

When VLLM_BATCH_INVARIANT=1 is set with torch.compile active (enforce_eager=False), RMSNorm's batch invariant path is bypassed. This causes batch-size-dependent numerical differences that break speculative decoding determinism on L4 (SM89).

Root Cause

vLLM's batch invariance registers aten::IMPL overrides in enable_batch_invariant_mode() (e.g., aten::mean.dim → deterministic Triton kernel with fixed reduction order). However, under torch.compile with custom_ops=["none"] (the default when Inductor is active):

  1. CustomOp.dispatch_forward()RMSNorm.enabled() returns False (because default_on() returns False when custom_ops=["none"])
  2. RMSNorm dispatches to forward_native instead of forward_cuda
  3. forward_native calls ir.ops.rms_normx.pow(2).mean(dim=-1, keepdim=True) (pure PyTorch ops)
  4. Inductor lowers these to its own IR and generates its own Triton reduction kernels, bypassing the aten::mean.dim override at the dispatch table
  5. The Inductor-generated reduction kernel produces batch-size-dependent results on L4

In contrast, forward_cuda calls rms_norm_batch_invariant() directly, which is deterministic regardless of batch size.

Why this wasn't caught earlier

The e2e batch invariant tests (tests/v1/determinism/test_batch_invariance.py) set enforce_eager=IS_DEVICE_CAPABILITY_BELOW_90, so on L4 (SM89) they run in eager mode → custom_ops=["all"]forward_cuda is used. The issue was only exposed by test_eagle_dp which uses enforce_eager=False unconditionally.

Minimal repro

import torch
import os

os.environ["VLLM_BATCH_INVARIANT"] = "1"

from vllm.model_executor.layers.batch_invariant import enable_batch_invariant_mode
from vllm.model_executor.layers.layernorm import RMSNorm

enable_batch_invariant_mode()

hidden_size = 4096
eps = 1e-5
norm = RMSNorm(hidden_size, eps=eps).cuda()

torch.manual_seed(42)
x_single = torch.randn(1, hidden_size, dtype=torch.bfloat16, device="cuda")
x_batch = torch.randn(4, hidden_size, dtype=torch.bfloat16, device="cuda")
x_batch[0] = x_single[0]

# forward_cuda path (batch invariant) — should match
out_cuda_single = norm.forward_cuda(x_single.clone())
out_cuda_batch = norm.forward_cuda(x_batch.clone())
print("forward_cuda match:", torch.equal(out_cuda_single[0], out_cuda_batch[0]))

# forward_native path (compiled by Inductor) — may NOT match on L4
out_native_single = norm.forward_native(x_single.clone())
out_native_batch = norm.forward_native(x_batch.clone())
print("forward_native match:", torch.equal(out_native_single[0], out_native_batch[0]))

On L4, forward_cuda should produce matching results while forward_native may not.

Current workaround

In PR [#38938], the EAGLE DP test forces +rms_norm via compilation_config={"custom_ops": ["none", "+rms_norm"]} to ensure forward_cuda is used.

Related

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

TL;DR

To fix the batch-size-dependent numerical differences issue with RMSNorm on L4 GPUs, ensure that forward_cuda is used instead of forward_native by configuring custom_ops to include +rms_norm.

Guidance

  • Verify that the issue is indeed caused by the forward_native path being used instead of forward_cuda by checking the results of out_cuda_single and out_native_single in the minimal repro code.
  • Configure custom_ops to include +rms_norm to force the use of forward_cuda, as done in PR [#38938].
  • Test the workaround by running the minimal repro code with the modified custom_ops configuration and verifying that out_cuda_single and out_native_single produce matching results.
  • Consider updating the e2e batch invariant tests to use enforce_eager=False unconditionally to catch similar issues in the future.

Example

The minimal repro code provided in the issue can be used to test the workaround:

compilation_config={"custom_ops": ["none", "+rms_norm"]}

This configuration should force the use of forward_cuda and produce matching results for out_cuda_single and out_native_single.

Notes

The issue is specific to L4 GPUs and may not be reproducible on other hardware. The workaround may not be necessary if the forward_native path is modified to produce deterministic results regardless of batch size.

Recommendation

Apply the workaround by configuring custom_ops to include +rms_norm, as this ensures that forward_cuda is used and produces deterministic results. This is a safe and effective solution until a more permanent fix is implemented.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

vllm - ✅(Solved) Fix [Bug]: Batch invariant aten::IMPL overrides bypassed under torch.compile for RMSNorm [1 pull requests, 2 comments, 2 participants]