vllm - ✅(Solved) Fix [Bug]: Batch invariant aten::IMPL overrides bypassed under torch.compile for RMSNorm [1 pull requests, 2 comments, 2 participants]

vllm2026-04-06 17:59:49

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

vllm-project/vllm#39096•Fetched 2026-04-08 03:02:00

View on GitHub

Comments

Participants

Timeline

Reactions

Author

Monishver11

Participants

Monishver11

ProExpertProg

Timeline (top)

mentioned ×3subscribed ×3commented ×2cross-referenced ×1

Root Cause

Fix Action

Fix / Workaround

CustomOp.dispatch_forward() → RMSNorm.enabled() returns False (because default_on() returns False when custom_ops=["none"])
RMSNorm dispatches to forward_native instead of forward_cuda
forward_native calls ir.ops.rms_norm → x.pow(2).mean(dim=-1, keepdim=True) (pure PyTorch ops)
Inductor lowers these to its own IR and generates its own Triton reduction kernels, bypassing the aten::mean.dim override at the dispatch table
The Inductor-generated reduction kernel produces batch-size-dependent results on L4

Current workaround

[#31913](https://github.com/vllm-project/vllm/issues/31913) — flaky test_eagle_dp
[pytorch/pytorch#170563](https://github.com/pytorch/pytorch/issues/170563)
[#38566](https://github.com/vllm-project/vllm/pull/38566) — workaround PR
[#32992](https://github.com/vllm-project/vllm/issues/32992) — torch.compile + batch invariance

PR fix notes

PR #38566: [Bugfix][CI] Skip flaky `test_eagle` test

Repository: vllm-project/vllm
Author: NickLucche
State: closed | merged: True
Link: https://github.com/vllm-project/vllm/pull/38566

Description (problem / solution / changelog)

Tentative fix for https://github.com/vllm-project/vllm/issues/31913, do not merge until reviewed and approved locally.

EDIT: I think disabling async with DP>1 is too harsh, still I would appreciate any comment with more context to clarify this is indeed not needed.

Changed files

tests/v1/distributed/test_eagle_dp.py (modified, +8/-1)

Code Example

import torch
import os

os.environ["VLLM_BATCH_INVARIANT"] = "1"

from vllm.model_executor.layers.batch_invariant import enable_batch_invariant_mode
from vllm.model_executor.layers.layernorm import RMSNorm

enable_batch_invariant_mode()

hidden_size = 4096
eps = 1e-5
norm = RMSNorm(hidden_size, eps=eps).cuda()

torch.manual_seed(42)
x_single = torch.randn(1, hidden_size, dtype=torch.bfloat16, device="cuda")
x_batch = torch.randn(4, hidden_size, dtype=torch.bfloat16, device="cuda")
x_batch[0] = x_single[0]

# forward_cuda path (batch invariant) — should match
out_cuda_single = norm.forward_cuda(x_single.clone())
out_cuda_batch = norm.forward_cuda(x_batch.clone())
print("forward_cuda match:", torch.equal(out_cuda_single[0], out_cuda_batch[0]))

# forward_native path (compiled by Inductor) — may NOT match on L4
out_native_single = norm.forward_native(x_single.clone())
out_native_batch = norm.forward_native(x_batch.clone())
print("forward_native match:", torch.equal(out_native_single[0], out_native_batch[0]))

RAW_BUFFERClick to expand / collapse

Your current environment

Reproducible on L4 (SM89) GPUs in CI. Environment details to be added once reproduced on an L4 node.

🐛 Describe the bug

When VLLM_BATCH_INVARIANT=1 is set with torch.compile active (enforce_eager=False), RMSNorm's batch invariant path is bypassed. This causes batch-size-dependent numerical differences that break speculative decoding determinism on L4 (SM89).

Root Cause

vLLM's batch invariance registers aten::IMPL overrides in enable_batch_invariant_mode() (e.g., aten::mean.dim → deterministic Triton kernel with fixed reduction order). However, under torch.compile with custom_ops=["none"] (the default when Inductor is active):

CustomOp.dispatch_forward() → RMSNorm.enabled() returns False (because default_on() returns False when custom_ops=["none"])
RMSNorm dispatches to forward_native instead of forward_cuda
forward_native calls ir.ops.rms_norm → x.pow(2).mean(dim=-1, keepdim=True) (pure PyTorch ops)
Inductor lowers these to its own IR and generates its own Triton reduction kernels, bypassing the aten::mean.dim override at the dispatch table
The Inductor-generated reduction kernel produces batch-size-dependent results on L4

In contrast, forward_cuda calls rms_norm_batch_invariant() directly, which is deterministic regardless of batch size.

Why this wasn't caught earlier

The e2e batch invariant tests (tests/v1/determinism/test_batch_invariance.py) set enforce_eager=IS_DEVICE_CAPABILITY_BELOW_90, so on L4 (SM89) they run in eager mode → custom_ops=["all"] → forward_cuda is used. The issue was only exposed by test_eagle_dp which uses enforce_eager=False unconditionally.

Minimal repro

import torch
import os

os.environ["VLLM_BATCH_INVARIANT"] = "1"

from vllm.model_executor.layers.batch_invariant import enable_batch_invariant_mode
from vllm.model_executor.layers.layernorm import RMSNorm

enable_batch_invariant_mode()

hidden_size = 4096
eps = 1e-5
norm = RMSNorm(hidden_size, eps=eps).cuda()

torch.manual_seed(42)
x_single = torch.randn(1, hidden_size, dtype=torch.bfloat16, device="cuda")
x_batch = torch.randn(4, hidden_size, dtype=torch.bfloat16, device="cuda")
x_batch[0] = x_single[0]

# forward_cuda path (batch invariant) — should match
out_cuda_single = norm.forward_cuda(x_single.clone())
out_cuda_batch = norm.forward_cuda(x_batch.clone())
print("forward_cuda match:", torch.equal(out_cuda_single[0], out_cuda_batch[0]))

# forward_native path (compiled by Inductor) — may NOT match on L4
out_native_single = norm.forward_native(x_single.clone())
out_native_batch = norm.forward_native(x_batch.clone())
print("forward_native match:", torch.equal(out_native_single[0], out_native_batch[0]))

On L4, forward_cuda should produce matching results while forward_native may not.

Current workaround

In PR [#38938], the EAGLE DP test forces +rms_norm via compilation_config={"custom_ops": ["none", "+rms_norm"]} to ensure forward_cuda is used.

[#31913](https://github.com/vllm-project/vllm/issues/31913) — flaky test_eagle_dp
[pytorch/pytorch#170563](https://github.com/pytorch/pytorch/issues/170563)
[#38566](https://github.com/vllm-project/vllm/pull/38566) — workaround PR
[#32992](https://github.com/vllm-project/vllm/issues/32992) — torch.compile + batch invariance

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

TL;DR

To fix the batch-size-dependent numerical differences issue with RMSNorm on L4 GPUs, ensure that forward_cuda is used instead of forward_native by configuring custom_ops to include +rms_norm.

Guidance

Verify that the issue is indeed caused by the forward_native path being used instead of forward_cuda by checking the results of out_cuda_single and out_native_single in the minimal repro code.
Configure custom_ops to include +rms_norm to force the use of forward_cuda, as done in PR [#38938].
Test the workaround by running the minimal repro code with the modified custom_ops configuration and verifying that out_cuda_single and out_native_single produce matching results.
Consider updating the e2e batch invariant tests to use enforce_eager=False unconditionally to catch similar issues in the future.

Example

The minimal repro code provided in the issue can be used to test the workaround:

compilation_config={"custom_ops": ["none", "+rms_norm"]}

This configuration should force the use of forward_cuda and produce matching results for out_cuda_single and out_native_single.

Notes

The issue is specific to L4 GPUs and may not be reproducible on other hardware. The workaround may not be necessary if the forward_native path is modified to produce deterministic results regardless of batch size.

Recommendation

Apply the workaround by configuring custom_ops to include +rms_norm, as this ensures that forward_cuda is used and produces deterministic results. This is a safe and effective solution until a more permanent fix is implemented.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#API routing #API middleware #SSR setup #ISR setup #authentication setup

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

vllm - ✅(Solved) Fix [Bug]: Batch invariant aten::IMPL overrides bypassed under torch.compile for RMSNorm [1 pull requests, 2 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Root Cause

Fix Action

Fix / Workaround

Current workaround

PR fix notes

PR #38566: [Bugfix][CI] Skip flaky `test_eagle` test

Description (problem / solution / changelog)

Changed files

Code Example

Your current environment

🐛 Describe the bug

Root Cause

Why this wasn't caught earlier

Minimal repro

Current workaround

Related

Before submitting a new issue...

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

TRENDING

vllm - ✅(Solved) Fix [Bug]: Batch invariant aten::IMPL overrides bypassed under torch.compile for RMSNorm [1 pull requests, 2 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Root Cause

Fix Action

Fix / Workaround

Current workaround

PR fix notes

PR #38566: [Bugfix][CI] Skip flaky test_eagle test

Description (problem / solution / changelog)

Changed files

Code Example

Your current environment

🐛 Describe the bug

Root Cause

Why this wasn't caught earlier

Minimal repro

Current workaround

Related

Before submitting a new issue...

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING

PR #38566: [Bugfix][CI] Skip flaky `test_eagle` test