vllm - ✅(Solved) Fix [RFC][vLLM IR]: Batch Invariance Dispatching in vLLM IR [1 pull requests, 4 comments, 4 participants]

vllm2026-04-22 15:48:42

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

vllm-project/vllm#40628•Fetched 2026-04-23 07:23:46

View on GitHub

Comments

Participants

Timeline

Reactions

Author

Participants

Timeline (top)

subscribed ×15mentioned ×14commented ×4labeled ×3

This RFC discusses how batch invariance should be handled within the vLLM IR dispatching model. Currently toggled via VLLM_BATCH_INVARIANT=1 and primarily affecting rms_norm, the question is: should vLLM IR own batch-invariant dispatching as a first-class concern, or should it be handled outside the IR?

Root Cause

Fix Action

Fix / Workaround

The vLLM IR design introduces a unified dispatching and kernel registration mechanism to replace CustomOp. A key motivation for this replacement is that CustomOp dispatching was opaque and hard to reason about. A clear example is that @yewentao256 did not realize that forward_native (the torch-native implementation) was being used by default in batch-invariant mode (and I didn't know at all what was happening), meaning the batch-invariant triton kernel was not invoked, as it only existed inside forward_cuda and forward_hip. Multiple levels of initialization and dispatching made the actual execution path difficult to trace.

forward_cuda / forward_hip — dispatching between the standard CUDA/HIP custom kernel and batch invariant Triton kernel (also AITER in forward_hip)
forward_native — used by default during compilation; batch-invariant dispatching is absent here, so it is silently skipped. Instead we relied on Inductor to generate a batch-invariant Triton kernel.

PR fix notes

PR #36816: [DO NOT MERGE][vLLM IR] 2/N batch-invariant-aware dispatching and rms_norm

Repository: vllm-project/vllm
Author: ProExpertProg
State: open | merged: False
Link: https://github.com/vllm-project/vllm/pull/36816

Description (problem / solution / changelog)

Purpose

This PR adds batch-invariant-aware kernel dispatching infrastructure to vLLM IR and plugs the batch-invariant Triton kernel as an implementation of the rms_norm op.

Key changes:

Extended IrOp to support batch_invariant flag on implementations, allowing kernel selection based on VLLM_BATCH_INVARIANT mode
Registered batch-invariant Triton kernel for rms_norm (vllm.kernels.triton.layernorm_batch_invariant)
Updated lowering pass to filter implementations by batch-invariance requirements
Added platform capability detection for batch-invariant kernel support (SM 9.0+)

How it works:

IR ops are batch-invariant by default, ops with reductions (has_reduction, e.g. rms_norm) are not batch-invariant by default
Native implementations are always batch-invariant
Implementations explicitly opt-in via batch_invariant=True parameter
Kernel selection automatically chooses batch-invariant implementations when VLLM_BATCH_INVARIANT=1

Test Plan

# Unit tests
pytest tests/ir/test_op.py -v -k batch_invariant                                                                                                                                                                             
pytest tests/kernels/ir/test_layernorm.py -v
pytest tests/compile/passes/ir/test_lowering.py -v

# Batch invariance tests
pytest -s -v tests/v1/determinism/test_batch_invariance.py

# E2E: lm_eval, vllm bench latency

Test Result

# H100
$ pytest tests/v1/determinism/test_batch_invariance.py 
tests/v1/determinism/test_batch_invariance.py ............. [100%]
============ 13 passed, 34 warnings in 488.50s (0:08:08) ============
$ pytest tests/ir/ tests/kernels/ir/ tests/compile/passes/ir/ 
tests/ir/test_op.py ....................... [  5%]
tests/kernels/ir/test_layernorm.py .............................................................................................................................................................................................. [ 46%]
.................................................................................ssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssss.................................... [ 95%]
.................. [ 99%]
tests/compile/passes/ir/test_lowering.py ... [100%]
============ 351 passed, 108 skipped, 19 warnings in 143.29s (0:02:23) ============

# B200
$ pytest tests/v1/determinism/test_batch_invariance.py 
tests/v1/determinism/test_batch_invariance.py ............. [100%]
============ 13 passed, 34 warnings in 559.89s (0:09:19) ============

$ pytest tests/ir/ tests/kernels/ir/ tests/compile/passes/ir/ 
tests/ir/test_op.py ....................... [  5%]
tests/kernels/ir/test_layernorm.py ...................................................................................................................................................................... [ 41%]
.........................................................................................................ssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssss [ 84%]
ssssssssssss...................................................... [ 99%] 
tests/compile/passes/ir/test_lowering.py ... [100%]
============ 351 passed, 108 skipped, 19 warnings in 161.71s (0:02:41) ============

lm_eval

B200

main

$ VLLM_BATCH_INVARIANT=0 vllm serve Qwen/Qwen3-30B-A3B --attention-backend=TRITON_ATTN
local-completions ({'pretrained': 'Qwen/Qwen3-30B-A3B', 'base_url': 'http://0.0.0.0:8000/v1/completions', 'num_concurrent': 50, 'max_retries': 3}), gen_kwargs: ({}), limit: None, num_fewshot: 5, batch_size: auto

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
gsm8k	3	flexible-extract	5	exact_match	↑	0.8469	±	0.0099
		strict-match	5	exact_match	↑	0.8886	±	0.0087

$ VLLM_BATCH_INVARIANT=1 vllm serve Qwen/Qwen3-30B-A3B --attention-backend=TRITON_ATTN
local-completions ({'pretrained': 'Qwen/Qwen3-30B-A3B', 'base_url': 'http://0.0.0.0:8000/v1/completions', 'num_concurrent': 50, 'max_retries': 3}), gen_kwargs: ({}), limit: None, num_fewshot: 5, batch_size: auto

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
gsm8k	3	flexible-extract	5	exact_match	↑	0.8582	±	0.0096
		strict-match	5	exact_match	↑	0.8976	±	0.0083

$ VLLM_BATCH_INVARIANT=0 vllm serve Qwen/Qwen3-30B-A3B --attention-backend=TRITON_ATTN --enforce-eager
local-completions ({'pretrained': 'Qwen/Qwen3-30B-A3B', 'base_url': 'http://0.0.0.0:8000/v1/completions', 'num_concurrent': 50, 'max_retries': 3}), gen_kwargs: ({}), limit: None, num_fewshot: 5, batch_size: auto

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
gsm8k	3	flexible-extract	5	exact_match	↑	0.8537	±	0.0097
		strict-match	5	exact_match	↑	0.8969	±	0.0084

$ VLLM_BATCH_INVARIANT=1 vllm serve Qwen/Qwen3-30B-A3B --attention-backend=TRITON_ATTN --enforce-eager
local-completions ({'pretrained': 'Qwen/Qwen3-30B-A3B', 'base_url': 'http://0.0.0.0:8000/v1/completions', 'num_concurrent': 50, 'max_retries': 3}), gen_kwargs: ({}), limit: None, num_fewshot: 5, batch_size: auto

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
gsm8k	3	flexible-extract	5	exact_match	↑	0.8506	±	0.0098
		strict-match	5	exact_match	↑	0.8969	±	0.0084

PR

$ VLLM_BATCH_INVARIANT=0 vllm serve Qwen/Qwen3-30B-A3B --attention-backend=TRITON_ATTN
local-completions ({'pretrained': 'Qwen/Qwen3-30B-A3B', 'base_url': 'http://0.0.0.0:8000/v1/completions', 'num_concurrent': 50, 'max_retries': 3}), gen_kwargs: ({}), limit: None, num_fewshot: 5, batch_size: auto

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
gsm8k	3	flexible-extract	5	exact_match	↑	0.8476	±	0.0099
		strict-match	5	exact_match	↑	0.8916	±	0.0086

$ VLLM_BATCH_INVARIANT=1 vllm serve Qwen/Qwen3-30B-A3B --attention-backend=TRITON_ATTN
local-completions ({'pretrained': 'Qwen/Qwen3-30B-A3B', 'base_url': 'http://0.0.0.0:8000/v1/completions', 'num_concurrent': 50, 'max_retries': 3}), gen_kwargs: ({}), limit: None, num_fewshot: 5, batch_size: auto

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
gsm8k	3	flexible-extract	5	exact_match	↑	0.8582	±	0.0096
		strict-match	5	exact_match	↑	0.8976	±	0.0083

$ VLLM_BATCH_INVARIANT=0 vllm serve Qwen/Qwen3-30B-A3B --attention-backend=TRITON_ATTN --enforce-eager
local-completions ({'pretrained': 'Qwen/Qwen3-30B-A3B', 'base_url': 'http://0.0.0.0:8000/v1/completions', 'num_concurrent': 50, 'max_retries': 3}), gen_kwargs: ({}), limit: None, num_fewshot: 5, batch_size: auto

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
gsm8k	3	flexible-extract	5	exact_match	↑	0.8529	±	0.0098
		strict-match	5	exact_match	↑	0.8984	±	0.0083

$ VLLM_BATCH_INVARIANT=1 vllm serve Qwen/Qwen3-30B-A3B --attention-backend=TRITON_ATTN --enforce-eager
local-completions ({'pretrained': 'Qwen/Qwen3-30B-A3B', 'base_url': 'http://0.0.0.0:8000/v1/completions', 'num_concurrent': 50, 'max_retries': 3}), gen_kwargs: ({}), limit: None, num_fewshot: 5, batch_size: auto

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
gsm8k	3	flexible-extract	5	exact_match	↑	0.8506	±	0.0098
		strict-match	5	exact_match	↑	0.8969	±	0.0084

Latency

B200

All were run with vllm bench latency --attention-backend=TRITON_ATTN (TP=1).

Configuration	main [s]	PR [s]
Qwen/Qwen3-0.6B
VLLM_BATCH_INVARIANT=0	0.288	0.278
VLLM_BATCH_INVARIANT=0 --enforce-eager	2.157	2.150
VLLM_BATCH_INVARIANT=1	0.493	0.490
VLLM_BATCH_INVARIANT=1 --enforce-eager	3.057	4.602

nvidia/Llama-3.3-70B-Instruct-NVFP4
VLLM_BATCH_INVARIANT=0	1.612	1.620
VLLM_BATCH_INVARIANT=1	11.35	11.34

H100

TBD

<details> <summary> Essential Elements of an Effective PR Description Checklist </summary>

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

</details>

Changed files

.buildkite/test_areas/misc.yaml (modified, +5/-5)
tests/compile/passes/ir/test_lowering.py (modified, +2/-0)
tests/ir/test_op.py (modified, +80/-10)
tests/kernels/ir/test_layernorm.py (modified, +11/-6)
tests/v1/determinism/test_batch_invariance.py (modified, +16/-10)
vllm/compilation/passes/ir/lowering_pass.py (modified, +7/-2)
vllm/compilation/passes/pass_manager.py (modified, +3/-3)
vllm/config/kernel.py (modified, +6/-1)
vllm/config/vllm.py (modified, +7/-3)
vllm/forward_context.py (modified, +2/-0)
vllm/ir/op.py (modified, +47/-6)
vllm/ir/ops/layernorm.py (modified, +1/-1)
vllm/kernels/__init__.py (modified, +2/-2)
vllm/kernels/triton/__init__.py (added, +3/-0)
vllm/kernels/triton/layernorm_batch_invariant.py (added, +29/-0)
vllm/model_executor/layers/batch_invariant.py (modified, +1/-0)
vllm/model_executor/layers/layernorm.py (modified, +7/-18)
vllm/platforms/cuda.py (modified, +9/-2)

Code Example

class CudaPlatform:
  ...

  @classmethod
  def get_default_ir_op_priority(cls, vllm_config: VllmConfig) -> IrOpPriorityConfig:
    # Native used by default when compiling,
    # use vllm_c kernels where available when no codegen
    cc = vllm_config.compilation_config
    using_inductor = cc.backend == "inductor" and cc.mode != CompilationMode.NONE
    default = ["native"] if using_inductor else ["vllm_c", "native"]

    if using_inductor:
      rms_norm = ["native"]
    elif envs.VLLM_BATCH_INVARIANT:
      rms_norm = ["triton_batch_invariant", "native"]
    else:
      rms_norm = ["vllm_c", "native"]

    return IrOpPriorityConfig.with_default(default, rms_norm=rms_norm)

---

# triton_batch_invariant
$ VLLM_BATCH_INVARIANT=1 vllm bench latency --model=RedHatAI/Meta-Llama-3.1-8B-Instruct-FP8 --attention-backend=TRITON_ATTN --ir-op-priority.rms_norm=triton_batch_invariant --ir-op-priority.fused_add_rms_norm=triton_batch_invariant -cc.pass_config.fuse_norm_quant=False
Avg latency: 0.43808772719154754 seconds
10% percentile latency: 0.43741614669561385 seconds
25% percentile latency: 0.43829121452290565 seconds
50% percentile latency: 0.4384383929427713 seconds
75% percentile latency: 0.4386132051004097 seconds
90% percentile latency: 0.43903175396844746 seconds
99% percentile latency: 0.43941899626515807 seconds

# native
$ VLLM_BATCH_INVARIANT=1 vllm bench latency --model=RedHatAI/Meta-Llama-3.1-8B-Instruct-FP8 --attention-backend=TRITON_ATTN --ir-op-priority.rms_norm=native --ir-op-priority.fused_add_rms_norm=native -cc.pass_config.fuse_norm_quant=False
Avg latency: 0.3949168597658475 seconds
10% percentile latency: 0.3945025376044214 seconds
25% percentile latency: 0.39466362388338894 seconds
50% percentile latency: 0.3952090779785067 seconds
75% percentile latency: 0.3955157147720456 seconds
90% percentile latency: 0.3957540170755237 seconds
99% percentile latency: 0.3962868914008141 seconds

RAW_BUFFERClick to expand / collapse

Related: #32358 Status: Under Discussion Written with: Claude

Summary

Background

Under CustomOp, batch invariance for rms_norm intersects with three execution paths:

forward_cuda / forward_hip — dispatching between the standard CUDA/HIP custom kernel and batch invariant Triton kernel (also AITER in forward_hip)
forward_native — used by default during compilation; batch-invariant dispatching is absent here, so it is silently skipped. Instead we relied on Inductor to generate a batch-invariant Triton kernel.

Under vLLM IR, the forward_* methods are eliminated and replaced with a single dispatching mechanism. How to represent and route batch-invariant execution within that model is an open question.

Proposed Alternatives

There are basically two decisions embedded in this:

Do we want to dispatch batch-invariant rms_norm to native torch ops when compiling the model, or just always use the batch-invariant Triton kernel?
If yes (alternatives 1-4), how should this dispatch happen?

Alternatives 1-4 are various ways to dispatch to the native implementation, and alternative 5 just always dispatches to the batch-invariant Triton kernel.

Option 1: vLLM IR handles batch-invariant dispatching natively (original plan, PR #36816)

Batch invariance is a first-class property in the IR. Each implementation declares whether it is batch invariant, and IrOp.dispatch filters implementations accordingly when the batch_invariant flag is set to true.

Option 2: Control batch-invariant kernels via vLLM IR priority list (my new preference)

Batch invariance is not encoded in the IR itself. Instead, when VLLM_BATCH_INVARIANT=1, the KernelConfig initialization logic adjusts the priority list for affected ops to prefer batch-invariant implementations:

class CudaPlatform:
  ...

  @classmethod
  def get_default_ir_op_priority(cls, vllm_config: VllmConfig) -> IrOpPriorityConfig:
    # Native used by default when compiling,
    # use vllm_c kernels where available when no codegen
    cc = vllm_config.compilation_config
    using_inductor = cc.backend == "inductor" and cc.mode != CompilationMode.NONE
    default = ["native"] if using_inductor else ["vllm_c", "native"]

    if using_inductor:
      rms_norm = ["native"]
    elif envs.VLLM_BATCH_INVARIANT:
      rms_norm = ["triton_batch_invariant", "native"]
    else:
      rms_norm = ["vllm_c", "native"]

    return IrOpPriorityConfig.with_default(default, rms_norm=rms_norm)

The IR dispatches according to whatever priority list is active and remains unaware of batch invariance as a concept.

Option 3: Separate rms_norm_batch_invariant IR op

Define a distinct IR op vllm.ir.ops.rms_norm_batch_invariant with its own semantics declaration and implementation registry. Layer code selects which op to invoke based on whether batch invariance is required. This makes the path separate and even visible in the FX graph, but that also means more duplication and manual effort.

Option 4: Separate dispatching mechanism outside vLLM IR

When VLLM_BATCH_INVARIANT=1, execution bypasses IR dispatching entirely and dispatches between the batch-invariant Triton kernel and native implementation directly. The IR is not involved in this path. This does not allow OOT extensibility or custom compile passes but it does at least keep the benefit of native performance, although the native implementation has to be duplicated as we’re not using the IR op (we could manually invoke ir.ops.rms_norm.impls[“native”]).

Option 5: Batch-invariant path always calls the Triton kernel (Wentao's preference)

When batch invariance is required, always dispatch directly to the batch-invariant Triton kernel, with no IR involvement and no priority list manipulation. This matches current behavior on A100 and earlier GPUs, where eager execution is already forced.

B200 perf results for option 5, comparing triton_batch_invariant and native implementations (this is before #40408 & #40413 which close roughly half of that gap). Llama was the easiest to demonstrate this on but this applies to Qwen and Deepseek even more as they have more norms per layer (there are some temporary confounding issues right now out of scope of this RFC):

# triton_batch_invariant
$ VLLM_BATCH_INVARIANT=1 vllm bench latency --model=RedHatAI/Meta-Llama-3.1-8B-Instruct-FP8 --attention-backend=TRITON_ATTN --ir-op-priority.rms_norm=triton_batch_invariant --ir-op-priority.fused_add_rms_norm=triton_batch_invariant -cc.pass_config.fuse_norm_quant=False
Avg latency: 0.43808772719154754 seconds
10% percentile latency: 0.43741614669561385 seconds
25% percentile latency: 0.43829121452290565 seconds
50% percentile latency: 0.4384383929427713 seconds
75% percentile latency: 0.4386132051004097 seconds
90% percentile latency: 0.43903175396844746 seconds
99% percentile latency: 0.43941899626515807 seconds

# native
$ VLLM_BATCH_INVARIANT=1 vllm bench latency --model=RedHatAI/Meta-Llama-3.1-8B-Instruct-FP8 --attention-backend=TRITON_ATTN --ir-op-priority.rms_norm=native --ir-op-priority.fused_add_rms_norm=native -cc.pass_config.fuse_norm_quant=False
Avg latency: 0.3949168597658475 seconds
10% percentile latency: 0.3945025376044214 seconds
25% percentile latency: 0.39466362388338894 seconds
50% percentile latency: 0.3952090779785067 seconds
75% percentile latency: 0.3955157147720456 seconds
90% percentile latency: 0.3957540170755237 seconds
99% percentile latency: 0.3962868914008141 seconds

Discussion

The core trade-off is between visible, explicit logic in layer/config code and a clean, unified IR dispatching model with superior performance.

Options 4 and 5 are the simplest from an IR perspective: no batch-invariance-related logic enters the core dispatch mechanism. However, both place the batch-invariant path outside the IR entirely, making it a second-class citizen. OOT platforms (that don’t support the Triton kernel) would need to implement their own separate routing logic and override the whole layer rather than using the standard register_impl interface.

Option 5 is the simplest, but it carries the classic downsides of a custom kernel: lower performance portability, no portability beyond Triton kernels, no automatic fusion, higher maintenance cost. Option 4 recovers torch.compile-kernel-gen benefits (including 5% E2E perf) but still lacks OOT extensibility and IR benefits (fusion passes & sequence parallelism/async TP, although those don’t currently work with batch invariance anyway).

Option 3 adds explicit IR visibility but nearly duplicates the rms_norm op declaration, creates maintenance burden, and requires conditional logic in the RMSNorm layer — complexity that grows if other ops ever need batch-invariant variants.

Option 1 (current plan) keeps the batch-invariant path fully inside the IR, giving it access to unified dispatching, OOT register_impl, and Inductor fusions. The cost is that the IR dispatch model must understand the concept of batch invariance, adding a layer of reasoning to the core.

Option 2 splits the difference: batch-invariant kernels are still registered through the IR and benefit from unified dispatch and OOT extensibility, but the IR itself has no knowledge of batch invariance — it just follows the active priority list. The batch-invariant override logic lives in CudaPlatform.get_default_ir_op_priority, which is explicit and easy to audit. The main downside relative to Option 1 is that there is no enforcement that a registered implementation actually satisfies batch invariance — correctness depends on the platform providing the right priority list.

Recommendation

Option 1 was the original plan and is the most principled long-term design. However, Option 2 is likely a better starting point: it does not preclude migrating to Option 1 in the future, keeps the IR dispatch model slightly simpler while the IR is still stabilizing, and avoids encoding a concept that currently only affects a single op (rms_norm). No other ops are currently known to require batch-invariant variants, so building first-class IR support for this now is premature. If batch invariance becomes relevant to additional ops, or Inductor fusions on the batch-invariant path prove impactful, migrating to Option 1 remains straightforward.

Feedback from RL post-training users and OOT platform maintainers would be especially valuable.

Feedback Period.

4/22 - 4/24, this is a high-priority blocker for vLLM IR enablement

CC List.

@yewentao256 @simon-mo @zhuohan123 @houseroad @tjtanaa @zou3519 @gmagogsfm @angelayi @PaulZhang12 @tlrmchlsmth

extent analysis

TL;DR

Implement Option 2, which controls batch-invariant kernels via the vLLM IR priority list, as a starting point to handle batch invariance in the vLLM IR dispatching model.

Guidance

Review the trade-offs between the different options for handling batch invariance, considering factors such as performance, maintainability, and extensibility.
Implement Option 2 by adjusting the priority list for affected ops to prefer batch-invariant implementations when VLLM_BATCH_INVARIANT=1.
Verify that the batch-invariant path is correctly routed through the IR dispatching mechanism and that OOT platforms can extend this logic using the standard register_impl interface.
Monitor the performance impact of this approach and reassess the decision if batch invariance becomes relevant to additional ops or if Inductor fusions on the batch-invariant path prove impactful.

Example

class CudaPlatform:
  ...

  @classmethod
  def get_default_ir_op_priority(cls, vllm_config: VllmConfig) -> IrOpPriorityConfig:
    # ...
    if envs.VLLM_BATCH_INVARIANT:
      rms_norm = ["triton_batch_invariant", "native"]
    # ...

Notes

This approach may not be suitable if batch invariance becomes a first-class concern for multiple ops, in which case migrating to Option 1 may be necessary. Additionally, the correctness of this approach depends on the platform providing the correct priority list.

Recommendation

Apply Option 2 as a starting point, as it balances the need for a clean IR dispatching model with the requirement for batch-invariant kernels, while keeping the IR dispatch model slightly simpler. This approach can be revisited if necessary, and migrating to Option 1 remains a viable future option.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#request error #file not found #serialization error #model compatibility #GPU setup

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

vllm - ✅(Solved) Fix [RFC][vLLM IR]: Batch Invariance Dispatching in vLLM IR [1 pull requests, 4 comments, 4 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Fix Action

Fix / Workaround

PR fix notes

PR #36816: [DO NOT MERGE][vLLM IR] 2/N batch-invariant-aware dispatching and rms_norm

Description (problem / solution / changelog)

Purpose

Test Plan

Test Result

lm_eval

B200

main

PR

Latency

B200

H100

Changed files

Code Example

Summary

Background

Proposed Alternatives

Option 1: vLLM IR handles batch-invariant dispatching natively (original plan, PR #36816)

Option 2: Control batch-invariant kernels via vLLM IR priority list (my new preference)

Option 3: Separate rms_norm_batch_invariant IR op

Option 4: Separate dispatching mechanism outside vLLM IR

Option 5: Batch-invariant path always calls the Triton kernel (Wentao's preference)

Discussion

Recommendation

Feedback Period.

CC List.

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING