vllm - 💡(How to fix) Fix [Performance]: FlashInfer AR+RMSNorm fusion cap is ~2x too high on multimem NVLink Hopper (TP8): it displaces the faster multimem all-reduce in the 256KB-512KB band [1 pull requests]

vllm2026-05-30 21:39:47

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

On a multimem/NVSwitch sm90 node at TP8 (default settings), AllReduceFusionPass fuses all_reduce + residual + RMSNorm into flashinfer's one-shot trtllm AR whenever the message is below FI_ALLREDUCE_FUSION_MAX_SIZE_MB[90][8] = 0.5 MB. But above 256 KB the unfused all-reduce uses vLLM's fast multimem_all_reduce (NVLink multicast), which is faster than flashinfer's one-shot AR. So in the 256 KB–512 KB band the pass trades the fast multimem AR for a slower fused AR — a net regression:

−1.9 to −4.3 µs/op (microbench, CUDA-graph timed)
−3.4 % / −7.4 % decode throughput at batch 96 / 128 (Qwen3-30B-A3B, TP8)

Fix: cap the fusion at the size where the fast unfused AR takes over (256 KB here), gated on multimem availability.

Root Cause

cuda_communicator.all_reduce sends the unfused AR to vLLM's custom one-shot kernel below CUSTOM_ALL_REDUCE_MAX_SIZES["9.0"][8] = 262144 B and to multimem_all_reduce at/above it (the NCCL-symm-mem and flashinfer-regular-AR branches are off by default). AllReduceFusionPass emits flashinfer's one-shot trtllm AR+RMSNorm for any message below FI_ALLREDUCE_FUSION_MAX_SIZE_MB[90][8] = 0.5 MB. So in 256 KB–512 KB the pass replaces the fast multimem AR with the slower flashinfer one-shot; the ~2 µs RMSNorm saving doesn't offset flashinfer being ~2× slower than multimem there.

Fix Action

Fixed

Fixed by PR: [Perf] Gate AR+RMSNorm fusion cap on the active fast all-reduce threshold (https://github.com/vllm-project/vllm/pull/44080)

RAW_BUFFERClick to expand / collapse

Proposal to improve performance

Summary

−1.9 to −4.3 µs/op (microbench, CUDA-graph timed)
−3.4 % / −7.4 % decode throughput at batch 96 / 128 (Qwen3-30B-A3B, TP8)

Fix: cap the fusion at the size where the fast unfused AR takes over (256 KB here), gated on multimem availability.

Environment

8×H200 (sm_90), NVLink/NVSwitch, TP8, bf16. vLLM 0.1.dev1916, flashinfer 0.6.11.post2. Defaults VLLM_ALLREDUCE_USE_SYMM_MEM=1, VLLM_USE_NCCL_SYMM_MEM=0, VLLM_ALLREDUCE_USE_FLASHINFER=0.

Root cause

Evidence

1. Per-op (microbench, CUDA-graph timed, default config, H=2048). delta = unfused − fused µs/op; >0 = fusion wins:

tokens	bytes	fused	unfused (AR backend)	delta
48	192 KB	10.4	11.2 (custom)	+0.86
64	256 KB	12.1	12.1 (multimem)	+0.05
96	384 KB	15.0	13.1 (multimem)	−1.87
128	512 KB	17.9	13.6 (multimem)	−4.30

Crossover is byte-governed, constant at 256 KB across hidden sizes 2048/4096/7168 (token thresholds 64/32/18).

2. The regression is specifically multimem. With VLLM_ALLREDUCE_USE_SYMM_MEM=0 (unfused AR stays custom one-shot), the 256–384 KB band flips back to a win (N=64 +3.79, N=96 +1.48 µs). (A small −1 µs residual remains only at 512 KB.)

3. End-to-end (decode throughput, cap 0.5 vs 0.25 MB; median of 6, 256 steps):

batch	cap=0.5	cap=0.25	Δ
64	10584	10655	+0.7 % (noise)
96	14421	14912	+3.4 %
128	17425	18713	+7.4 %

Localized to the band (≈0 at batch ≤64), well outside per-batch noise (~0.2–0.3 %).

Proposed fix

Cap the fusion at the active fast-AR threshold: min(current_cap, CUSTOM_ALL_REDUCE_MAX_SIZES["9.0"][8]) when multimem is on (or the ~128 KiB NCCL-symm-mem bound when VLLM_USE_NCCL_SYMM_MEM=1), gated on availability; unchanged on non-multimem nodes (fusion wins to ~384 KB there). Keep byte/MB-shaped. Thresholds are config-overridable (#23722). Precedent: #42409, #37756.

Config caveat

With the non-default VLLM_USE_NCCL_SYMM_MEM=1, NCCL-symm-mem wins from ~128 KiB, moving the crossover lower — hence keying the cap to the active fast-AR threshold rather than a fixed value.

Report of performance regression

No response

Misc discussion on performance

No response

Your current environment (if you think it is necessary)

No response

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

vllm - 💡(How to fix) Fix [Performance]: FlashInfer AR+RMSNorm fusion cap is ~2x too high on multimem NVLink Hopper (TP8): it displaces the faster multimem all-reduce in the 256KB-512KB band [1 pull requests]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Fix Action

Fixed

Proposal to improve performance

Summary

Environment

Root cause

Evidence

Proposed fix

Config caveat

Report of performance regression

Misc discussion on performance

Your current environment (if you think it is necessary)

Before submitting a new issue...

Still need to ship something?

TRENDING