vllm - 💡(How to fix) Fix [Performance]: FlashInfer AR+RMSNorm fusion cap is ~2x too high on multimem NVLink Hopper (TP8): it displaces the faster multimem all-reduce in the 256KB-512KB band [1 pull requests]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

On a multimem/NVSwitch sm90 node at TP8 (default settings), AllReduceFusionPass fuses all_reduce + residual + RMSNorm into flashinfer's one-shot trtllm AR whenever the message is below FI_ALLREDUCE_FUSION_MAX_SIZE_MB[90][8] = 0.5 MB. But above 256 KB the unfused all-reduce uses vLLM's fast multimem_all_reduce (NVLink multicast), which is faster than flashinfer's one-shot AR. So in the 256 KB–512 KB band the pass trades the fast multimem AR for a slower fused AR — a net regression:

  • −1.9 to −4.3 µs/op (microbench, CUDA-graph timed)
  • −3.4 % / −7.4 % decode throughput at batch 96 / 128 (Qwen3-30B-A3B, TP8)

Fix: cap the fusion at the size where the fast unfused AR takes over (256 KB here), gated on multimem availability.

Root Cause

cuda_communicator.all_reduce sends the unfused AR to vLLM's custom one-shot kernel below CUSTOM_ALL_REDUCE_MAX_SIZES["9.0"][8] = 262144 B and to multimem_all_reduce at/above it (the NCCL-symm-mem and flashinfer-regular-AR branches are off by default). AllReduceFusionPass emits flashinfer's one-shot trtllm AR+RMSNorm for any message below FI_ALLREDUCE_FUSION_MAX_SIZE_MB[90][8] = 0.5 MB. So in 256 KB–512 KB the pass replaces the fast multimem AR with the slower flashinfer one-shot; the ~2 µs RMSNorm saving doesn't offset flashinfer being ~2× slower than multimem there.

Fix Action

Fixed

RAW_BUFFERClick to expand / collapse

Proposal to improve performance

Summary

On a multimem/NVSwitch sm90 node at TP8 (default settings), AllReduceFusionPass fuses all_reduce + residual + RMSNorm into flashinfer's one-shot trtllm AR whenever the message is below FI_ALLREDUCE_FUSION_MAX_SIZE_MB[90][8] = 0.5 MB. But above 256 KB the unfused all-reduce uses vLLM's fast multimem_all_reduce (NVLink multicast), which is faster than flashinfer's one-shot AR. So in the 256 KB–512 KB band the pass trades the fast multimem AR for a slower fused AR — a net regression:

  • −1.9 to −4.3 µs/op (microbench, CUDA-graph timed)
  • −3.4 % / −7.4 % decode throughput at batch 96 / 128 (Qwen3-30B-A3B, TP8)

Fix: cap the fusion at the size where the fast unfused AR takes over (256 KB here), gated on multimem availability.

Environment

8×H200 (sm_90), NVLink/NVSwitch, TP8, bf16. vLLM 0.1.dev1916, flashinfer 0.6.11.post2. Defaults VLLM_ALLREDUCE_USE_SYMM_MEM=1, VLLM_USE_NCCL_SYMM_MEM=0, VLLM_ALLREDUCE_USE_FLASHINFER=0.

Root cause

cuda_communicator.all_reduce sends the unfused AR to vLLM's custom one-shot kernel below CUSTOM_ALL_REDUCE_MAX_SIZES["9.0"][8] = 262144 B and to multimem_all_reduce at/above it (the NCCL-symm-mem and flashinfer-regular-AR branches are off by default). AllReduceFusionPass emits flashinfer's one-shot trtllm AR+RMSNorm for any message below FI_ALLREDUCE_FUSION_MAX_SIZE_MB[90][8] = 0.5 MB. So in 256 KB–512 KB the pass replaces the fast multimem AR with the slower flashinfer one-shot; the ~2 µs RMSNorm saving doesn't offset flashinfer being ~2× slower than multimem there.

Evidence

1. Per-op (microbench, CUDA-graph timed, default config, H=2048). delta = unfused − fused µs/op; >0 = fusion wins:

tokensbytesfusedunfused (AR backend)delta
48192 KB10.411.2 (custom)+0.86
64256 KB12.112.1 (multimem)+0.05
96384 KB15.013.1 (multimem)−1.87
128512 KB17.913.6 (multimem)−4.30

Crossover is byte-governed, constant at 256 KB across hidden sizes 2048/4096/7168 (token thresholds 64/32/18).

2. The regression is specifically multimem. With VLLM_ALLREDUCE_USE_SYMM_MEM=0 (unfused AR stays custom one-shot), the 256–384 KB band flips back to a win (N=64 +3.79, N=96 +1.48 µs). (A small −1 µs residual remains only at 512 KB.)

3. End-to-end (decode throughput, cap 0.5 vs 0.25 MB; median of 6, 256 steps):

batchcap=0.5cap=0.25Δ
641058410655+0.7 % (noise)
961442114912+3.4 %
1281742518713+7.4 %

Localized to the band (≈0 at batch ≤64), well outside per-batch noise (~0.2–0.3 %).

Proposed fix

Cap the fusion at the active fast-AR threshold: min(current_cap, CUSTOM_ALL_REDUCE_MAX_SIZES["9.0"][8]) when multimem is on (or the ~128 KiB NCCL-symm-mem bound when VLLM_USE_NCCL_SYMM_MEM=1), gated on availability; unchanged on non-multimem nodes (fusion wins to ~384 KB there). Keep byte/MB-shaped. Thresholds are config-overridable (#23722). Precedent: #42409, #37756.

Config caveat

With the non-default VLLM_USE_NCCL_SYMM_MEM=1, NCCL-symm-mem wins from ~128 KiB, moving the crossover lower — hence keying the cap to the active fast-AR threshold rather than a fixed value.

Report of performance regression

No response

Misc discussion on performance

No response

Your current environment (if you think it is necessary)

No response

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

vllm - 💡(How to fix) Fix [Performance]: FlashInfer AR+RMSNorm fusion cap is ~2x too high on multimem NVLink Hopper (TP8): it displaces the faster multimem all-reduce in the 256KB-512KB band [1 pull requests]