vllm - ✅(Solved) Fix [Performance]: Deepseek performance regressing with norm fusion enabled [2 pull requests, 1 comments, 1 participants]

vllm2026-03-23 01:04:25

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

vllm-project/vllm#37832•Fetched 2026-04-08 01:17:50

View on GitHub

Comments

Participants

Timeline

Reactions

Author

tianrengao

Participants

tianrengao

Assignees

zou3519

Timeline (top)

labeled ×2added_to_project_v2 ×1assigned ×1commented ×1

While profiling DeepSeek-R1 671B FP8 with Inductor compilation (O2), we observed that the default fuse_norm_quant=true configuration is slower than fuse_norm_quant=false. Disabling the norm fusion pass consistently recovers ~1 ms/step TPOT across TP=4 and TP=8 configurations.

The root cause is enable_norm_fusion() in vllm/config/vllm.py, which activates RMSNormQuantFusionPass for FP8 models where rms_norm is native (decomposed by Inductor). The pass replaces Inductor-decomposed norm ops with an opaque CUDA extern rms_norm_per_block_quant, which introduces a redundant no-op kernel that Inductor cannot eliminate.

Root Cause

PR fix notes

PR #37960: [Performance] Fused rmsnorm fp8 quant for DeepSeek

Repository: vllm-project/vllm
Author: tianrengao
State: open | merged: False
Link: https://github.com/vllm-project/vllm/pull/37960

Description (problem / solution / changelog)

fixes: https://github.com/vllm-project/vllm/issues/37832

Fused Triton kernel: Combines RMSNorm + FP8 per-block quantization into a single kernel for q_b_proj, eliminating the intermediate DRAM round-trip between norm and quant.
Native aten decomposition: Replaces the opaque CUDA CustomOp with native aten ops for norm-adjacent linears, enabling Inductor to automatically fuse normalize + quantize (requires companion PyTorch PR #178237 to fix Reduction+Pointwise fusion).

Summary: Introduces a custom Triton kernel that fuses RMSNorm and per-block FP8 quantization into a single kernel for norm-adjacent linear layers, along with selective native quantization that enables Inductor to fuse normalize + quantize operations. Together these reduce E2E latency significantly on DeepSeek-R1 671B FP8 with TP=8. see:

Companion PyTorch PR: [inductor] Enable Reduction+Pointwise fusion when reduction output has opaque consumers (fixes Inductor scheduler to allow fusing Reduction epilogues when the output feeds an opaque consumer like cutlass_scaled_mm).

Problem

Under torch.compile with Inductor, DeepSeek-R1 FP8 inference has a performance gap between RMSNorm and per-block FP8 quantization:

The QuantFP8 CustomOp is opaque to Inductor, preventing fusion with the preceding RMSNorm. This forces an unnecessary DRAM round-trip (write normalized output, read it back for quantization).
Even when quantization uses native aten ops (Inductor-visible), the variance reduction (rnumel=hidden_dim) and quant reduction (rnumel=128) have different iteration spaces. Inductor scheduler requires rnumel1 == rnumel2 (simd.py:1337) and MixOrderReduction requires no producer-consumer relationship (scheduler.py:296). Neither can fuse them.

Solution (3 layers)

1. Selective native quantization (fp8.py + fp8_utils.py)

For norm-adjacent linear layers (q_a_proj, q_b_proj, kv_a_proj, kv_b_proj), replace the opaque QuantFP8 CustomOp with pure aten ops (_quantize_input_native). Non-norm-adjacent layers (o_proj, MoE) keep the optimized CUDA CustomOp.

Key technique: transposed-amax produces column-major scales without .contiguous() copy — compute amax on permute(1,0,2) view so output [G,B] contiguous = [B,G] col-major via .t() view.

2. Fused RMSNorm+FP8Quant Triton kernel (fp8_utils.py)

_fused_rmsnorm_fp8_quant_kernel — a two-pass persistent reduction:

Pass 1: Read x, compute variance = mean(x^2) per row
Pass 2: Read x again (L1 cache hit), normalize * weight, per-group amax -> scale -> FP8 cast, write column-major scales directly

Registered as vllm::fused_rmsnorm_fp8_quant custom_op. Replaces 2 Inductor kernels (triton_red + triton_per) with 1 custom kernel.

Applied to q_b_proj (preceded by q_a_layernorm). kv_b_proj cannot be fused because kv_c_normed is stored in the KV cache.

3. Model integration (deepseek_v2.py + mla.py)

_set_fused_norm() stores RMSNorm weight/eps on the linear ops quant op. The MLA forward skips standalone q_a_layernorm when the downstream q_b_proj has fused norm (norm is done inside the fused kernel).

E2E Codegen Verification

DeepSeek-R1 671B FP8, TP=8, compiled graph (rank0):

fused_rmsnorm_fp8_quant calls: 6 (q_b_proj across layers) per_token_group_fp8_quant calls: 2 (non-norm-adjacent, CustomOp) triton_per (native quant): 9 (kv_b, q_a, kv_a) Scale layout: (s72, 12) strides (1, s72) — column-major confirmed

Benchmark

DeepSeek-R1 671B FP8, TP=8, BS=1, input=128, output=8 (vllm bench latency): rate = 15, num requests=600 Baseline (all CUDA CustomOp): 98.15 ms This PR (fused Triton kernel): 95.91 ms (-2.24 ms, -2.3%)

Numerical Correctness

BS=1: q_diff=0.0 vs float32 reference (exact match) BS=8: q_diff=0.0 (Triton reduction ordering, within FP8 precision) BS=64: q_diff=0.0 (within FP8 precision bounds)

The fused kernel operates in float32 throughout (no intermediate bf16 truncation), giving equal or better precision than separate norm->quant.

Test Plan:

Codegen verification (quick, ~5 min compile)

VLLM_SKIP_NATIVE_QUANT=0 TORCH_LOGS=output_code \
vllm bench latency --model deepseek-ai/DeepSeek-R1 \
  --tensor-parallel-size 8 \
  --compilation-config '"'"'{"custom_ops": ["-rms_norm"]}'"'"' \
  --input-len 128 --output-len 2 --batch-size 1 \
  --num-iters 1 --num-iters-warmup 0 2>&1 | \
  grep "fused_rmsnorm_fp8_quant"
# Expected: 6 fused_rmsnorm_fp8_quant calls in compiled graph

A/B benchmark

for SKIP in 1 0; do
  echo "=== SKIP=$SKIP ==="
  VLLM_SKIP_NATIVE_QUANT=$SKIP \
  vllm bench latency --model deepseek-ai/DeepSeek-R1 \
    --tensor-parallel-size 8 \
    --compilation-config '"'"'{"custom_ops": ["-rms_norm"]}'"'"' \
    --input-len 128 --output-len 8 --batch-size 1 \
    --num-iters 3 --num-iters-warmup 1
done
# SKIP=1 (baseline) vs SKIP=0 (fused): expect ~2% improvement

Standalone numerical correctness test

python test_fused_rmsnorm_quant.py
# Tests fused kernel vs float32 reference across BS=1/8/64
# and hidden_dim=512/1536/7168

NOTE: Requires companion PyTorch PR for Inductor scheduler fix (choices.py + scheduler.py). Without it, the native quant path for kv_b/q_a/kv_a will produce extra triton_poi kernels. ' 2>&1

<details> <summary> Essential Elements of an Effective PR Description Checklist </summary>

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

</details>

Changed files

vllm/config/compilation.py (modified, +18/-0)
vllm/model_executor/layers/mla.py (modified, +11/-1)
vllm/model_executor/layers/quantization/fp8.py (modified, +14/-0)
vllm/model_executor/layers/quantization/utils/fp8_utils.py (modified, +207/-4)
vllm/model_executor/models/deepseek_v2.py (modified, +37/-0)

Code Example

# The no-op kernel body (from Inductor codegen):
def triton_poi_fused__to_copy_empty_like_permute_to_2(in_out_ptr0, xnumel, XBLOCK):
    tmp0 = tl.load(in_out_ptr0 + (x0), xmask).to(tl.float32)  # load bf16 -> f32
    tmp1 = tmp0.to(tl.float32)                                   # f32 -> f32 (no-op)
    tmp2 = tmp1.to(tl.float32)                                   # f32 -> f32 (no-op)
    tl.store(in_out_ptr0 + (x0), tmp2, xmask)                   # store f32 -> bf16

---

# Step 1: NO-OP identity kernel (bf16 -> f32 -> bf16, wastes ~224 MB bandwidth)
triton_poi_fused__to_copy_empty_like_permute_to_2.run(buf22, ...)

# Step 2: opaque CUDA extern
torch.ops._C.rms_norm_per_block_quant.default(buf20, buf22, arg6_1, ...)

# Step 3: matmul
torch.ops._C.cutlass_scaled_mm.default(buf27, buf20, ...)

---

# Step 1: Inductor triton_red -- data prep absorbed into norm
triton_red_fused__to_copy_add_mean_mul_per_token_group_fp8_quant_permute_pow_rsqrt_2.run(
    buf30, buf9, arg5_1, arg10_1, buf34, buf56, s72, 7168, ...)

# Step 2: separate quant (efficient CUDA op)
torch.ops._C.per_token_group_fp8_quant.default(buf34, buf32, ...)

# Step 3: matmul
torch.ops._C.cutlass_scaled_mm.default(buf38, buf32, ...)

---

fuse_norm_quant=True                         fuse_norm_quant=False

 allreduce output (bf16)                      allreduce output (bf16)
        |                                            |
        v                                            |  residual (bf16)
 +---------------------+                             |        |
 | triton_poi (NO-OP)  |  <-- EXTRA kernel           v        v
 | bf16->f32->bf16     |                     +---------------------+
 | 224 MB wasted       |                     | triton_red           |
 +----------+----------+                     | bf16->f32 + residual |  <-- ALL IN ONE
            |                                 | + pow->mean->rsqrt   |
            v  residual (bf16)               | (no extra kernel)    |
 +---------------------+                     +----------+----------+
 | rms_norm_per_block   |                                |
 | _quant (opaque)     |                                 v
 | reads buf22 AGAIN   |  <-- redundant        +--------------------+
 | norm + fp8 quant    |      read             | per_token_group    |
 +----------+----------+                       | _fp8_quant         |
            |                                   +----------+---------+
            v                                              v
 +---------------------+                     +---------------------+
 | cutlass_scaled_mm   |                     | cutlass_scaled_mm   |
 +---------------------+                     +---------------------+

 3 kernels + no-op waste                      3 kernels, no waste

---

def enable_norm_fusion(cfg: "VllmConfig") -> bool:
    """Enable only when rms_norm custom op is active (opaque to Inductor)."""
    return cfg.compilation_config.is_custom_op_enabled("rms_norm")

---

Model: DeepSeek-R1 671B FP8 (deepseek-ai/DeepSeek-R1)
Hardware: 8x NVIDIA B200 (TP=8), Driver 580.82.07
vLLM: 0.16.0 (pip, from /home/tianren/.conda/envs/vllm_profile/)
PyTorch: 2.12.0a0+gitb05b2d3 (local build, from /data/users/tianren/pytorch/)
CUDA: 12.9, Triton: 3.5.1
Python: 3.12.12
Optimization level: O2 (default)

RAW_BUFFERClick to expand / collapse

Proposal to improve performance

Summary

Environment

Model: DeepSeek-R1 671B FP8 (deepseek-ai/DeepSeek-R1)
Hardware: 8x NVIDIA B200 (TP=8), Driver 580.82.07
vLLM: 0.16.0
PyTorch: 2.12.0a0+gitb05b2d3 (local build)
CUDA: 12.9, Triton: 3.5.1
Python: 3.12.12
Optimization level: O2 (default)

Why the opaque extern hurts performance

Inductor + norm fusion true (dafault) 38us <img width="1175" height="237" alt="Image" src="https://github.com/user-attachments/assets/01ffa0bb-fed2-41a1-9f48-c87687cbf415" />

Inductor + norm fusion false 28 us <img width="1285" height="242" alt="Image" src="https://github.com/user-attachments/assets/9d9140ef-0143-4171-a316-c92ffb3fa009" />

The fusion pass replacement creates a no-op identity kernel(~5-10us per no-op). The RMSNormQuantFusionPass replacement function does input = input.to(dtype=self.model_dtype) ("just to be safe"), which combined with the pattern matching starting from f32 input (MatcherRMSNorm.inputs() uses empty_f32 for native mode), produces a bf16->f32->bf16 round-trip in the graph:

# The no-op kernel body (from Inductor codegen):
def triton_poi_fused__to_copy_empty_like_permute_to_2(in_out_ptr0, xnumel, XBLOCK):
    tmp0 = tl.load(in_out_ptr0 + (x0), xmask).to(tl.float32)  # load bf16 -> f32
    tmp1 = tmp0.to(tl.float32)                                   # f32 -> f32 (no-op)
    tmp2 = tmp1.to(tl.float32)                                   # f32 -> f32 (no-op)
    tl.store(in_out_ptr0 + (x0), tmp2, xmask)                   # store f32 -> bf16

Inductor cannot DCE this kernel because rms_norm_per_block_quant is opaque -- it can't prove the preceding _to_copy is safe to eliminate.

Codegen Evidence

Side-by-side from the same DeepSeek-R1 graph partition (TP=8, B200x8, TORCH_LOGS=output_code):

`fuse_norm_quant=True` (buggy default) -- extra no-op kernel

# Step 1: NO-OP identity kernel (bf16 -> f32 -> bf16, wastes ~224 MB bandwidth)
triton_poi_fused__to_copy_empty_like_permute_to_2.run(buf22, ...)

# Step 2: opaque CUDA extern
torch.ops._C.rms_norm_per_block_quant.default(buf20, buf22, arg6_1, ...)

# Step 3: matmul
torch.ops._C.cutlass_scaled_mm.default(buf27, buf20, ...)

`fuse_norm_quant=False` (fix) -- no-op eliminated

# Step 1: Inductor triton_red -- data prep absorbed into norm
triton_red_fused__to_copy_add_mean_mul_per_token_group_fp8_quant_permute_pow_rsqrt_2.run(
    buf30, buf9, arg5_1, arg10_1, buf34, buf56, s72, 7168, ...)

# Step 2: separate quant (efficient CUDA op)
torch.ops._C.per_token_group_fp8_quant.default(buf34, buf32, ...)

# Step 3: matmul
torch.ops._C.cutlass_scaled_mm.default(buf38, buf32, ...)

Data flow

fuse_norm_quant=True                         fuse_norm_quant=False

 allreduce output (bf16)                      allreduce output (bf16)
        |                                            |
        v                                            |  residual (bf16)
 +---------------------+                             |        |
 | triton_poi (NO-OP)  |  <-- EXTRA kernel           v        v
 | bf16->f32->bf16     |                     +---------------------+
 | 224 MB wasted       |                     | triton_red           |
 +----------+----------+                     | bf16->f32 + residual |  <-- ALL IN ONE
            |                                 | + pow->mean->rsqrt   |
            v  residual (bf16)               | (no extra kernel)    |
 +---------------------+                     +----------+----------+
 | rms_norm_per_block   |                                |
 | _quant (opaque)     |                                 v
 | reads buf22 AGAIN   |  <-- redundant        +--------------------+
 | norm + fp8 quant    |      read             | per_token_group    |
 +----------+----------+                       | _fp8_quant         |
            |                                   +----------+---------+
            v                                              v
 +---------------------+                     +---------------------+
 | cutlass_scaled_mm   |                     | cutlass_scaled_mm   |
 +---------------------+                     +---------------------+

 3 kernels + no-op waste                      3 kernels, no waste

Across all codegen: 80 rms_norm_per_block_quant calls (fuse=True) -> 0 (fuse=False). All non-norm ops (allreduce, MoE, matmul) are identical.

Suggested Fix

For short term, we suggest just turn off norm fusion for deepseek and allow inductor to take over the fusion, until the no-op is eliminated from the graph. This gives 1ms back. When rms_norm is native, Inductor handles fusion better, it absorbs data preparation into the norm kernel and eliminates the no-op.

Another short term fix is on vllm side, to eliminate the permute and dtype conversion nodes after fusion is done, because inductor is not able to remove these kernels and they become a no-op kernel.

def enable_norm_fusion(cfg: "VllmConfig") -> bool:
    """Enable only when rms_norm custom op is active (opaque to Inductor)."""
    return cfg.compilation_config.is_custom_op_enabled("rms_norm")

For longer term, inductor should enable rmsnorm + fp8 quant fusion to further improve the perf. We are currently working on this in inductor

Impact

Affects all FP8 models served with Inductor (O1/O2/O3) where rms_norm is not explicitly enabled as a custom op -- the default for most FP8 deployments.

Report of performance regression

Benchmark Results

TP=8 (8x B200, DeepSeek-R1 671B FP8, rate=5)

Configuration	Run 1	Run 2	Run 3	Avg TPOT (ms)
Inductor, fuse_norm_quant=true	22.19	22.52	--	22.36
Inductor, fuse_norm_quant=false	--	21.54	21.17	21.36
Delta (true -> false)				-1.00

TP=4 (4x B200, 2 independent sessions)

Configuration	Session 1 Avg	Session 2 Avg
fuse_norm_quant=true	31.58	32.53
fuse_norm_quant=false	30.57	31.52
Delta	-1.01	-1.00

Consistent ~1.0 ms/step improvement across TP=4, TP=8, and multiple sessions.

Misc discussion on performance

No response

Your current environment (if you think it is necessary)

Model: DeepSeek-R1 671B FP8 (deepseek-ai/DeepSeek-R1)
Hardware: 8x NVIDIA B200 (TP=8), Driver 580.82.07
vLLM: 0.16.0 (pip, from /home/tianren/.conda/envs/vllm_profile/)
PyTorch: 2.12.0a0+gitb05b2d3 (local build, from /data/users/tianren/pytorch/)
CUDA: 12.9, Triton: 3.5.1
Python: 3.12.12
Optimization level: O2 (default)

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

Fix Plan

To fix the performance regression issue, we need to disable norm fusion for DeepSeek-R1 models. Here are the steps:

Set fuse_norm_quant to False in the model configuration.
Update the enable_norm_fusion function in vllm/config/vllm.py to return False when rms_norm is native.

Example code:

def enable_norm_fusion(cfg: "VllmConfig") -> bool:
    """Enable only when rms_norm custom op is active (opaque to Inductor)."""
    # Return False when rms_norm is native
    return False

Alternatively, you can set fuse_norm_quant to False when creating the model:

model_config = {
    # ... other configurations ...
    "fuse_norm_quant": False
}

Verification

To verify that the fix worked, you can run the benchmark tests again and check the average TPOT (ms) values. The values should be lower than before, indicating an improvement in performance.

Example verification code:

# Run benchmark tests
run_benchmark_tests(model_config)

# Print average TPOT (ms) values
print("Average TPOT (ms) values:")
print("TP=8:", avg_tpot_tp8)
print("TP=4:", avg_tpot_tp4)

Extra Tips

Make sure to update the vllm library to the latest version to ensure that the fix is applied correctly.
If you are using a custom rms_norm implementation, you may need to modify it to work with the updated enable_norm_fusion function.
Keep an eye on future updates to the inductor library, as they may include further improvements to the performance of FP8 models.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#optimization #tokenizer error #prompt formatting #chain error #conversation history

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.