vllm - 💡(How to fix) Fix [Performance]: Triton fusion for Qwen2/3-MoE shared-expert gate (Qwen2MoeMLP/Qwen3MoeMLP)

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

Fix Action

Fix / Workaround

I have a working patch (Triton kernel + Python wrapper with a shape-guard fallback + parametrized correctness test) and a full A/B sweep on MI355x — numbers in the "Misc" section. Before opening a PR, I'd like to confirm maintainers want this fusion to land alongside the existing FSE work, given the points below.

Code Example

if self.expert_gate is not None:
    out = F.sigmoid(self.expert_gate(x)[0]) * out

---

The output of `python collect_env.py`
RAW_BUFFERClick to expand / collapse

Proposal to improve performance

In vllm/model_executor/models/qwen2_moe.py:119-120 (also reused by qwen3_next.py via re-export of Qwen2MoeMLP) and the byte-identical duplicate in qwen3_moe.py:131-132, the shared-expert gate runs as:

if self.expert_gate is not None:
    out = F.sigmoid(self.expert_gate(x)[0]) * out

For Qwen3-Next configs (shared_expert_gate = ReplicatedLinear(H, 1)weight.shape == [1, H]), this expands at runtime into three back-to-back memory-bound GPU kernels (skinny [N,H]×[H,1] GEMM → sigmoid on [N,1] → broadcast multiply [N,1]·[N,H]) plus two HBM-resident intermediates. A single Triton kernel that does one row at a time — load x_row, load weight, compute sigmoid(dot), load out_row, store out_row * gate — collapses them into one pass and removes both intermediates.

I have a working patch (Triton kernel + Python wrapper with a shape-guard fallback + parametrized correctness test) and a full A/B sweep on MI355x — numbers in the "Misc" section. Before opening a PR, I'd like to confirm maintainers want this fusion to land alongside the existing FSE work, given the points below.

Relationship to existing FSE work

This is complementary to #39280 (AITER FSE) rather than overlapping. FSE (VLLM_ROCM_USE_AITER_FUSION_SHARED_EXPERTS, default False) bypasses the call site at the MoE-block level when the user is on ROCm + AITER and has explicitly opted in. For every other configuration — NVIDIA, ROCm with AITER off, ROCm + AITER + default, and all qwen3_moe.py configs (FSE PR did not touch the duplicate gate code) — the slow 3-kernel path runs today. The proposed Triton fusion is shape-guarded and falls back to the PyTorch reference, so it is a no-op when FSE is enabled; the two never execute simultaneously.

One related open PR worth flagging: #37800 (@ChuanLi1101, ready label) proposes flipping VLLM_ROCM_USE_AITER_FUSION_SHARED_EXPERTS to True by default on ROCm. If it lands, the audience for this fusion narrows to NVIDIA + qwen3_moe.py + explicit opt-out — still meaningful coverage. I'd appreciate guidance on whether that gap is worth closing.

Report of performance regression

Not a regression; this is a forward-looking optimization. A/B numbers in the "Misc discussion on performance" section below.

Misc discussion on performance

Setup: single MI355x, Qwen3-Next-80B-A3B-Instruct-FP8, vLLM 0.19.1, TP=1, AITER on, FSE not enabled. Three workload shapes (balanced ISL=OSL=1024 / decode-heavy ISL=1024/OSL=8192 / prefill-heavy ISL=8192/OSL=1024), three concurrencies per workload, one run per cell.

Output throughput Δ per cell (fused vs baseline):

WorkloadCONC=16CONC=32CONC=64mean
balanced+2.69%+7.09%+4.18%+4.65%
decode-heavy+5.99%+6.96%+6.12%+6.36%
prefill-heavy+3.87%+12.68%+14.32%+10.29%

Kernel-only microbench (bf16, MI355x) shows 1.22–1.36× on real Qwen3-Next shapes (N ∈ {1024, 7177, 8192}, K=2048). Detailed Pareto plots and per-cell numbers in a benchmark HTML available on request.

Happy to open the PR as soon as we agree on direction. cc @sighingnow @vadiklyutiy (qwen models codeowners), @mgoin @pavanimajety @zyongye (fused_moe codeowners), @tpopp @dllehr-amd (FSE authors), @ChuanLi1101 (#37800 author).

Your current environment (if you think it is necessary)

The output of `python collect_env.py`

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

vllm - 💡(How to fix) Fix [Performance]: Triton fusion for Qwen2/3-MoE shared-expert gate (Qwen2MoeMLP/Qwen3MoeMLP)