vllm - 💡(How to fix) Fix [Performance]: NVFP4 MoE on SM120: no env override to select backend (FLASHINFER_CUTLASS vs MARLIN) [1 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#38971Fetched 2026-04-08 02:44:42
View on GitHub
Comments
1
Participants
2
Timeline
4
Reactions
0
Timeline (top)
closed ×1commented ×1labeled ×1subscribed ×1

Code Example

Using 'FLASHINFER_CUTLASS' NvFp4 MoE backend out of potential backends:
['FLASHINFER_TRTLLM', 'FLASHINFER_CUTEDSL', 'FLASHINFER_CUTLASS', 'VLLM_CUTLASS', 'MARLIN']

---

The output of `python collect_env.py`
RAW_BUFFERClick to expand / collapse

Proposal to improve performance

Add an env variable (e.g. VLLM_NVFP4_MOE_BACKEND) to allow users to override the NVFP4 MoE backend selection. Currently the backend is auto-selected with no override possible.

Report of performance regression

Since v0.19.0, NVFP4 MoE models on SM120 (RTX PRO 6000) use FLASHINFER_CUTLASS by default. In v0.17.1, Marlin was the automatic fallback since FLASHINFER_CUTLASS did not support SM120 at that time.

Single-user throughput on SM120 with Nemotron 3 Super (NVFP4, 120B):

  • Marlin (v0.17.1 default): ~92 tok/s
  • FLASHINFER_CUTLASS (v0.19.0 default): ~74 tok/s
  • Regression: ~20-25%

Currently there is no way to override the MoE backend:

  • VLLM_NVFP4_MOE_BACKEND does not exist
  • VLLM_NVFP4_GEMM_BACKEND=MARLIN crashes with MoE models

Log output (v0.19.0):

Using 'FLASHINFER_CUTLASS' NvFp4 MoE backend out of potential backends:
['FLASHINFER_TRTLLM', 'FLASHINFER_CUTEDSL', 'FLASHINFER_CUTLASS', 'VLLM_CUTLASS', 'MARLIN']

KV cache comparison: Marlin vs FLASHINFER_CUTLASS

Tested both backends on the same hardware with identical parameters (--gpu-memory-utilization 0.95, --max-num-seqs 512):

v0.17.1 (Marlin)v0.19.0 (FLASHINFER_CUTLASS)
Available KV cache16.79 GiB18.37 GiB
KV cache tokens732,160798,720
Max concurrency @262K14.27x15.62x
Single-user tok/s~92~74

FLASHINFER_CUTLASS provides ~9% more KV cache but ~20% lower single-user throughput. An env override would let users choose the right tradeoff for their workload.

Misc discussion on performance

No response

Your current environment (if you think it is necessary)

The output of `python collect_env.py`

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

TL;DR

Adding an environment variable VLLM_NVFP4_MOE_BACKEND to override the NVFP4 MoE backend selection could help mitigate the performance regression.

Guidance

  • Introduce the proposed environment variable VLLM_NVFP4_MOE_BACKEND to allow users to manually select the MoE backend, potentially opting for Marlin over FLASHINFER_CUTLASS for better performance in certain workloads.
  • Verify the impact of this change by comparing single-user throughput with different backends, using metrics such as tokens per second.
  • Consider testing with various hardware configurations and parameters (e.g., --gpu-memory-utilization, --max-num-seqs) to understand the tradeoffs between KV cache availability and throughput.
  • Evaluate the effectiveness of the override by monitoring performance metrics and adjusting the backend selection as needed.

Example

No specific code snippet is provided, but the introduction of the VLLM_NVFP4_MOE_BACKEND environment variable could be implemented in a configuration file or as a command-line argument, allowing users to specify their preferred MoE backend.

Notes

The performance regression appears to be specific to the SM120 (RTX PRO 6000) hardware and the NVFP4 MoE models. The proposed solution may not apply to other hardware configurations or model types.

Recommendation

Apply workaround: Introduce the VLLM_NVFP4_MOE_BACKEND environment variable to allow users to override the default MoE backend selection, potentially improving performance in certain workloads. This approach provides flexibility and allows users to choose the best backend for their specific use case.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING