vllm - 💡(How to fix) Fix [Performance]: NVFP4 MoE on SM120: no env override to select backend (FLASHINFER_CUTLASS vs MARLIN) [1 comments, 2 participants]

vllm2026-04-04 07:00:53

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

vllm-project/vllm#38971•Fetched 2026-04-08 02:44:42

View on GitHub

Comments

Participants

Timeline

Reactions

Author

mmeyer-datendo

Participants

mmeyer-datendo

robertgshaw2-redhat

Timeline (top)

closed ×1commented ×1labeled ×1subscribed ×1

Code Example

Using 'FLASHINFER_CUTLASS' NvFp4 MoE backend out of potential backends:
['FLASHINFER_TRTLLM', 'FLASHINFER_CUTEDSL', 'FLASHINFER_CUTLASS', 'VLLM_CUTLASS', 'MARLIN']

---

The output of `python collect_env.py`

RAW_BUFFERClick to expand / collapse

Proposal to improve performance

Add an env variable (e.g. VLLM_NVFP4_MOE_BACKEND) to allow users to override the NVFP4 MoE backend selection. Currently the backend is auto-selected with no override possible.

Report of performance regression

Since v0.19.0, NVFP4 MoE models on SM120 (RTX PRO 6000) use FLASHINFER_CUTLASS by default. In v0.17.1, Marlin was the automatic fallback since FLASHINFER_CUTLASS did not support SM120 at that time.

Single-user throughput on SM120 with Nemotron 3 Super (NVFP4, 120B):

Marlin (v0.17.1 default): ~92 tok/s
FLASHINFER_CUTLASS (v0.19.0 default): ~74 tok/s
Regression: ~20-25%

Currently there is no way to override the MoE backend:

VLLM_NVFP4_MOE_BACKEND does not exist
VLLM_NVFP4_GEMM_BACKEND=MARLIN crashes with MoE models

Log output (v0.19.0):

Using 'FLASHINFER_CUTLASS' NvFp4 MoE backend out of potential backends:
['FLASHINFER_TRTLLM', 'FLASHINFER_CUTEDSL', 'FLASHINFER_CUTLASS', 'VLLM_CUTLASS', 'MARLIN']

KV cache comparison: Marlin vs FLASHINFER_CUTLASS

Tested both backends on the same hardware with identical parameters (--gpu-memory-utilization 0.95, --max-num-seqs 512):

	v0.17.1 (Marlin)	v0.19.0 (FLASHINFER_CUTLASS)
Available KV cache	16.79 GiB	18.37 GiB
KV cache tokens	732,160	798,720
Max concurrency @262K	14.27x	15.62x
Single-user tok/s	~92	~74

FLASHINFER_CUTLASS provides ~9% more KV cache but ~20% lower single-user throughput. An env override would let users choose the right tradeoff for their workload.

Misc discussion on performance

No response

Your current environment (if you think it is necessary)

The output of `python collect_env.py`

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

TL;DR

Adding an environment variable VLLM_NVFP4_MOE_BACKEND to override the NVFP4 MoE backend selection could help mitigate the performance regression.

Guidance

Introduce the proposed environment variable VLLM_NVFP4_MOE_BACKEND to allow users to manually select the MoE backend, potentially opting for Marlin over FLASHINFER_CUTLASS for better performance in certain workloads.
Verify the impact of this change by comparing single-user throughput with different backends, using metrics such as tokens per second.
Consider testing with various hardware configurations and parameters (e.g., --gpu-memory-utilization, --max-num-seqs) to understand the tradeoffs between KV cache availability and throughput.
Evaluate the effectiveness of the override by monitoring performance metrics and adjusting the backend selection as needed.

Example

No specific code snippet is provided, but the introduction of the VLLM_NVFP4_MOE_BACKEND environment variable could be implemented in a configuration file or as a command-line argument, allowing users to specify their preferred MoE backend.

Notes

The performance regression appears to be specific to the SM120 (RTX PRO 6000) hardware and the NVFP4 MoE models. The proposed solution may not apply to other hardware configurations or model types.

Recommendation

Apply workaround: Introduce the VLLM_NVFP4_MOE_BACKEND environment variable to allow users to override the default MoE backend selection, potentially improving performance in certain workloads. This approach provides flexibility and allows users to choose the best backend for their specific use case.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#latency issue #model loading #dependency error #configuration error #environment variable

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

vllm - 💡(How to fix) Fix [Performance]: NVFP4 MoE on SM120: no env override to select backend (FLASHINFER_CUTLASS vs MARLIN) [1 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Code Example

Proposal to improve performance

Report of performance regression

KV cache comparison: Marlin vs FLASHINFER_CUTLASS

Misc discussion on performance

Your current environment (if you think it is necessary)

Before submitting a new issue...

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

TRENDING

vllm - 💡(How to fix) Fix [Performance]: NVFP4 MoE on SM120: no env override to select backend (FLASHINFER_CUTLASS vs MARLIN) [1 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Code Example

Proposal to improve performance

Report of performance regression

KV cache comparison: Marlin vs FLASHINFER_CUTLASS

Misc discussion on performance

Your current environment (if you think it is necessary)

Before submitting a new issue...

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING