PR fix notes

PR #38985: [feat] Support modelopt_mixed for Turing and Ampere via Marlin

Repository: vllm-project/vllm
Author: ir1ka
State: open | merged: False
Link: https://github.com/vllm-project/vllm/pull/38985

Description (problem / solution / changelog)

Leverages Marlin kernels to enable modelopt_mixed quantization support, extending compatibility to NVIDIA Turing and Ampere architectures.

Due to limitations in Marlin, tensor dimensions must be aligned; however, the output dimensions of certain layers in the nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 model were misaligned. Consequently, zero-padding was applied. As a precautionary measure, zero-padding was restricted exclusively to layers utilizing FP8 per-tensor quantization.

@jinzhen-lin Could you please take a look at this PR when you have a moment?

#38776

Purpose

Support model nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 for Turing or Ampere.

Test Plan

Test model nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4. Validated implementation using nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4, ensuring functional correctness and performance stability on 4x RTX3090.

Command: vllm serve nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 --served-model-name Nemotron-3-Super --tensor-parallel-size 4 --enable-expert-parallel --trust-remote-code --enable-auto-tool-choice --tool-call-parser qwen3_coder --reasoning-parser nemotron_v3 --kv-cache-memory-bytes 3G --max-model-len auto --async-scheduling --enable-prefix-caching --enable-chunked-prefill --max-num-seqs 4

Test Result

The vllm serve command launched successfully, and performance tests appear normal.

vllm-nemotron-3-super-1  | (APIServer pid=1) INFO 04-04 13:15:32 [loggers.py:259] Engine 000: Avg prompt throughput: 2.9 tokens/s, Avg generation throughput: 46.3 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 1.6%, Prefix cache hit rate: 0.0%
vllm-nemotron-3-super-1  | (APIServer pid=1) INFO:     127.0.0.1:35860 - "POST /v1/chat/completions HTTP/1.1" 200 OK
vllm-nemotron-3-super-1  | (APIServer pid=1) INFO 04-04 13:15:42 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 18.1 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
vllm-nemotron-3-super-1  | (APIServer pid=1) INFO 04-04 13:15:52 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
vllm-nemotron-3-super-1  | (APIServer pid=1) INFO:     127.0.0.1:46700 - "POST /v1/chat/completions HTTP/1.1" 200 OK
vllm-nemotron-3-super-1  | (APIServer pid=1) INFO 04-04 13:18:42 [loggers.py:259] Engine 000: Avg prompt throughput: 2.2 tokens/s, Avg generation throughput: 11.8 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
vllm-nemotron-3-super-1  | (APIServer pid=1) INFO 04-04 13:18:52 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%

Previously, the model nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 failed to launch because the modelopt_mixed constraint restricted deployment to SM89 or newer architectures only.

<details> <summary> Essential Elements of an Effective PR Description Checklist </summary>

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

</details>

Changed files

csrc/quantization/marlin/awq_marlin_repack.cu (modified, +6/-4)
csrc/quantization/marlin/gptq_marlin_repack.cu (modified, +6/-4)
vllm/model_executor/kernels/linear/scaled_mm/cutlass.py (modified, +4/-0)
vllm/model_executor/layers/quantization/modelopt.py (modified, +13/-1)
vllm/model_executor/layers/quantization/utils/marlin_utils_fp8.py (modified, +66/-11)

🚀 The feature, motivation and pitch

Only support sm89 or above in modelopt_mixed, and the nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 is modelopt_mixed. The modelopt_mixed need FP8ScaledMMLinearKernel impl. I reviewed the code and noted that Marlin FP8 already supports per-tensor FP8 matrix multiplication, implemented by extending it to a per-channel approach. Consequently, we can also leverage Marlin FP8 to implement an FP8ScaledMMLinearKernel to provide support for Turing and Ampere architectures.

Alternatives

No response

Additional context

No response

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

TL;DR

Implement an FP8ScaledMMLinearKernel using Marlin FP8 to support Turing and Ampere architectures in the modelopt_mixed feature.

Guidance

Leverage Marlin FP8's per-tensor FP8 matrix multiplication support to implement the FP8ScaledMMLinearKernel.
Extend Marlin FP8's implementation to a per-channel approach to provide the required support.
Focus on supporting sm89 or above in modelopt_mixed to ensure compatibility with the target architectures.
Review the existing code to identify areas where the FP8ScaledMMLinearKernel implementation can be integrated.

Example

No explicit code example can be provided without more context, but the implementation should involve extending Marlin FP8's matrix multiplication capabilities to support the FP8ScaledMMLinearKernel interface.

Notes

The solution relies on the availability of Marlin FP8's per-tensor FP8 matrix multiplication implementation and its extensibility to a per-channel approach. Uncertainty remains about the specific requirements and constraints of the FP8ScaledMMLinearKernel interface.

Recommendation

Apply workaround: Implement the FP8ScaledMMLinearKernel using Marlin FP8 to provide support for the required architectures, as a direct upgrade to a fixed version is not implied in the issue.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

vllm - ✅(Solved) Fix [Feature]: support nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 for turing and ampere [1 pull requests, 2 comments, 1 participants]

Recommended Tools

GitHub issue graph ai analysis

PR fix notes

PR #38985: [feat] Support modelopt_mixed for Turing and Ampere via Marlin

Description (problem / solution / changelog)

Purpose

Test Plan

Test Result

Changed files

🚀 The feature, motivation and pitch

Alternatives

Additional context

Before submitting a new issue...

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

TRENDING

vllm - ✅(Solved) Fix [Feature]: support nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 for turing and ampere [1 pull requests, 2 comments, 1 participants]

Recommended Tools

GitHub issue graph ai analysis

PR fix notes

PR #38985: [feat] Support modelopt_mixed for Turing and Ampere via Marlin

Description (problem / solution / changelog)

Purpose

Test Plan

Test Result

Changed files

🚀 The feature, motivation and pitch

Alternatives

Additional context

Before submitting a new issue...

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING