vllm - ✅(Solved) Fix [Feature]: support nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 for turing and ampere [1 pull requests, 2 comments, 1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#38776Fetched 2026-04-08 02:22:45
View on GitHub
Comments
2
Participants
1
Timeline
5
Reactions
0
Author
Participants
Timeline (top)
commented ×2cross-referenced ×1labeled ×1subscribed ×1

PR fix notes

PR #38985: [feat] Support modelopt_mixed for Turing and Ampere via Marlin

Description (problem / solution / changelog)

Leverages Marlin kernels to enable modelopt_mixed quantization support, extending compatibility to NVIDIA Turing and Ampere architectures.

Due to limitations in Marlin, tensor dimensions must be aligned; however, the output dimensions of certain layers in the nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 model were misaligned. Consequently, zero-padding was applied. As a precautionary measure, zero-padding was restricted exclusively to layers utilizing FP8 per-tensor quantization.

@jinzhen-lin Could you please take a look at this PR when you have a moment?

#38776

Purpose

Support model nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 for Turing or Ampere.

Test Plan

Test model nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4. Validated implementation using nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4, ensuring functional correctness and performance stability on 4x RTX3090.

Command: vllm serve nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 --served-model-name Nemotron-3-Super --tensor-parallel-size 4 --enable-expert-parallel --trust-remote-code --enable-auto-tool-choice --tool-call-parser qwen3_coder --reasoning-parser nemotron_v3 --kv-cache-memory-bytes 3G --max-model-len auto --async-scheduling --enable-prefix-caching --enable-chunked-prefill --max-num-seqs 4

Test Result

The vllm serve command launched successfully, and performance tests appear normal.

vllm-nemotron-3-super-1  | (APIServer pid=1) INFO 04-04 13:15:32 [loggers.py:259] Engine 000: Avg prompt throughput: 2.9 tokens/s, Avg generation throughput: 46.3 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 1.6%, Prefix cache hit rate: 0.0%
vllm-nemotron-3-super-1  | (APIServer pid=1) INFO:     127.0.0.1:35860 - "POST /v1/chat/completions HTTP/1.1" 200 OK
vllm-nemotron-3-super-1  | (APIServer pid=1) INFO 04-04 13:15:42 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 18.1 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
vllm-nemotron-3-super-1  | (APIServer pid=1) INFO 04-04 13:15:52 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
vllm-nemotron-3-super-1  | (APIServer pid=1) INFO:     127.0.0.1:46700 - "POST /v1/chat/completions HTTP/1.1" 200 OK
vllm-nemotron-3-super-1  | (APIServer pid=1) INFO 04-04 13:18:42 [loggers.py:259] Engine 000: Avg prompt throughput: 2.2 tokens/s, Avg generation throughput: 11.8 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
vllm-nemotron-3-super-1  | (APIServer pid=1) INFO 04-04 13:18:52 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%

Previously, the model nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 failed to launch because the modelopt_mixed constraint restricted deployment to SM89 or newer architectures only.


<details> <summary> Essential Elements of an Effective PR Description Checklist </summary>
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.
</details>

Changed files

  • csrc/quantization/marlin/awq_marlin_repack.cu (modified, +6/-4)
  • csrc/quantization/marlin/gptq_marlin_repack.cu (modified, +6/-4)
  • vllm/model_executor/kernels/linear/scaled_mm/cutlass.py (modified, +4/-0)
  • vllm/model_executor/layers/quantization/modelopt.py (modified, +13/-1)
  • vllm/model_executor/layers/quantization/utils/marlin_utils_fp8.py (modified, +66/-11)
RAW_BUFFERClick to expand / collapse

🚀 The feature, motivation and pitch

Only support sm89 or above in modelopt_mixed, and the nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 is modelopt_mixed. The modelopt_mixed need FP8ScaledMMLinearKernel impl. I reviewed the code and noted that Marlin FP8 already supports per-tensor FP8 matrix multiplication, implemented by extending it to a per-channel approach. Consequently, we can also leverage Marlin FP8 to implement an FP8ScaledMMLinearKernel to provide support for Turing and Ampere architectures.

Alternatives

No response

Additional context

No response

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

TL;DR

Implement an FP8ScaledMMLinearKernel using Marlin FP8 to support Turing and Ampere architectures in the modelopt_mixed feature.

Guidance

  • Leverage Marlin FP8's per-tensor FP8 matrix multiplication support to implement the FP8ScaledMMLinearKernel.
  • Extend Marlin FP8's implementation to a per-channel approach to provide the required support.
  • Focus on supporting sm89 or above in modelopt_mixed to ensure compatibility with the target architectures.
  • Review the existing code to identify areas where the FP8ScaledMMLinearKernel implementation can be integrated.

Example

No explicit code example can be provided without more context, but the implementation should involve extending Marlin FP8's matrix multiplication capabilities to support the FP8ScaledMMLinearKernel interface.

Notes

The solution relies on the availability of Marlin FP8's per-tensor FP8 matrix multiplication implementation and its extensibility to a per-channel approach. Uncertainty remains about the specific requirements and constraints of the FP8ScaledMMLinearKernel interface.

Recommendation

Apply workaround: Implement the FP8ScaledMMLinearKernel using Marlin FP8 to provide support for the required architectures, as a direct upgrade to a fixed version is not implied in the issue.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING