vllm - ✅(Solved) Fix [RFC]: Add Nunchaku SVDQuant W4A4 quantization backend [1 pull requests, 3 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#37908Fetched 2026-04-08 01:22:43
View on GitHub
Comments
3
Participants
2
Timeline
12
Reactions
0
Timeline (top)
mentioned ×4subscribed ×4commented ×3cross-referenced ×1

Fix Action

Fixed

PR fix notes

PR #1986: [Feature] Integrate Nunchaku SVDQuant W4A4 for diffusion models

Description (problem / solution / changelog)

Summary

  • Integrate Nunchaku as a quantization backend for diffusion transformers, enabling W4A4 inference with SVD low-rank correction.
  • Verified on Z-Image-Turbo: ~2.2x speedup over BF16 on RTX 5090 with comparable image quality.

Motivation

SVDQuant (W4A4) provides significant inference speedup and reduced memory footprint for DiT models. Nunchaku's PTX-optimized kernels are community-proven (FLUX, Qwen-Image) and lightweight enough to integrate as an optional backend.

The main blocker during integration was weight key mapping: Nunchaku checkpoints use diffusers-style naming while vLLM models use different conventions, and the naming is not standardized across models (Z-Image: w13/w2, Flux: linear_in/linear_out, QwenImage: no remap needed). This mapping must currently be hardcoded per-model in load_weights, which is the primary effort when adding new model support.

Additionally, Nunchaku's weight format is highly optimized (tiled/interleaved MMA layout via PTX assembly), so the glue code (weight packing, activation swap, shape calculations) is tightly coupled to Nunchaku's internal layout. This means weight-level manipulation (e.g. row-swapping for SwiGLU convention) is not possible — we handle this via runtime output swap instead.

Changes

  • NunchakuConfig / NunchakuLinearMethod (svdq_nunchaku.py): vLLM quantization plugin with W4A4 GEMM + SVD low-rank correction. Quantizes QKV, MergedColumnParallel, and RowParallel layers; leaves ReplicatedLinear (adaLN, embedders) unquantized.
  • Gated-activation output swap: Nunchaku checkpoints (from diffusers) store merged gate+up weights in diffusers order [linear ; activation], while vLLM's SiluAndMul expects [activation ; linear]. Applied automatically at runtime in NunchakuLinearMethod.apply() for all MergedColumnParallelLinear layers.
  • DiffusionNunchakuConfig (nunchaku.py): Per-model weight key mapping table for translating diffusers-style naming to vLLM conventions.
  • Z-Image support: key remapping (net.0.projw13, net.2w2) in load_weights, fixed stacked_params_mapping substring collision (.w1 falsely matching .w13).
  • Example script text_to_image_quant.py.

Quantized Model

Quality Comparison (RTX 5090, seed=42, Z-Image-Turbo 1024x1024)

BF16 (13.4s)Nunchaku W4A4 nvfp4 (6.0s, 2.2x faster)
<img width="400" alt="bf16" src="https://github.com/user-attachments/assets/3a04d63e-3a6d-42d8-a5d0-b8328a904501" /><img width="400" alt="quant" src="https://github.com/user-attachments/assets/1778d30a-e46b-461c-bdd4-95acd7516e5a" />

Follow-up Plans

  • Auto-infer rank/precision from safetensors file metadata (currently must be specified manually via --rank / --precision). Nunchaku checkpoints embed quantization_config (including rank, group_size, method) and model_class in safetensors metadata — the same mechanism Nunchaku's own from_pretrained uses. This would eliminate the need for users to specify these parameters manually.
  • Auto key mapping: derive weight name mapping from Nunchaku model metadata on meta device, eliminating per-model hardcoding
  • CI/CD tests: unit tests for weight loading, key remapping, and E2E inference

Test Plan

  • E2E quantized Z-Image-Turbo inference (RTX 5090, RTX 5060 Ti)
  • BF16 vs quantized visual quality comparison (same seed, same GPU)
  • CPU offload compatibility verified

Closes #507

Changed files

  • examples/offline_inference/text_to_image/text_to_image_quant.py (added, +252/-0)
  • vllm_omni/diffusion/layers/quantization/__init__.py (added, +11/-0)
  • vllm_omni/diffusion/layers/quantization/svdq_nunchaku.py (added, +661/-0)
  • vllm_omni/diffusion/model_loader/diffusers_loader.py (modified, +5/-1)
  • vllm_omni/diffusion/models/z_image/z_image_transformer.py (modified, +39/-10)
  • vllm_omni/diffusion/quantization/__init__.py (modified, +3/-0)
  • vllm_omni/diffusion/quantization/base.py (modified, +23/-0)
  • vllm_omni/diffusion/quantization/nunchaku.py (added, +106/-0)
RAW_BUFFERClick to expand / collapse

Motivation.

SVDQuant (W4A4 with low-rank correction) is currently the only practical quantization method for diffusion transformers, delivering 2x+ speedup with minimal quality loss. It is implemented by the Nunchaku library, which provides custom CUDA kernels for W4A4 GEMM with fused low-rank projection.

We are working on Nunchaku integration in vllm-omni (PR #1986). During review, the vllm-omni maintainer raised the concern that implementing this entirely on the omni side involves invasive changes to model pipeline files, and asked us to identify which parts can be reused from vLLM upstream to keep the omni-side integration lightweight.

After analysis, the core quantization backend — QuantizationConfig + LinearMethodBase implementing create_weights() / apply() / process_weights_after_loading() — follows the exact same pattern as existing external kernel integrations like Marlin (for GPTQ/AWQ) and DeepGEMM (for FP8). This part has no diffusion-specific logic and can live in vLLM upstream. Placing it here would allow vllm-omni to import and use it directly, keeping only diffusion-specific glue code (key mapping, activation order handling) on the omni side.

Proposed Change.

Add vllm/model_executor/layers/quantization/svdq_nunchaku.py containing:

  • NunchakuConfig(QuantizationConfig) — Configuration class storing rank, precision (int4/nvfp4), act_unsigned. Implements standard interface: get_name(), get_supported_act_dtypes(), get_min_capability(), get_quant_method().

  • NunchakuLinearMethod(LinearMethodBase) — Linear method that:

    • create_weights(): Creates SVDQuant parameters (qweight, wscales, proj_down, proj_up, smooth_factor, smooth_factor_orig, and nvfp4-specific wcscales/wtscale) with proper TP sharding
    • apply(): Calls Nunchaku's CUDA kernels (svdq_quantize_w4a4_act_fuse_lora_cuda, svdq_gemm_w4a4_cuda) for W4A4 forward pass
    • process_weights_after_loading(): Materializes meta-device parameters, computes alpha scaling for nvfp4
  • Dependency: nunchaku is a soft dependency (lazy import with availability flag), same pattern as marlin, deep_gemm, etc.

  • No model changes: This PR only adds the quantization backend. Model-level integration (key mapping, checkpoint format handling) is done downstream in vllm-omni.

The companion PR in vllm-omni (#1986) would then be simplified to only contain diffusion-specific logic:

  • Weight key mapping (diffusers naming → vLLM model naming)
  • SwiGLU activation order handling (diffusers convention vs vLLM convention)
  • Model-level integration (Z-Image, with Flux/HunyuanImage planned)
  • Example scripts

Feedback Period.

2 weeks

CC List.

@lishunyang12 @ZJY0516 @Isotr0py

Any Other Things.

For reference, the quantized checkpoint is available at:

Image quality comparison (BF16 vs W4A4, same seed, RTX 5090) can be found in PR #1986.

extent analysis

Fix Plan

To integrate the SVDQuant quantization method into vLLM, we need to create a quantization backend that can be reused across different models. Here are the steps to achieve this:

  • Create a new file svdq_nunchaku.py in the vllm/model_executor/layers/quantization directory.
  • Define a NunchakuConfig class that inherits from QuantizationConfig and implements the standard interface:
    • get_name()
    • get_supported_act_dtypes()
    • get_min_capability()
    • get_quant_method()
  • Define a NunchakuLinearMethod class that inherits from LinearMethodBase and implements the following methods:
    • create_weights(): Creates SVDQuant parameters with proper TP sharding.
    • apply(): Calls Nunchaku's CUDA kernels for W4A4 forward pass.
    • process_weights_after_loading(): Materializes meta-device parameters and computes alpha scaling for nvfp4.
  • Add a soft dependency on the nunchaku library using lazy import with an availability flag.

Example Code

# svdq_nunchaku.py

import torch
from vllm.model_executor.layers.quantization import QuantizationConfig, LinearMethodBase

class NunchakuConfig(QuantizationConfig):
    def __init__(self, rank, precision, act_unsigned):
        self.rank = rank
        self.precision = precision
        self.act_unsigned = act_unsigned

    def get_name(self):
        return "SVDQ_Nunchaku"

    def get_supported_act_dtypes(self):
        return [torch.float16]

    def get_min_capability(self):
        return "sm_80"

    def get_quant_method(self):
        return "w4a4"

class NunchakuLinearMethod(LinearMethodBase):
    def create_weights(self, weight):
        # Create SVDQuant parameters with proper TP sharding
        qweight, wscales, proj_down, proj_up, smooth_factor, smooth_factor_orig, wcscales, wtscale = self._create_svdq_params(weight)
        return qweight, wscales, proj_down, proj_up, smooth_factor, smooth_factor_orig, wcscales, wtscale

    def apply(self, input, weight, bias=None):
        # Call Nunchaku's CUDA kernels for W4A4 forward pass
        output = self._apply_svdq(input, weight)
        return output

    def process_weights_after_loading(self, weight):
        # Materialize meta-device parameters and compute alpha scaling for nvfp4
        weight = self._process_weights_after_loading(weight)
        return weight

Verification

To verify that the fix worked, you can test the NunchakuLinearMethod class with a sample input and weight tensor. You can also check the output of the

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING