vllm - 💡(How to fix) Fix [RFC]: Unified ModelOpt Quantization in vLLM [1 comments, 2 participants]

vllm2026-04-18 00:52:57

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

vllm-project/vllm#40182•Fetched 2026-04-18 05:52:09

View on GitHub

Comments

Participants

Timeline

Reactions

Author

juhi10071998

Participants

juhi10071998

robertgshaw2-redhat

Timeline (top)

mentioned ×3subscribed ×3commented ×1labeled ×1

This issue proposes implementing a bridge that routes ModelOpt FP8 checkpoints (FP8_PER_CHANNEL_PER_TOKEN and FP8_BLOCK) through vLLM's existing Compressed-Tensors (CT) infrastructure, rather than keeping them on an isolated native path. This mirrors what SGLang PR #19101 did.

Root Cause

We may have to different transformations for different schemes based on what ModelOpt produces and what CT expects to finally load the weights into the created graph structure.

The underlying question/ concern is how do we make this step generic enough but also customizable and not have changes scattered everywhere. This is one observation I drew from the corresponding SGLang PR.
The core reason we do these transforms is because the when the loaded weights are copied into the params, there could be a shape mismatch. And there is no single path how we reach these params weight copying step.
We could reach through ColumnParallelLinear/ RowParallelLinear, but the source is same and each param inherits from the BaseParam class. - [Option A] At the source level when iterating weights (don't know the param shape here, name based heuristic) - [Option B] At the param level where we have the param shape (desired) and the loaded weight shape - [Option B1] Do in the BaseParam class and call it at top of every param - [Option B2] Do it as needed in each param (how SGLang current PR implemented)
We should ideally only push what is really needed like the RowParallelLinear transforms after this step otherwise we should avoid changing where the kernel call happens or at the actual kernel level since the expectation is these changes would already be fixed by the time we reach that point.

Fix Action

Fix / Workaround

We may have to different transformations for different schemes based on what ModelOpt produces and what CT expects to finally load the weights into the created graph structure.

The underlying question/ concern is how do we make this step generic enough but also customizable and not have changes scattered everywhere. This is one observation I drew from the corresponding SGLang PR.
The core reason we do these transforms is because the when the loaded weights are copied into the params, there could be a shape mismatch. And there is no single path how we reach these params weight copying step.
We could reach through ColumnParallelLinear/ RowParallelLinear, but the source is same and each param inherits from the BaseParam class. - [Option A] At the source level when iterating weights (don't know the param shape here, name based heuristic) - [Option B] At the param level where we have the param shape (desired) and the loaded weight shape - [Option B1] Do in the BaseParam class and call it at top of every param - [Option B2] Do it as needed in each param (how SGLang current PR implemented)
We should ideally only push what is really needed like the RowParallelLinear transforms after this step otherwise we should avoid changing where the kernel call happens or at the actual kernel level since the expectation is these changes would already be fixed by the time we reach that point.

Code Example

▼ DefaultModelLoader.load_weights(model, model_config)

  ┌──── OPTION A (iterator) ───────────────────────────┐
  │ if is_modelopt_bridge_checkpoint(model_config):                                                                    │
  │     weights_gen = normalize_modelopt_bridge_weights_iterator(                                             │
  │                       weights_gen, model_config)                                                                                    │
  │               For each (name, tensor):                                                                                                 │
  │                             skip .k_scale / .v_scale  (no CT parameter slot)                                              │
  │                             block only: rename .weight_scale → .weight_scale_inv                                 │
  │                             if tensor.dim()==4:  squeeze → 2D                                                                   │
  │                             if tensor.dim()==1 and name.endswith("weight_scale"):                                │
  │                                 tensor = tensor.unsqueeze(-1)   # (N,) → (N,1)                                            │
  │                             if tensor dtype != float32:  cast                                                                         │
  │                             yield name, tensor                                                                                              │
  │ All downstream load methods receive clean (N,1) tensors                                                         │
  └────────────────────────────────   ────────┘

  ┌──── OPTION B1 (base class, unconditional) ───────────────────────┐
  │ No iterator wrapping.                                                                                                                                       │
  │ _normalize_loaded_weight() added to BasevLLMParameter.                                                                      │
  │ Called at top of every load method in _ColumnvLLMParameter.                                                                 │
  │ Guard is implicit: only fires when param.shape[1]==1.                                                                                 │
  └───────── ────────────────────────────────────┘

  ┌──── OPTION B2 (subclass override, flag-gated) ───────────────────┐
  │ No iterator wrapping.                                                                                                                               │
  │ ChannelQuantScaleParameter overrides load methods.                                                                      │
  │ Reshape only when _allow_1d_scale_reshape flag is set.                                                                    │
  │ Flag set in create_weights from _modelopt_bridge.                                                                             │
  │ Matches SGLang PR #19101 approach.                                                                                                  │
  └───────────────────────────────────────────. ┘

---

PATH A: ColumnParallelLinear → load_column_parallel_weight [parameter.py:148]
    narrow (N,1) → (shard,1) → assert (shard,1)==(shard,1) ✓

  PATH B: QKVParallelLinear → load_qkv_weight [parameter.py:178]
    narrow param slot → (1024,1) for Q
    narrow loaded   → (1024,1) for this tp_rank's Q
    assert ✓

  PATH C: MergedColumnParallelLinear → load_merged_column_weight [parameter.py:156]
    narrow → assert ✓

  PATH D: RowParallelLinear → weight_loader_v2 [linear.py:1502]  ← #### THIS MODIFICATION IS STILL NEEDED
    BEFORE: param.load_row_parallel_weight → _assert_and_load
              assert (N,1)==(N,)  CRASH
    AFTER:
      if isinstance(param, ChannelQuantScaleParameter):
          # Output dim NOT sharded in RowParallel — no narrow needed
          # Only fix needed: 1D→2D (Option A: already done; Options B1/B2: normalize here)
          loaded_weight = param._normalize_loaded_weight(loaded_weight)
          param.load_row_parallel_weight(loaded_weight)
          return
      param.load_row_parallel_weight(loaded_weight)  # all other params

RAW_BUFFERClick to expand / collapse

Motivation.

Summary

Background

This issue aims to expand vLLM's ModelOpt quantization support from the current limited implementation to encompass the full suite of NVIDIA Model Optimizer quantization recipes. Rather than building separate implementations for each format, we propose a unified architecture that bridges ModelOpt checkpoints to vLLM's existing compressed-tensors (llm-compressor) infrastructure, enabling code reuse, kernel sharing, and consistent performance across quantization methods.

The bridge: detect at load time that a checkpoint is from ModelOpt, convert its config into CT format, and reuse all existing CT infrastructure.

Root Tensor and Config Mismatch

The changes need to be done at two levels.

detect the config is from ModelOpt and translate to CT style
The .safetensors that ModelOpt produces may not align with the architecture/ weights/ params created by Compressed Tensors path, hence we need to ensure we update the tensors shape before they are copied into the params else there will be assertion errors.

Proposed Change.

Implementation Details

Phase 1 - Config Building phase

1. Scheme Detection

modelopt_scheme.py (NEW): ModelOptQuantizationScheme enum and detect_modelopt_quantization_scheme(config).

Detection uses quant_cfg / recipe first, then falls back to quant_algo (including ModelOpt-specific values like fp8_pb_wo for block).
Schemes are tagged with uses_ct_bridge() and quant_method() for routing and CLI (--quantization modelopt_fp8).

2. Config Routing

weight_utils.get_quant_config() (HF and file-based paths): after resolving the quant config dict, run scheme detection, and depending on whether it is bridge supported scheme, route it to modelopt_config_to_compressed_tensors or create ModelOpt config as existing way.

model_config._parse_modelopt_quant_config() uses the same scheme detection so quant_method is set correctly for bridge vs native.

3. Bridge Config Builders

modelopt_ct_bridge.py (NEW):

We might have to define the modelopt_config_to_compressed_tensors for every scheme we wish to route, need to do this in a scalable manner.

For instance, following converts the FP8PCPT only in Sglang's PR

modelopt_config_to_compressed_tensors_config() — builds a CompressedTensorsConfig for per-channel per-token FP8 (W8A8, channel strategy, dynamic per-token activations).

CT configs can set _modelopt_bridge=True where needed (e.g. for KV scale handling and 1D→2D scale reshape only when loading ModelOpt checkpoints).

Phase 2 (Model Build) happens once at startup, for the empty model skeleton. No weights yet — just allocating parameter slots and assigning a linear method object to each layer that knows how to forward.

Phase 3 (Weight Loading) happens after the skeleton is built. It reads the checkpoint file tensor-by-tensor and copies values into the parameter slots allocated in Phase 2. This is where we need to make changes since the .safetensors are coming from ModelOpt while the param slots are created by CT.

Checkpoint Format Adaptation (ModelOpt vs CT)

Scales: ModelOpt sometimes uses different shapes/dtypes than CT:

4D block scales (32, 1, 112, 1) vs 2D (32, 112)
Scalar k_scale / v_scale not present in CT block path

Weight loader transform (for FP8 block): when loading a ModelOpt FP8 block checkpoint:

Remap keys: *.weight_scale → *.weight_scale_inv
Squeeze 4D scale tensors to 2D
Skip k_scale / v_scale for the block path
BlockQuantScaleParameter also squeezes 4D→2D on load so any load path is robust

We may have to different transformations for different schemes based on what ModelOpt produces and what CT expects to finally load the weights into the created graph structure.

The underlying question/ concern is how do we make this step generic enough but also customizable and not have changes scattered everywhere. This is one observation I drew from the corresponding SGLang PR.
The core reason we do these transforms is because the when the loaded weights are copied into the params, there could be a shape mismatch. And there is no single path how we reach these params weight copying step.
We could reach through ColumnParallelLinear/ RowParallelLinear, but the source is same and each param inherits from the BaseParam class. - [Option A] At the source level when iterating weights (don't know the param shape here, name based heuristic) - [Option B] At the param level where we have the param shape (desired) and the loaded weight shape - [Option B1] Do in the BaseParam class and call it at top of every param - [Option B2] Do it as needed in each param (how SGLang current PR implemented)
We should ideally only push what is really needed like the RowParallelLinear transforms after this step otherwise we should avoid changing where the kernel call happens or at the actual kernel level since the expectation is these changes would already be fixed by the time we reach that point.

Proposed Call Flow

Phase 3 — Weight Loading (Options diverge here)

         ▼ DefaultModelLoader.load_weights(model, model_config)

  ┌──── OPTION A (iterator) ───────────────────────────┐
  │ if is_modelopt_bridge_checkpoint(model_config):                                                                    │
  │     weights_gen = normalize_modelopt_bridge_weights_iterator(                                             │
  │                       weights_gen, model_config)                                                                                    │
  │               For each (name, tensor):                                                                                                 │
  │                             skip .k_scale / .v_scale  (no CT parameter slot)                                              │
  │                             block only: rename .weight_scale → .weight_scale_inv                                 │
  │                             if tensor.dim()==4:  squeeze → 2D                                                                   │
  │                             if tensor.dim()==1 and name.endswith("weight_scale"):                                │
  │                                 tensor = tensor.unsqueeze(-1)   # (N,) → (N,1)                                            │
  │                             if tensor dtype != float32:  cast                                                                         │
  │                             yield name, tensor                                                                                              │
  │ All downstream load methods receive clean (N,1) tensors                                                         │
  └────────────────────────────────   ────────┘

  ┌──── OPTION B1 (base class, unconditional) ───────────────────────┐
  │ No iterator wrapping.                                                                                                                                       │
  │ _normalize_loaded_weight() added to BasevLLMParameter.                                                                      │
  │ Called at top of every load method in _ColumnvLLMParameter.                                                                 │
  │ Guard is implicit: only fires when param.shape[1]==1.                                                                                 │
  └───────── ────────────────────────────────────┘

  ┌──── OPTION B2 (subclass override, flag-gated) ───────────────────┐
  │ No iterator wrapping.                                                                                                                               │
  │ ChannelQuantScaleParameter overrides load methods.                                                                      │
  │ Reshape only when _allow_1d_scale_reshape flag is set.                                                                    │
  │ Flag set in create_weights from _modelopt_bridge.                                                                             │
  │ Matches SGLang PR #19101 approach.                                                                                                  │
  └───────────────────────────────────────────. ┘

Phase 4 — Per-Tensor Load (all four paths)

  PATH A: ColumnParallelLinear → load_column_parallel_weight [parameter.py:148]
    narrow (N,1) → (shard,1) → assert (shard,1)==(shard,1) ✓

  PATH B: QKVParallelLinear → load_qkv_weight [parameter.py:178]
    narrow param slot → (1024,1) for Q
    narrow loaded   → (1024,1) for this tp_rank's Q
    assert ✓

  PATH C: MergedColumnParallelLinear → load_merged_column_weight [parameter.py:156]
    narrow → assert ✓

  PATH D: RowParallelLinear → weight_loader_v2 [linear.py:1502]  ← #### THIS MODIFICATION IS STILL NEEDED
    BEFORE: param.load_row_parallel_weight → _assert_and_load
              assert (N,1)==(N,)  CRASH
    AFTER:
      if isinstance(param, ChannelQuantScaleParameter):
          # Output dim NOT sharded in RowParallel — no narrow needed
          # Only fix needed: 1D→2D (Option A: already done; Options B1/B2: normalize here)
          loaded_weight = param._normalize_loaded_weight(loaded_weight)
          param.load_row_parallel_weight(loaded_weight)
          return
      param.load_row_parallel_weight(loaded_weight)  # all other params

Option Comparison

	Option A — Iterator	Option B1 — Base Class	Option B2 — Subclass (SGLang-style)
Normalization point	Before name matching	Inside every load method	Inside `ChannelQuantScaleParameter` only
Guard	`is_modelopt_bridge_checkpoint()`	`param.shape[1]==1` condition	`_allow_1d_scale_reshape` flag
Base class touched	No	Yes — adds method	No
Flag needed	No	No	Yes — propagated from `_modelopt_bridge`
Future schemes	Add rules to iterator	Auto-covered	Add override to relevant param class
Matches SGLang PR	Partially	No	Yes

One thing no option avoids: The linear.py:1502 fix for RowParallelLinear. The output dimension is not sharded in RowParallel, so no TP narrow is needed — but the 1D→2D reshape must be handled explicitly before _assert_and_load.

Open Questions / Discussion Points

1. 1-way bridge (ModelOpt → CT only)

The current design proposal assumes that the bridge is 1 way that is whatever infrastructure we will add is going to convert the ModelOpt config and ckpt to match the CT style handling in vLLM. How do we scope the other way? Confirm this as an explicit non-goal.

2. Per-scheme translation overhead

Every new ModelOpt recipe needs its own bridge function. Maintenance scales linearly (~20-30 lines per scheme). Risk of drift if ModelOpt changes a recipe.

Alternatives:

Generic config translator (may miss edge cases)
Shared schema registry between ModelOpt and vLLM
Accept the cost (1-2 new recipes per year)

3. Scope: backward or forward-looking?

Backward: support existing ModelOpt checkpoints that don't have the vLLM path yet (perhaps kernel are supported)
Forward: make bridge the default path, deprecate native methods, or where should the development go? in ModelOpt path or CT path

4. Unified IR as longer-term direction

The bridge is a translator. A more ambitious design: a unified intermediate representation that all quant formats translate into, feeding one execution backend. Cleaner but multi-quarter refactor.

Feedback Period.

No response

CC List.

@Edwardf0t1 @pavanimajety @sychen52

Any Other Things.

No response

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

TL;DR

To implement a bridge that routes ModelOpt FP8 checkpoints through vLLM's existing Compressed-Tensors infrastructure, focus on detecting the ModelOpt quantization scheme, translating the config to CT format, and reusing CT infrastructure.

Guidance

Identify the ModelOpt quantization scheme using detect_modelopt_quantization_scheme function and route it to the appropriate config builder.
Implement a scalable way to define modelopt_config_to_compressed_tensors for each scheme, such as using a dictionary or a registry.
Handle tensor shape mismatches between ModelOpt and CT by implementing a weight loader transform, such as squeezing 4D scale tensors to 2D.
Consider implementing a generic config translator or a shared schema registry to reduce maintenance overhead.

Example

# modelopt_scheme.py
from enum import Enum

class ModelOptQuantizationScheme(Enum):
    FP8_PER_CHANNEL_PER_TOKEN = 1
    FP8_BLOCK = 2

def detect_modelopt_quantization_scheme(config):
    # Detection logic using quant_cfg, recipe, and quant_algo
    pass

Notes

The proposed solution assumes a one-way bridge from ModelOpt to CT, and it's essential to confirm this as an explicit non-goal.
The maintenance overhead of adding new bridge functions for each ModelOpt recipe should be considered, and alternatives such as a generic config translator or a shared schema registry should be explored.

Recommendation

Apply a workaround by implementing a weight loader transform to handle tensor shape mismatches, and consider implementing a generic config translator or a shared schema registry to reduce maintenance overhead. This approach allows for a more scalable and maintainable solution.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#memory optimization #batch processing #GPU compatibility #latency issue #model loading

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

vllm - 💡(How to fix) Fix [RFC]: Unified ModelOpt Quantization in vLLM [1 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Fix Action

Fix / Workaround

Code Example

Motivation.

Summary

Background

Root Tensor and Config Mismatch

Proposed Change.

Implementation Details

Phase 1 - Config Building phase

1. Scheme Detection

2. Config Routing

3. Bridge Config Builders

Phase 2 (Model Build) happens once at startup, for the empty model skeleton. No weights yet — just allocating parameter slots and assigning a linear method object to each layer that knows how to forward.

Phase 3 (Weight Loading) happens after the skeleton is built. It reads the checkpoint file tensor-by-tensor and copies values into the parameter slots allocated in Phase 2. This is where we need to make changes since the .safetensors are coming from ModelOpt while the param slots are created by CT.

Checkpoint Format Adaptation (ModelOpt vs CT)

Proposed Call Flow

Phase 3 — Weight Loading (Options diverge here)

Phase 4 — Per-Tensor Load (all four paths)

Option Comparison

Open Questions / Discussion Points

1. 1-way bridge (ModelOpt → CT only)

2. Per-scheme translation overhead

3. Scope: backward or forward-looking?

4. Unified IR as longer-term direction

Feedback Period.

CC List.

Any Other Things.

Before submitting a new issue...

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING