vllm - 💡(How to fix) Fix [RFC]: Unified ModelOpt Quantization in vLLM [1 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#40182Fetched 2026-04-18 05:52:09
View on GitHub
Comments
1
Participants
2
Timeline
8
Reactions
0
Timeline (top)
mentioned ×3subscribed ×3commented ×1labeled ×1

This issue proposes implementing a bridge that routes ModelOpt FP8 checkpoints (FP8_PER_CHANNEL_PER_TOKEN and FP8_BLOCK) through vLLM's existing Compressed-Tensors (CT) infrastructure, rather than keeping them on an isolated native path. This mirrors what SGLang PR #19101 did.


Root Cause

We may have to different transformations for different schemes based on what ModelOpt produces and what CT expects to finally load the weights into the created graph structure.

  • The underlying question/ concern is how do we make this step generic enough but also customizable and not have changes scattered everywhere. This is one observation I drew from the corresponding SGLang PR.
  • The core reason we do these transforms is because the when the loaded weights are copied into the params, there could be a shape mismatch. And there is no single path how we reach these params weight copying step.
  • We could reach through ColumnParallelLinear/ RowParallelLinear, but the source is same and each param inherits from the BaseParam class. - [Option A] At the source level when iterating weights (don't know the param shape here, name based heuristic) - [Option B] At the param level where we have the param shape (desired) and the loaded weight shape - [Option B1] Do in the BaseParam class and call it at top of every param - [Option B2] Do it as needed in each param (how SGLang current PR implemented)
  • We should ideally only push what is really needed like the RowParallelLinear transforms after this step otherwise we should avoid changing where the kernel call happens or at the actual kernel level since the expectation is these changes would already be fixed by the time we reach that point.

Fix Action

Fix / Workaround

We may have to different transformations for different schemes based on what ModelOpt produces and what CT expects to finally load the weights into the created graph structure.

  • The underlying question/ concern is how do we make this step generic enough but also customizable and not have changes scattered everywhere. This is one observation I drew from the corresponding SGLang PR.
  • The core reason we do these transforms is because the when the loaded weights are copied into the params, there could be a shape mismatch. And there is no single path how we reach these params weight copying step.
  • We could reach through ColumnParallelLinear/ RowParallelLinear, but the source is same and each param inherits from the BaseParam class. - [Option A] At the source level when iterating weights (don't know the param shape here, name based heuristic) - [Option B] At the param level where we have the param shape (desired) and the loaded weight shape - [Option B1] Do in the BaseParam class and call it at top of every param - [Option B2] Do it as needed in each param (how SGLang current PR implemented)
  • We should ideally only push what is really needed like the RowParallelLinear transforms after this step otherwise we should avoid changing where the kernel call happens or at the actual kernel level since the expectation is these changes would already be fixed by the time we reach that point.

Code Example

DefaultModelLoader.load_weights(model, model_config)

  ┌──── OPTION A (iterator) ───────────────────────────┐
if is_modelopt_bridge_checkpoint(model_config):  │     weights_gen = normalize_modelopt_bridge_weights_iterator(  │                       weights_gen, model_config)For each (name, tensor):  │                             skip .k_scale / .v_scale  (no CT parameter slot)  │                             block only: rename .weight_scale.weight_scale_invif tensor.dim()==4:  squeeze → 2D                                                                   │
if tensor.dim()==1 and name.endswith("weight_scale"):  │                                 tensor = tensor.unsqueeze(-1)   # (N,)  (N,1)if tensor dtype != float32:  cast                                                                         │
yield name, tensor                                                                                              │
All downstream load methods receive clean (N,1) tensors                                                         │
  └────────────────────────────────   ────────┘

  ┌──── OPTION B1 (base class, unconditional) ───────────────────────┐
No iterator wrapping.                                                                                                                                       
_normalize_loaded_weight() added to BasevLLMParameter.                                                                      
Called at top of every load method in _ColumnvLLMParameter.                                                                 
Guard is implicit: only fires when param.shape[1]==1.  └───────── ────────────────────────────────────┘

  ┌──── OPTION B2 (subclass override, flag-gated) ───────────────────┐
No iterator wrapping.                                                                                                                               
ChannelQuantScaleParameter overrides load methods.                                                                      
Reshape only when _allow_1d_scale_reshape flag is set.                                                                    
Flag set in create_weights from _modelopt_bridge.                                                                             
Matches SGLang PR #19101 approach.                                                                                                  
  └───────────────────────────────────────────. 

---

PATH A: ColumnParallelLinear → load_column_parallel_weight [parameter.py:148]
    narrow (N,1)  (shard,1)assert (shard,1)==(shard,1)
  PATH B: QKVParallelLinear → load_qkv_weight [parameter.py:178]
    narrow param slot  (1024,1) for Q
    narrow loaded    (1024,1) for this tp_rank's Q
    assert ✓

  PATH C: MergedColumnParallelLinear → load_merged_column_weight [parameter.py:156]
    narrow → assert ✓

  PATH D: RowParallelLinear → weight_loader_v2 [linear.py:1502]  ← #### THIS MODIFICATION IS STILL NEEDED
    BEFORE: param.load_row_parallel_weight → _assert_and_load
              assert (N,1)==(N,)  CRASH
    AFTER:
      if isinstance(param, ChannelQuantScaleParameter):
          # Output dim NOT sharded in RowParallel — no narrow needed
          # Only fix needed: 1D→2D (Option A: already done; Options B1/B2: normalize here)
          loaded_weight = param._normalize_loaded_weight(loaded_weight)
          param.load_row_parallel_weight(loaded_weight)
          return
      param.load_row_parallel_weight(loaded_weight)  # all other params
RAW_BUFFERClick to expand / collapse

Motivation.

Summary

This issue proposes implementing a bridge that routes ModelOpt FP8 checkpoints (FP8_PER_CHANNEL_PER_TOKEN and FP8_BLOCK) through vLLM's existing Compressed-Tensors (CT) infrastructure, rather than keeping them on an isolated native path. This mirrors what SGLang PR #19101 did.


Background

This issue aims to expand vLLM's ModelOpt quantization support from the current limited implementation to encompass the full suite of NVIDIA Model Optimizer quantization recipes. Rather than building separate implementations for each format, we propose a unified architecture that bridges ModelOpt checkpoints to vLLM's existing compressed-tensors (llm-compressor) infrastructure, enabling code reuse, kernel sharing, and consistent performance across quantization methods.

The bridge: detect at load time that a checkpoint is from ModelOpt, convert its config into CT format, and reuse all existing CT infrastructure.


Root Tensor and Config Mismatch

The changes need to be done at two levels.

  1. detect the config is from ModelOpt and translate to CT style
  2. The .safetensors that ModelOpt produces may not align with the architecture/ weights/ params created by Compressed Tensors path, hence we need to ensure we update the tensors shape before they are copied into the params else there will be assertion errors.

Proposed Change.

Implementation Details

Phase 1 - Config Building phase

1. Scheme Detection

modelopt_scheme.py (NEW): ModelOptQuantizationScheme enum and detect_modelopt_quantization_scheme(config).

  • Detection uses quant_cfg / recipe first, then falls back to quant_algo (including ModelOpt-specific values like fp8_pb_wo for block).
  • Schemes are tagged with uses_ct_bridge() and quant_method() for routing and CLI (--quantization modelopt_fp8).

2. Config Routing

weight_utils.get_quant_config() (HF and file-based paths): after resolving the quant config dict, run scheme detection, and depending on whether it is bridge supported scheme, route it to modelopt_config_to_compressed_tensors or create ModelOpt config as existing way.

model_config._parse_modelopt_quant_config() uses the same scheme detection so quant_method is set correctly for bridge vs native.


3. Bridge Config Builders

modelopt_ct_bridge.py (NEW):

We might have to define the modelopt_config_to_compressed_tensors for every scheme we wish to route, need to do this in a scalable manner.

For instance, following converts the FP8PCPT only in Sglang's PR

  • modelopt_config_to_compressed_tensors_config() — builds a CompressedTensorsConfig for per-channel per-token FP8 (W8A8, channel strategy, dynamic per-token activations).

CT configs can set _modelopt_bridge=True where needed (e.g. for KV scale handling and 1D→2D scale reshape only when loading ModelOpt checkpoints).


Phase 2 (Model Build) happens once at startup, for the empty model skeleton. No weights yet — just allocating parameter slots and assigning a linear method object to each layer that knows how to forward.


Phase 3 (Weight Loading) happens after the skeleton is built. It reads the checkpoint file tensor-by-tensor and copies values into the parameter slots allocated in Phase 2. This is where we need to make changes since the .safetensors are coming from ModelOpt while the param slots are created by CT.

Checkpoint Format Adaptation (ModelOpt vs CT)

Scales: ModelOpt sometimes uses different shapes/dtypes than CT:

  • 4D block scales (32, 1, 112, 1) vs 2D (32, 112)
  • Scalar k_scale / v_scale not present in CT block path

Weight loader transform (for FP8 block): when loading a ModelOpt FP8 block checkpoint:

  • Remap keys: *.weight_scale*.weight_scale_inv
  • Squeeze 4D scale tensors to 2D
  • Skip k_scale / v_scale for the block path
  • BlockQuantScaleParameter also squeezes 4D→2D on load so any load path is robust

We may have to different transformations for different schemes based on what ModelOpt produces and what CT expects to finally load the weights into the created graph structure.

  • The underlying question/ concern is how do we make this step generic enough but also customizable and not have changes scattered everywhere. This is one observation I drew from the corresponding SGLang PR.
  • The core reason we do these transforms is because the when the loaded weights are copied into the params, there could be a shape mismatch. And there is no single path how we reach these params weight copying step.
  • We could reach through ColumnParallelLinear/ RowParallelLinear, but the source is same and each param inherits from the BaseParam class. - [Option A] At the source level when iterating weights (don't know the param shape here, name based heuristic) - [Option B] At the param level where we have the param shape (desired) and the loaded weight shape - [Option B1] Do in the BaseParam class and call it at top of every param - [Option B2] Do it as needed in each param (how SGLang current PR implemented)
  • We should ideally only push what is really needed like the RowParallelLinear transforms after this step otherwise we should avoid changing where the kernel call happens or at the actual kernel level since the expectation is these changes would already be fixed by the time we reach that point.

Proposed Call Flow

Phase 3 — Weight Loading (Options diverge here)

         ▼ DefaultModelLoader.load_weights(model, model_config)

  ┌──── OPTION A (iterator) ───────────────────────────┐
  │ if is_modelopt_bridge_checkpoint(model_config):                                                                    │
  │     weights_gen = normalize_modelopt_bridge_weights_iterator(                                             │
  │                       weights_gen, model_config)                                                                                    │
  │               For each (name, tensor):                                                                                                 │
  │                             skip .k_scale / .v_scale  (no CT parameter slot)                                              │
  │                             block only: rename .weight_scale → .weight_scale_inv                                 │
  │                             if tensor.dim()==4:  squeeze → 2D                                                                   │
  │                             if tensor.dim()==1 and name.endswith("weight_scale"):                                │
  │                                 tensor = tensor.unsqueeze(-1)   # (N,) → (N,1)                                            │
  │                             if tensor dtype != float32:  cast                                                                         │
  │                             yield name, tensor                                                                                              │
  │ All downstream load methods receive clean (N,1) tensors                                                         │
  └────────────────────────────────   ────────┘

  ┌──── OPTION B1 (base class, unconditional) ───────────────────────┐
  │ No iterator wrapping.                                                                                                                                       │
  │ _normalize_loaded_weight() added to BasevLLMParameter.                                                                      │
  │ Called at top of every load method in _ColumnvLLMParameter.                                                                 │
  │ Guard is implicit: only fires when param.shape[1]==1.                                                                                 │
  └───────── ────────────────────────────────────┘

  ┌──── OPTION B2 (subclass override, flag-gated) ───────────────────┐
  │ No iterator wrapping.                                                                                                                               │
  │ ChannelQuantScaleParameter overrides load methods.                                                                      │
  │ Reshape only when _allow_1d_scale_reshape flag is set.                                                                    │
  │ Flag set in create_weights from _modelopt_bridge.                                                                             │
  │ Matches SGLang PR #19101 approach.                                                                                                  │
  └───────────────────────────────────────────. ┘

Phase 4 — Per-Tensor Load (all four paths)

  PATH A: ColumnParallelLinear → load_column_parallel_weight [parameter.py:148]
    narrow (N,1) → (shard,1) → assert (shard,1)==(shard,1) ✓

  PATH B: QKVParallelLinear → load_qkv_weight [parameter.py:178]
    narrow param slot → (1024,1) for Q
    narrow loaded   → (1024,1) for this tp_rank's Q
    assert ✓

  PATH C: MergedColumnParallelLinear → load_merged_column_weight [parameter.py:156]
    narrow → assert ✓

  PATH D: RowParallelLinear → weight_loader_v2 [linear.py:1502]  ← #### THIS MODIFICATION IS STILL NEEDED
    BEFORE: param.load_row_parallel_weight → _assert_and_load
              assert (N,1)==(N,)  CRASH
    AFTER:
      if isinstance(param, ChannelQuantScaleParameter):
          # Output dim NOT sharded in RowParallel — no narrow needed
          # Only fix needed: 1D→2D (Option A: already done; Options B1/B2: normalize here)
          loaded_weight = param._normalize_loaded_weight(loaded_weight)
          param.load_row_parallel_weight(loaded_weight)
          return
      param.load_row_parallel_weight(loaded_weight)  # all other params

Option Comparison

Option A — IteratorOption B1 — Base ClassOption B2 — Subclass (SGLang-style)
Normalization pointBefore name matchingInside every load methodInside ChannelQuantScaleParameter only
Guardis_modelopt_bridge_checkpoint()param.shape[1]==1 condition_allow_1d_scale_reshape flag
Base class touchedNoYes — adds methodNo
Flag neededNoNoYes — propagated from _modelopt_bridge
Future schemesAdd rules to iteratorAuto-coveredAdd override to relevant param class
Matches SGLang PRPartiallyNoYes

One thing no option avoids: The linear.py:1502 fix for RowParallelLinear. The output dimension is not sharded in RowParallel, so no TP narrow is needed — but the 1D→2D reshape must be handled explicitly before _assert_and_load.


Open Questions / Discussion Points

1. 1-way bridge (ModelOpt → CT only)

The current design proposal assumes that the bridge is 1 way that is whatever infrastructure we will add is going to convert the ModelOpt config and ckpt to match the CT style handling in vLLM. How do we scope the other way? Confirm this as an explicit non-goal.

2. Per-scheme translation overhead

Every new ModelOpt recipe needs its own bridge function. Maintenance scales linearly (~20-30 lines per scheme). Risk of drift if ModelOpt changes a recipe.

Alternatives:

  • Generic config translator (may miss edge cases)
  • Shared schema registry between ModelOpt and vLLM
  • Accept the cost (1-2 new recipes per year)

3. Scope: backward or forward-looking?

  • Backward: support existing ModelOpt checkpoints that don't have the vLLM path yet (perhaps kernel are supported)
  • Forward: make bridge the default path, deprecate native methods, or where should the development go? in ModelOpt path or CT path

4. Unified IR as longer-term direction

The bridge is a translator. A more ambitious design: a unified intermediate representation that all quant formats translate into, feeding one execution backend. Cleaner but multi-quarter refactor.

Feedback Period.

No response

CC List.

@Edwardf0t1 @pavanimajety @sychen52

Any Other Things.

No response

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

TL;DR

To implement a bridge that routes ModelOpt FP8 checkpoints through vLLM's existing Compressed-Tensors infrastructure, focus on detecting the ModelOpt quantization scheme, translating the config to CT format, and reusing CT infrastructure.

Guidance

  • Identify the ModelOpt quantization scheme using detect_modelopt_quantization_scheme function and route it to the appropriate config builder.
  • Implement a scalable way to define modelopt_config_to_compressed_tensors for each scheme, such as using a dictionary or a registry.
  • Handle tensor shape mismatches between ModelOpt and CT by implementing a weight loader transform, such as squeezing 4D scale tensors to 2D.
  • Consider implementing a generic config translator or a shared schema registry to reduce maintenance overhead.

Example

# modelopt_scheme.py
from enum import Enum

class ModelOptQuantizationScheme(Enum):
    FP8_PER_CHANNEL_PER_TOKEN = 1
    FP8_BLOCK = 2

def detect_modelopt_quantization_scheme(config):
    # Detection logic using quant_cfg, recipe, and quant_algo
    pass

Notes

  • The proposed solution assumes a one-way bridge from ModelOpt to CT, and it's essential to confirm this as an explicit non-goal.
  • The maintenance overhead of adding new bridge functions for each ModelOpt recipe should be considered, and alternatives such as a generic config translator or a shared schema registry should be explored.

Recommendation

Apply a workaround by implementing a weight loader transform to handle tensor shape mismatches, and consider implementing a generic config translator or a shared schema registry to reduce maintenance overhead. This approach allows for a more scalable and maintainable solution.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING