vllm - 💡(How to fix) Fix [Feature]: granitemoe loader should accept FP8_DYNAMIC expert weight_scale tensors

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

The GraniteMoeForCausalLM weight loader in vllm/model_executor/models/granitemoe.py does not route .block_sparse_moe.input_linear.weight_scale or .block_sparse_moe.output_linear.weight_scale tensors into FusedMoE expert slots. Its expert branches in load_weights only match on .weight; tensors ending in .weight_scale fall through to a generic pass-through that has no mapping for them, so loading raises KeyError on params_dict[name] in _load_weights once it tries to look up the unrecognized name. As a result, FP8-quantized Granite MoE (non-hybrid) checkpoints cannot be served.

This issue is scoped to the FP8_DYNAMIC scheme (compressed-tensors weights.strategy="channel", per-output-row weight scales; dynamic per-token activation scales computed at runtime; no input_scale tensor on disk). Producer-side tooling (e.g. llm-compressor's GraniteMoeHybridParallelExpertsLinear helper, adapted for the plain GraniteMoeParallelExperts module) emits fused expert weight scales with shape [num_experts, 2*intermediate_size, 1] on input_linear and [num_experts, hidden_size, 1] on output_linear, which matches exactly what the existing CompressedTensorsW8A8Fp8MoEMethod channel-strategy branch expects for w13_weight_scale / w2_weight_scale. FP8_BLOCK (2D block-shaped weight scales) is explicitly out of scope here; see "Follow-ups".

Root Cause

  1. stacked_params_mapping (for q_proj/k_proj/v_proj) doesn't match.
  2. expert_params_mapping (from fused_moe_make_expert_params_mapping with ckpt_{gate,down,up}_proj_name = "w1"/"w2"/"w3") searches for f"experts.{id}.w1." / w2. / w3. substrings. The unprocessed name ...input_linear.weight_scale contains none of these, so no mapping matches.
  3. The fallthrough calls maybe_remap_kv_scale_name(name, params_dict), which only handles .k_scale / .v_scale / .q_scale / .kv_scale suffixes; for .weight_scale it returns the name unchanged.
  4. params_dict[name] then raises KeyError because model.layers.N.block_sparse_moe.input_linear.weight_scale is not a registered parameter.

Fix Action

Fix / Workaround

Downstream, _load_weights's existing expert_params_mapping dispatcher matches on .w1. / .w2. / .w3. substrings and routes the rewritten names into w13_weight_scale / w2_weight_scale FusedMoE slots via the per-scheme method class (CompressedTensorsW8A8Fp8MoEMethod). No changes are needed in _load_weights itself or in FusedMoE.

This differs slightly from the hybrid loader's structure — hybrid's fused-expert branch calls _load_expert directly with manually-constructed experts.w13_weight_scale names, while the plain loader's two-stage structure (rewrite into new_weights, then dispatch in _load_weights) hands the per-expert per-shard names through the existing Mixtral-style dispatcher. Both end up in the same FusedMoE slot; the plain loader's fix is a minimal extension of its existing branches.

Code Example

if n.endswith(".block_sparse_moe.input_linear.weight"):
    for e in range(p.size(0)):
        w1_name = n.replace(".block_sparse_moe.input_linear.weight",
                            f".block_sparse_moe.experts.{e}.w1.weight")
        ...
        w1_param, w3_param = p[e].chunk(2, dim=0)
        new_weights[w1_name] = w1_param
        new_weights[w3_name] = w3_param
elif n.endswith(".block_sparse_moe.output_linear.weight"):
    ...
RAW_BUFFERClick to expand / collapse

🚀 The feature, motivation and pitch

Summary

The GraniteMoeForCausalLM weight loader in vllm/model_executor/models/granitemoe.py does not route .block_sparse_moe.input_linear.weight_scale or .block_sparse_moe.output_linear.weight_scale tensors into FusedMoE expert slots. Its expert branches in load_weights only match on .weight; tensors ending in .weight_scale fall through to a generic pass-through that has no mapping for them, so loading raises KeyError on params_dict[name] in _load_weights once it tries to look up the unrecognized name. As a result, FP8-quantized Granite MoE (non-hybrid) checkpoints cannot be served.

This issue is scoped to the FP8_DYNAMIC scheme (compressed-tensors weights.strategy="channel", per-output-row weight scales; dynamic per-token activation scales computed at runtime; no input_scale tensor on disk). Producer-side tooling (e.g. llm-compressor's GraniteMoeHybridParallelExpertsLinear helper, adapted for the plain GraniteMoeParallelExperts module) emits fused expert weight scales with shape [num_experts, 2*intermediate_size, 1] on input_linear and [num_experts, hidden_size, 1] on output_linear, which matches exactly what the existing CompressedTensorsW8A8Fp8MoEMethod channel-strategy branch expects for w13_weight_scale / w2_weight_scale. FP8_BLOCK (2D block-shaped weight scales) is explicitly out of scope here; see "Follow-ups".

Motivation

With a producer-side helper that reshapes the 3D fused expert tensor to 2D for quantization and then restores 3D on save (analogous to llm-compressor's existing GraniteMoeHybridParallelExpertsLinear for the hybrid architecture), compressed-tensors emits input_linear.weight_scale and output_linear.weight_scale with the correct 3D shapes listed above. The sibling loader for the hybrid variant (granitemoehybrid.py) already handles both .weight and .weight_scale on the fused-expert path. The plain granitemoe.py loader is the only piece missing. Without this, FP8 quantization of Granite MoE models (e.g. the G5 family: 20B / 120B / 230B) has no working serving path in vLLM — any attempt to load such a checkpoint will fail at the expert-scale load step.

Current behavior

vllm/model_executor/models/granitemoe.py load_weights():

if n.endswith(".block_sparse_moe.input_linear.weight"):
    for e in range(p.size(0)):
        w1_name = n.replace(".block_sparse_moe.input_linear.weight",
                            f".block_sparse_moe.experts.{e}.w1.weight")
        ...
        w1_param, w3_param = p[e].chunk(2, dim=0)
        new_weights[w1_name] = w1_param
        new_weights[w3_name] = w3_param
elif n.endswith(".block_sparse_moe.output_linear.weight"):
    ...

Only .weight matches on the expert branches. A tensor ending in .weight_scale falls into the else branch at line 481 (new_weights[n] = p), is handed off to _load_weights, and there:

  1. stacked_params_mapping (for q_proj/k_proj/v_proj) doesn't match.
  2. expert_params_mapping (from fused_moe_make_expert_params_mapping with ckpt_{gate,down,up}_proj_name = "w1"/"w2"/"w3") searches for f"experts.{id}.w1." / w2. / w3. substrings. The unprocessed name ...input_linear.weight_scale contains none of these, so no mapping matches.
  3. The fallthrough calls maybe_remap_kv_scale_name(name, params_dict), which only handles .k_scale / .v_scale / .q_scale / .kv_scale suffixes; for .weight_scale it returns the name unchanged.
  4. params_dict[name] then raises KeyError because model.layers.N.block_sparse_moe.input_linear.weight_scale is not a registered parameter.

So the current behavior on an FP8-with-expert-scales checkpoint is a hard KeyError at load time, not a silent skip.

The fix pattern is straightforward: extend the expert branches in load_weights to also match .weight_scale, split the 3D tensor per expert the same way the weight is split, and rename to per-expert per-shard names (e.g. ...experts.{e}.w1.weight_scale). Those rewritten names will then be matched by expert_params_mapping in _load_weights via the w1. / w2. / w3. substring search, which routes them into the correct w13_weight_scale / w2_weight_scale FusedMoE slots without any further changes to _load_weights.

Proposed behavior

Extend the two fused-expert branches in granitemoe.py::GraniteMoeModel.load_weights to match both .weight and .weight_scale:

  • input_linear: change if n.endswith(".block_sparse_moe.input_linear.weight"): to also accept .weight_scale. For each expert e, slice p[e].chunk(2, dim=0) and store under ...experts.{e}.w1.weight_scale / ...experts.{e}.w3.weight_scale in new_weights. The n.replace(".block_sparse_moe.input_linear.weight", f".block_sparse_moe.experts.{e}.w1.weight") call already works for both suffixes because .weight is a prefix of .weight_scale — the replacement produces the correctly-suffixed name by string-prefix match, the same subtlety the hybrid loader already relies on.
  • output_linear: same pattern, for ...experts.{e}.w2.weight_scale.

Downstream, _load_weights's existing expert_params_mapping dispatcher matches on .w1. / .w2. / .w3. substrings and routes the rewritten names into w13_weight_scale / w2_weight_scale FusedMoE slots via the per-scheme method class (CompressedTensorsW8A8Fp8MoEMethod). No changes are needed in _load_weights itself or in FusedMoE.

This differs slightly from the hybrid loader's structure — hybrid's fused-expert branch calls _load_expert directly with manually-constructed experts.w13_weight_scale names, while the plain loader's two-stage structure (rewrite into new_weights, then dispatch in _load_weights) hands the per-expert per-shard names through the existing Mixtral-style dispatcher. Both end up in the same FusedMoE slot; the plain loader's fix is a minimal extension of its existing branches.

Acceptance

  • A compressed-tensors checkpoint for a GraniteMoeForCausalLM model with quantization_config.format = "float-quantized", 8-bit float weights (weights.num_bits=8, weights.type="float", per-channel weights.strategy="channel"), and dynamic per-token activation scales (input_activations.dynamic=true), whose experts carry 3D input_linear.weight_scale of shape [num_experts, 2*intermediate_size, 1] and output_linear.weight_scale of shape [num_experts, hidden_size, 1], loads cleanly in vLLM.
  • Output token quality matches a reference path (e.g. the same model served with attention-only FP8 quantization, or the BF16 baseline ± expected FP8 drift).
  • No warnings about unused weights or missing scale parameters.

Scope / non-goals

  • In scope: FP8_DYNAMIC only — per-output-row weight scale tensors sliced per expert the same way the weight is (p[e].chunk(2, dim=0) for input_linear, p[e] for output_linear). FP8_DYNAMIC has no on-disk input_scale; activation scales are computed per-token at runtime, so no activation-scale plumbing is needed.
  • Out of scope (tracked as follow-ups):
    • FP8_BLOCK — 2D block-shaped weight scales (after producer-side 2D flatten, or an equivalent 3D layout). The hybrid loader's per-row chunk(2, dim=0) slicing does not apply cleanly to block-scale shapes; this needs its own design and will be filed as a separate issue (or rolled into Issue 2 during implementation, since block scales are closer in shape to FP4 sidecar tensors than to per-channel FP8 scales).
    • FP4 / W4A16-packed sidecars (weight_packed, weight_global_scale, weight_shape, weight_zero_point, weight_g_idx) — tracked in Issues 2 and 3.
    • Static FP8 activation quantization (FP8_STATIC with an on-disk input_scale) — would require a separate branch that does not exist in the hybrid loader today either.

Alternatives

No response

Additional context

Affected files

  • vllm/model_executor/models/granitemoe.py

Reference

  • Mirror implementation already in vllm/model_executor/models/granitemoehybrid.py load_weights().

Follow-ups

  • FP8_BLOCK expert support for Granite MoE loaders. To be filed once this issue lands.

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

vllm - 💡(How to fix) Fix [Feature]: granitemoe loader should accept FP8_DYNAMIC expert weight_scale tensors