vllm - 💡(How to fix) Fix [Bug] FusedMoE `_load_per_tensor_weight_scale` rejects shape `(1,)` per-tensor scales

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

FusedMoE._load_per_tensor_weight_scale at vllm/model_executor/layers/fused_moe/layer.py:573 does param_data[expert_id][idx] = loaded_weight to install per-expert, per-(w1|w3)-shard scalar scales. When the on-disk per-tensor scale is stored as a length-1 tensor (shape (1,)) rather than a 0-D scalar (shape ()), the assignment fails:

RuntimeError: output with shape [] doesn't match the broadcast shape [1]

This trips during weight load before the engine reaches forward — server init fails.

Error Message

RuntimeError: output with shape [] doesn't match the broadcast shape [1]

Root Cause

FusedMoE._load_per_tensor_weight_scale at vllm/model_executor/layers/fused_moe/layer.py:573 does param_data[expert_id][idx] = loaded_weight to install per-expert, per-(w1|w3)-shard scalar scales. When the on-disk per-tensor scale is stored as a length-1 tensor (shape (1,)) rather than a 0-D scalar (shape ()), the assignment fails:

RuntimeError: output with shape [] doesn't match the broadcast shape [1]

This trips during weight load before the engine reaches forward — server init fails.

Fix Action

Fix / Workaround

Option A is the most permissive (accepts both (1,) and (1, 1) etc. as the broadcast-equivalent of a scalar) and was what we used artifact-side as a workaround.

Workaround we use today

Code Example

RuntimeError: output with shape [] doesn't match the broadcast shape [1]

---

import torch
import torch.nn as nn

# Simulates the loader site
num_experts, num_shards = 4, 2
param_data = nn.Parameter(
    torch.empty(num_experts, num_shards, dtype=torch.float32),
    requires_grad=False,
).data

loaded_weight = torch.tensor([0.5])  # shape (1,), llm-compressor default
expert_id, idx = 0, 0

param_data[expert_id][idx] = loaded_weight  # RuntimeError

---

RuntimeError: output with shape [] doesn't match the broadcast shape [1]

---

# Option A: squeeze any leading singleton dims before assignment
param_data[expert_id][idx] = loaded_weight.squeeze()

# Option B: explicit shape-coerce via item()/view
param_data[expert_id][idx] = loaded_weight.view([])

# Option C: use copy_ on the scalar slot
param_data[expert_id][idx].copy_(loaded_weight.reshape(()))
RAW_BUFFERClick to expand / collapse

Summary

FusedMoE._load_per_tensor_weight_scale at vllm/model_executor/layers/fused_moe/layer.py:573 does param_data[expert_id][idx] = loaded_weight to install per-expert, per-(w1|w3)-shard scalar scales. When the on-disk per-tensor scale is stored as a length-1 tensor (shape (1,)) rather than a 0-D scalar (shape ()), the assignment fails:

RuntimeError: output with shape [] doesn't match the broadcast shape [1]

This trips during weight load before the engine reaches forward — server init fails.

Trigger

compressed-tensors quantization schemes that produce per-tensor weight_global_scale / input_global_scale tensors emit them as shape (1,) by default (the torch.tensor([x]) form rather than torch.tensor(x)). On NVFP4 MoE artifacts produced by llm-compressor with the standard NVFP4 preset, every routed expert has shape-(1,) *_global_scale entries — for our DeepSeek-V4-Flash artifact with 256 routed experts × 43 main layers × 4 scale tensors each (w1/w2/w3 × weight_global_scale + input_global_scale), that's ~66,000 affected tensors.

Reproducer

Load any compressed-tensors NVFP4 MoE artifact whose *_global_scale tensors are saved as shape (1,):

import torch
import torch.nn as nn

# Simulates the loader site
num_experts, num_shards = 4, 2
param_data = nn.Parameter(
    torch.empty(num_experts, num_shards, dtype=torch.float32),
    requires_grad=False,
).data

loaded_weight = torch.tensor([0.5])  # shape (1,), llm-compressor default
expert_id, idx = 0, 0

param_data[expert_id][idx] = loaded_weight  # RuntimeError
RuntimeError: output with shape [] doesn't match the broadcast shape [1]

Proposed fix

The simplest fix is to coerce loaded_weight to scalar at the loader site. Both options work:

# Option A: squeeze any leading singleton dims before assignment
param_data[expert_id][idx] = loaded_weight.squeeze()

# Option B: explicit shape-coerce via item()/view
param_data[expert_id][idx] = loaded_weight.view([])

# Option C: use copy_ on the scalar slot
param_data[expert_id][idx].copy_(loaded_weight.reshape(()))

Option A is the most permissive (accepts both (1,) and (1, 1) etc. as the broadcast-equivalent of a scalar) and was what we used artifact-side as a workaround.

The _load_per_channel_weight_scale path at line 632 may have a similar issue depending on broadcast semantics — worth a sweep.

Workaround we use today

Artifact-side: scripts/squeeze_global_scales.py walks all weight_global_scale / input_global_scale tensors in the safetensors index and squeezes shape (1,)(). Atomic .tmp + rename per shard. ~66K tensors touched for V4-Flash; runs in ~5 min on EXT4. The script is at https://github.com/canada-quant/dsv4-flash-nvfp4-fp8-mtp/blob/main/scripts/squeeze_global_scales.py.

This unblocked the artifact's serve smoke and benchmarks (GSM8K 0.9181 strict / 0.9515 flexible-extract, beating RedHat's published 0.910 on the same recipe family). But it's clearly a vLLM-side bug — every compressed-tensors NVFP4 MoE artifact built with the standard preset will hit this.

Related

  • compressed-tensors could also be fixed to save per-tensor *_global_scale tensors as 0-D scalars from the start. But the safer-by-default fix is loader-side tolerance — there are likely existing artifacts in the wild with shape (1,).
  • Same code family as vLLM PRs #43248, #43288, #43290 (this org's session) — defensive shape/type handling on the compressed-tensors load path.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING