vllm - 💡(How to fix) Fix [Bug] FusedMoE `_load_per_tensor_weight_scale` rejects shape `(1,)` per-tensor scales

vllm2026-05-21 08:59:19

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

FusedMoE._load_per_tensor_weight_scale at vllm/model_executor/layers/fused_moe/layer.py:573 does param_data[expert_id][idx] = loaded_weight to install per-expert, per-(w1|w3)-shard scalar scales. When the on-disk per-tensor scale is stored as a length-1 tensor (shape (1,)) rather than a 0-D scalar (shape ()), the assignment fails:

RuntimeError: output with shape [] doesn't match the broadcast shape [1]

This trips during weight load before the engine reaches forward — server init fails.

Error Message

RuntimeError: output with shape [] doesn't match the broadcast shape [1]

Root Cause

RuntimeError: output with shape [] doesn't match the broadcast shape [1]

This trips during weight load before the engine reaches forward — server init fails.

Fix Action

Fix / Workaround

Option A is the most permissive (accepts both (1,) and (1, 1) etc. as the broadcast-equivalent of a scalar) and was what we used artifact-side as a workaround.

Workaround we use today

Code Example

RuntimeError: output with shape [] doesn't match the broadcast shape [1]

---

import torch
import torch.nn as nn

# Simulates the loader site
num_experts, num_shards = 4, 2
param_data = nn.Parameter(
    torch.empty(num_experts, num_shards, dtype=torch.float32),
    requires_grad=False,
).data

loaded_weight = torch.tensor([0.5])  # shape (1,), llm-compressor default
expert_id, idx = 0, 0

param_data[expert_id][idx] = loaded_weight  # RuntimeError

---

RuntimeError: output with shape [] doesn't match the broadcast shape [1]

---

# Option A: squeeze any leading singleton dims before assignment
param_data[expert_id][idx] = loaded_weight.squeeze()

# Option B: explicit shape-coerce via item()/view
param_data[expert_id][idx] = loaded_weight.view([])

# Option C: use copy_ on the scalar slot
param_data[expert_id][idx].copy_(loaded_weight.reshape(()))

RAW_BUFFERClick to expand / collapse

Summary

RuntimeError: output with shape [] doesn't match the broadcast shape [1]

This trips during weight load before the engine reaches forward — server init fails.

Trigger

compressed-tensors quantization schemes that produce per-tensor weight_global_scale / input_global_scale tensors emit them as shape (1,) by default (the torch.tensor([x]) form rather than torch.tensor(x)). On NVFP4 MoE artifacts produced by llm-compressor with the standard NVFP4 preset, every routed expert has shape-(1,) *_global_scale entries — for our DeepSeek-V4-Flash artifact with 256 routed experts × 43 main layers × 4 scale tensors each (w1/w2/w3 × weight_global_scale + input_global_scale), that's ~66,000 affected tensors.

Reproducer

Load any compressed-tensors NVFP4 MoE artifact whose *_global_scale tensors are saved as shape (1,):

import torch
import torch.nn as nn

# Simulates the loader site
num_experts, num_shards = 4, 2
param_data = nn.Parameter(
    torch.empty(num_experts, num_shards, dtype=torch.float32),
    requires_grad=False,
).data

loaded_weight = torch.tensor([0.5])  # shape (1,), llm-compressor default
expert_id, idx = 0, 0

param_data[expert_id][idx] = loaded_weight  # RuntimeError

RuntimeError: output with shape [] doesn't match the broadcast shape [1]

Proposed fix

The simplest fix is to coerce loaded_weight to scalar at the loader site. Both options work:

# Option A: squeeze any leading singleton dims before assignment
param_data[expert_id][idx] = loaded_weight.squeeze()

# Option B: explicit shape-coerce via item()/view
param_data[expert_id][idx] = loaded_weight.view([])

# Option C: use copy_ on the scalar slot
param_data[expert_id][idx].copy_(loaded_weight.reshape(()))

Option A is the most permissive (accepts both (1,) and (1, 1) etc. as the broadcast-equivalent of a scalar) and was what we used artifact-side as a workaround.

The _load_per_channel_weight_scale path at line 632 may have a similar issue depending on broadcast semantics — worth a sweep.

Workaround we use today

Artifact-side: scripts/squeeze_global_scales.py walks all weight_global_scale / input_global_scale tensors in the safetensors index and squeezes shape (1,) → (). Atomic .tmp + rename per shard. ~66K tensors touched for V4-Flash; runs in ~5 min on EXT4. The script is at https://github.com/canada-quant/dsv4-flash-nvfp4-fp8-mtp/blob/main/scripts/squeeze_global_scales.py.

This unblocked the artifact's serve smoke and benchmarks (GSM8K 0.9181 strict / 0.9515 flexible-extract, beating RedHat's published 0.910 on the same recipe family). But it's clearly a vLLM-side bug — every compressed-tensors NVFP4 MoE artifact built with the standard preset will hit this.

compressed-tensors could also be fixed to save per-tensor *_global_scale tensors as 0-D scalars from the start. But the safer-by-default fix is loader-side tolerance — there are likely existing artifacts in the wild with shape (1,).
Same code family as vLLM PRs #43248, #43288, #43290 (this org's session) — defensive shape/type handling on the compressed-tensors load path.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

vllm - 💡(How to fix) Fix [Bug] FusedMoE `_load_per_tensor_weight_scale` rejects shape `(1,)` per-tensor scales

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Fix / Workaround

Workaround we use today

Code Example

Summary

Trigger

Reproducer

Proposed fix

Workaround we use today

Related

Still need to ship something?

TRENDING