vllm - 💡(How to fix) Fix [Bug]: DeepSeek-V4-Pro TP=16 fails fp8 block-shape check on shared_experts.down_proj — contradicts the official recipe

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

Root Cause

Root cause — the contradiction lives in the source

Fix Action

Fix / Workaround

Workaround

--tensor-parallel-size 8 works (single-node 8× H100, ~640 GB HBM, fits V4-Pro). For larger TP, currently the only option is to monkey-patch DeepseekV4MoE.init to pass is_sequence_parallel=True to shared_experts.

Code Example

Your output of `python collect_env.py` here
RAW_BUFFERClick to expand / collapse

Your current environment

<details> <summary>The output of <code>python collect_env.py</code></summary>
Your output of `python collect_env.py` here
</details>

🐛 Describe the bug

TL;DR

The official recipes.vllm.ai page for DeepSeek-V4-Pro (https://recipes.vllm.ai/deepseek-ai/DeepSeek-V4-Pro?features=spec_decoding%2Creasoning&hardware=h100) advertises a --tensor-parallel-size 16 configuration (the "2× H100 · Multi-Node TEP · FP8" tab). However, on a freshly-cloned/installed vLLM 0.20.x with DeepSeek-V4-Pro's published checkpoint, the recipe's --tensor-parallel-size 16 cannot finish worker init: every worker raises:

ValueError: Weight input_size_per_partition = 192 is not divisible by weight quantization block_k = 128.

…while the same recipe at --tensor-parallel-size 8 (the "1× H100" tab) initializes fine. So the recipe page advertises a config that the shipped V4 model code path cannot run.

Reproduction

Use the recipe verbatim, only flipping the TP flag:

Works (1× H100 recipe):

vllm serve deepseek-ai/DeepSeek-V4-Pro
--trust-remote-code --kv-cache-dtype fp8 --block-size 256
--enable-expert-parallel --tensor-parallel-size 8
--max-model-len 800000 --gpu-memory-utilization 0.95
--max-num-seqs 512 --max-num-batched-tokens 512
--no-enable-flashinfer-autotune
--compilation-config '{"mode": 0, "cudagraph_mode": "FULL_DECODE_ONLY"}'
--speculative_config '{"method":"mtp","num_speculative_tokens":2}'
--reasoning-parser deepseek_v4

Fails (2× H100 recipe — same flags + nnodes/TP swap):

vllm serve deepseek-ai/DeepSeek-V4-Pro
... [identical flags] ...
-cc.pass_config.fuse_allreduce_rms=False
--tensor-parallel-size 16 --nnodes 2 --node-rank 0 --master-addr $HEAD_IP
...

The TP=16 cmd raises during WorkerProc.load_model() for every rank.

Expected behavior

The TP=16 recipe — being the headline production config on the recipe page — should construct the model and proceed to weight loading.

Actual behavior — full failing trace

... vllm/v1/worker/gpu_model_runner.py:4793 in load_model ... vllm/model_executor/models/deepseek_v4.py:1539 in init ... vllm/model_executor/models/deepseek_v4.py:1267 in init (make_layers) ... vllm/model_executor/models/utils.py:646 in make_layers ... vllm/model_executor/models/deepseek_v4.py:1269 in <lambda> (DeepseekV4DecoderLayer) ... vllm/model_executor/models/deepseek_v4.py:1116 in init ... vllm/model_executor/models/deepseek_v4.py:784 in init ← shared_experts construction ... vllm/model_executor/models/deepseek_v4.py:95 in init ← DeepseekV4MLP.down_proj ... vllm/model_executor/layers/linear.py:1462 in init ... vllm/model_executor/layers/quantization/fp8.py:333 in create_weights ... vllm/model_executor/layers/quantization/utils/fp8_utils.py:1126 in validate_fp8_block_shape ValueError: Weight input_size_per_partition = 192 is not divisible by weight quantization block_k = 128.

Root cause — the contradiction lives in the source

Per the published V4-Pro config.json: moe_intermediate_size = ... n_shared_experts = ... → shared_experts.intermediate_size = moe_intermediate_size * n_shared_experts = 3072

vllm/model_executor/models/deepseek_v4.py:784 constructs the shared experts as a plain RowParallelLinear with TP-split input: self.shared_experts = DeepseekV4MLP( hidden_size=config.hidden_size, intermediate_size=intermediate_size, # 3072 ... # ← is_sequence_parallel is NOT passed, so it defaults to False prefix=f"{prefix}.shared_experts", )

With is_sequence_parallel=False, DeepseekV4MLP.down_proj (deepseek_v4.py:95) uses standard TP-row-parallel: input_size_per_partition = 3072 / TP. The fp8 block-quant validator (fp8_utils.py:1126) then enforces input_size_per_partition % 128 == 0. The arithmetic:

┌─────┬───────────┬───────┬──────────────────────────────┐ │ TP │ 3072 / TP │ % 128 │ Result │ ├─────┼───────────┼───────┼──────────────────────────────┤ │ 8 │ 384 │ 0 │ ✅ passes │ │ ├─────┼───────────┼───────┼──────────────────────────────┤ │ 16 │ 192 │ 64 │ ❌ raises │ │ ├─────┼───────────┼───────┼──────────────────────────────┤ │ 12 │ 256 │ 0 │ ✅ would pass (but uncommon) │ └─────┴───────────┴───────┴──────────────────────────────┘

So TP=16 is a hard-coded mathematical impossibility for V4-Pro's shared experts under the current deepseek_v4.py, regardless of whether you're on 1 node or 2 nodes — the per-rank input dim does not change with node topology.

DeepseekV4MLP.init (deepseek_v4.py:78) actually has the escape hatch built in: def init(self, ..., is_sequence_parallel: bool = False, ...): self.gate_up_proj = MergedColumnParallelLinear(..., disable_tp=is_sequence_parallel, ...) self.down_proj = RowParallelLinear (..., disable_tp=is_sequence_parallel, ...) …but the DeepseekV4MoE parent (line 784) never threads is_sequence_parallel=True down. There is also no read of any pass_config.enable_sp / parallel_config.enable_sequence_parallel at this construction site.

Why the recipe contradicts the code

Three working hypotheses, all of which would explain the gap:

  1. Recipe was generated from a config template, never end-to-end validated on this exact wheel + checkpoint combination.
  2. Recipe was tested against a private/branch wheel that has the is_sequence_parallel plumbing through shared_experts (which would also let pass_config.enable_sp=True actually take effect on this layer).
  3. A different V4-Pro checkpoint variant (with moe_intermediate_size * n_shared_experts divisible by 128 at TP=16) was used during recipe verification.

It would be useful to know which of (1)/(2)/(3) is the case, because the user-facing contract changes drastically.

Suggested fix

Thread is_sequence_parallel (or read the global pass_config) into shared_experts construction:

--- a/vllm/model_executor/models/deepseek_v4.py +++ b/vllm/model_executor/models/deepseek_v4.py @@ -780,6 +780,11 @@ class DeepseekV4MoE(nn.Module): if config.n_shared_experts is None: self.shared_experts = None else: intermediate_size = config.moe_intermediate_size * config.n_shared_experts

  •        # When sequence parallelism is on (or when per-rank intermediate_size
  •        # would not satisfy fp8 block_k alignment), replicate the shared
  •        # experts instead of TP-splitting them — same hatch DeepseekV4MLP
  •        # already exposes for routed expert weights.
  •        shared_sp = vllm_config.compilation_config.pass_config.enable_sp
           self.shared_experts = DeepseekV4MLP(
               hidden_size=config.hidden_size,
               intermediate_size=intermediate_size,
               hidden_act=config.hidden_act,
               swiglu_limit=self.swiglu_limit,
               quant_config=quant_config,
               reduce_results=self.use_mega_moe,
  •            is_sequence_parallel=shared_sp,
               prefix=f"{prefix}.shared_experts",
           )

Or — if upstream prefers — make the validator skip the row-parallel divisibility check for shared_experts when an alternative kernel path is available.

Workaround

--tensor-parallel-size 8 works (single-node 8× H100, ~640 GB HBM, fits V4-Pro). For larger TP, currently the only option is to monkey-patch DeepseekV4MoE.init to pass is_sequence_parallel=True to shared_experts.

Environment

vllm: 0.20.2 (+cu126 build) torch: 2.9.0 CUDA: 12.6 Hardware: 16-GPU single-host, Ampere-class GPUs (96 GB HBM each) Model: deepseek-ai/DeepSeek-V4-Pro (default checkpoint)

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

vllm - 💡(How to fix) Fix [Bug]: DeepSeek-V4-Pro TP=16 fails fp8 block-shape check on shared_experts.down_proj — contradicts the official recipe