Your current environment

<details> <summary>The output of <code>python collect_env.py</code></summary>

Your output of `python collect_env.py` here

</details>

🐛 Describe the bug

TL;DR

The official recipes.vllm.ai page for DeepSeek-V4-Pro (https://recipes.vllm.ai/deepseek-ai/DeepSeek-V4-Pro?features=spec_decoding%2Creasoning&hardware=h100) advertises a --tensor-parallel-size 16 configuration (the "2× H100 · Multi-Node TEP · FP8" tab). However, on a freshly-cloned/installed vLLM 0.20.x with DeepSeek-V4-Pro's published checkpoint, the recipe's --tensor-parallel-size 16 cannot finish worker init: every worker raises:

ValueError: Weight input_size_per_partition = 192 is not divisible by weight quantization block_k = 128.

…while the same recipe at --tensor-parallel-size 8 (the "1× H100" tab) initializes fine. So the recipe page advertises a config that the shipped V4 model code path cannot run.

Reproduction

Use the recipe verbatim, only flipping the TP flag:

Works (1× H100 recipe):

vllm serve deepseek-ai/DeepSeek-V4-Pro
--trust-remote-code --kv-cache-dtype fp8 --block-size 256
--enable-expert-parallel --tensor-parallel-size 8
--max-model-len 800000 --gpu-memory-utilization 0.95
--max-num-seqs 512 --max-num-batched-tokens 512
--no-enable-flashinfer-autotune
--compilation-config '{"mode": 0, "cudagraph_mode": "FULL_DECODE_ONLY"}'
--speculative_config '{"method":"mtp","num_speculative_tokens":2}'
--reasoning-parser deepseek_v4

Fails (2× H100 recipe — same flags + nnodes/TP swap):

vllm serve deepseek-ai/DeepSeek-V4-Pro
... [identical flags] ...
-cc.pass_config.fuse_allreduce_rms=False
--tensor-parallel-size 16 --nnodes 2 --node-rank 0 --master-addr $HEAD_IP
...

The TP=16 cmd raises during WorkerProc.load_model() for every rank.

Expected behavior

The TP=16 recipe — being the headline production config on the recipe page — should construct the model and proceed to weight loading.

Actual behavior — full failing trace

... vllm/v1/worker/gpu_model_runner.py:4793 in load_model ... vllm/model_executor/models/deepseek_v4.py:1539 in init ... vllm/model_executor/models/deepseek_v4.py:1267 in init (make_layers) ... vllm/model_executor/models/utils.py:646 in make_layers ... vllm/model_executor/models/deepseek_v4.py:1269 in <lambda> (DeepseekV4DecoderLayer) ... vllm/model_executor/models/deepseek_v4.py:1116 in init ... vllm/model_executor/models/deepseek_v4.py:784 in init ← shared_experts construction ... vllm/model_executor/models/deepseek_v4.py:95 in init ← DeepseekV4MLP.down_proj ... vllm/model_executor/layers/linear.py:1462 in init ... vllm/model_executor/layers/quantization/fp8.py:333 in create_weights ... vllm/model_executor/layers/quantization/utils/fp8_utils.py:1126 in validate_fp8_block_shape ValueError: Weight input_size_per_partition = 192 is not divisible by weight quantization block_k = 128.

Root cause — the contradiction lives in the source

Per the published V4-Pro config.json: moe_intermediate_size = ... n_shared_experts = ... → shared_experts.intermediate_size = moe_intermediate_size * n_shared_experts = 3072

vllm/model_executor/models/deepseek_v4.py:784 constructs the shared experts as a plain RowParallelLinear with TP-split input: self.shared_experts = DeepseekV4MLP( hidden_size=config.hidden_size, intermediate_size=intermediate_size, # 3072 ... # ← is_sequence_parallel is NOT passed, so it defaults to False prefix=f"{prefix}.shared_experts", )

With is_sequence_parallel=False, DeepseekV4MLP.down_proj (deepseek_v4.py:95) uses standard TP-row-parallel: input_size_per_partition = 3072 / TP. The fp8 block-quant validator (fp8_utils.py:1126) then enforces input_size_per_partition % 128 == 0. The arithmetic:

┌─────┬───────────┬───────┬──────────────────────────────┐ │ TP │ 3072 / TP │ % 128 │ Result │ ├─────┼───────────┼───────┼──────────────────────────────┤ │ 8 │ 384 │ 0 │ ✅ passes │ │ ├─────┼───────────┼───────┼──────────────────────────────┤ │ 16 │ 192 │ 64 │ ❌ raises │ │ ├─────┼───────────┼───────┼──────────────────────────────┤ │ 12 │ 256 │ 0 │ ✅ would pass (but uncommon) │ └─────┴───────────┴───────┴──────────────────────────────┘

So TP=16 is a hard-coded mathematical impossibility for V4-Pro's shared experts under the current deepseek_v4.py, regardless of whether you're on 1 node or 2 nodes — the per-rank input dim does not change with node topology.

DeepseekV4MLP.init (deepseek_v4.py:78) actually has the escape hatch built in: def init(self, ..., is_sequence_parallel: bool = False, ...): self.gate_up_proj = MergedColumnParallelLinear(..., disable_tp=is_sequence_parallel, ...) self.down_proj = RowParallelLinear (..., disable_tp=is_sequence_parallel, ...) …but the DeepseekV4MoE parent (line 784) never threads is_sequence_parallel=True down. There is also no read of any pass_config.enable_sp / parallel_config.enable_sequence_parallel at this construction site.

Why the recipe contradicts the code

Three working hypotheses, all of which would explain the gap:

Recipe was generated from a config template, never end-to-end validated on this exact wheel + checkpoint combination.
Recipe was tested against a private/branch wheel that has the is_sequence_parallel plumbing through shared_experts (which would also let pass_config.enable_sp=True actually take effect on this layer).
A different V4-Pro checkpoint variant (with moe_intermediate_size * n_shared_experts divisible by 128 at TP=16) was used during recipe verification.

It would be useful to know which of (1)/(2)/(3) is the case, because the user-facing contract changes drastically.

Suggested fix

Thread is_sequence_parallel (or read the global pass_config) into shared_experts construction:

--- a/vllm/model_executor/models/deepseek_v4.py +++ b/vllm/model_executor/models/deepseek_v4.py @@ -780,6 +780,11 @@ class DeepseekV4MoE(nn.Module): if config.n_shared_experts is None: self.shared_experts = None else: intermediate_size = config.moe_intermediate_size * config.n_shared_experts

       # When sequence parallelism is on (or when per-rank intermediate_size

       # would not satisfy fp8 block_k alignment), replicate the shared

       # experts instead of TP-splitting them — same hatch DeepseekV4MLP

       # already exposes for routed expert weights.

       shared_sp = vllm_config.compilation_config.pass_config.enable_sp
       self.shared_experts = DeepseekV4MLP(
           hidden_size=config.hidden_size,
           intermediate_size=intermediate_size,
           hidden_act=config.hidden_act,
           swiglu_limit=self.swiglu_limit,
           quant_config=quant_config,
           reduce_results=self.use_mega_moe,

           is_sequence_parallel=shared_sp,
           prefix=f"{prefix}.shared_experts",
       )

Or — if upstream prefers — make the validator skip the row-parallel divisibility check for shared_experts when an alternative kernel path is available.

Workaround

--tensor-parallel-size 8 works (single-node 8× H100, ~640 GB HBM, fits V4-Pro). For larger TP, currently the only option is to monkey-patch DeepseekV4MoE.init to pass is_sequence_parallel=True to shared_experts.

Environment

vllm: 0.20.2 (+cu126 build) torch: 2.9.0 CUDA: 12.6 Hardware: 16-GPU single-host, Ampere-class GPUs (96 GB HBM each) Model: deepseek-ai/DeepSeek-V4-Pro (default checkpoint)

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

vllm - 💡(How to fix) Fix [Bug]: DeepSeek-V4-Pro TP=16 fails fp8 block-shape check on shared_experts.down_proj — contradicts the official recipe

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Fix Action

Fix / Workaround

Code Example

Your current environment

🐛 Describe the bug

Works (1× H100 recipe):

Fails (2× H100 recipe — same flags + nnodes/TP swap):

Before submitting a new issue...

Still need to ship something?

TRENDING

vllm - 💡(How to fix) Fix [Bug]: DeepSeek-V4-Pro TP=16 fails fp8 block-shape check on shared_experts.down_proj — contradicts the official recipe

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Fix Action

Fix / Workaround

Code Example

Your current environment

🐛 Describe the bug

Works (1× H100 recipe):

Fails (2× H100 recipe — same flags + nnodes/TP swap):

Before submitting a new issue...

Still need to ship something?

RELATED_DISCOVERY

TRENDING