vllm - 💡(How to fix) Fix [Info][DSV4-Pro][MTP] V4-Pro MTP acceptance ~1.82% on vLLM mainline reproduces LMSYS day-zero ~1.19 accept length — `opt_in_features` classification correctly reflects current MTP head capability

vllm2026-05-22 23:55:57

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

Fix Action

Fix / Workaround

vLLM mainline @ 39910f2b25 (2026-05-22)
+ PR #42209 (NVFP4 MoE support for DSV4) — now merged
+ 4 local DSV4 patches (#43248, #43288, #43290, #43319)
8× NVIDIA B300 SXM6 AC (sm_103a, 288 GB HBM3e each)

Our artifact dequantizes mtp.0.{e_proj, h_proj}.weight from FP8 to BF16 at conversion time (workaround for ReplicatedLinear + Fp8Config MTP-loader gap on mainline — see our findings). LMSYS's setup uses the native FP8 versions via the fork's loader path.
Different prompts: our chat workload is 20 short chat prompts; LMSYS's was likely a broader mix.

Code Example

vLLM mainline @ 39910f2b25 (2026-05-22)
+ PR #42209 (NVFP4 MoE support for DSV4) — now merged
+ 4 local DSV4 patches (#43248, #43288, #43290, #43319)
8× NVIDIA B300 SXM6 AC (sm_103a, 288 GB HBM3e each)

RAW_BUFFERClick to expand / collapse

Your current environment

vLLM mainline @ 39910f2b25 (2026-05-22)
+ PR #42209 (NVFP4 MoE support for DSV4) — now merged
+ 4 local DSV4 patches (#43248, #43288, #43290, #43319)
8× NVIDIA B300 SXM6 AC (sm_103a, 288 GB HBM3e each)

Purpose

This is an informational issue, not a bug report. Filed to consolidate the upstream evidence around current V4-Pro MTP draft-acceptance capability, since multiple users (us included) hit "MTP retains and fires but acceptance is much lower than expected" and the right answer involves trusting the official recipe YAML's opt_in_features classification rather than chasing a non-existent loader bug.

Measurement

Setup: NVFP4 V4-Pro artifact (canada-quant/DeepSeek-V4-Pro-NVFP4-FP8-MTP), upstream-default single_node_tep strategy at TP=8 + EP, --moe-backend flashinfer_trtllm, --attention_config.use_fp4_indexer_cache=True, --compilation-config '{"cudagraph_mode":"FULL_AND_PIECEWISE","custom_ops":["all"]}', --speculative-config '{"method":"mtp","num_speculative_tokens":2}'. 20 chat-style prompts, counters summed across all engine="<n>" labels in /metrics:

Metric	Value
Draft tokens emitted	13,180
Tokens accepted	240
Per-token acceptance rate	1.82%
Equivalent accept length (N=2)	1.036

Comparison to upstream sources

LMSYS day-zero V4-Pro blog (officially-blessed vllm/vllm-openai:deepseekv4-cu130 docker image, built from zyongye's PR fork):

MTP-3 on B200 Pro (accept ~1.19). The per-position breakdown is heavily skewed -- positions 0/1/2 accept 2226 / 354 / 55 tokens respectively -- so the spec path looks like it is mostly accepting only position 0. This suggests the MTP path may not be hitting full effectiveness on Pro; we did not investigate further.

LMSYS's accept length 1.19 at N=3 translates to per-token acceptance ~6.3%. Our 1.82% at N=2 (accept length 1.036) is in the same low-single-digit regime but slightly lower. Two structural factors might explain why ours is lower than LMSYS's:

Our artifact dequantizes mtp.0.{e_proj, h_proj}.weight from FP8 to BF16 at conversion time (workaround for ReplicatedLinear + Fp8Config MTP-loader gap on mainline — see our findings). LMSYS's setup uses the native FP8 versions via the fork's loader path.
Different prompts: our chat workload is 20 short chat prompts; LMSYS's was likely a broader mix.

But the order of magnitude matches across both partners' deployments: V4-Pro MTP currently produces accept length ~1.0-1.2, not the 2.0-3.0 the V4-Flash MTP head produces. The vLLM recipe YAML correctly classifies this as opt_in_features, not part of the default V4-Pro deployment.

Reproducibility

Full reproduction recipe + raw counter dumps at:

Source repo: https://github.com/canada-quant/dsv4-pro-nvfp4-fp8-mtp
Backend×format matrix: https://github.com/canada-quant/dsv4-pro-nvfp4-fp8-mtp/blob/main/docs/findings/backend_format_matrix.md
Upstream MTP classification evidence trail: https://github.com/canada-quant/dsv4-pro-nvfp4-fp8-mtp/blob/main/docs/findings/upstream_mtp_classification.md
Raw bench JSON: https://github.com/canada-quant/dsv4-pro-nvfp4-fp8-mtp/blob/main/docs/benchmarks/matrix/mtp_A_nvfp4_flashinfer_mtp_2026_05_22.json

Asks

Optional: link this issue from the recipe YAML's opt_in_features block so users who hit "why is MTP acceptance so low" land here directly.
Optional: a note in the V4-Pro section of the vLLM model docs that V4-Pro MTP is currently weak compared to V3.2 / V4-Flash MTP — sets correct expectations.

No code change requested. Filing for collective-knowledge purposes so subsequent V4-Pro deployers don't re-derive the same investigation.

Cc'ing @zyongye @WoosukKwon since the MTP path discussion has been on PR #40760 and related.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering