vllm - ✅(Solved) Fix DeepSeek-V4 MTP2 GB200 throughput regression likely tied to FP32->FP4 cvt path (#41015) [1 pull requests, 1 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#41603Fetched 2026-05-05 05:44:47
View on GitHub
Comments
1
Participants
2
Timeline
2
Reactions
0
Timeline (top)
commented ×1milestoned ×1

I'm seeing a large DeepSeek-V4-Pro MTP2 throughput regression on GB200 between a pre-merge PR container and newer vLLM images. A one-off test that reverts the FP32->FP4 cvt path from #41015 on top of the nightly recovers most of the lost throughput, so #41015 looks like the current lead suspect for this workload.

Root Cause

I'm seeing a large DeepSeek-V4-Pro MTP2 throughput regression on GB200 between a pre-merge PR container and newer vLLM images. A one-off test that reverts the FP32->FP4 cvt path from #41015 on top of the nightly recovers most of the lost throughput, so #41015 looks like the current lead suspect for this workload.

Fix Action

Fix / Workaround

PR fix notes

PR #103: feat(vllm): vllm gb200 dsv4 recipes

Description (problem / solution / changelog)

Draft PR for the vLLM GB200 v0.20.0 branch.

Summary:

  • Adds a self-contained lm-eval benchmark runner for GSM8K-style evals against an OpenAI-compatible chat endpoint.
  • Keeps the existing SGLang-dependent gsm8k runner untouched; this new path uses python3 -m lm_eval --model local-chat-completions and does not require SGLang or an InferenceX workspace mount.
  • Bundles the GSM8K task YAML, score thresholds, and score validator under src/srtctl/benchmarks/scripts/lm-eval/.
  • Updates only recipes/vllm/deepseek-v4-pro/GB200/8k1k/disagg-gb200-1p4d-dep8-tp8-c256-c512-offload.yaml to run lm-eval with the bundled GSM8K task and VALIDATE_EVAL_SCORES=true.

Validation:

  • Local smoke: launched a small Qwen/Qwen2.5-0.5B-Instruct OpenAI-compatible chat endpoint and ran the bundled script with EVAL_LIMIT=2, EVAL_NUM_FEWSHOT=0, EVAL_CONC=1; it produced meta_env.json, results_*.json, and samples_*.jsonl successfully.
  • bash -n src/srtctl/benchmarks/scripts/lm-eval/bench.sh passed.
  • python3 -m py_compile src/srtctl/benchmarks/lm_eval.py src/srtctl/cli/do_sweep.py src/srtctl/benchmarks/scripts/lm-eval/validate_scores.py passed.
  • Focused tests passed: tests/test_benchmarks.py::TestLMEvalRunner, TestRunPostEval, and TestScriptsExist.
  • UV_DEFAULT_INDEX=https://pypi.org/simple make check passed (635 passed, 2 skipped, 6 deselected).
  • The known ty diagnostic in src/srtctl/core/validation.py is still emitted under the existing || true Makefile behavior.

Changed files

  • Makefile (modified, +18/-4)
  • configs/patches/vllm-container-deps-one-sided.sh (added, +17/-0)
  • configs/patches/vllm-container-deps-pr41015-fp4-fix.sh (added, +8/-0)
  • configs/patches/vllm_numa_bind_hash_fix.py (modified, +26/-1)
  • configs/patches/vllm_nvlink_one_sided_bf16_fix_v20.py (added, +382/-0)
  • configs/patches/vllm_revert_pr41015_fp4_cvt.py (added, +189/-0)
  • issue-41603/README.md (added, +22/-0)
  • issue-41603/good-pr-container-15908/benchmark-rollup.csv (added, +2/-0)
  • issue-41603/good-pr-container-15908/recipe.lock.yaml (added, +2666/-0)
  • issue-41603/nightly-a749-15902/benchmark-rollup.csv (added, +2/-0)
  • issue-41603/nightly-a749-15902/recipe.lock.yaml (added, +2678/-0)
  • issue-41603/nightly-a749-revert-pr41015-15975/benchmark-rollup.csv (added, +2/-0)
  • issue-41603/nightly-a749-revert-pr41015-15975/recipe.lock.yaml (added, +2679/-0)
  • issue-41603/official-v0201-15963/benchmark-rollup.csv (added, +2/-0)
  • issue-41603/official-v0201-15963/recipe.lock.yaml (added, +2574/-0)
  • recipes/vllm/deepseek-v4-pro/GB200/8k1k/agg-gb200-low-latency-mtp2.yaml (added, +85/-0)
  • recipes/vllm/deepseek-v4-pro/GB200/8k1k/disagg-gb200-high-tpt-megamoe-mtp2.yaml (renamed, +41/-21)
  • recipes/vllm/deepseek-v4-pro/GB200/8k1k/disagg-gb200-high-tpt-megamoe.yaml (renamed, +19/-12)
  • recipes/vllm/deepseek-v4-pro/GB200/8k1k/disagg-gb200-low-latency-mtp2.yaml (added, +125/-0)
  • recipes/vllm/deepseek-v4-pro/GB200/8k1k/disagg-gb200-low-latency.yaml (renamed, +12/-7)
  • recipes/vllm/deepseek-v4-pro/GB200/8k1k/disagg-gb200-low-middle-curve.yaml (renamed, +11/-5)
  • recipes/vllm/deepseek-v4-pro/GB200/8k1k/disagg-gb200-max-tpt-megamoe.yaml (renamed, +19/-12)
  • recipes/vllm/deepseek-v4-pro/GB200/8k1k/disagg-gb200-mid-curve-megamoe-mtp2.yaml (added, +144/-0)
  • recipes/vllm/deepseek-v4-pro/GB200/8k1k/disagg-gb200-mid-curve-megamoe.yaml (renamed, +19/-16)
  • src/srtctl/backends/vllm.py (modified, +10/-1)
  • src/srtctl/benchmarks/__init__.py (modified, +2/-0)
  • src/srtctl/benchmarks/lm_eval.py (added, +56/-0)
  • src/srtctl/benchmarks/scripts/lm-eval/bench.sh (added, +380/-0)
  • src/srtctl/benchmarks/scripts/lm-eval/gsm8k.yaml (added, +51/-0)
  • src/srtctl/benchmarks/scripts/lm-eval/thresholds.json (added, +3/-0)
  • src/srtctl/benchmarks/scripts/lm-eval/validate_scores.py (added, +87/-0)
  • src/srtctl/cli/do_sweep.py (modified, +130/-2)
  • src/srtctl/cli/submit.py (modified, +3/-1)
  • src/srtctl/core/schema.py (modified, +1/-0)
  • tests/test_benchmarks.py (modified, +163/-0)
  • tests/test_vllm_cli_args.py (added, +14/-0)
RAW_BUFFERClick to expand / collapse

Summary

I'm seeing a large DeepSeek-V4-Pro MTP2 throughput regression on GB200 between a pre-merge PR container and newer vLLM images. A one-off test that reverts the FP32->FP4 cvt path from #41015 on top of the nightly recovers most of the lost throughput, so #41015 looks like the current lead suspect for this workload.

Config

  • Model: deepseek-ai/DeepSeek-V4-Pro
  • Backend: Dynamo + vLLM, disaggregated serving
  • Hardware: GB200, 24 GPUs total
  • Layout: 2P/1D high-throughput MegaMOE
    • Prefill: TP8 / EP8 / dp_attention=true / 2 workers = 16 GPUs
    • Decode: TP8 / EP8 / dp_attention=true / 1 worker = 8 GPUs
  • Workload: ISL 8192 / OSL 1024, concurrency 1024, 10,240 completed requests
  • Features: MTP2 speculative decoding, FP4 indexer cache
  • Same srt-slurm recipe/knobs for the main comparison; only the container or patch changed.

Metrics below use InferenceX conventions:

  • total tok/s/GPU = (input tokens + output tokens) / benchmark duration / 24 GPUs
  • tok/s/user = 1000 / median TPOT_ms
  • AL (derived) = 1 + accepted_tokens / (drafted_token_positions / 2) for MTP2. The vLLM log field is named Drafted; in these runs it appears to count drafted token positions, so I divide by num_speculative_tokens=2 to infer draft iterations.

Reproducer and artifacts

Results

RunStatetotal tok/s/GPUtok/s/usermedian TPOTAL (derived)total tok/sNotes
PR container: vLLM 0.20.1rc1.dev38+g61c3a50f4completed~7,489~22.9~43.7 ms2.358~179,747Good reference run from pre-merge PR container; SA job 15908
Official vllm/vllm-openai:v0.20.1-ubuntu2404completed5,86016.7159.86 ms2.359~140,634InferenceX/GHA run; SA job 15970; still substantially below good reference
Official vllm/vllm-openai:v0.20.1-ubuntu2404completed2,66936.6227.31 ms2.36664,051SA job 15963; bad high-throughput result with very high median TTFT (~74.2 s)
Nightly: vLLM 0.20.1rc1.dev91+ga749a33d8completed5,30515.066.83 ms2.360127,326Regression with same recipe; SA job 15902
Nightly + revert #41015 FP32->FP4 cvt pathcompleted7,32621.7545.97 ms2.363175,827Recovers most of the regression; SA job 15975

The #41015 revert is the only test that brought total throughput back near the pre-merge PR container. It recovered roughly 92% of the lost total-token throughput versus the nightly regression.

AL source counters, summed across decode worker SpecDecoding metrics log lines:

Run/jobAccepted tokensDrafted token positionsInferred draftsAL (derived)
PR container / 159086,521,7689,604,3864,802,1932.358
Official 0.20.1 GHA / 159706,523,6029,600,4244,800,2122.359
Official 0.20.1 SA / 159636,537,7479,572,1384,786,0692.366
Nightly a749 / 159026,525,5619,596,4884,798,2442.360
Nightly a749 + revert #41015 / 159756,531,0109,586,0204,793,0102.363

AL is essentially flat across these runs, including the regression and the #41015-revert recovery, so the current evidence does not point to acceptance length as the direct cause of the throughput drop.

Request

Could someone familiar with the #41015 FP32->FP4 cvt path check whether it is expected to affect the DSV4 FP4/indexer + MTP2 path on GB200? This may be GB200/DeepSeek-V4/MTP2 specific. The lockfiles and benchmark rollups linked above should make the exact repro configs easier to inspect.

extent analysis

TL;DR

Reverting the FP32->FP4 conversion path from #41015 may help recover the lost throughput in DeepSeek-V4-Pro MTP2 on GB200.

Guidance

  • Investigate the FP32->FP4 conversion path in #41015 to determine its impact on the DSV4 FP4/indexer + MTP2 path on GB200.
  • Review the lockfiles and benchmark rollups provided to understand the exact repro configs.
  • Consider testing the revert #41015 FP32->FP4 cvt path on other hardware or models to determine if the issue is specific to GB200/DeepSeek-V4/MTP2.
  • Analyze the AL source counters to confirm that acceptance length is not the direct cause of the throughput drop.

Example

No code snippet is provided as the issue does not require a code change, but rather an investigation into the FP32->FP4 conversion path.

Notes

The issue seems to be specific to the GB200 hardware and DeepSeek-V4-Pro model with MTP2, and the revert #41015 FP32->FP4 cvt path may not be the root cause of the problem. Further investigation is needed to determine the exact cause of the throughput regression.

Recommendation

Apply the workaround of reverting the #41015 FP32->FP4 cvt path, as it has been shown to recover most of the lost throughput in the provided test results.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING