vllm - ✅(Solved) Fix DeepSeek-V4 MTP2 GB200 throughput regression likely tied to FP32->FP4 cvt path (#41015) [1 pull requests, 1 comments, 2 participants]

alec-flowers · 2026-05-04T04:22:53Z

[vllm] I'm seeing a large DeepSeek-V4-Pro MTP2 throughput regression on GB200 between a pre-merge PR container and newer vLLM images. A one-off test that rever… I'm seeing a large DeepSeek-V4-Pro MTP2 throughput regression on GB200 between a pre-merge PR container and newer vLLM images. A one-off test that reverts the FP32->FP4 cvt path from #41015 on top of the nightly recovers most of the lost throughput, so #41015 looks like the current lead suspect for this workload. # PR #103: feat(vllm): vllm gb200 dsv4 recipes - Repository: NVIDIA/srt-slurm - Author: alec-flowers - State: open | merged: False - Link: https://github.com/NVIDIA/srt-slurm/pull/103 ## Description (problem / solution / changelog) Draft PR for the vLLM GB200 v0.20.0 branch. Summary: - Adds a self-contained `lm-eval` benchmark runner for GSM8K-style evals against an OpenAI-compatible chat endpoint. - Keeps the existing SGLang-dependent `gsm8k` runner untouched; this new path uses `python3 -m lm_eval --model local-chat-completions` and does not require SGLang or an InferenceX workspace mount. - Bundles the GSM8K task YAML, score thresholds, and score validator under `src/srtctl/benchmarks/scripts/lm-eval/`. - Updates only `recipes/vllm/deepseek-v4-pro/GB200/8k1k/disagg-gb200-1p4d-dep8-tp8-c256-c512-offload.yaml` to run `lm-eval` with the bundled GSM8K task and `VALIDATE_EVAL_SCORES=true`. Validation: - Local smoke: launched a small `Qwen/Qwen2.5-0.5B-Instruct` OpenAI-compatible chat endpoint and ran the bundled script with `EVAL_LIMIT=2`, `EVAL_NUM_FEWSHOT=0`, `EVAL_CONC=1`; it produced `meta_env.json`, `results_*.json`, and `samples_*.jsonl` successfully. - `bash -n src/srtctl/benchmarks/scripts/lm-eval/bench.sh` passed. - `python3 -m py_compile src/srtctl/benchmarks/lm_eval.py src/srtctl/cli/do_sweep.py src/srtctl/benchmarks/scripts/lm-eval/validate_scores.py` passed. - Focused tests passed: `tests/test_benchmarks.py::TestLMEvalRunner`, `TestRunPostEval`, and `TestScriptsExist`. - `UV_DEFAULT_INDEX=https://pypi.org/simple make check` passed (`635 passed, 2 skipped, 6 deselected`). - The known `ty` diagnostic in `src/srtctl/core/validation.py` is still emitted under the existing `|| true` Makefile behavior. ## Changed files - `Makefile` (modified, +18/-4) - `configs/patches/vllm-container-deps-one-sided.sh` (added, +17/-0) - `configs/patches/vllm-container-deps-pr41015-fp4-fix.sh` (added, +8/-0) - `configs/patches/vllm_numa_bind_hash_fix.py` (modified, +26/-1) - `configs/patches/vllm_nvlink_one_sided_bf16_fix_v20.py` (added, +382/-0) - `configs/patches/vllm_revert_pr41015_fp4_cvt.py` (added, +189/-0) - `issue-41603/README.md` (added, +22/-0) - `issue-41603/good-pr-container-15908/benchmark-rollup.csv` (added, +2/-0) - `issue-41603/good-pr-container-15908/recipe.lock.yaml` (added, +2666/-0) - `issue-41603/nightly-a749-15902/benchmark-rollup.csv` (added, +2/-0) - `issue-41603/nightly-a749-15902/recipe.lock.yaml` (added, +2678/-0) - `issue-41603/nightly-a749-revert-pr41015-15975/benchmark-rollup.csv` (added, +2/-0) - `issue-41603/nightly-a749-revert-pr41015-15975/recipe.lock.yaml` (added, +2679/-0) - `issue-41603/official-v0201-15963/benchmark-rollup.csv` (added, +2/-0) - `issue-41603/official-v0201-15963/recipe.lock.yaml` (added, +2574/-0) - `recipes/vllm/deepseek-v4-pro/GB200/8k1k/agg-gb200-low-latency-mtp2.yaml` (added, +85/-0) - `recipes/vllm/deepseek-v4-pro/GB200/8k1k/disagg-gb200-high-tpt-megamoe-mtp2.yaml` (renamed, +41/-21) - `recipes/vllm/deepseek-v4-pro/GB200/8k1k/disagg-gb200-high-tpt-megamoe.yaml` (renamed, +19/-12) - `recipes/vllm/deepseek-v4-pro/GB200/8k1k/disagg-gb200-low-latency-mtp2.yaml` (added, +125/-0) - `recipes/vllm/deepseek-v4-pro/GB200/8k1k/disagg-gb200-low-latency.yaml` (renamed, +12/-7) - `recipes/vllm/deepseek-v4-pro/GB200/8k1k/disagg-gb200-low-middle-curve.yaml` (renamed, +11/-5) - `recipes/vllm/deepseek-v4-pro/GB200/8k1k/disagg-gb200-max-tpt-megamoe.yaml` (renamed, +19/-12) - `recipes/vllm/deepseek-v4-pro/GB200/8k1k/disagg-gb200-mid-curve-megamoe-mtp2.yaml` (added, +144/-0) - `recipes/vllm/deepseek-v4-pro/GB200/8k1k/disagg-gb200-mid-curve-megamoe.yaml` (renamed, +19/-16) - `src/srtctl/backends/vllm.py` (modified, +10/-1) - `src/srtctl/benchmarks/__init__.py` (modified, +2/-0) - `src/srtctl/benchmarks/lm_eval.py` (added, +56/-0) - `src/srtctl/benchmarks/scripts/lm-eval/bench.sh` (added, +380/-0) - `src/srtctl/benchmarks/scripts/lm-eval/gsm8k.yaml` (added, +51/-0) - `src/srtctl/benchmarks/scripts/lm-eval/thresholds.json` (added, +3/-0) - `src/srtctl/benchmarks/scripts/lm-eval/validate_scores.py` (added, +87/-0) - `src/srtctl/cli/do_sweep.py` (modified, +130/-2) - `src/srtctl/cli/submit.py` (modified, +3/-1) - `src/srtctl/core/schema.py` (modified, +1/-0) - `tests/test_benchmarks.py` (modified, +163/-0) - `tests/test_vllm_cli_args.py` (added, +14/-0) ## Fix / Workaround - Model: `deepseek-ai/DeepSeek-V4-Pro` - Backend: Dynamo + vLLM, disaggregated serving - Hardware: GB200, 24 GPUs total - Layout: 2P/1D high-throughpu

vllm2026-05-04 04:22:53

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

vllm-project/vllm#41603•Fetched 2026-05-05 05:44:47

View on GitHub

Comments

Participants

Timeline

Reactions

Author

alec-flowers

Participants

alec-flowers

ywang96

Timeline (top)

commented ×1milestoned ×1

I'm seeing a large DeepSeek-V4-Pro MTP2 throughput regression on GB200 between a pre-merge PR container and newer vLLM images. A one-off test that reverts the FP32->FP4 cvt path from #41015 on top of the nightly recovers most of the lost throughput, so #41015 looks like the current lead suspect for this workload.

Root Cause

Fix Action

Fix / Workaround

Model: deepseek-ai/DeepSeek-V4-Pro
Backend: Dynamo + vLLM, disaggregated serving
Hardware: GB200, 24 GPUs total
Layout: 2P/1D high-throughput MegaMOE
- Prefill: TP8 / EP8 / dp_attention=true / 2 workers = 16 GPUs
- Decode: TP8 / EP8 / dp_attention=true / 1 worker = 8 GPUs
Workload: ISL 8192 / OSL 1024, concurrency 1024, 10,240 completed requests
Features: MTP2 speculative decoding, FP4 indexer cache
Same srt-slurm recipe/knobs for the main comparison; only the container or patch changed.
srt-slurm reproducer PR: https://github.com/NVIDIA/srt-slurm/pull/103
Primary recipe: https://github.com/NVIDIA/srt-slurm/blob/aflowers/vllm-gb200-v0.20.0/recipes/vllm/deepseek-v4-pro/GB200/8k1k/disagg-gb200-high-tpt-megamoe-mtp2.yaml
Lockfiles and rollups for the runs below: https://github.com/NVIDIA/srt-slurm/tree/aflowers/vllm-gb200-v0.20.0/issue-41603
Local recovery patch used for the Nightly + revert #41015 row:
- https://github.com/NVIDIA/srt-slurm/blob/aflowers/vllm-gb200-v0.20.0/configs/patches/vllm-container-deps-pr41015-fp4-fix.sh
- https://github.com/NVIDIA/srt-slurm/blob/aflowers/vllm-gb200-v0.20.0/configs/patches/vllm_revert_pr41015_fp4_cvt.py

PR fix notes

PR #103: feat(vllm): vllm gb200 dsv4 recipes

Repository: NVIDIA/srt-slurm
Author: alec-flowers
State: open | merged: False
Link: https://github.com/NVIDIA/srt-slurm/pull/103

Description (problem / solution / changelog)

Draft PR for the vLLM GB200 v0.20.0 branch.

Summary:

Adds a self-contained lm-eval benchmark runner for GSM8K-style evals against an OpenAI-compatible chat endpoint.
Keeps the existing SGLang-dependent gsm8k runner untouched; this new path uses python3 -m lm_eval --model local-chat-completions and does not require SGLang or an InferenceX workspace mount.
Bundles the GSM8K task YAML, score thresholds, and score validator under src/srtctl/benchmarks/scripts/lm-eval/.
Updates only recipes/vllm/deepseek-v4-pro/GB200/8k1k/disagg-gb200-1p4d-dep8-tp8-c256-c512-offload.yaml to run lm-eval with the bundled GSM8K task and VALIDATE_EVAL_SCORES=true.

Validation:

Local smoke: launched a small Qwen/Qwen2.5-0.5B-Instruct OpenAI-compatible chat endpoint and ran the bundled script with EVAL_LIMIT=2, EVAL_NUM_FEWSHOT=0, EVAL_CONC=1; it produced meta_env.json, results_*.json, and samples_*.jsonl successfully.
bash -n src/srtctl/benchmarks/scripts/lm-eval/bench.sh passed.
python3 -m py_compile src/srtctl/benchmarks/lm_eval.py src/srtctl/cli/do_sweep.py src/srtctl/benchmarks/scripts/lm-eval/validate_scores.py passed.
Focused tests passed: tests/test_benchmarks.py::TestLMEvalRunner, TestRunPostEval, and TestScriptsExist.
UV_DEFAULT_INDEX=https://pypi.org/simple make check passed (635 passed, 2 skipped, 6 deselected).
The known ty diagnostic in src/srtctl/core/validation.py is still emitted under the existing || true Makefile behavior.

Changed files

Makefile (modified, +18/-4)
configs/patches/vllm-container-deps-one-sided.sh (added, +17/-0)
configs/patches/vllm-container-deps-pr41015-fp4-fix.sh (added, +8/-0)
configs/patches/vllm_numa_bind_hash_fix.py (modified, +26/-1)
configs/patches/vllm_nvlink_one_sided_bf16_fix_v20.py (added, +382/-0)
configs/patches/vllm_revert_pr41015_fp4_cvt.py (added, +189/-0)
issue-41603/README.md (added, +22/-0)
issue-41603/good-pr-container-15908/benchmark-rollup.csv (added, +2/-0)
issue-41603/good-pr-container-15908/recipe.lock.yaml (added, +2666/-0)
issue-41603/nightly-a749-15902/benchmark-rollup.csv (added, +2/-0)
issue-41603/nightly-a749-15902/recipe.lock.yaml (added, +2678/-0)
issue-41603/nightly-a749-revert-pr41015-15975/benchmark-rollup.csv (added, +2/-0)
issue-41603/nightly-a749-revert-pr41015-15975/recipe.lock.yaml (added, +2679/-0)
issue-41603/official-v0201-15963/benchmark-rollup.csv (added, +2/-0)
issue-41603/official-v0201-15963/recipe.lock.yaml (added, +2574/-0)
recipes/vllm/deepseek-v4-pro/GB200/8k1k/agg-gb200-low-latency-mtp2.yaml (added, +85/-0)
recipes/vllm/deepseek-v4-pro/GB200/8k1k/disagg-gb200-high-tpt-megamoe-mtp2.yaml (renamed, +41/-21)
recipes/vllm/deepseek-v4-pro/GB200/8k1k/disagg-gb200-high-tpt-megamoe.yaml (renamed, +19/-12)
recipes/vllm/deepseek-v4-pro/GB200/8k1k/disagg-gb200-low-latency-mtp2.yaml (added, +125/-0)
recipes/vllm/deepseek-v4-pro/GB200/8k1k/disagg-gb200-low-latency.yaml (renamed, +12/-7)
recipes/vllm/deepseek-v4-pro/GB200/8k1k/disagg-gb200-low-middle-curve.yaml (renamed, +11/-5)
recipes/vllm/deepseek-v4-pro/GB200/8k1k/disagg-gb200-max-tpt-megamoe.yaml (renamed, +19/-12)
recipes/vllm/deepseek-v4-pro/GB200/8k1k/disagg-gb200-mid-curve-megamoe-mtp2.yaml (added, +144/-0)
recipes/vllm/deepseek-v4-pro/GB200/8k1k/disagg-gb200-mid-curve-megamoe.yaml (renamed, +19/-16)
src/srtctl/backends/vllm.py (modified, +10/-1)
src/srtctl/benchmarks/__init__.py (modified, +2/-0)
src/srtctl/benchmarks/lm_eval.py (added, +56/-0)
src/srtctl/benchmarks/scripts/lm-eval/bench.sh (added, +380/-0)
src/srtctl/benchmarks/scripts/lm-eval/gsm8k.yaml (added, +51/-0)
src/srtctl/benchmarks/scripts/lm-eval/thresholds.json (added, +3/-0)
src/srtctl/benchmarks/scripts/lm-eval/validate_scores.py (added, +87/-0)
src/srtctl/cli/do_sweep.py (modified, +130/-2)
src/srtctl/cli/submit.py (modified, +3/-1)
src/srtctl/core/schema.py (modified, +1/-0)
tests/test_benchmarks.py (modified, +163/-0)
tests/test_vllm_cli_args.py (added, +14/-0)

RAW_BUFFERClick to expand / collapse

Summary

Config

Model: deepseek-ai/DeepSeek-V4-Pro
Backend: Dynamo + vLLM, disaggregated serving
Hardware: GB200, 24 GPUs total
Layout: 2P/1D high-throughput MegaMOE
- Prefill: TP8 / EP8 / dp_attention=true / 2 workers = 16 GPUs
- Decode: TP8 / EP8 / dp_attention=true / 1 worker = 8 GPUs
Workload: ISL 8192 / OSL 1024, concurrency 1024, 10,240 completed requests
Features: MTP2 speculative decoding, FP4 indexer cache
Same srt-slurm recipe/knobs for the main comparison; only the container or patch changed.

Metrics below use InferenceX conventions:

total tok/s/GPU = (input tokens + output tokens) / benchmark duration / 24 GPUs
tok/s/user = 1000 / median TPOT_ms
AL (derived) = 1 + accepted_tokens / (drafted_token_positions / 2) for MTP2. The vLLM log field is named Drafted; in these runs it appears to count drafted token positions, so I divide by num_speculative_tokens=2 to infer draft iterations.

Reproducer and artifacts

srt-slurm reproducer PR: https://github.com/NVIDIA/srt-slurm/pull/103
Primary recipe: https://github.com/NVIDIA/srt-slurm/blob/aflowers/vllm-gb200-v0.20.0/recipes/vllm/deepseek-v4-pro/GB200/8k1k/disagg-gb200-high-tpt-megamoe-mtp2.yaml
Lockfiles and rollups for the runs below: https://github.com/NVIDIA/srt-slurm/tree/aflowers/vllm-gb200-v0.20.0/issue-41603
Local recovery patch used for the Nightly + revert #41015 row:
- https://github.com/NVIDIA/srt-slurm/blob/aflowers/vllm-gb200-v0.20.0/configs/patches/vllm-container-deps-pr41015-fp4-fix.sh
- https://github.com/NVIDIA/srt-slurm/blob/aflowers/vllm-gb200-v0.20.0/configs/patches/vllm_revert_pr41015_fp4_cvt.py

Results

Run	State	total tok/s/GPU	tok/s/user	median TPOT	AL (derived)	total tok/s	Notes
PR container: `vLLM 0.20.1rc1.dev38+g61c3a50f4`	completed	~7,489	~22.9	~43.7 ms	2.358	~179,747	Good reference run from pre-merge PR container; SA job 15908
Official `vllm/vllm-openai:v0.20.1-ubuntu2404`	completed	5,860	16.71	59.86 ms	2.359	~140,634	InferenceX/GHA run; SA job 15970; still substantially below good reference
Official `vllm/vllm-openai:v0.20.1-ubuntu2404`	completed	2,669	36.62	27.31 ms	2.366	64,051	SA job 15963; bad high-throughput result with very high median TTFT (~74.2 s)
Nightly: `vLLM 0.20.1rc1.dev91+ga749a33d8`	completed	5,305	15.0	66.83 ms	2.360	127,326	Regression with same recipe; SA job 15902
Nightly + revert #41015 FP32->FP4 cvt path	completed	7,326	21.75	45.97 ms	2.363	175,827	Recovers most of the regression; SA job 15975

The #41015 revert is the only test that brought total throughput back near the pre-merge PR container. It recovered roughly 92% of the lost total-token throughput versus the nightly regression.

AL source counters, summed across decode worker SpecDecoding metrics log lines:

Run/job	Accepted tokens	Drafted token positions	Inferred drafts	AL (derived)
PR container / 15908	6,521,768	9,604,386	4,802,193	2.358
Official 0.20.1 GHA / 15970	6,523,602	9,600,424	4,800,212	2.359
Official 0.20.1 SA / 15963	6,537,747	9,572,138	4,786,069	2.366
Nightly a749 / 15902	6,525,561	9,596,488	4,798,244	2.360
Nightly a749 + revert #41015 / 15975	6,531,010	9,586,020	4,793,010	2.363

AL is essentially flat across these runs, including the regression and the #41015-revert recovery, so the current evidence does not point to acceptance length as the direct cause of the throughput drop.

Request

Could someone familiar with the #41015 FP32->FP4 cvt path check whether it is expected to affect the DSV4 FP4/indexer + MTP2 path on GB200? This may be GB200/DeepSeek-V4/MTP2 specific. The lockfiles and benchmark rollups linked above should make the exact repro configs easier to inspect.

extent analysis

TL;DR

Reverting the FP32->FP4 conversion path from #41015 may help recover the lost throughput in DeepSeek-V4-Pro MTP2 on GB200.

Guidance

Investigate the FP32->FP4 conversion path in #41015 to determine its impact on the DSV4 FP4/indexer + MTP2 path on GB200.
Review the lockfiles and benchmark rollups provided to understand the exact repro configs.
Consider testing the revert #41015 FP32->FP4 cvt path on other hardware or models to determine if the issue is specific to GB200/DeepSeek-V4/MTP2.
Analyze the AL source counters to confirm that acceptance length is not the direct cause of the throughput drop.

Example

No code snippet is provided as the issue does not require a code change, but rather an investigation into the FP32->FP4 conversion path.

Notes

The issue seems to be specific to the GB200 hardware and DeepSeek-V4-Pro model with MTP2, and the revert #41015 FP32->FP4 cvt path may not be the root cause of the problem. Further investigation is needed to determine the exact cause of the throughput regression.

Recommendation

Apply the workaround of reverting the #41015 FP32->FP4 cvt path, as it has been shown to recover most of the lost throughput in the provided test results.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#model compatibility #GPU setup #container setup #orchestration issue #cache issue

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

vllm - ✅(Solved) Fix DeepSeek-V4 MTP2 GB200 throughput regression likely tied to FP32->FP4 cvt path (#41015) [1 pull requests, 1 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Fix Action

Fix / Workaround

PR fix notes

PR #103: feat(vllm): vllm gb200 dsv4 recipes

Description (problem / solution / changelog)

Changed files

Summary

Config

Reproducer and artifacts

Results

Request

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

TRENDING

vllm - ✅(Solved) Fix DeepSeek-V4 MTP2 GB200 throughput regression likely tied to FP32->FP4 cvt path (#41015) [1 pull requests, 1 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Fix Action

Fix / Workaround

PR fix notes

PR #103: feat(vllm): vllm gb200 dsv4 recipes

Description (problem / solution / changelog)

Changed files

Summary

Config

Reproducer and artifacts

Results

Request

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING