vllm - 💡(How to fix) Fix [Bug]: v0.20 latency and throughput regression on MoE models [1 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#41306Fetched 2026-04-30 06:18:55
View on GitHub
Comments
1
Participants
2
Timeline
3
Reactions
0
Timeline (top)
commented ×1labeled ×1subscribed ×1

Fix Action

Fix / Workaround

The regression appears to originate from the MoE runner refactor in v0.20 (PluggableLayer, modular MoEPrepareAndFinalize, DefaultMoERunner split/recombine) introducing additional CPU dispatch overhead between GPU kernel launches.

Code Example

# Start server
docker run -d --name vllm_test --gpus all --network host --shm-size=16g \
  vllm/vllm-openai:v0.19.0 \
  --model mistralai/Mixtral-8x7B-Instruct-v0.1 \
  --tensor-parallel-size 8 \
  --max-model-len 16384 \
  --max-num-seqs 16 \
  --dtype bfloat16 \
  --compilation-config '{"compile_sizes":[1,2,4,8],"cudagraph_mode":"FULL_AND_PIECEWISE"}'

# Run benchmark
vllm bench serve \
  --model mistralai/Mixtral-8x7B-Instruct-v0.1 \
  --backend vllm \
  --dataset-name sonnet \
  --dataset-path sonnet.txt \
  --sonnet-input-len 4096 \
  --sonnet-output-len 512 \
  --num-prompts 300 \
  --max-concurrency 1

# Stop and repeat with vllm/vllm-openai:v0.20.0

---

docker run -d --name vllm_test --gpus all --network host --shm-size=16g \
  vllm/vllm-openai:v0.19.0 \
  --model deepseek-ai/DeepSeek-V2-Chat \
  --tensor-parallel-size 8 \
  --max-model-len 16384 \
  --max-num-seqs 16 \
  --dtype bfloat16 \
  --gpu-memory-utilization 0.85 \
  --compilation-config '{"compile_sizes":[1,2,4,8],"cudagraph_mode":"FULL_AND_PIECEWISE"}'

# Same benchmark command with --model deepseek-ai/DeepSeek-V2-Chat

---

docker run -d --name vllm_test --gpus all --network host --shm-size=16g \
  vllm/vllm-openai:v0.19.0 \
  --model meta-llama/Llama-3.1-70B-Instruct \
  --tensor-parallel-size 8 \
  --max-model-len 16384 \
  --max-num-seqs 16 \
  --dtype bfloat16 \
  --compilation-config '{"compile_sizes":[1,2,4,8],"cudagraph_mode":"FULL_AND_PIECEWISE"}'

# Same benchmark command with --model meta-llama/Llama-3.1-70B-Instruct
RAW_BUFFERClick to expand / collapse

Your current environment

Your current environment

  • vLLM v0.19.0 vs v0.20.0 (vllm/vllm-openai:v0.19.0 and vllm/vllm-openai:v0.20.0)
  • 8× NVIDIA H200 (p5en.48xlarge)

🐛 Describe the bug

v0.20.0 introduces a latency and throughput regression on MoE models compared to v0.19.0. Dense models are not affected.

300 prompts, max-concurrency=1, sonnet (input=4096, output=512):

ModelTypeParamsv0.19 TPOTv0.20 TPOTDeltav0.19 TTFTv0.20 TTFTDeltav0.19 tputv0.20 tputDelta
Llama-3.1-8BDense8B2.85ms2.87ms+0.7% ✅46.2ms47.3ms+2.4% ✅340 tok/s338 tok/s-0.6% ✅
Llama-3.1-70BDense70B7.56ms7.59ms+0.4% ✅160.6ms162.6ms+1.2% ✅125 tok/s125 tok/s0% ✅
Mixtral-8x7BMoE46B2.66ms3.22ms+21%55.2ms87.7ms+59%306 tok/s248 tok/s-19%
DeepSeek-V2-ChatMoE+MLA236B8.73ms9.14ms+4.7%135.4ms167.0ms+23%111 tok/s106 tok/s-4.5%

Reproduction

Mixtral-8x7B (clearest regression)

# Start server
docker run -d --name vllm_test --gpus all --network host --shm-size=16g \
  vllm/vllm-openai:v0.19.0 \
  --model mistralai/Mixtral-8x7B-Instruct-v0.1 \
  --tensor-parallel-size 8 \
  --max-model-len 16384 \
  --max-num-seqs 16 \
  --dtype bfloat16 \
  --compilation-config '{"compile_sizes":[1,2,4,8],"cudagraph_mode":"FULL_AND_PIECEWISE"}'

# Run benchmark
vllm bench serve \
  --model mistralai/Mixtral-8x7B-Instruct-v0.1 \
  --backend vllm \
  --dataset-name sonnet \
  --dataset-path sonnet.txt \
  --sonnet-input-len 4096 \
  --sonnet-output-len 512 \
  --num-prompts 300 \
  --max-concurrency 1

# Stop and repeat with vllm/vllm-openai:v0.20.0

DeepSeek-V2-Chat (236B)

docker run -d --name vllm_test --gpus all --network host --shm-size=16g \
  vllm/vllm-openai:v0.19.0 \
  --model deepseek-ai/DeepSeek-V2-Chat \
  --tensor-parallel-size 8 \
  --max-model-len 16384 \
  --max-num-seqs 16 \
  --dtype bfloat16 \
  --gpu-memory-utilization 0.85 \
  --compilation-config '{"compile_sizes":[1,2,4,8],"cudagraph_mode":"FULL_AND_PIECEWISE"}'

# Same benchmark command with --model deepseek-ai/DeepSeek-V2-Chat

Control: Llama-3.1-70B (no regression)

docker run -d --name vllm_test --gpus all --network host --shm-size=16g \
  vllm/vllm-openai:v0.19.0 \
  --model meta-llama/Llama-3.1-70B-Instruct \
  --tensor-parallel-size 8 \
  --max-model-len 16384 \
  --max-num-seqs 16 \
  --dtype bfloat16 \
  --compilation-config '{"compile_sizes":[1,2,4,8],"cudagraph_mode":"FULL_AND_PIECEWISE"}'

# Same benchmark command with --model meta-llama/Llama-3.1-70B-Instruct

The regression appears to originate from the MoE runner refactor in v0.20 (PluggableLayer, modular MoEPrepareAndFinalize, DefaultMoERunner split/recombine) introducing additional CPU dispatch overhead between GPU kernel launches.

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

TL;DR

The most likely fix for the latency and throughput regression in v0.20.0 is to revert or modify the MoE runner refactor that introduced additional CPU dispatch overhead.

Guidance

  • Investigate the MoE runner refactor in v0.20.0 and identify the specific changes that introduced the additional CPU dispatch overhead.
  • Consider reverting or modifying the PluggableLayer, modular MoEPrepareAndFinalize, and DefaultMoERunner changes to reduce the overhead.
  • Run benchmarks with different compilation configurations to see if the cudagraph_mode or compile_sizes have an impact on the regression.
  • Compare the performance of v0.19.0 and v0.20.0 with different models, such as Llama-3.1-70B, to understand the scope of the regression.

Example

No code snippet is provided as the issue is related to a specific version change and requires investigation of the MoE runner refactor.

Notes

The regression appears to be specific to MoE models, and the cause is likely related to the changes in the MoE runner refactor. Further investigation is needed to determine the exact cause and develop a fix.

Recommendation

Apply a workaround by reverting to v0.19.0 until the issue is resolved, as the regression is significant and affects the performance of MoE models.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING