vllm - 💡(How to fix) Fix [Bug]: v0.20 latency and throughput regression on MoE models [1 comments, 2 participants]

Code Example

# Start server
docker run -d --name vllm_test --gpus all --network host --shm-size=16g \
  vllm/vllm-openai:v0.19.0 \
  --model mistralai/Mixtral-8x7B-Instruct-v0.1 \
  --tensor-parallel-size 8 \
  --max-model-len 16384 \
  --max-num-seqs 16 \
  --dtype bfloat16 \
  --compilation-config '{"compile_sizes":[1,2,4,8],"cudagraph_mode":"FULL_AND_PIECEWISE"}'

# Run benchmark
vllm bench serve \
  --model mistralai/Mixtral-8x7B-Instruct-v0.1 \
  --backend vllm \
  --dataset-name sonnet \
  --dataset-path sonnet.txt \
  --sonnet-input-len 4096 \
  --sonnet-output-len 512 \
  --num-prompts 300 \
  --max-concurrency 1

# Stop and repeat with vllm/vllm-openai:v0.20.0

---

docker run -d --name vllm_test --gpus all --network host --shm-size=16g \
  vllm/vllm-openai:v0.19.0 \
  --model deepseek-ai/DeepSeek-V2-Chat \
  --tensor-parallel-size 8 \
  --max-model-len 16384 \
  --max-num-seqs 16 \
  --dtype bfloat16 \
  --gpu-memory-utilization 0.85 \
  --compilation-config '{"compile_sizes":[1,2,4,8],"cudagraph_mode":"FULL_AND_PIECEWISE"}'

# Same benchmark command with --model deepseek-ai/DeepSeek-V2-Chat

---

docker run -d --name vllm_test --gpus all --network host --shm-size=16g \
  vllm/vllm-openai:v0.19.0 \
  --model meta-llama/Llama-3.1-70B-Instruct \
  --tensor-parallel-size 8 \
  --max-model-len 16384 \
  --max-num-seqs 16 \
  --dtype bfloat16 \
  --compilation-config '{"compile_sizes":[1,2,4,8],"cudagraph_mode":"FULL_AND_PIECEWISE"}'

# Same benchmark command with --model meta-llama/Llama-3.1-70B-Instruct

Your current environment

vLLM v0.19.0 vs v0.20.0 (vllm/vllm-openai:v0.19.0 and vllm/vllm-openai:v0.20.0)
8× NVIDIA H200 (p5en.48xlarge)

🐛 Describe the bug

v0.20.0 introduces a latency and throughput regression on MoE models compared to v0.19.0. Dense models are not affected.

300 prompts, max-concurrency=1, sonnet (input=4096, output=512):

Model	Type	Params	v0.19 TPOT	v0.20 TPOT	Delta	v0.19 TTFT	v0.20 TTFT	Delta	v0.19 tput	v0.20 tput	Delta
Llama-3.1-8B	Dense	8B	2.85ms	2.87ms	+0.7% ✅	46.2ms	47.3ms	+2.4% ✅	340 tok/s	338 tok/s	-0.6% ✅
Llama-3.1-70B	Dense	70B	7.56ms	7.59ms	+0.4% ✅	160.6ms	162.6ms	+1.2% ✅	125 tok/s	125 tok/s	0% ✅
Mixtral-8x7B	MoE	46B	2.66ms	3.22ms	+21% ❌	55.2ms	87.7ms	+59% ❌	306 tok/s	248 tok/s	-19% ❌
DeepSeek-V2-Chat	MoE+MLA	236B	8.73ms	9.14ms	+4.7% ❌	135.4ms	167.0ms	+23% ❌	111 tok/s	106 tok/s	-4.5% ❌

Reproduction

Mixtral-8x7B (clearest regression)

# Start server
docker run -d --name vllm_test --gpus all --network host --shm-size=16g \
  vllm/vllm-openai:v0.19.0 \
  --model mistralai/Mixtral-8x7B-Instruct-v0.1 \
  --tensor-parallel-size 8 \
  --max-model-len 16384 \
  --max-num-seqs 16 \
  --dtype bfloat16 \
  --compilation-config '{"compile_sizes":[1,2,4,8],"cudagraph_mode":"FULL_AND_PIECEWISE"}'

# Run benchmark
vllm bench serve \
  --model mistralai/Mixtral-8x7B-Instruct-v0.1 \
  --backend vllm \
  --dataset-name sonnet \
  --dataset-path sonnet.txt \
  --sonnet-input-len 4096 \
  --sonnet-output-len 512 \
  --num-prompts 300 \
  --max-concurrency 1

# Stop and repeat with vllm/vllm-openai:v0.20.0

DeepSeek-V2-Chat (236B)

docker run -d --name vllm_test --gpus all --network host --shm-size=16g \
  vllm/vllm-openai:v0.19.0 \
  --model deepseek-ai/DeepSeek-V2-Chat \
  --tensor-parallel-size 8 \
  --max-model-len 16384 \
  --max-num-seqs 16 \
  --dtype bfloat16 \
  --gpu-memory-utilization 0.85 \
  --compilation-config '{"compile_sizes":[1,2,4,8],"cudagraph_mode":"FULL_AND_PIECEWISE"}'

# Same benchmark command with --model deepseek-ai/DeepSeek-V2-Chat

Control: Llama-3.1-70B (no regression)

docker run -d --name vllm_test --gpus all --network host --shm-size=16g \
  vllm/vllm-openai:v0.19.0 \
  --model meta-llama/Llama-3.1-70B-Instruct \
  --tensor-parallel-size 8 \
  --max-model-len 16384 \
  --max-num-seqs 16 \
  --dtype bfloat16 \
  --compilation-config '{"compile_sizes":[1,2,4,8],"cudagraph_mode":"FULL_AND_PIECEWISE"}'

# Same benchmark command with --model meta-llama/Llama-3.1-70B-Instruct

The regression appears to originate from the MoE runner refactor in v0.20 (PluggableLayer, modular MoEPrepareAndFinalize, DefaultMoERunner split/recombine) introducing additional CPU dispatch overhead between GPU kernel launches.

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

TL;DR

The most likely fix for the latency and throughput regression in v0.20.0 is to revert or modify the MoE runner refactor that introduced additional CPU dispatch overhead.

Guidance

Investigate the MoE runner refactor in v0.20.0 and identify the specific changes that introduced the additional CPU dispatch overhead.
Consider reverting or modifying the PluggableLayer, modular MoEPrepareAndFinalize, and DefaultMoERunner changes to reduce the overhead.
Run benchmarks with different compilation configurations to see if the cudagraph_mode or compile_sizes have an impact on the regression.
Compare the performance of v0.19.0 and v0.20.0 with different models, such as Llama-3.1-70B, to understand the scope of the regression.

Example

No code snippet is provided as the issue is related to a specific version change and requires investigation of the MoE runner refactor.

Notes

The regression appears to be specific to MoE models, and the cause is likely related to the changes in the MoE runner refactor. Further investigation is needed to determine the exact cause and develop a fix.

Recommendation

Apply a workaround by reverting to v0.19.0 until the issue is resolved, as the regression is significant and affects the performance of MoE models.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

vllm - 💡(How to fix) Fix [Bug]: v0.20 latency and throughput regression on MoE models [1 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Fix Action

Fix / Workaround

Code Example

Your current environment

Your current environment

🐛 Describe the bug

Reproduction

Mixtral-8x7B (clearest regression)

DeepSeek-V2-Chat (236B)

Control: Llama-3.1-70B (no regression)

Before submitting a new issue...

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

TRENDING

vllm - 💡(How to fix) Fix [Bug]: v0.20 latency and throughput regression on MoE models [1 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Fix Action

Fix / Workaround

Code Example

Your current environment

Your current environment

🐛 Describe the bug

Reproduction

Mixtral-8x7B (clearest regression)

DeepSeek-V2-Chat (236B)

Control: Llama-3.1-70B (no regression)

Before submitting a new issue...

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING