vllm - 💡(How to fix) Fix [Bug]: DP replica under-utilization with Qwen3-8B (tp=4, dp=2) on 8x A100 PCIe [1 participants]

Code Example

vllm serve /home/skl/mkx/model/Qwen3-8B \
  --host 127.0.0.1 \
  --port 10842 \
  --served-model-name Qwen3-8B \
  --tensor-parallel-size 4 \
  --data-parallel-size 2

---

vllm serve /home/skl/mkx/model/Qwen2.5-Coder-7B-Instruct \
  --host 127.0.0.1 \
  --port 10872 \
  --served-model-name Qwen2.5-Coder-7B-Instruct \
  --tensor-parallel-size 4 \
  --data-parallel-size 2

Your current environment

vLLM version: 0.16.0
CUDA driver: 550.54.15
CUDA version: 12.4
GPUs: 8 x NVIDIA A100-PCIE-40GB
Interconnect: PCIe-only (no NVLink)
NCCL version reported by vLLM: 2.27.5
Python: 3.10
OS: Ubuntu Linux

🐛 Describe the bug

Model

Problematic model: Qwen3-8B
Comparison model: Qwen2.5-Coder-7B-Instruct

Command

For Qwen3-8B:

vllm serve /home/skl/mkx/model/Qwen3-8B \
  --host 127.0.0.1 \
  --port 10842 \
  --served-model-name Qwen3-8B \
  --tensor-parallel-size 4 \
  --data-parallel-size 2

For Qwen2.5-Coder-7B-Instruct:

vllm serve /home/skl/mkx/model/Qwen2.5-Coder-7B-Instruct \
  --host 127.0.0.1 \
  --port 10872 \
  --served-model-name Qwen2.5-Coder-7B-Instruct \
  --tensor-parallel-size 4 \
  --data-parallel-size 2

vLLM logs show that api_server_count defaults to data_parallel_size (2) in both cases.

What I observe

When serving Qwen3-8B with tp=4, dp=2 on 8 A100 PCIe GPUs:

All 8 GPUs load model weights into memory successfully. The two TP groups appear to be: DP replica 0: local_rank 0,1,2,3 DP replica 1: local_rank 4,5,6,7 However, during inference, only GPUs 4-7 show high utilization, while GPUs 0-3 remain near 0% utilization, even though they keep almost full memory allocated. From nvidia-smi, GPUs 0-3 are mostly idle, while GPUs 4-7 are ~85%-98% utilized. This makes it look like only one DP replica is actually doing work.

For Qwen2.5-Coder-7B-Instruct under the same tp=4, dp=2 setup, GPU utilization appears more distributed in practice, even though the HTTP request logs may still mostly show a single API server process handling requests.

Why this seems suspicious

This does not look like a simple model-loading failure:

For Qwen3-8B, all 8 GPUs allocate nearly full memory. vLLM logs show both TP groups initialized successfully. NCCL initializes successfully. The issue appears at runtime / scheduling / execution time, not at startup.

This also makes me suspect the issue is not solely explained by api_server_count=2, because the same default happens for Qwen2.5-Coder-7B-Instruct, but the runtime GPU utilization behavior is different.

Relevant log details

For Qwen3-8B:

Resolved architecture: Qwen3ForCausalLM Using max model len 40960 tensor_parallel_size=4 data_parallel_size=2 api_server_count=2 world_size=4 rank=0 local_rank=0/4 ... indicating two separate TP groups Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs SymmMemCommunicator: Device capability 8.0 not supported

For Qwen2.5-Coder-7B-Instruct:

Resolved architecture: Qwen2ForCausalLM Using max model len 32768 same tensor_parallel_size=4 same data_parallel_size=2 same api_server_count=2

One notable difference is that Qwen3-8B runs with max model len 40960, while Qwen2.5-Coder-7B-Instruct runs with 32768.

Expected behavior

With tp=4, dp=2 on 8 GPUs, I expect both DP replicas to participate in serving requests under load, so that both GPU groups (0-3 and 4-7) show meaningful compute utilization.

Actual behavior

For Qwen3-8B, only one 4-GPU group seems to execute inference workloads, while the other 4-GPU group remains memory-resident but mostly idle.

Additional notes This is on PCIe-only A100s, so I understand the custom all-reduce warning is expected and probably not the root cause. The issue seems model-dependent, because I do not see the same behavior as clearly with Qwen2.5-Coder-7B-Instruct under the same topology and launch flags. I would like to know whether this is: expected behavior due to scheduler / queueing with this workload, a known issue with Qwen3 on vLLM 0.16.0, related to V1 engine / cudagraph / long-context behavior, or an actual DP scheduling bug.

Questions

Is this a known issue for Qwen3 models on vLLM 0.16.0 with tp=4, dp=2? Should a single API server process still be able to drive both DP replicas evenly in this setup? Is there any known interaction between: Qwen3 max_model_len=40960 V1 engine cudagraph capture and DP scheduling / replica utilization? Would you recommend testing with: --api-server-count 1 --enforce-eager smaller --max-model-len or a newer vLLM version first? Minimal reproduction

On a machine with 8 x A100-PCIE-40GB, launch Qwen3-8B with:

vllm serve /path/to/Qwen3-8B --tensor-parallel-size 4 --data-parallel-size 2

Then send repeated /v1/chat/completions requests.

Observed:

all 8 GPUs hold model memory only one 4-GPU replica shows sustained compute utilization the other 4 GPUs stay mostly idle

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

TL;DR

The issue with uneven GPU utilization in Qwen3-8B model serving can be addressed by testing with a different configuration, such as adjusting the --api-server-count or --max-model-len parameters.

Guidance

Investigate the impact of --api-server-count 1 on GPU utilization to determine if the issue is related to the default value of api_server_count being set to data_parallel_size.
Test with a smaller --max-model-len to see if the issue is related to the large model length of 40960.
Consider upgrading to a newer version of vLLM to ensure that any known issues with DP scheduling or replica utilization have been addressed.
Verify that the issue is not specific to the Qwen3 model by testing with other models and configurations.

Example

No specific code example is provided, but the commands to test the suggested configurations could be:

vllm serve /home/skl/mkx/model/Qwen3-8B \
  --host 127.0.0.1 \
  --port 10842 \
  --served-model-name Qwen3-8B \
  --tensor-parallel-size 4 \
  --data-parallel-size 2 \
  --api-server-count 1

vllm serve /home/skl/mkx/model/Qwen3-8B \
  --host 127.0.0.1 \
  --port 10842 \
  --served-model-name Qwen3-8B \
  --tensor-parallel-size 4 \
  --data-parallel-size 2 \
  --max-model-len 32768

Notes

The issue appears to be model-dependent and may be related to the large model length or the default value of api_server_count. Further testing and investigation are needed to determine the root cause and the most effective solution.

Recommendation

Apply a workaround by testing with a different configuration, such as --api-server-count 1 or a smaller --max-model-len, to determine if this resolves the issue with uneven GPU utilization.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

vllm - 💡(How to fix) Fix [Bug]: DP replica under-utilization with Qwen3-8B (tp=4, dp=2) on 8x A100 PCIe [1 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Code Example

Your current environment

🐛 Describe the bug

Model

Command

Why this seems suspicious

Relevant log details

Expected behavior

Actual behavior

Questions

Observed:

Before submitting a new issue...

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

FAQ

Expected behavior

Still need to ship something?

TRENDING

vllm - 💡(How to fix) Fix [Bug]: DP replica under-utilization with Qwen3-8B (tp=4, dp=2) on 8x A100 PCIe [1 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Code Example

Your current environment

🐛 Describe the bug

Model

Command

Why this seems suspicious

Relevant log details

Expected behavior

Actual behavior

Questions

Observed:

Before submitting a new issue...

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

FAQ

Expected behavior

Still need to ship something?

RELATED_DISCOVERY

TRENDING