vllm - 💡(How to fix) Fix [Bug]: DP replica under-utilization with Qwen3-8B (tp=4, dp=2) on 8x A100 PCIe [1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#39384Fetched 2026-04-10 03:40:56
View on GitHub
Comments
0
Participants
1
Timeline
4
Reactions
1
Author
Participants
Timeline (top)
subscribed ×2labeled ×1unsubscribed ×1

Root Cause

This also makes me suspect the issue is not solely explained by api_server_count=2, because the same default happens for Qwen2.5-Coder-7B-Instruct, but the runtime GPU utilization behavior is different.

Code Example

vllm serve /home/skl/mkx/model/Qwen3-8B \
  --host 127.0.0.1 \
  --port 10842 \
  --served-model-name Qwen3-8B \
  --tensor-parallel-size 4 \
  --data-parallel-size 2

---

vllm serve /home/skl/mkx/model/Qwen2.5-Coder-7B-Instruct \
  --host 127.0.0.1 \
  --port 10872 \
  --served-model-name Qwen2.5-Coder-7B-Instruct \
  --tensor-parallel-size 4 \
  --data-parallel-size 2
RAW_BUFFERClick to expand / collapse

Your current environment

  • vLLM version: 0.16.0
  • CUDA driver: 550.54.15
  • CUDA version: 12.4
  • GPUs: 8 x NVIDIA A100-PCIE-40GB
  • Interconnect: PCIe-only (no NVLink)
  • NCCL version reported by vLLM: 2.27.5
  • Python: 3.10
  • OS: Ubuntu Linux

🐛 Describe the bug

Model

  • Problematic model: Qwen3-8B
  • Comparison model: Qwen2.5-Coder-7B-Instruct

Command

For Qwen3-8B:

vllm serve /home/skl/mkx/model/Qwen3-8B \
  --host 127.0.0.1 \
  --port 10842 \
  --served-model-name Qwen3-8B \
  --tensor-parallel-size 4 \
  --data-parallel-size 2

For Qwen2.5-Coder-7B-Instruct:

vllm serve /home/skl/mkx/model/Qwen2.5-Coder-7B-Instruct \
  --host 127.0.0.1 \
  --port 10872 \
  --served-model-name Qwen2.5-Coder-7B-Instruct \
  --tensor-parallel-size 4 \
  --data-parallel-size 2

vLLM logs show that api_server_count defaults to data_parallel_size (2) in both cases.

What I observe

When serving Qwen3-8B with tp=4, dp=2 on 8 A100 PCIe GPUs:

All 8 GPUs load model weights into memory successfully. The two TP groups appear to be: DP replica 0: local_rank 0,1,2,3 DP replica 1: local_rank 4,5,6,7 However, during inference, only GPUs 4-7 show high utilization, while GPUs 0-3 remain near 0% utilization, even though they keep almost full memory allocated. From nvidia-smi, GPUs 0-3 are mostly idle, while GPUs 4-7 are ~85%-98% utilized. This makes it look like only one DP replica is actually doing work.

For Qwen2.5-Coder-7B-Instruct under the same tp=4, dp=2 setup, GPU utilization appears more distributed in practice, even though the HTTP request logs may still mostly show a single API server process handling requests.

Why this seems suspicious

This does not look like a simple model-loading failure:

For Qwen3-8B, all 8 GPUs allocate nearly full memory. vLLM logs show both TP groups initialized successfully. NCCL initializes successfully. The issue appears at runtime / scheduling / execution time, not at startup.

This also makes me suspect the issue is not solely explained by api_server_count=2, because the same default happens for Qwen2.5-Coder-7B-Instruct, but the runtime GPU utilization behavior is different.

Relevant log details

For Qwen3-8B:

Resolved architecture: Qwen3ForCausalLM Using max model len 40960 tensor_parallel_size=4 data_parallel_size=2 api_server_count=2 world_size=4 rank=0 local_rank=0/4 ... indicating two separate TP groups Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs SymmMemCommunicator: Device capability 8.0 not supported

For Qwen2.5-Coder-7B-Instruct:

Resolved architecture: Qwen2ForCausalLM Using max model len 32768 same tensor_parallel_size=4 same data_parallel_size=2 same api_server_count=2

One notable difference is that Qwen3-8B runs with max model len 40960, while Qwen2.5-Coder-7B-Instruct runs with 32768.

Expected behavior

With tp=4, dp=2 on 8 GPUs, I expect both DP replicas to participate in serving requests under load, so that both GPU groups (0-3 and 4-7) show meaningful compute utilization.

Actual behavior

For Qwen3-8B, only one 4-GPU group seems to execute inference workloads, while the other 4-GPU group remains memory-resident but mostly idle.

Additional notes This is on PCIe-only A100s, so I understand the custom all-reduce warning is expected and probably not the root cause. The issue seems model-dependent, because I do not see the same behavior as clearly with Qwen2.5-Coder-7B-Instruct under the same topology and launch flags. I would like to know whether this is: expected behavior due to scheduler / queueing with this workload, a known issue with Qwen3 on vLLM 0.16.0, related to V1 engine / cudagraph / long-context behavior, or an actual DP scheduling bug.

Questions

Is this a known issue for Qwen3 models on vLLM 0.16.0 with tp=4, dp=2? Should a single API server process still be able to drive both DP replicas evenly in this setup? Is there any known interaction between: Qwen3 max_model_len=40960 V1 engine cudagraph capture and DP scheduling / replica utilization? Would you recommend testing with: --api-server-count 1 --enforce-eager smaller --max-model-len or a newer vLLM version first? Minimal reproduction

On a machine with 8 x A100-PCIE-40GB, launch Qwen3-8B with:

vllm serve /path/to/Qwen3-8B --tensor-parallel-size 4 --data-parallel-size 2

Then send repeated /v1/chat/completions requests.

Observed:

all 8 GPUs hold model memory only one 4-GPU replica shows sustained compute utilization the other 4 GPUs stay mostly idle

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

TL;DR

The issue with uneven GPU utilization in Qwen3-8B model serving can be addressed by testing with a different configuration, such as adjusting the --api-server-count or --max-model-len parameters.

Guidance

  • Investigate the impact of --api-server-count 1 on GPU utilization to determine if the issue is related to the default value of api_server_count being set to data_parallel_size.
  • Test with a smaller --max-model-len to see if the issue is related to the large model length of 40960.
  • Consider upgrading to a newer version of vLLM to ensure that any known issues with DP scheduling or replica utilization have been addressed.
  • Verify that the issue is not specific to the Qwen3 model by testing with other models and configurations.

Example

No specific code example is provided, but the commands to test the suggested configurations could be:

vllm serve /home/skl/mkx/model/Qwen3-8B \
  --host 127.0.0.1 \
  --port 10842 \
  --served-model-name Qwen3-8B \
  --tensor-parallel-size 4 \
  --data-parallel-size 2 \
  --api-server-count 1

or

vllm serve /home/skl/mkx/model/Qwen3-8B \
  --host 127.0.0.1 \
  --port 10842 \
  --served-model-name Qwen3-8B \
  --tensor-parallel-size 4 \
  --data-parallel-size 2 \
  --max-model-len 32768

Notes

The issue appears to be model-dependent and may be related to the large model length or the default value of api_server_count. Further testing and investigation are needed to determine the root cause and the most effective solution.

Recommendation

Apply a workaround by testing with a different configuration, such as --api-server-count 1 or a smaller --max-model-len, to determine if this resolves the issue with uneven GPU utilization.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

FAQ

Expected behavior

With tp=4, dp=2 on 8 GPUs, I expect both DP replicas to participate in serving requests under load, so that both GPU groups (0-3 and 4-7) show meaningful compute utilization.

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING