vllm - ✅(Solved) Fix [Performance]: 2-stage custom allreduce (TP4) bandwidth lagging behind NCCL for large message sizes [1 pull requests, 6 comments, 3 participants]

vllm2026-03-09 09:53:42

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

vllm-project/vllm#36481•Fetched 2026-04-08 00:36:40

View on GitHub

Comments

Participants

Timeline

Reactions

Author

Participants

Timeline (top)

commented ×6mentioned ×5subscribed ×5closed ×1

PR fix notes

PR #2760: Some fixes for custom allreduce kernels

Repository: vllm-project/vllm
Author: hanzhi713
State: closed | merged: True
Link: https://github.com/vllm-project/vllm/pull/2760

Description (problem / solution / changelog)

Recently, there are some reports of stuck generation or garbage text when custom all reduce is enabled. While I didn't manage to reproduce any issues on A30 and A100, I did find some potentially unsafe synchronizations and I attempt to fix them here.

When using the signal flag, GPUs are writing to different bytes of the same 8-byte signal. Although the writes are strong writes, they are not considered morally strong according to the CUDA memory model because they didn't overlap completely. Hence, they are still considered as data races.
When using 2-stage allreduce or half butterfly allreduce, a __threadfence_system or a release-acquire pattern is needed to absolutely guarantee the visibility of other devices' write on the current device, which is missing from the current implementation.

Related issues: #2788 (garbage output when upgrading vllm from 0.2.7 -> 0.3.0) #2742 (garbage output. Solved when disable_custom_all_reduce=True) #2731 (one person reported that disable_custom_all_reduce=True solve generation hanging)

In this PR, I made the following changes

Use 8 uint32 instead of 8 bytes per signal per device.
simplify synchronization by only syncing blocks of the same index across GPUs. Some changes are made to ensure that each thread's read only depends on the writes from the thread of the same id in other devices.
add a __threadfence_system guarantee visibility of other device's writes when using 2-stage or half butterfly allreduce. Note that this adds a few microseconds of overhead.
removed support for more than two PCIe-only GPUs because performance improvement is small.
add additional p2p checks to avoid buggy driver/hardware P2P support. Might fix #2728
add check for device count when running P2P test. Should fix #2795
<s>disable custom allreduce by default by setting the argument to True. User can explicitly opt-in by setting it to False.</s>

Changed files

csrc/custom_all_reduce.cu (modified, +5/-5)
csrc/custom_all_reduce.cuh (modified, +75/-152)
csrc/custom_all_reduce_test.cu (modified, +108/-76)
vllm/config.py (modified, +0/-9)
vllm/entrypoints/llm.py (modified, +1/-1)
vllm/model_executor/parallel_utils/custom_all_reduce.py (modified, +43/-7)

Code Example

[P2P (Peer-to-Peer) GPU Bandwidth Latency Test]
Device: 0, NVIDIA GeForce RTX 4090, pciBusID: 9, pciDeviceID: 0, pciDomainID:0
Device: 1, NVIDIA GeForce RTX 4090, pciBusID: e, pciDeviceID: 0, pciDomainID:0
Device: 2, NVIDIA GeForce RTX 4090, pciBusID: 1c, pciDeviceID: 0, pciDomainID:0
Device: 3, NVIDIA GeForce RTX 4090, pciBusID: 20, pciDeviceID: 0, pciDomainID:0
Device=0 CAN Access Peer Device=1
Device=0 CAN Access Peer Device=2
Device=0 CAN Access Peer Device=3
Device=1 CAN Access Peer Device=0
Device=1 CAN Access Peer Device=2
Device=1 CAN Access Peer Device=3
Device=2 CAN Access Peer Device=0
Device=2 CAN Access Peer Device=1
Device=2 CAN Access Peer Device=3
Device=3 CAN Access Peer Device=0
Device=3 CAN Access Peer Device=1
Device=3 CAN Access Peer Device=2

***NOTE: In case a device doesn't have P2P access to other one, it falls back to normal memcopy procedure.
So you can see lesser Bandwidth (GB/s) and unstable Latency (us) in those cases.

P2P Connectivity Matrix
     D\D     0     1     2     3
     0       1     1     1     1
     1       1     1     1     1
     2       1     1     1     1
     3       1     1     1     1
Unidirectional P2P=Disabled Bandwidth Matrix (GB/s)
   D\D     0      1      2      3
     0 937.82  21.55  21.61  21.52
     1  21.78 948.35  21.56  21.52
     2  21.26  21.53 946.13  21.28
     3  21.36  21.46  21.38 949.25
Unidirectional P2P=Enabled Bandwidth (P2P Writes) Matrix (GB/s)
   D\D     0      1      2      3
     0 942.36  26.33  26.33  26.33
     1  26.33 968.00  26.32  26.33
     2  26.33  26.33 964.02  26.33
     3  26.33  26.33  26.33 967.92
Bidirectional P2P=Disabled Bandwidth Matrix (GB/s)
   D\D     0      1      2      3
     0 944.66  29.64  29.48  29.53
     1  29.82 948.81  29.36  29.50
     2  29.75  29.83 949.13  28.96
     3  29.79  29.83  28.99 948.98
Bidirectional P2P=Enabled Bandwidth Matrix (GB/s)
   D\D     0      1      2      3
     0 944.26  52.05  52.05  52.06
     1  52.01 945.39  52.06  52.07
     2  52.06  52.04 945.96  52.07
     3  52.07  52.07  52.06 945.19

---

mpirun --allow-run-as-root -np 4 ./custom_all_reduce_test                                                 Rank 0 done, nGPUs:4, data size(KB) 1, 512, 36, my time:10.29us, nccl time:9.14us
average abs diffs: nccl: 0.00202151 me: 0.00150494
Rank 0 done, nGPUs:4, data size(KB) 2, 512, 36, my time:10.35us, nccl time:8.73us
average abs diffs: nccl: 0.00195802 me: 0.00147666
Rank 0 done, nGPUs:4, data size(KB) 4, 512, 36, my time:10.40us, nccl time:9.42us
average abs diffs: nccl: 0.00201413 me: 0.00147393
Rank 0 done, nGPUs:4, data size(KB) 8, 512, 36, my time:10.77us, nccl time:9.76us
average abs diffs: nccl: 0.00201263 me: 0.00146741
Rank 0 done, nGPUs:4, data size(KB) 16, 512, 36, my time:11.38us, nccl time:10.66us
average abs diffs: nccl: 0.00197639 me: 0.00144941
Rank 0 done, nGPUs:4, data size(KB) 32, 512, 36, my time:12.54us, nccl time:11.23us
average abs diffs: nccl: 0.00197297 me: 0.00143931
Rank 0 done, nGPUs:4, data size(KB) 64, 512, 36, my time:13.97us, nccl time:13.96us
average abs diffs: nccl: 0.00197241 me: 0.00144246
Rank 0 done, nGPUs:4, data size(KB) 128, 512, 36, my time:17.50us, nccl time:21.77us
average abs diffs: nccl: 0.00198034 me: 0.00144287
Rank 0 done, nGPUs:4, data size(KB) 256, 512, 36, my time:25.83us, nccl time:36.71us
average abs diffs: nccl: 0.00198367 me: 0.00144506
Rank 0 done, nGPUs:4, data size(KB) 512, 512, 36, my time:42.93us, nccl time:65.09us
average abs diffs: nccl: 0.00198434 me: 0.0014479
Rank 0 done, nGPUs:4, data size(KB) 1024, 512, 36, my time:81.08us, nccl time:90.22us
average abs diffs: nccl: 0.00198447 me: 0.00144652
Rank 0 done, nGPUs:4, data size(KB) 2048, 512, 36, my time:156.45us, nccl time:150.93us
average abs diffs: nccl: 0.00198394 me: 0.0014453
Rank 0 done, nGPUs:4, data size(KB) 4096, 512, 36, my time:304.69us, nccl time:261.71us
average abs diffs: nccl: 0.00198257 me: 0.001444
Rank 0 done, nGPUs:4, data size(KB) 8192, 512, 36, my time:598.21us, nccl time:505.03us
average abs diffs: nccl: 0.00198188 me: 0.0014442
Rank 0 done, nGPUs:4, data size(KB) 16384, 512, 36, my time:1171.32us, nccl time:995.06us
average abs diffs: nccl: 0.00198109 me: 0.00144337

---

The output of `python collect_env.py`

RAW_BUFFERClick to expand / collapse

Proposal to improve performance

@hanzhi713 , in your previous reply in https://github.com/vllm-project/vllm/issues/4770 and https://github.com/vllm-project/vllm/pull/2760 you mentioned that for GPUs connected via PCIe (not crossing CPU) , Custom allreduce should perform better than NCCL. However, I'm now encountering an issue where the 2-stage TP4 bandwidth is worse than NCCL. Could you help take a look?

1、It's more performant than NCCL when either there are only two PCIe GPUs (they can be connected to the PCIe root complex directly or with a PCIe switch), or there are multiple PCIe GPUs connected to the same PCIe switch. 2、However, in your case you have a PCIe switch connecting to each group of 4 GPUs. Given the much better switching performance, my implementation may work and provide performance improvements.

Hardware: 4× RTX 4090 (connected via PCIe 4.0, not crossing CPU sockets)
Driver: Open-source driver (used to force-enable P2P, as 4090 does not support P2P by default; the modification successfully enables direct P2P communication between GPUs)
P2P bandwidth: Verified that peer-to-peer transfers achieve near PCIe 4.0 x16 theoretical bandwidth (~32 GB/s). The topology and bandwidth tests confirm this（Unidirectional P2P bandwidth = 26.33GB/s）.

[P2P (Peer-to-Peer) GPU Bandwidth Latency Test]
Device: 0, NVIDIA GeForce RTX 4090, pciBusID: 9, pciDeviceID: 0, pciDomainID:0
Device: 1, NVIDIA GeForce RTX 4090, pciBusID: e, pciDeviceID: 0, pciDomainID:0
Device: 2, NVIDIA GeForce RTX 4090, pciBusID: 1c, pciDeviceID: 0, pciDomainID:0
Device: 3, NVIDIA GeForce RTX 4090, pciBusID: 20, pciDeviceID: 0, pciDomainID:0
Device=0 CAN Access Peer Device=1
Device=0 CAN Access Peer Device=2
Device=0 CAN Access Peer Device=3
Device=1 CAN Access Peer Device=0
Device=1 CAN Access Peer Device=2
Device=1 CAN Access Peer Device=3
Device=2 CAN Access Peer Device=0
Device=2 CAN Access Peer Device=1
Device=2 CAN Access Peer Device=3
Device=3 CAN Access Peer Device=0
Device=3 CAN Access Peer Device=1
Device=3 CAN Access Peer Device=2

***NOTE: In case a device doesn't have P2P access to other one, it falls back to normal memcopy procedure.
So you can see lesser Bandwidth (GB/s) and unstable Latency (us) in those cases.

P2P Connectivity Matrix
     D\D     0     1     2     3
     0       1     1     1     1
     1       1     1     1     1
     2       1     1     1     1
     3       1     1     1     1
Unidirectional P2P=Disabled Bandwidth Matrix (GB/s)
   D\D     0      1      2      3
     0 937.82  21.55  21.61  21.52
     1  21.78 948.35  21.56  21.52
     2  21.26  21.53 946.13  21.28
     3  21.36  21.46  21.38 949.25
Unidirectional P2P=Enabled Bandwidth (P2P Writes) Matrix (GB/s)
   D\D     0      1      2      3
     0 942.36  26.33  26.33  26.33
     1  26.33 968.00  26.32  26.33
     2  26.33  26.33 964.02  26.33
     3  26.33  26.33  26.33 967.92
Bidirectional P2P=Disabled Bandwidth Matrix (GB/s)
   D\D     0      1      2      3
     0 944.66  29.64  29.48  29.53
     1  29.82 948.81  29.36  29.50
     2  29.75  29.83 949.13  28.96
     3  29.79  29.83  28.99 948.98
Bidirectional P2P=Enabled Bandwidth Matrix (GB/s)
   D\D     0      1      2      3
     0 944.26  52.05  52.05  52.06
     1  52.01 945.39  52.06  52.07
     2  52.06  52.04 945.96  52.07
     3  52.07  52.07  52.06 945.19

Observation: When running with TP4, I compared the performance of vLLM's custom allreduce (2-stage) against NCCL allreduce( I modified the code to force the use of the 2-stage algorithm for all message sizes for this test.). At a message size of 16 MB, the custom allreduce achieves a bus bandwidth of only ~21.49 GB/s, whereas NCCL reaches ~25.29 GB/s, which is close to the P2P peak. This gap persists for larger data sizes.

mpirun --allow-run-as-root -np 4 ./custom_all_reduce_test                                                 Rank 0 done, nGPUs:4, data size(KB) 1, 512, 36, my time:10.29us, nccl time:9.14us
average abs diffs: nccl: 0.00202151 me: 0.00150494
Rank 0 done, nGPUs:4, data size(KB) 2, 512, 36, my time:10.35us, nccl time:8.73us
average abs diffs: nccl: 0.00195802 me: 0.00147666
Rank 0 done, nGPUs:4, data size(KB) 4, 512, 36, my time:10.40us, nccl time:9.42us
average abs diffs: nccl: 0.00201413 me: 0.00147393
Rank 0 done, nGPUs:4, data size(KB) 8, 512, 36, my time:10.77us, nccl time:9.76us
average abs diffs: nccl: 0.00201263 me: 0.00146741
Rank 0 done, nGPUs:4, data size(KB) 16, 512, 36, my time:11.38us, nccl time:10.66us
average abs diffs: nccl: 0.00197639 me: 0.00144941
Rank 0 done, nGPUs:4, data size(KB) 32, 512, 36, my time:12.54us, nccl time:11.23us
average abs diffs: nccl: 0.00197297 me: 0.00143931
Rank 0 done, nGPUs:4, data size(KB) 64, 512, 36, my time:13.97us, nccl time:13.96us
average abs diffs: nccl: 0.00197241 me: 0.00144246
Rank 0 done, nGPUs:4, data size(KB) 128, 512, 36, my time:17.50us, nccl time:21.77us
average abs diffs: nccl: 0.00198034 me: 0.00144287
Rank 0 done, nGPUs:4, data size(KB) 256, 512, 36, my time:25.83us, nccl time:36.71us
average abs diffs: nccl: 0.00198367 me: 0.00144506
Rank 0 done, nGPUs:4, data size(KB) 512, 512, 36, my time:42.93us, nccl time:65.09us
average abs diffs: nccl: 0.00198434 me: 0.0014479
Rank 0 done, nGPUs:4, data size(KB) 1024, 512, 36, my time:81.08us, nccl time:90.22us
average abs diffs: nccl: 0.00198447 me: 0.00144652
Rank 0 done, nGPUs:4, data size(KB) 2048, 512, 36, my time:156.45us, nccl time:150.93us
average abs diffs: nccl: 0.00198394 me: 0.0014453
Rank 0 done, nGPUs:4, data size(KB) 4096, 512, 36, my time:304.69us, nccl time:261.71us
average abs diffs: nccl: 0.00198257 me: 0.001444
Rank 0 done, nGPUs:4, data size(KB) 8192, 512, 36, my time:598.21us, nccl time:505.03us
average abs diffs: nccl: 0.00198188 me: 0.0014442
Rank 0 done, nGPUs:4, data size(KB) 16384, 512, 36, my time:1171.32us, nccl time:995.06us
average abs diffs: nccl: 0.00198109 me: 0.00144337

Question: Could you please provide some clues or hints on why the custom 2-stage allreduce is slower than NCCL in this scenario? Is there any known limitation, or is there any implementation deficiency compared to NCCL that leads to worse performance?

Any guidance or pointers would be greatly appreciated. Thanks!

Report of performance regression

No response

Misc discussion on performance

No response

Your current environment (if you think it is necessary)

The output of `python collect_env.py`

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

Fix Plan

To address the performance regression of the custom 2-stage allreduce compared to NCCL, consider the following steps:

Optimize GPU-GPU communication: Ensure that the custom allreduce implementation is optimized for GPU-GPU communication, taking into account the specific hardware configuration (e.g., PCIe 4.0, RTX 4090).
Minimize synchronization overhead: Reduce synchronization overhead between GPUs by using asynchronous communication primitives and minimizing the number of synchronization points.
Improve data transfer efficiency: Optimize data transfer between GPUs by using techniques such as data pipelining, overlapping computation and communication, and minimizing data copies.

Example code snippet to optimize GPU-GPU communication:

// Use CUDA streams to overlap computation and communication
cudaStream_t streams[4];
for (int i = 0; i < 4; i++) {
    cudaStreamCreate(&streams[i]);
}

// Asynchronous data transfer between GPUs
cudaMemcpyAsync(peer_buf, send_buf, size, cudaMemcpyDefault, streams[0]);
cudaMemcpyAsync(recv_buf, peer_buf, size, cudaMemcpyDefault, streams[1]);

// Overlap computation and communication
kernel<<<grid, block, 0, streams[2]>>>(data);
cudaMemcpyAsync(peer_buf, send_buf, size, cudaMemcpyDefault, streams[3]);

Verification

To verify the fix, run the custom allreduce test with the optimized implementation and compare the performance results with NCCL. Monitor the bus bandwidth and latency to ensure that the custom implementation achieves similar or better performance than NCCL.

Example command to run the test:

mpirun --allow-run-as-root -np 4 ./custom_all_reduce_test

Extra Tips

Use profiling tools (e.g., NVIDIA Nsight Systems, CUDA Profiler) to identify performance bottlenecks in the custom allreduce implementation.
Consider using collective communication libraries (e.g., NCCL, MPI) that are optimized for specific hardware configurations.
Ensure that the custom implementation is correctly handling errors and exceptions to prevent performance regressions.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #ssr #installation #tensor shape #autograd error #latency issue #model loading #dependency error #configuration error #environment variable

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

vllm - ✅(Solved) Fix [Performance]: 2-stage custom allreduce (TP4) bandwidth lagging behind NCCL for large message sizes [1 pull requests, 6 comments, 3 participants]

Recommended Tools

GitHub issue graph ai analysis

PR fix notes

PR #2760: Some fixes for custom allreduce kernels

Description (problem / solution / changelog)

Changed files

Code Example

Proposal to improve performance

Report of performance regression

Misc discussion on performance

Your current environment (if you think it is necessary)

Before submitting a new issue...

extent analysis

Fix Plan

Verification

Extra Tips

Still need to ship something?

TRENDING

vllm - ✅(Solved) Fix [Performance]: 2-stage custom allreduce (TP4) bandwidth lagging behind NCCL for large message sizes [1 pull requests, 6 comments, 3 participants]

Recommended Tools

GitHub issue graph ai analysis

PR fix notes

PR #2760: Some fixes for custom allreduce kernels

Description (problem / solution / changelog)

Changed files

Code Example

Proposal to improve performance

Report of performance regression

Misc discussion on performance

Your current environment (if you think it is necessary)

Before submitting a new issue...

extent analysis

Fix Plan

Verification

Extra Tips

Still need to ship something?

RELATED_DISCOVERY

TRENDING