vllm - 💡(How to fix) Fix [Bug]: Pipeline Parallelism Blocked on V1 Engine (Multi-Node PP) [1 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#41864Fetched 2026-05-07 03:32:18
View on GitHub
Comments
1
Participants
2
Timeline
4
Reactions
0
Author
Participants
Timeline (top)
commented ×1labeled ×1mentioned ×1subscribed ×1

Multi-node pipeline parallelism (PP=2/TP=8 across 2 nodes) fails with AssertionError: collective_rpc should not be called on follower node. The V1 engine's MultiprocessExecutor only supports shared-memory IPC (single-node). Multi-node PP requires Ray or a new distributed executor.

Error Message

AssertionError: collective_rpc should not be called on follower node

Source: /usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py:352

Root Cause

Multi-node pipeline parallelism (PP=2/TP=8 across 2 nodes) fails with AssertionError: collective_rpc should not be called on follower node. The V1 engine's MultiprocessExecutor only supports shared-memory IPC (single-node). Multi-node PP requires Ray or a new distributed executor.

Fix Action

Fix / Workaround

Attempted Workarounds

Code Example

Collecting environment information...
==============================
        System Info
==============================
OS                           : Ubuntu 24.04.4 LTS (x86_64)
GCC version                  : (Ubuntu 13.3.0-6ubuntu2~24.04.1) 13.3.0
Clang version                : Could not collect
CMake version                : version 3.28.3
Libc version                 : glibc-2.39

==============================
       PyTorch Info
==============================
PyTorch version              : 2.11.0+cu130
Is debug build               : False
CUDA used to build PyTorch   : 13.0
ROCM used to build PyTorch   : N/A
XPU used to build PyTorch    : N/A

==============================
      Python Environment
==============================
Python version               : 3.12.3 (main, Mar 23 2026, 19:04:32) [GCC 13.3.0] (64-bit runtime)
Python platform              : Linux-6.12.64-87.122.amzn2023.x86_64-x86_64-with-glibc2.39
    
==============================
       CUDA / GPU Info
==============================
Is CUDA available            : True
CUDA runtime version         : 13.0.88
CUDA_MODULE_LOADING set to   : 
GPU models and configuration : GPU 0: NVIDIA H200
Nvidia driver version        : 580.126.09
cuDNN version                : Could not collect
HIP runtime version          : N/A
MIOpen runtime version       : N/A
Is XNNPACK available         : True

==============================
          CPU Info
==============================
Architecture:                            x86_64
CPU op-mode(s):                          32-bit, 64-bit
Address sizes:                           46 bits physical, 48 bits virtual
Byte Order:                              Little Endian
CPU(s):                                  192
On-line CPU(s) list:                     0-191
Vendor ID:                               GenuineIntel
Model name:                              Intel(R) Xeon(R) Platinum 8488C
CPU family:                              6
Model:                                   143
Thread(s) per core:                      2
Core(s) per socket:                      48
Socket(s):                               2
Stepping:                                8
BogoMIPS:                                4800.00
L1d cache:                               4.5 MiB (96 instances)
L1i cache:                               3 MiB (96 instances)
L2 cache:                                192 MiB (96 instances)
L3 cache:                                210 MiB (2 instances)
NUMA node(s):                            2
NUMA node0 CPU(s):                       0-47,96-143
NUMA node1 CPU(s):                       48-95,144-191

==============================
Versions of relevant libraries
==============================
[pip3] flashinfer-python==0.6.8.post1
[pip3] numpy==2.3.5
[pip3] nvidia-cublas==13.1.0.3
[pip3] nvidia-cuda-cupti==13.0.85
[pip3] nvidia-cuda-nvrtc==13.0.88
[pip3] nvidia-cuda-runtime==13.0.96
[pip3] nvidia-cudnn-cu13==9.19.0.56
[pip3] nvidia-cudnn-frontend==1.18.0
[pip3] nvidia-cufft==12.0.0.61
[pip3] nvidia-cufile==1.15.1.6
[pip3] nvidia-curand==10.4.0.35
[pip3] nvidia-cusolver==12.0.4.66
[pip3] nvidia-cusparse==12.6.3.3
[pip3] nvidia-cusparselt-cu13==0.8.0
[pip3] nvidia-cutlass-dsl==4.4.2
[pip3] nvidia-cutlass-dsl-libs-base==4.4.2
[pip3] nvidia-ml-py==13.595.45
[pip3] nvidia-nccl-cu13==2.28.9
[pip3] nvidia-nvshmem-cu13==3.4.5
[pip3] nvidia-nvtx==13.0.85
[pip3] torch==2.11.0+cu130
[pip3] torch_c_dlpack_ext==0.1.5
[pip3] torchaudio==2.11.0+cu130
[pip3] torchvision==0.26.0+cu130
[pip3] transformers==5.6.2
[pip3] triton==3.6.0

==============================
         vLLM Info
==============================
vLLM Version                 : 0.20.0
vLLM Build Flags:
  CUDA Archs: 8.0 8.6 8.9 9.0 10.0; ROCm: Disabled; XPU: Disabled
GPU Topology:
  GPU0: NVIDIA H200 (141 GiB)
  8 GPUs per node connected via NVLink (NV18)
  Cross-node: 16× EFA v2 (100 Gbps each, 1.6 Tbps total)

==============================
     Environment Variables
==============================
CUDA_VERSION=13.0.3
NCCL_VERSION=2.28.3-1
TORCH_CUDA_ARCH_LIST=8.0 8.6 8.9 9.0 10.0
VLLM_TARGET_DEVICE=cuda
CUDA_HOME=/usr/local/cuda
PYTORCH_NVML_BASED_CUDA_CHECK=1
TORCHINDUCTOR_COMPILE_THREADS=1

---

PyTorch version              : 2.11.0+cu130
Is debug build               : False
CUDA used to build PyTorch   : 13.0
OS                           : Ubuntu 24.04.4 LTS (x86_64)
GCC version                  : (Ubuntu 13.3.0-6ubuntu2~24.04.1) 13.3.0
CMake version                : version 3.28.3
Python version               : 3.12.3 (64-bit runtime)
Python platform              : Linux-6.12.64-87.122.amzn2023.x86_64-x86_64-with-glibc2.39
Is CUDA available            : True
CUDA runtime version         : 13.0.88
GPU models and configuration : GPU 0-7: NVIDIA H200 (141 GiB each)
Nvidia driver version        : 580.126.09
vLLM Version                 : 0.20.0
vLLM Build Flags             : CUDA Archs: 8.0 8.6 8.9 9.0 10.0; ROCm: Disabled; XPU: Disabled
NCCL_VERSION                 : 2.28.3-1
TORCH_CUDA_ARCH_LIST         : 8.0 8.6 8.9 9.0 10.0

[pip3] flashinfer-python==0.6.8.post1
[pip3] torch==2.11.0+cu130
[pip3] transformers==5.6.2
[pip3] triton==3.6.0
[pip3] nvidia-nccl-cu13==2.28.9
[pip3] nvidia-nvshmem-cu13==3.4.5

---

# Node 1 (head):
vllm serve deepseek-ai/DeepSeek-V3-0324 \
  --tensor-parallel-size 8 \
  --pipeline-parallel-size 2 \
  --gpu-memory-utilization 0.85 \
  --trust-remote-code \
  --host 0.0.0.0 \
  --port 8000

# Node 2 (worker — launched via Ray or manual):
# Same command with appropriate rank assignment

---

AssertionError: collective_rpc should not be called on follower node
RAW_BUFFERClick to expand / collapse

Your current environment

<details> <summary>The output of <code>python collect_env.py</code></summary>
Collecting environment information...
==============================
        System Info
==============================
OS                           : Ubuntu 24.04.4 LTS (x86_64)
GCC version                  : (Ubuntu 13.3.0-6ubuntu2~24.04.1) 13.3.0
Clang version                : Could not collect
CMake version                : version 3.28.3
Libc version                 : glibc-2.39

==============================
       PyTorch Info
==============================
PyTorch version              : 2.11.0+cu130
Is debug build               : False
CUDA used to build PyTorch   : 13.0
ROCM used to build PyTorch   : N/A
XPU used to build PyTorch    : N/A

==============================
      Python Environment
==============================
Python version               : 3.12.3 (main, Mar 23 2026, 19:04:32) [GCC 13.3.0] (64-bit runtime)
Python platform              : Linux-6.12.64-87.122.amzn2023.x86_64-x86_64-with-glibc2.39
    
==============================
       CUDA / GPU Info
==============================
Is CUDA available            : True
CUDA runtime version         : 13.0.88
CUDA_MODULE_LOADING set to   : 
GPU models and configuration : GPU 0: NVIDIA H200
Nvidia driver version        : 580.126.09
cuDNN version                : Could not collect
HIP runtime version          : N/A
MIOpen runtime version       : N/A
Is XNNPACK available         : True

==============================
          CPU Info
==============================
Architecture:                            x86_64
CPU op-mode(s):                          32-bit, 64-bit
Address sizes:                           46 bits physical, 48 bits virtual
Byte Order:                              Little Endian
CPU(s):                                  192
On-line CPU(s) list:                     0-191
Vendor ID:                               GenuineIntel
Model name:                              Intel(R) Xeon(R) Platinum 8488C
CPU family:                              6
Model:                                   143
Thread(s) per core:                      2
Core(s) per socket:                      48
Socket(s):                               2
Stepping:                                8
BogoMIPS:                                4800.00
L1d cache:                               4.5 MiB (96 instances)
L1i cache:                               3 MiB (96 instances)
L2 cache:                                192 MiB (96 instances)
L3 cache:                                210 MiB (2 instances)
NUMA node(s):                            2
NUMA node0 CPU(s):                       0-47,96-143
NUMA node1 CPU(s):                       48-95,144-191

==============================
Versions of relevant libraries
==============================
[pip3] flashinfer-python==0.6.8.post1
[pip3] numpy==2.3.5
[pip3] nvidia-cublas==13.1.0.3
[pip3] nvidia-cuda-cupti==13.0.85
[pip3] nvidia-cuda-nvrtc==13.0.88
[pip3] nvidia-cuda-runtime==13.0.96
[pip3] nvidia-cudnn-cu13==9.19.0.56
[pip3] nvidia-cudnn-frontend==1.18.0
[pip3] nvidia-cufft==12.0.0.61
[pip3] nvidia-cufile==1.15.1.6
[pip3] nvidia-curand==10.4.0.35
[pip3] nvidia-cusolver==12.0.4.66
[pip3] nvidia-cusparse==12.6.3.3
[pip3] nvidia-cusparselt-cu13==0.8.0
[pip3] nvidia-cutlass-dsl==4.4.2
[pip3] nvidia-cutlass-dsl-libs-base==4.4.2
[pip3] nvidia-ml-py==13.595.45
[pip3] nvidia-nccl-cu13==2.28.9
[pip3] nvidia-nvshmem-cu13==3.4.5
[pip3] nvidia-nvtx==13.0.85
[pip3] torch==2.11.0+cu130
[pip3] torch_c_dlpack_ext==0.1.5
[pip3] torchaudio==2.11.0+cu130
[pip3] torchvision==0.26.0+cu130
[pip3] transformers==5.6.2
[pip3] triton==3.6.0

==============================
         vLLM Info
==============================
vLLM Version                 : 0.20.0
vLLM Build Flags:
  CUDA Archs: 8.0 8.6 8.9 9.0 10.0; ROCm: Disabled; XPU: Disabled
GPU Topology:
  GPU0: NVIDIA H200 (141 GiB)
  8 GPUs per node connected via NVLink (NV18)
  Cross-node: 16× EFA v2 (100 Gbps each, 1.6 Tbps total)

==============================
     Environment Variables
==============================
CUDA_VERSION=13.0.3
NCCL_VERSION=2.28.3-1
TORCH_CUDA_ARCH_LIST=8.0 8.6 8.9 9.0 10.0
VLLM_TARGET_DEVICE=cuda
CUDA_HOME=/usr/local/cuda
PYTORCH_NVML_BASED_CUDA_CHECK=1
TORCHINDUCTOR_COMPILE_THREADS=1
</details>

🐛 Describe the bug

Summary

Multi-node pipeline parallelism (PP=2/TP=8 across 2 nodes) fails with AssertionError: collective_rpc should not be called on follower node. The V1 engine's MultiprocessExecutor only supports shared-memory IPC (single-node). Multi-node PP requires Ray or a new distributed executor.

Severity

Medium — blocks an important scaling strategy. SGLang/LMSYS show PP4+TP8 beats TP=32 by 30% for cross-node MoE serving.

Environment

ComponentVersion
vLLM0.20.0
PyTorch2.11.0+cu130
CUDA13.0.3
NVIDIA Driver580.105.08
NCCL2.30.4-1

Hardware

  • Instance: 2× AWS p5en.48xlarge
  • GPU: 16× NVIDIA H200
  • Network: 16× EFA per node

collect_env.py Output

PyTorch version              : 2.11.0+cu130
Is debug build               : False
CUDA used to build PyTorch   : 13.0
OS                           : Ubuntu 24.04.4 LTS (x86_64)
GCC version                  : (Ubuntu 13.3.0-6ubuntu2~24.04.1) 13.3.0
CMake version                : version 3.28.3
Python version               : 3.12.3 (64-bit runtime)
Python platform              : Linux-6.12.64-87.122.amzn2023.x86_64-x86_64-with-glibc2.39
Is CUDA available            : True
CUDA runtime version         : 13.0.88
GPU models and configuration : GPU 0-7: NVIDIA H200 (141 GiB each)
Nvidia driver version        : 580.126.09
vLLM Version                 : 0.20.0
vLLM Build Flags             : CUDA Archs: 8.0 8.6 8.9 9.0 10.0; ROCm: Disabled; XPU: Disabled
NCCL_VERSION                 : 2.28.3-1
TORCH_CUDA_ARCH_LIST         : 8.0 8.6 8.9 9.0 10.0

[pip3] flashinfer-python==0.6.8.post1
[pip3] torch==2.11.0+cu130
[pip3] transformers==5.6.2
[pip3] triton==3.6.0
[pip3] nvidia-nccl-cu13==2.28.9
[pip3] nvidia-nvshmem-cu13==3.4.5

Steps to Reproduce

# Node 1 (head):
vllm serve deepseek-ai/DeepSeek-V3-0324 \
  --tensor-parallel-size 8 \
  --pipeline-parallel-size 2 \
  --gpu-memory-utilization 0.85 \
  --trust-remote-code \
  --host 0.0.0.0 \
  --port 8000

# Node 2 (worker — launched via Ray or manual):
# Same command with appropriate rank assignment

Error

AssertionError: collective_rpc should not be called on follower node

Source: /usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py:352

What Works

  • NCCL distributed init: world_size=16, all ranks created ✓
  • Model loading: Both nodes load weights (38.72 GiB/GPU) ✓
  • Layer partitioning: [31, 30] layers across 2 PP stages ✓
  • MoE backend selection: DEEPGEMM ✓
  • Attention backend: FLASH_ATTN_MLA ✓

What Fails

The follower node attempts collective_rpc during KV cache initialization. The V1 engine's MultiprocessExecutor initializes rpc_broadcast_mq (shared-memory message queue) only on the leader node. On the follower, this is None, causing the assertion.

Attempted Workarounds

  1. Both nodes as servers: Same assertion during KV cache init
  2. Headless worker approach: Same error — headless still creates EngineCore calling collective_rpc
  3. Ray (in image tag -ray): Ray is installed but the V1 engine path doesn't use it for PP

Theoretical Performance Benefit

For DeepSeek-V3 with 61 MoE layers:

  • EP approach: 61 × all-to-all per forward pass
  • PP=2 approach: 1 activation transfer between stages (replacing 30-31 all-to-alls)

LMSYS/SGLang measured PP4+TP8 beating TP=32 by 30% for large MoE models.

Suggested Fix

Options:

  1. Support multi-node PP in V1 engine via TCP/gRPC-based collective_rpc (not shm-only)
  2. Add Ray-based PP executor to V1 engine (Ray is already in the image)
  3. Document that PP is single-node only in v0.20.0

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

vllm - 💡(How to fix) Fix [Bug]: Pipeline Parallelism Blocked on V1 Engine (Multi-Node PP) [1 comments, 2 participants]