vllm - 💡(How to fix) Fix [Bug]: Sync EPLB rearrangement hangs indefinitely with DP8 + EP on B200 [1 participants]

arpera · 2026-04-04T15:46:52Z

[vllm] Your current environment The output of python collect_env.py ```text Collecting environment information... ============================== System Info ============================== OS : Ubuntu 24.04.4 LTS (x86_64) GCC version : (Ubuntu 13.3.0-6ubuntu2~24.04.1) 13.3.0 CMake version : version 3.28.3 Libc version : glibc-2.39 ============================== PyTorch Info ============================== PyTorch version : 2.10.0+cu128 Is debug build : False CUDA used to build PyTorch : 12.8 ROCM used to build PyTorch : N/A ============================== Python Environment ============================== Python version : 3.12.12 (main, Feb 3 2026, 22:51:04) [Clang 21.1.4 ] (64-bit runtime) Python platform : Linux-6.8.0-94-generic-x86_64-with-glibc2.39 ============================== CUDA / GPU Info ============================== Is CUDA available : True CUDA runtime version : 12.8.61 GPU models and configuration : GPU 0: NVIDIA B200 GPU 1: NVIDIA B200 GPU 2: NVIDIA B200 GPU 3: NVIDIA B200 GPU 4: NVIDIA B200 GPU 5: NVIDIA B200 GPU 6: NVIDIA B200 GPU 7: NVIDIA B200 Nvidia driver version : 580.126.09 Is XNNPACK available : True ============================== CPU Info ============================== Architecture: x86_64 Model name: INTEL(R) XEON(R) PLATINUM 8570 CPU(s): 224 Socket(s): 2 Core(s) per socket: 56 Thread(s) per core: 2 NUMA node(s): 2 ============================== Versions of relevant libraries ============================== [pip3] flashinfer-python==0.6.7 [pip3] numpy==2.2.6 [pip3] nvidia-nccl-cu12==2.27.5 [pip3] torch==2.10.0 [pip3] transformers==4.57.5 [pip3] triton==3.6.0 ============================== vLLM Info ============================== vLLM Version : 0.18.2rc1.dev78+g2021f494a (git sha: 2021f494a) vLLM Build Flags: CUDA Archs: Not Set; ROCm: Disabled GPU Topology: GPU0-GPU7: All NV18 interconnected (8x NVIDIA B200) ``` ### 🐛 Describe the bug Sync EPLB rearrangement hangs indefinitely during serving with DP8 + expert parallel on 8xB200, causing all EngineCore processes to stall. **Timeline from logs:** - Steps 75-99: EPLB runs normally, balancedness ~0.46 - Step 99 (18:07:19): `Rearranging experts sync mode ...` -- rearrangement starts - 18:08:20 (+60s): All 8 EngineCore report `No available shared memory broadcast block found in 60 seconds` - This repeats every 60 seconds for ~11 minutes - 18:18:30: All EngineCore crash with `RuntimeError: cancelled` from `shm_broadcast.py:677` The rearrangement never completes -- it deadlocks on the NCCL collective inside `rearrange()`. The first rearrangement during model loading (profile mode) works fine (3.80s). The hang occurs on the first real rearrangement triggered by serving load. Note: the balancedness before the hang is very poor (~0.46). Not sure if that's related. **Server command:** ```bash vllm serve nvidia/Qwen3.5-397B-A17B-NVFP4 \ --port 8000 -tp 1 -pp 1 -dp 8 \ --enable-expert-parallel --language-model-only \ --reasoning-parser qwen3 --stream-interval 100 \ --enable-eplb \ --eplb-config '{"num_redundant_experts": 32, "window_size": 100, "step_interval": 100, "log_balancedness": true, "log_balancedness_interval": 1}' \ --gpu-memory-utilization 0.80 ``` **Benchmark command (triggers the hang):** ```bash vllm bench serve \ --backend vllm --model nvidia/Qwen3.5-397B-A17B-NVFP4 \ --port 8000 --endpoint /v1/completions \ --dataset-name random --random-input 8192 --random-output 1 \ --max-concurrency 128 --num-prompt 128 --ignore-eos --temperature 0.0 ``` **Key observations:** - Profile rearrangement during startup completes fine (3.80s) - The hang occurs on the first real rearrangement at step 100 (after the first 25 steps of actual serving traffic, since initial step is set to 75) - All 8 GPU workers are at 100% utilization during the hang (busy-spinning on NCCL?) - Memory is not the issue (~17 GB free per GPU) **Related:** #32478 (EPLB hangs in several cases) -- that issue covers async EPLB + DeepEP/specific backends. This is **sync EPLB** with standard NCCL on B200. Full server log attached as [sync_eplb_failure_log.txt](https://github.com/user-attachments/files/26481621/sync_eplb_failure_log.txt). ### Before submitting a new issue... - [x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the [documentation page](https://docs.vllm.ai/en/latest/), which can answer lots of frequently asked questions.

Code Example

Collecting environment information...
==============================
        System Info
==============================
OS                           : Ubuntu 24.04.4 LTS (x86_64)
GCC version                  : (Ubuntu 13.3.0-6ubuntu2~24.04.1) 13.3.0
CMake version                : version 3.28.3
Libc version                 : glibc-2.39

==============================
       PyTorch Info
==============================
PyTorch version              : 2.10.0+cu128
Is debug build               : False
CUDA used to build PyTorch   : 12.8
ROCM used to build PyTorch   : N/A

==============================
      Python Environment
==============================
Python version               : 3.12.12 (main, Feb  3 2026, 22:51:04) [Clang 21.1.4 ] (64-bit runtime)
Python platform              : Linux-6.8.0-94-generic-x86_64-with-glibc2.39

==============================
       CUDA / GPU Info
==============================
Is CUDA available            : True
CUDA runtime version         : 12.8.61
GPU models and configuration :
GPU 0: NVIDIA B200
GPU 1: NVIDIA B200
GPU 2: NVIDIA B200
GPU 3: NVIDIA B200
GPU 4: NVIDIA B200
GPU 5: NVIDIA B200
GPU 6: NVIDIA B200
GPU 7: NVIDIA B200

Nvidia driver version        : 580.126.09
Is XNNPACK available         : True

==============================
          CPU Info
==============================
Architecture:                         x86_64
Model name:                           INTEL(R) XEON(R) PLATINUM 8570
CPU(s):                               224
Socket(s):                            2
Core(s) per socket:                   56
Thread(s) per core:                   2
NUMA node(s):                         2

==============================
Versions of relevant libraries
==============================
[pip3] flashinfer-python==0.6.7
[pip3] numpy==2.2.6
[pip3] nvidia-nccl-cu12==2.27.5
[pip3] torch==2.10.0
[pip3] transformers==4.57.5
[pip3] triton==3.6.0

==============================
         vLLM Info
==============================
vLLM Version                 : 0.18.2rc1.dev78+g2021f494a (git sha: 2021f494a)
vLLM Build Flags:
  CUDA Archs: Not Set; ROCm: Disabled
GPU Topology:
  GPU0-GPU7: All NV18 interconnected (8x NVIDIA B200)

---

vllm serve nvidia/Qwen3.5-397B-A17B-NVFP4 \
  --port 8000 -tp 1 -pp 1 -dp 8 \
  --enable-expert-parallel --language-model-only \
  --reasoning-parser qwen3 --stream-interval 100 \
  --enable-eplb \
  --eplb-config '{"num_redundant_experts": 32, "window_size": 100, "step_interval": 100, "log_balancedness": true, "log_balancedness_interval": 1}' \
  --gpu-memory-utilization 0.80

---

vllm bench serve \
  --backend vllm --model nvidia/Qwen3.5-397B-A17B-NVFP4 \
  --port 8000 --endpoint /v1/completions \
  --dataset-name random --random-input 8192 --random-output 1 \
  --max-concurrency 128 --num-prompt 128 --ignore-eos --temperature 0.0

Your current environment

<details> <summary>The output of <code>python collect_env.py</code></summary>

Collecting environment information...
==============================
        System Info
==============================
OS                           : Ubuntu 24.04.4 LTS (x86_64)
GCC version                  : (Ubuntu 13.3.0-6ubuntu2~24.04.1) 13.3.0
CMake version                : version 3.28.3
Libc version                 : glibc-2.39

==============================
       PyTorch Info
==============================
PyTorch version              : 2.10.0+cu128
Is debug build               : False
CUDA used to build PyTorch   : 12.8
ROCM used to build PyTorch   : N/A

==============================
      Python Environment
==============================
Python version               : 3.12.12 (main, Feb  3 2026, 22:51:04) [Clang 21.1.4 ] (64-bit runtime)
Python platform              : Linux-6.8.0-94-generic-x86_64-with-glibc2.39

==============================
       CUDA / GPU Info
==============================
Is CUDA available            : True
CUDA runtime version         : 12.8.61
GPU models and configuration :
GPU 0: NVIDIA B200
GPU 1: NVIDIA B200
GPU 2: NVIDIA B200
GPU 3: NVIDIA B200
GPU 4: NVIDIA B200
GPU 5: NVIDIA B200
GPU 6: NVIDIA B200
GPU 7: NVIDIA B200

Nvidia driver version        : 580.126.09
Is XNNPACK available         : True

==============================
          CPU Info
==============================
Architecture:                         x86_64
Model name:                           INTEL(R) XEON(R) PLATINUM 8570
CPU(s):                               224
Socket(s):                            2
Core(s) per socket:                   56
Thread(s) per core:                   2
NUMA node(s):                         2

==============================
Versions of relevant libraries
==============================
[pip3] flashinfer-python==0.6.7
[pip3] numpy==2.2.6
[pip3] nvidia-nccl-cu12==2.27.5
[pip3] torch==2.10.0
[pip3] transformers==4.57.5
[pip3] triton==3.6.0

==============================
         vLLM Info
==============================
vLLM Version                 : 0.18.2rc1.dev78+g2021f494a (git sha: 2021f494a)
vLLM Build Flags:
  CUDA Archs: Not Set; ROCm: Disabled
GPU Topology:
  GPU0-GPU7: All NV18 interconnected (8x NVIDIA B200)

</details>

🐛 Describe the bug

Sync EPLB rearrangement hangs indefinitely during serving with DP8 + expert parallel on 8xB200, causing all EngineCore processes to stall.

Timeline from logs:

Steps 75-99: EPLB runs normally, balancedness ~0.46
Step 99 (18:07:19): Rearranging experts sync mode ... -- rearrangement starts
18:08:20 (+60s): All 8 EngineCore report No available shared memory broadcast block found in 60 seconds
This repeats every 60 seconds for ~11 minutes
18:18:30: All EngineCore crash with RuntimeError: cancelled from shm_broadcast.py:677

The rearrangement never completes -- it deadlocks on the NCCL collective inside rearrange(). The first rearrangement during model loading (profile mode) works fine (3.80s). The hang occurs on the first real rearrangement triggered by serving load.

Note: the balancedness before the hang is very poor (~0.46). Not sure if that's related.

Server command:

vllm serve nvidia/Qwen3.5-397B-A17B-NVFP4 \
  --port 8000 -tp 1 -pp 1 -dp 8 \
  --enable-expert-parallel --language-model-only \
  --reasoning-parser qwen3 --stream-interval 100 \
  --enable-eplb \
  --eplb-config '{"num_redundant_experts": 32, "window_size": 100, "step_interval": 100, "log_balancedness": true, "log_balancedness_interval": 1}' \
  --gpu-memory-utilization 0.80

Benchmark command (triggers the hang):

vllm bench serve \
  --backend vllm --model nvidia/Qwen3.5-397B-A17B-NVFP4 \
  --port 8000 --endpoint /v1/completions \
  --dataset-name random --random-input 8192 --random-output 1 \
  --max-concurrency 128 --num-prompt 128 --ignore-eos --temperature 0.0

Key observations:

Profile rearrangement during startup completes fine (3.80s)
The hang occurs on the first real rearrangement at step 100 (after the first 25 steps of actual serving traffic, since initial step is set to 75)
All 8 GPU workers are at 100% utilization during the hang (busy-spinning on NCCL?)
Memory is not the issue (~17 GB free per GPU)

Related: #32478 (EPLB hangs in several cases) -- that issue covers async EPLB + DeepEP/specific backends. This is sync EPLB with standard NCCL on B200.

Full server log attached as sync_eplb_failure_log.txt.

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

TL;DR

The most likely fix for the sync EPLB rearrangement hang is to investigate and address the potential deadlock in the NCCL collective inside the rearrange() function, possibly by adjusting the EPLB configuration or NCCL settings.

Guidance

Investigate the NCCL collective inside the rearrange() function to identify the cause of the deadlock, considering the high GPU utilization and lack of memory issues.
Review the EPLB configuration, particularly the num_redundant_experts, window_size, and step_interval parameters, to determine if adjustments can help prevent the hang.
Consider reducing the max-concurrency and num-prompt parameters in the benchmark command to decrease the load on the GPU workers and potentially prevent the deadlock.
Examine the server log and attached sync_eplb_failure_log.txt file for additional clues about the cause of the hang.

Example

No specific code snippet is provided, as the issue is related to a complex system configuration and requires a more in-depth investigation.

Notes

The hang occurs only during the first real rearrangement at step 100, after the initial profile rearrangement completes successfully. The poor balancedness before the hang (~0.46) may be related to the issue, but its impact is unclear.

Recommendation

Apply a workaround by adjusting the EPLB configuration or NCCL settings to prevent the deadlock, as the root cause of the issue is not immediately clear and may require further investigation.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

vllm - 💡(How to fix) Fix [Bug]: Sync EPLB rearrangement hangs indefinitely with DP8 + EP on B200 [1 participants]

Recommended Tools

GitHub issue graph ai analysis

Code Example

Your current environment

🐛 Describe the bug

Before submitting a new issue...

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

TRENDING

vllm - 💡(How to fix) Fix [Bug]: Sync EPLB rearrangement hangs indefinitely with DP8 + EP on B200 [1 participants]

Recommended Tools

GitHub issue graph ai analysis

Code Example

Your current environment

🐛 Describe the bug

Before submitting a new issue...

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING