vllm - 💡(How to fix) Fix [Bug]: DeepSeek-R1 hang on 8xB200 after NCCL Initialization [1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#40604Fetched 2026-04-23 07:24:02
View on GitHub
Comments
0
Participants
1
Timeline
4
Reactions
0
Participants
Timeline (top)
added_to_project_v2 ×1labeled ×1project_v2_item_status_changed ×1renamed ×1

Code Example

==============================
         vLLM Info
==============================
ROCM Version                 : Could not collect
vLLM Version                 : 0.19.0
vLLM Build Flags:
  CUDA Archs: Not Set; ROCm: Disabled; XPU: Disabled
GPU Topology:
        GPU0    GPU1    GPU2    GPU3    GPU4    GPU5    GPU6    GPU7    NIC0    NIC1    NIC2    NIC3    NIC4    NIC5    NIC6    NIC7    NIC8    NIC9    NIC10   CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X      NV18    NV18    NV18    NV18    NV18    NV18    NV18    NODE    PIX     NODE    NODE    SYS     SYS     SYS     SYS     SYS     SYS     NODE    0-63,128-191    0               N/A
GPU1    NV18     X      NV18    NV18    NV18    NV18    NV18    NV18    NODE    NODE    NODE    PIX     SYS     SYS     SYS     SYS     SYS     SYS     NODE    0-63,128-191    0               N/A
GPU2    NV18    NV18     X      NV18    NV18    NV18    NV18    NV18    NODE    NODE    PIX     NODE    SYS     SYS     SYS     SYS     SYS     SYS     NODE    0-63,128-191    0               N/A
GPU3    NV18    NV18    NV18     X      NV18    NV18    NV18    NV18    PIX     NODE    NODE    NODE    SYS     SYS     SYS     SYS     SYS     SYS     NODE    0-63,128-191    0               N/A
GPU4    NV18    NV18    NV18    NV18     X      NV18    NV18    NV18    SYS     SYS     SYS     SYS     NODE    PIX     NODE    NODE    NODE    NODE    SYS     64-127,192-255  1               N/A
GPU5    NV18    NV18    NV18    NV18    NV18     X      NV18    NV18    SYS     SYS     SYS     SYS     NODE    NODE    NODE    NODE    NODE    PIX     SYS     64-127,192-255  1               N/A
GPU6    NV18    NV18    NV18    NV18    NV18    NV18     X      NV18    SYS     SYS     SYS     SYS     NODE    NODE    NODE    NODE    PIX     NODE    SYS     64-127,192-255  1               N/A
GPU7    NV18    NV18    NV18    NV18    NV18    NV18    NV18     X      SYS     SYS     SYS     SYS     PIX     NODE    NODE    NODE    NODE    NODE    SYS     64-127,192-255  1               N/A
NIC0    NODE    NODE    NODE    PIX     SYS     SYS     SYS     SYS      X      NODE    NODE    NODE    SYS     SYS     SYS     SYS     SYS     SYS     NODE
NIC1    PIX     NODE    NODE    NODE    SYS     SYS     SYS     SYS     NODE     X      NODE    NODE    SYS     SYS     SYS     SYS     SYS     SYS     NODE
NIC2    NODE    NODE    PIX     NODE    SYS     SYS     SYS     SYS     NODE    NODE     X      NODE    SYS     SYS     SYS     SYS     SYS     SYS     NODE
NIC3    NODE    PIX     NODE    NODE    SYS     SYS     SYS     SYS     NODE    NODE    NODE     X      SYS     SYS     SYS     SYS     SYS     SYS     NODE
NIC4    SYS     SYS     SYS     SYS     NODE    NODE    NODE    PIX     SYS     SYS     SYS     SYS      X      NODE    NODE    NODE    NODE    NODE    SYS
NIC5    SYS     SYS     SYS     SYS     PIX     NODE    NODE    NODE    SYS     SYS     SYS     SYS     NODE     X      NODE    NODE    NODE    NODE    SYS
NIC6    SYS     SYS     SYS     SYS     NODE    NODE    NODE    NODE    SYS     SYS     SYS     SYS     NODE    NODE     X      PIX     NODE    NODE    SYS
NIC7    SYS     SYS     SYS     SYS     NODE    NODE    NODE    NODE    SYS     SYS     SYS     SYS     NODE    NODE    PIX      X      NODE    NODE    SYS
NIC8    SYS     SYS     SYS     SYS     NODE    NODE    PIX     NODE    SYS     SYS     SYS     SYS     NODE    NODE    NODE    NODE     X      NODE    SYS
NIC9    SYS     SYS     SYS     SYS     NODE    PIX     NODE    NODE    SYS     SYS     SYS     SYS     NODE    NODE    NODE    NODE    NODE     X      SYS
NIC10   NODE    NODE    NODE    NODE    SYS     SYS     SYS     SYS     NODE    NODE    NODE    NODE    SYS     SYS     SYS     SYS     SYS     SYS      X

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

NIC Legend:

  NIC0: mlx5_0
  NIC1: mlx5_1
  NIC2: mlx5_2
  NIC3: mlx5_5
  NIC4: mlx5_6
  NIC5: mlx5_11
  NIC6: mlx5_12
  NIC7: mlx5_13
  NIC8: mlx5_14
  NIC9: mlx5_15
  NIC10: mlx5_bond_0

==============================
     Environment Variables
==============================
NVIDIA_VISIBLE_DEVICES=void
NVIDIA_REQUIRE_CUDA=cuda>=13.1 brand=unknown,driver>=535,driver<536 brand=grid,driver>=535,driver<536 brand=tesla,driver>=535,driver<536 brand=nvidia,driver>=535,driver<536 brand=quadro,driver>=535,driver<536 brand=quadrortx,driver>=535,driver<536 brand=nvidiartx,driver>=535,driver<536 brand=vapps,driver>=535,driver<536 brand=vpc,driver>=535,driver<536 brand=vcs,driver>=535,driver<536 brand=vws,driver>=535,driver<536 brand=cloudgaming,driver>=535,driver<536 brand=unknown,driver>=550,driver<551 brand=grid,driver>=550,driver<551 brand=tesla,driver>=550,driver<551 brand=nvidia,driver>=550,driver<551 brand=quadro,driver>=550,driver<551 brand=quadrortx,driver>=550,driver<551 brand=nvidiartx,driver>=550,driver<551 brand=vapps,driver>=550,driver<551 brand=vpc,driver>=550,driver<551 brand=vcs,driver>=550,driver<551 brand=vws,driver>=550,driver<551 brand=cloudgaming,driver>=550,driver<551 brand=unknown,driver>=570,driver<571 brand=grid,driver>=570,driver<571 brand=tesla,driver>=570,driver<571 brand=nvidia,driver>=570,driver<571 brand=quadro,driver>=570,driver<571 brand=quadrortx,driver>=570,driver<571 brand=nvidiartx,driver>=570,driver<571 brand=vapps,driver>=570,driver<571 brand=vpc,driver>=570,driver<571 brand=vcs,driver>=570,driver<571 brand=vws,driver>=570,driver<571 brand=cloudgaming,driver>=570,driver<571 brand=unknown,driver>=575,driver<576 brand=grid,driver>=575,driver<576 brand=tesla,driver>=575,driver<576 brand=nvidia,driver>=575,driver<576 brand=quadro,driver>=575,driver<576 brand=quadrortx,driver>=575,driver<576 brand=nvidiartx,driver>=575,driver<576 brand=vapps,driver>=575,driver<576 brand=vpc,driver>=575,driver<576 brand=vcs,driver>=575,driver<576 brand=vws,driver>=575,driver<576 brand=cloudgaming,driver>=575,driver<576 brand=unknown,driver>=580,driver<581 brand=grid,driver>=580,driver<581 brand=tesla,driver>=580,driver<581 brand=nvidia,driver>=580,driver<581 brand=quadro,driver>=580,driver<581 brand=quadrortx,driver>=580,driver<581 brand=nvidiartx,driver>=580,driver<581 brand=vapps,driver>=580,driver<581 brand=vpc,driver>=580,driver<581 brand=vcs,driver>=580,driver<581 brand=vws,driver>=580,driver<581 brand=cloudgaming,driver>=580,driver<581
NVIDIA_DRIVER_CAPABILITIES=all
NVIDIA_PRODUCT_NAME=CUDA
CUDA_VERSION=13.1.1
LD_LIBRARY_PATH=/usr/local/nvidia/lib:/usr/local/nvidia/lib64
NVIDIA_CTK_LIBCUDA_DIR=/usr/lib/x86_64-linux-gnu
CUDA_HOME=/usr/local/cuda
CUDA_HOME=/usr/local/cuda
PYTORCH_NVML_BASED_CUDA_CHECK=1
TORCHINDUCTOR_COMPILE_THREADS=1
TORCHINDUCTOR_CACHE_DIR=/tmp/torchinductor_root

---

[Gloo] Rank [Gloo] Rank 6 is connected to [Gloo] Rank 57 is connected to  peer ranks. 7Expected number of connected peer ranks is :  peer ranks. 7Expected number of connected peer ranks is : 77
 is connected to
7 peer ranks. Expected number of connected peer ranks is : 7
(Worker pid=13612) ================================================================================
(Worker pid=13612) [2026-04-22 09:45:39] FlashInfer API Logging - System Information
(Worker pid=13612) ================================================================================
(Worker pid=13612) FlashInfer version: 0.6.6
(Worker pid=13612) CUDA toolkit version: 13.0
(Worker pid=13612) cuDNN version: 91501
(Worker pid=13612) Number of GPUs: 8
(Worker pid=13612)   GPU 0: NVIDIA B200
(Worker pid=13612)     Compute capability: 10.0 (SM100)
(Worker pid=13612)   GPU 1: NVIDIA B200
(Worker pid=13612)     Compute capability: 10.0 (SM100)
(Worker pid=13612)   GPU 2: NVIDIA B200
(Worker pid=13612)     Compute capability: 10.0 (SM100)
(Worker pid=13612)   GPU 3: NVIDIA B200
(Worker pid=13612)     Compute capability: 10.0 (SM100)
(Worker pid=13612)   GPU 4: NVIDIA B200
(Worker pid=13612)     Compute capability: 10.0 (SM100)
(Worker pid=13612)   GPU 5: NVIDIA B200
(Worker pid=13612)     Compute capability: 10.0 (SM100)
(Worker pid=13612)   GPU 6: NVIDIA B200
(Worker pid=13612)     Compute capability: 10.0 (SM100)
(Worker pid=13612)   GPU 7: NVIDIA B200
(Worker pid=13612)     Compute capability: 10.0 (SM100)
(Worker pid=13612) PyTorch version: 2.10.0+cu130
(Worker pid=13612) ================================================================================
(Worker pid=13612)
(Worker pid=13610) ================================================================================
(Worker pid=13610) [2026-04-22 09:45:39] FlashInfer API Logging - System Information
(Worker pid=13610) ================================================================================
(Worker pid=13610) FlashInfer version: 0.6.6
(Worker pid=13610) CUDA toolkit version: 13.0
(Worker pid=13610) cuDNN version: 91501
(Worker pid=13610) Number of GPUs: 8
(Worker pid=13610)   GPU 0: NVIDIA B200
(Worker pid=13610)     Compute capability: 10.0 (SM100)
(Worker pid=13610)   GPU 1: NVIDIA B200
(Worker pid=13610)     Compute capability: 10.0 (SM100)
(Worker pid=13610)   GPU 2: NVIDIA B200
(Worker pid=13610)     Compute capability: 10.0 (SM100)
(Worker pid=13610)   GPU 3: NVIDIA B200
(Worker pid=13610)     Compute capability: 10.0 (SM100)
(Worker pid=13610)   GPU 4: NVIDIA B200
(Worker pid=13610)     Compute capability: 10.0 (SM100)
(Worker pid=13610)   GPU 5: NVIDIA B200
(Worker pid=13610)     Compute capability: 10.0 (SM100)
(Worker pid=13610)   GPU 6: NVIDIA B200
(Worker pid=13610)     Compute capability: 10.0 (SM100)
(Worker pid=13610)   GPU 7: NVIDIA B200
(Worker pid=13610)     Compute capability: 10.0 (SM100)
(Worker pid=13610) PyTorch version: 2.10.0+cu130
(Worker pid=13610) ================================================================================
(Worker pid=13610)
(Worker pid=13611) ================================================================================
(Worker pid=13611) [2026-04-22 09:45:39] FlashInfer API Logging - System Information
(Worker pid=13611) ================================================================================
(Worker pid=13611) FlashInfer version: 0.6.6
(Worker pid=13611) CUDA toolkit version: 13.0
(Worker pid=13611) cuDNN version: 91501
(Worker pid=13611) Number of GPUs: 8
(Worker pid=13611)   GPU 0: NVIDIA B200
(Worker pid=13611)     Compute capability: 10.0 (SM100)
(Worker pid=13611)   GPU 1: NVIDIA B200
(Worker pid=13611)     Compute capability: 10.0 (SM100)
(Worker pid=13611)   GPU 2: NVIDIA B200
(Worker pid=13611)     Compute capability: 10.0 (SM100)
(Worker pid=13611)   GPU 3: NVIDIA B200
(Worker pid=13611)     Compute capability: 10.0 (SM100)
(Worker pid=13611)   GPU 4: NVIDIA B200
(Worker pid=13611)     Compute capability: 10.0 (SM100)
(Worker pid=13611)   GPU 5: NVIDIA B200
(Worker pid=13611)     Compute capability: 10.0 (SM100)
(Worker pid=13611)   GPU 6: NVIDIA B200
(Worker pid=13611)     Compute capability: 10.0 (SM100)
(Worker pid=13611)   GPU 7: NVIDIA B200
(Worker pid=13611)     Compute capability: 10.0 (SM100)
(Worker pid=13611) PyTorch version: 2.10.0+cu130
(Worker pid=13611) ================================================================================
(Worker pid=13611)
(Worker pid=13605) DEBUG 04-22 09:45:40 [utils/nccl.py:34] Found nccl from library libnccl.so.2
(Worker pid=13605) INFO 04-22 09:45:40 [distributed/device_communicators/pynccl.py:111] vLLM is using nccl==2.28.9
(APIServer pid=13192) DEBUG 04-22 09:45:43 [v1/engine/utils.py:1047] Waiting for 1 local, 0 remote core engine proc(s) to start.
(APIServer pid=13192) DEBUG 04-22 09:45:53 [v1/engine/utils.py:1047] Waiting for 1 local, 0 remote core engine proc(s) to start.
(APIServer pid=13192) DEBUG 04-22 09:46:03 [v1/engine/utils.py:1047] Waiting for 1 local, 0 remote core engine proc(s) to start.

---

FLASHINFER_LOGLEVEL=3 FLASHINFER_JIT_VERBOSE=1 VLLM_LOGGING_LEVEL=DEBUG MOE_CAP_PROFILING_ONLY=1 vllm serve   --model deepseek-ai/DeepSeek-R1 --port 8000 --tensor-parallel-size 8 --reasoning-parser deepseek_r1 --trust-remote-code       --max-num-batched-tokens 131072 --max-num-seqs 1
RAW_BUFFERClick to expand / collapse

Your current environment

<details> <summary>The output of <code>python collect_env.py</code></summary>
==============================
         vLLM Info
==============================
ROCM Version                 : Could not collect
vLLM Version                 : 0.19.0
vLLM Build Flags:
  CUDA Archs: Not Set; ROCm: Disabled; XPU: Disabled
GPU Topology:
        GPU0    GPU1    GPU2    GPU3    GPU4    GPU5    GPU6    GPU7    NIC0    NIC1    NIC2    NIC3    NIC4    NIC5    NIC6    NIC7    NIC8    NIC9    NIC10   CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X      NV18    NV18    NV18    NV18    NV18    NV18    NV18    NODE    PIX     NODE    NODE    SYS     SYS     SYS     SYS     SYS     SYS     NODE    0-63,128-191    0               N/A
GPU1    NV18     X      NV18    NV18    NV18    NV18    NV18    NV18    NODE    NODE    NODE    PIX     SYS     SYS     SYS     SYS     SYS     SYS     NODE    0-63,128-191    0               N/A
GPU2    NV18    NV18     X      NV18    NV18    NV18    NV18    NV18    NODE    NODE    PIX     NODE    SYS     SYS     SYS     SYS     SYS     SYS     NODE    0-63,128-191    0               N/A
GPU3    NV18    NV18    NV18     X      NV18    NV18    NV18    NV18    PIX     NODE    NODE    NODE    SYS     SYS     SYS     SYS     SYS     SYS     NODE    0-63,128-191    0               N/A
GPU4    NV18    NV18    NV18    NV18     X      NV18    NV18    NV18    SYS     SYS     SYS     SYS     NODE    PIX     NODE    NODE    NODE    NODE    SYS     64-127,192-255  1               N/A
GPU5    NV18    NV18    NV18    NV18    NV18     X      NV18    NV18    SYS     SYS     SYS     SYS     NODE    NODE    NODE    NODE    NODE    PIX     SYS     64-127,192-255  1               N/A
GPU6    NV18    NV18    NV18    NV18    NV18    NV18     X      NV18    SYS     SYS     SYS     SYS     NODE    NODE    NODE    NODE    PIX     NODE    SYS     64-127,192-255  1               N/A
GPU7    NV18    NV18    NV18    NV18    NV18    NV18    NV18     X      SYS     SYS     SYS     SYS     PIX     NODE    NODE    NODE    NODE    NODE    SYS     64-127,192-255  1               N/A
NIC0    NODE    NODE    NODE    PIX     SYS     SYS     SYS     SYS      X      NODE    NODE    NODE    SYS     SYS     SYS     SYS     SYS     SYS     NODE
NIC1    PIX     NODE    NODE    NODE    SYS     SYS     SYS     SYS     NODE     X      NODE    NODE    SYS     SYS     SYS     SYS     SYS     SYS     NODE
NIC2    NODE    NODE    PIX     NODE    SYS     SYS     SYS     SYS     NODE    NODE     X      NODE    SYS     SYS     SYS     SYS     SYS     SYS     NODE
NIC3    NODE    PIX     NODE    NODE    SYS     SYS     SYS     SYS     NODE    NODE    NODE     X      SYS     SYS     SYS     SYS     SYS     SYS     NODE
NIC4    SYS     SYS     SYS     SYS     NODE    NODE    NODE    PIX     SYS     SYS     SYS     SYS      X      NODE    NODE    NODE    NODE    NODE    SYS
NIC5    SYS     SYS     SYS     SYS     PIX     NODE    NODE    NODE    SYS     SYS     SYS     SYS     NODE     X      NODE    NODE    NODE    NODE    SYS
NIC6    SYS     SYS     SYS     SYS     NODE    NODE    NODE    NODE    SYS     SYS     SYS     SYS     NODE    NODE     X      PIX     NODE    NODE    SYS
NIC7    SYS     SYS     SYS     SYS     NODE    NODE    NODE    NODE    SYS     SYS     SYS     SYS     NODE    NODE    PIX      X      NODE    NODE    SYS
NIC8    SYS     SYS     SYS     SYS     NODE    NODE    PIX     NODE    SYS     SYS     SYS     SYS     NODE    NODE    NODE    NODE     X      NODE    SYS
NIC9    SYS     SYS     SYS     SYS     NODE    PIX     NODE    NODE    SYS     SYS     SYS     SYS     NODE    NODE    NODE    NODE    NODE     X      SYS
NIC10   NODE    NODE    NODE    NODE    SYS     SYS     SYS     SYS     NODE    NODE    NODE    NODE    SYS     SYS     SYS     SYS     SYS     SYS      X

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

NIC Legend:

  NIC0: mlx5_0
  NIC1: mlx5_1
  NIC2: mlx5_2
  NIC3: mlx5_5
  NIC4: mlx5_6
  NIC5: mlx5_11
  NIC6: mlx5_12
  NIC7: mlx5_13
  NIC8: mlx5_14
  NIC9: mlx5_15
  NIC10: mlx5_bond_0

==============================
     Environment Variables
==============================
NVIDIA_VISIBLE_DEVICES=void
NVIDIA_REQUIRE_CUDA=cuda>=13.1 brand=unknown,driver>=535,driver<536 brand=grid,driver>=535,driver<536 brand=tesla,driver>=535,driver<536 brand=nvidia,driver>=535,driver<536 brand=quadro,driver>=535,driver<536 brand=quadrortx,driver>=535,driver<536 brand=nvidiartx,driver>=535,driver<536 brand=vapps,driver>=535,driver<536 brand=vpc,driver>=535,driver<536 brand=vcs,driver>=535,driver<536 brand=vws,driver>=535,driver<536 brand=cloudgaming,driver>=535,driver<536 brand=unknown,driver>=550,driver<551 brand=grid,driver>=550,driver<551 brand=tesla,driver>=550,driver<551 brand=nvidia,driver>=550,driver<551 brand=quadro,driver>=550,driver<551 brand=quadrortx,driver>=550,driver<551 brand=nvidiartx,driver>=550,driver<551 brand=vapps,driver>=550,driver<551 brand=vpc,driver>=550,driver<551 brand=vcs,driver>=550,driver<551 brand=vws,driver>=550,driver<551 brand=cloudgaming,driver>=550,driver<551 brand=unknown,driver>=570,driver<571 brand=grid,driver>=570,driver<571 brand=tesla,driver>=570,driver<571 brand=nvidia,driver>=570,driver<571 brand=quadro,driver>=570,driver<571 brand=quadrortx,driver>=570,driver<571 brand=nvidiartx,driver>=570,driver<571 brand=vapps,driver>=570,driver<571 brand=vpc,driver>=570,driver<571 brand=vcs,driver>=570,driver<571 brand=vws,driver>=570,driver<571 brand=cloudgaming,driver>=570,driver<571 brand=unknown,driver>=575,driver<576 brand=grid,driver>=575,driver<576 brand=tesla,driver>=575,driver<576 brand=nvidia,driver>=575,driver<576 brand=quadro,driver>=575,driver<576 brand=quadrortx,driver>=575,driver<576 brand=nvidiartx,driver>=575,driver<576 brand=vapps,driver>=575,driver<576 brand=vpc,driver>=575,driver<576 brand=vcs,driver>=575,driver<576 brand=vws,driver>=575,driver<576 brand=cloudgaming,driver>=575,driver<576 brand=unknown,driver>=580,driver<581 brand=grid,driver>=580,driver<581 brand=tesla,driver>=580,driver<581 brand=nvidia,driver>=580,driver<581 brand=quadro,driver>=580,driver<581 brand=quadrortx,driver>=580,driver<581 brand=nvidiartx,driver>=580,driver<581 brand=vapps,driver>=580,driver<581 brand=vpc,driver>=580,driver<581 brand=vcs,driver>=580,driver<581 brand=vws,driver>=580,driver<581 brand=cloudgaming,driver>=580,driver<581
NVIDIA_DRIVER_CAPABILITIES=all
NVIDIA_PRODUCT_NAME=CUDA
CUDA_VERSION=13.1.1
LD_LIBRARY_PATH=/usr/local/nvidia/lib:/usr/local/nvidia/lib64
NVIDIA_CTK_LIBCUDA_DIR=/usr/lib/x86_64-linux-gnu
CUDA_HOME=/usr/local/cuda
CUDA_HOME=/usr/local/cuda
PYTORCH_NVML_BASED_CUDA_CHECK=1
TORCHINDUCTOR_COMPILE_THREADS=1
TORCHINDUCTOR_CACHE_DIR=/tmp/torchinductor_root
</details>

🐛 Describe the bug

I am using vllm 0.19.0 on 8xB200 and trying to run DeepSeek-R1.

It keeps waiting for the local engine like:

[Gloo] Rank [Gloo] Rank 6 is connected to [Gloo] Rank 57 is connected to  peer ranks. 7Expected number of connected peer ranks is :  peer ranks. 7Expected number of connected peer ranks is : 77
 is connected to
7 peer ranks. Expected number of connected peer ranks is : 7
(Worker pid=13612) ================================================================================
(Worker pid=13612) [2026-04-22 09:45:39] FlashInfer API Logging - System Information
(Worker pid=13612) ================================================================================
(Worker pid=13612) FlashInfer version: 0.6.6
(Worker pid=13612) CUDA toolkit version: 13.0
(Worker pid=13612) cuDNN version: 91501
(Worker pid=13612) Number of GPUs: 8
(Worker pid=13612)   GPU 0: NVIDIA B200
(Worker pid=13612)     Compute capability: 10.0 (SM100)
(Worker pid=13612)   GPU 1: NVIDIA B200
(Worker pid=13612)     Compute capability: 10.0 (SM100)
(Worker pid=13612)   GPU 2: NVIDIA B200
(Worker pid=13612)     Compute capability: 10.0 (SM100)
(Worker pid=13612)   GPU 3: NVIDIA B200
(Worker pid=13612)     Compute capability: 10.0 (SM100)
(Worker pid=13612)   GPU 4: NVIDIA B200
(Worker pid=13612)     Compute capability: 10.0 (SM100)
(Worker pid=13612)   GPU 5: NVIDIA B200
(Worker pid=13612)     Compute capability: 10.0 (SM100)
(Worker pid=13612)   GPU 6: NVIDIA B200
(Worker pid=13612)     Compute capability: 10.0 (SM100)
(Worker pid=13612)   GPU 7: NVIDIA B200
(Worker pid=13612)     Compute capability: 10.0 (SM100)
(Worker pid=13612) PyTorch version: 2.10.0+cu130
(Worker pid=13612) ================================================================================
(Worker pid=13612)
(Worker pid=13610) ================================================================================
(Worker pid=13610) [2026-04-22 09:45:39] FlashInfer API Logging - System Information
(Worker pid=13610) ================================================================================
(Worker pid=13610) FlashInfer version: 0.6.6
(Worker pid=13610) CUDA toolkit version: 13.0
(Worker pid=13610) cuDNN version: 91501
(Worker pid=13610) Number of GPUs: 8
(Worker pid=13610)   GPU 0: NVIDIA B200
(Worker pid=13610)     Compute capability: 10.0 (SM100)
(Worker pid=13610)   GPU 1: NVIDIA B200
(Worker pid=13610)     Compute capability: 10.0 (SM100)
(Worker pid=13610)   GPU 2: NVIDIA B200
(Worker pid=13610)     Compute capability: 10.0 (SM100)
(Worker pid=13610)   GPU 3: NVIDIA B200
(Worker pid=13610)     Compute capability: 10.0 (SM100)
(Worker pid=13610)   GPU 4: NVIDIA B200
(Worker pid=13610)     Compute capability: 10.0 (SM100)
(Worker pid=13610)   GPU 5: NVIDIA B200
(Worker pid=13610)     Compute capability: 10.0 (SM100)
(Worker pid=13610)   GPU 6: NVIDIA B200
(Worker pid=13610)     Compute capability: 10.0 (SM100)
(Worker pid=13610)   GPU 7: NVIDIA B200
(Worker pid=13610)     Compute capability: 10.0 (SM100)
(Worker pid=13610) PyTorch version: 2.10.0+cu130
(Worker pid=13610) ================================================================================
(Worker pid=13610)
(Worker pid=13611) ================================================================================
(Worker pid=13611) [2026-04-22 09:45:39] FlashInfer API Logging - System Information
(Worker pid=13611) ================================================================================
(Worker pid=13611) FlashInfer version: 0.6.6
(Worker pid=13611) CUDA toolkit version: 13.0
(Worker pid=13611) cuDNN version: 91501
(Worker pid=13611) Number of GPUs: 8
(Worker pid=13611)   GPU 0: NVIDIA B200
(Worker pid=13611)     Compute capability: 10.0 (SM100)
(Worker pid=13611)   GPU 1: NVIDIA B200
(Worker pid=13611)     Compute capability: 10.0 (SM100)
(Worker pid=13611)   GPU 2: NVIDIA B200
(Worker pid=13611)     Compute capability: 10.0 (SM100)
(Worker pid=13611)   GPU 3: NVIDIA B200
(Worker pid=13611)     Compute capability: 10.0 (SM100)
(Worker pid=13611)   GPU 4: NVIDIA B200
(Worker pid=13611)     Compute capability: 10.0 (SM100)
(Worker pid=13611)   GPU 5: NVIDIA B200
(Worker pid=13611)     Compute capability: 10.0 (SM100)
(Worker pid=13611)   GPU 6: NVIDIA B200
(Worker pid=13611)     Compute capability: 10.0 (SM100)
(Worker pid=13611)   GPU 7: NVIDIA B200
(Worker pid=13611)     Compute capability: 10.0 (SM100)
(Worker pid=13611) PyTorch version: 2.10.0+cu130
(Worker pid=13611) ================================================================================
(Worker pid=13611)
(Worker pid=13605) DEBUG 04-22 09:45:40 [utils/nccl.py:34] Found nccl from library libnccl.so.2
(Worker pid=13605) INFO 04-22 09:45:40 [distributed/device_communicators/pynccl.py:111] vLLM is using nccl==2.28.9
(APIServer pid=13192) DEBUG 04-22 09:45:43 [v1/engine/utils.py:1047] Waiting for 1 local, 0 remote core engine proc(s) to start.
(APIServer pid=13192) DEBUG 04-22 09:45:53 [v1/engine/utils.py:1047] Waiting for 1 local, 0 remote core engine proc(s) to start.
(APIServer pid=13192) DEBUG 04-22 09:46:03 [v1/engine/utils.py:1047] Waiting for 1 local, 0 remote core engine proc(s) to start.

It can last an hour and hang here.

Could you please help me check?

I start the server with:

FLASHINFER_LOGLEVEL=3 FLASHINFER_JIT_VERBOSE=1 VLLM_LOGGING_LEVEL=DEBUG MOE_CAP_PROFILING_ONLY=1 vllm serve   --model deepseek-ai/DeepSeek-R1 --port 8000 --tensor-parallel-size 8 --reasoning-parser deepseek_r1 --trust-remote-code       --max-num-batched-tokens 131072 --max-num-seqs 1

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

TL;DR

The issue is likely due to a mismatch between the CUDA version used by vLLM and the CUDA toolkit version installed, causing the local engine to hang indefinitely.

Guidance

  • Check the CUDA version compatibility between vLLM and the installed CUDA toolkit version.
  • Verify that the NVIDIA_VISIBLE_DEVICES environment variable is set correctly to include all available GPUs.
  • Review the vllm serve command to ensure that the --tensor-parallel-size option matches the number of available GPUs.
  • Consider updating the CUDA toolkit version to match the version required by vLLM.

Example

No code snippet is provided as the issue seems to be related to environment configuration rather than code.

Notes

The issue may be specific to the DeepSeek-R1 model or the vLLM version being used. Further debugging may be required to determine the root cause.

Recommendation

Apply a workaround by updating the CUDA toolkit version to match the version required by vLLM, and verify that the NVIDIA_VISIBLE_DEVICES environment variable is set correctly.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

vllm - 💡(How to fix) Fix [Bug]: DeepSeek-R1 hang on 8xB200 after NCCL Initialization [1 participants]