vllm - 💡(How to fix) Fix [Bug]: Engine hangs indefinitely during model weight loading for nvidia/Qwen3.5-397B-A17B-NVFP4 on Blackwell GPUs (RTX PRO 6000) with TP=4

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

Error Message

The last meaningful log output is:

Code Example

Collecting environment information...
==============================
        System Info
==============================
OS                           : Ubuntu 22.04.5 LTS (x86_64)
GCC version                  : (Ubuntu 11.4.0-1ubuntu1~22.04.3) 11.4.0
Clang version                : Could not collect
CMake version                : Could not collect
Libc version                 : glibc-2.35

==============================
       PyTorch Info
==============================
PyTorch version              : 2.11.0+cu130
Is debug build               : False
CUDA used to build PyTorch   : 13.0
ROCM used to build PyTorch   : N/A
XPU used to build PyTorch    : N/A

==============================
      Python Environment
==============================
Python version               : 3.12.13 (main, May  4 2026, 09:06:35) [GCC 11.4.0] (64-bit runtime)
Python platform              : Linux-6.12.68+-x86_64-with-glibc2.35

==============================
       CUDA / GPU Info
==============================
Is CUDA available            : True
CUDA runtime version         : 13.0.88
CUDA_MODULE_LOADING set to   :
GPU models and configuration :
GPU 0: NVIDIA RTX PRO 6000 Blackwell Server Edition
GPU 1: NVIDIA RTX PRO 6000 Blackwell Server Edition
GPU 2: NVIDIA RTX PRO 6000 Blackwell Server Edition
GPU 3: NVIDIA RTX PRO 6000 Blackwell Server Edition

Nvidia driver version        : 580.126.09
cuDNN version                : Could not collect
HIP runtime version          : N/A
MIOpen runtime version       : N/A
Is XNNPACK available         : True

==============================
          CPU Info
==============================
Architecture:                            x86_64
CPU op-mode(s):                          32-bit, 64-bit
Byte Order:                              Little Endian
CPU(s):                                  192
Vendor ID:                               AuthenticAMD
Model name:                              AMD EPYC 9B45
Thread(s) per core:                      2
Core(s) per socket:                      96
Socket(s):                               1

==============================
Versions of relevant libraries
==============================
[pip3] flashinfer-python==0.6.8.post1
[pip3] numpy==2.2.6
[pip3] nvidia-cublas==13.1.0.3
[pip3] nvidia-cuda-cupti==13.0.85
[pip3] nvidia-cuda-nvrtc==13.0.88
[pip3] nvidia-cuda-runtime==13.0.96
[pip3] nvidia-cudnn-cu13==9.19.0.56
[pip3] nvidia-nccl-cu13==2.28.9
[pip3] torch==2.11.0+cu130
[pip3] transformers==5.7.0
[pip3] triton==3.6.0

==============================
         vLLM Info
==============================
ROCM Version                 : Could not collect
vLLM Version                 : 0.20.1
vLLM Build Flags:
  CUDA Archs: 7.5 8.0 8.6 8.9 9.0 10.0 12.0+PTX; ROCm: Disabled; XPU: Disabled
GPU Topology:
        GPU0    GPU1    GPU2    GPU3    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X      PIX     NODE    NODE    0-191   0               N/A
GPU1    PIX      X      NODE    NODE    0-191   0               N/A
GPU2    NODE    NODE     X      PIX     0-191   0               N/A
GPU3    NODE    NODE    PIX      X      0-191   0               N/A

==============================
     Environment Variables
==============================
NCCL_P2P_LEVEL=PHB
NCCL_DEBUG=TRACE
VLLM_USAGE_SOURCE=production-docker-image
CUDA_VERSION=13.0.2
VLLM_DEEP_GEMM_WARMUP=skip
VLLM_USE_DEEP_GEMM=0
VLLM_ENABLE_CUDA_COMPATIBILITY=0
OMP_NUM_THREADS=1
VLLM_LOGGING_LEVEL=DEBUG

---

vllm serve nvidia/Qwen3.5-397B-A17B-NVFP4 \
    --trust-remote-code \
    --tensor-parallel-size 4 \
    --kv-cache-dtype fp8 \
    --gpu-memory-utilization 0.8 \
    --served-model-name Qwen/Qwen3.5-397B-A17B \
    --enable-auto-tool-choice \
    --tool-call-parser qwen3_coder \
    --reasoning-parser qwen3

---

NCCL_P2P_LEVEL=PHB
OMP_NUM_THREADS=1
VLLM_DEEP_GEMM_WARMUP=skip
VLLM_USE_DEEP_GEMM=0

---

(Worker_TP0 pid=46) INFO 05-08 11:11:01 [v1/worker/gpu_model_runner.py:4777] Starting to load model nvidia/Qwen3.5-397B-A17B-NVFP4...
(Worker_TP0 pid=46) INFO 05-08 11:11:02 [platforms/cuda.py:423] Using backend AttentionBackendEnum.FLASH_ATTN for vit attention
(Worker_TP0 pid=46) INFO 05-08 11:11:02 [model_executor/.../mamba/gdn_linear_attn.py:153] Using Triton/FLA GDN prefill kernel
(Worker_TP0 pid=46) DEBUG 05-08 11:11:02 [model_executor/.../oracle/nvfp4.py:283] NvFp4 MoE backend 'FLASHINFER_TRTLLM' does not support the deployment configuration since kernel does not support current device cuda.
(Worker_TP0 pid=46) DEBUG 05-08 11:11:02 [model_executor/.../oracle/nvfp4.py:283] NvFp4 MoE backend 'FLASHINFER_CUTEDSL' does not support the deployment configuration since kernel does not support current device cuda.
(Worker_TP0 pid=46) DEBUG 05-08 11:11:02 [model_executor/.../oracle/nvfp4.py:283] NvFp4 MoE backend 'FLASHINFER_CUTEDSL_BATCHED' does not support the deployment configuration since kernel does not support current device cuda.
(Worker_TP0 pid=46) INFO 05-08 11:11:02 [model_executor/.../oracle/nvfp4.py:280] Using 'FLASHINFER_CUTLASS' NvFp4 MoE backend out of potential backends: ['FLASHINFER_TRTLLM', 'FLASHINFER_CUTEDSL', 'FLASHINFER_CUTEDSL_BATCHED', 'FLASHINFER_CUTLASS', 'VLLM_CUTLASS', 'MARLIN', 'EMULATION'].
(Worker_TP0 pid=46) INFO 05-08 11:11:03 [platforms/cuda.py:368] Using FLASHINFER attention backend out of potential backends: ['FLASHINFER', 'TRITON_ATTN'].
(Worker_TP0 pid=46) DEBUG 05-08 11:11:07 [model_executor/model_loader/base_loader.py:63] Loading weights on cuda ...
(Worker_TP0 pid=46) DEBUG 05-08 11:11:09 [model_executor/model_loader/weight_utils.py:591] Using model weights format [['model-00009-of-00011.safetensors', ...]]

---

(APIServer pid=7) DEBUG 05-08 11:36:09 [v1/engine/utils.py:1168] Waiting for 1 local, 0 remote core engine proc(s) to start.
(APIServer pid=7) DEBUG 05-08 11:36:19 [v1/engine/utils.py:1168] Waiting for 1 local, 0 remote core engine proc(s) to start.
... (repeats every 10 seconds indefinitely)
RAW_BUFFERClick to expand / collapse

Your current environment

<details> <summary>The output of <code>python collect_env.py</code></summary>
Collecting environment information...
==============================
        System Info
==============================
OS                           : Ubuntu 22.04.5 LTS (x86_64)
GCC version                  : (Ubuntu 11.4.0-1ubuntu1~22.04.3) 11.4.0
Clang version                : Could not collect
CMake version                : Could not collect
Libc version                 : glibc-2.35

==============================
       PyTorch Info
==============================
PyTorch version              : 2.11.0+cu130
Is debug build               : False
CUDA used to build PyTorch   : 13.0
ROCM used to build PyTorch   : N/A
XPU used to build PyTorch    : N/A

==============================
      Python Environment
==============================
Python version               : 3.12.13 (main, May  4 2026, 09:06:35) [GCC 11.4.0] (64-bit runtime)
Python platform              : Linux-6.12.68+-x86_64-with-glibc2.35

==============================
       CUDA / GPU Info
==============================
Is CUDA available            : True
CUDA runtime version         : 13.0.88
CUDA_MODULE_LOADING set to   :
GPU models and configuration :
GPU 0: NVIDIA RTX PRO 6000 Blackwell Server Edition
GPU 1: NVIDIA RTX PRO 6000 Blackwell Server Edition
GPU 2: NVIDIA RTX PRO 6000 Blackwell Server Edition
GPU 3: NVIDIA RTX PRO 6000 Blackwell Server Edition

Nvidia driver version        : 580.126.09
cuDNN version                : Could not collect
HIP runtime version          : N/A
MIOpen runtime version       : N/A
Is XNNPACK available         : True

==============================
          CPU Info
==============================
Architecture:                            x86_64
CPU op-mode(s):                          32-bit, 64-bit
Byte Order:                              Little Endian
CPU(s):                                  192
Vendor ID:                               AuthenticAMD
Model name:                              AMD EPYC 9B45
Thread(s) per core:                      2
Core(s) per socket:                      96
Socket(s):                               1

==============================
Versions of relevant libraries
==============================
[pip3] flashinfer-python==0.6.8.post1
[pip3] numpy==2.2.6
[pip3] nvidia-cublas==13.1.0.3
[pip3] nvidia-cuda-cupti==13.0.85
[pip3] nvidia-cuda-nvrtc==13.0.88
[pip3] nvidia-cuda-runtime==13.0.96
[pip3] nvidia-cudnn-cu13==9.19.0.56
[pip3] nvidia-nccl-cu13==2.28.9
[pip3] torch==2.11.0+cu130
[pip3] transformers==5.7.0
[pip3] triton==3.6.0

==============================
         vLLM Info
==============================
ROCM Version                 : Could not collect
vLLM Version                 : 0.20.1
vLLM Build Flags:
  CUDA Archs: 7.5 8.0 8.6 8.9 9.0 10.0 12.0+PTX; ROCm: Disabled; XPU: Disabled
GPU Topology:
        GPU0    GPU1    GPU2    GPU3    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X      PIX     NODE    NODE    0-191   0               N/A
GPU1    PIX      X      NODE    NODE    0-191   0               N/A
GPU2    NODE    NODE     X      PIX     0-191   0               N/A
GPU3    NODE    NODE    PIX      X      0-191   0               N/A

==============================
     Environment Variables
==============================
NCCL_P2P_LEVEL=PHB
NCCL_DEBUG=TRACE
VLLM_USAGE_SOURCE=production-docker-image
CUDA_VERSION=13.0.2
VLLM_DEEP_GEMM_WARMUP=skip
VLLM_USE_DEEP_GEMM=0
VLLM_ENABLE_CUDA_COMPATIBILITY=0
OMP_NUM_THREADS=1
VLLM_LOGGING_LEVEL=DEBUG
</details>

🐛 Describe the bug

vLLM v0.20.1 hangs indefinitely during model weight loading when serving nvidia/Qwen3.5-397B-A17B-NVFP4 on 4x NVIDIA RTX PRO 6000 Blackwell Server Edition GPUs (compute capability 12.0, 95 GB each) with --tensor-parallel-size 4.

The engine starts correctly: NCCL initializes, all 4 workers connect via P2P/CUMEM, memory is allocated (75.98 GB per GPU), and weight loading begins across 11 safetensor shards. However, the process hangs indefinetly after enumerating the safetensor files -- no progress, no error, no OOM. The APIServer loops forever logging Waiting for 1 local, 0 remote core engine proc(s) to start.

The model files are fully downloaded and cached on the PVC. There is no OOM or CUDA error.

Reproducer

Launch vLLM with the following command (using Docker or Kubernetes):

vllm serve nvidia/Qwen3.5-397B-A17B-NVFP4 \
    --trust-remote-code \
    --tensor-parallel-size 4 \
    --kv-cache-dtype fp8 \
    --gpu-memory-utilization 0.8 \
    --served-model-name Qwen/Qwen3.5-397B-A17B \
    --enable-auto-tool-choice \
    --tool-call-parser qwen3_coder \
    --reasoning-parser qwen3

With env vars:

NCCL_P2P_LEVEL=PHB
OMP_NUM_THREADS=1
VLLM_DEEP_GEMM_WARMUP=skip
VLLM_USE_DEEP_GEMM=0

Image: vllm/vllm-openai:v0.20.1

Observed behavior

The last meaningful log output is:

(Worker_TP0 pid=46) INFO 05-08 11:11:01 [v1/worker/gpu_model_runner.py:4777] Starting to load model nvidia/Qwen3.5-397B-A17B-NVFP4...
(Worker_TP0 pid=46) INFO 05-08 11:11:02 [platforms/cuda.py:423] Using backend AttentionBackendEnum.FLASH_ATTN for vit attention
(Worker_TP0 pid=46) INFO 05-08 11:11:02 [model_executor/.../mamba/gdn_linear_attn.py:153] Using Triton/FLA GDN prefill kernel
(Worker_TP0 pid=46) DEBUG 05-08 11:11:02 [model_executor/.../oracle/nvfp4.py:283] NvFp4 MoE backend 'FLASHINFER_TRTLLM' does not support the deployment configuration since kernel does not support current device cuda.
(Worker_TP0 pid=46) DEBUG 05-08 11:11:02 [model_executor/.../oracle/nvfp4.py:283] NvFp4 MoE backend 'FLASHINFER_CUTEDSL' does not support the deployment configuration since kernel does not support current device cuda.
(Worker_TP0 pid=46) DEBUG 05-08 11:11:02 [model_executor/.../oracle/nvfp4.py:283] NvFp4 MoE backend 'FLASHINFER_CUTEDSL_BATCHED' does not support the deployment configuration since kernel does not support current device cuda.
(Worker_TP0 pid=46) INFO 05-08 11:11:02 [model_executor/.../oracle/nvfp4.py:280] Using 'FLASHINFER_CUTLASS' NvFp4 MoE backend out of potential backends: ['FLASHINFER_TRTLLM', 'FLASHINFER_CUTEDSL', 'FLASHINFER_CUTEDSL_BATCHED', 'FLASHINFER_CUTLASS', 'VLLM_CUTLASS', 'MARLIN', 'EMULATION'].
(Worker_TP0 pid=46) INFO 05-08 11:11:03 [platforms/cuda.py:368] Using FLASHINFER attention backend out of potential backends: ['FLASHINFER', 'TRITON_ATTN'].
(Worker_TP0 pid=46) DEBUG 05-08 11:11:07 [model_executor/model_loader/base_loader.py:63] Loading weights on cuda ...
(Worker_TP0 pid=46) DEBUG 05-08 11:11:09 [model_executor/model_loader/weight_utils.py:591] Using model weights format [['model-00009-of-00011.safetensors', ...]]

After this, no further log output from any worker. The APIServer then loops indefinitely:

(APIServer pid=7) DEBUG 05-08 11:36:09 [v1/engine/utils.py:1168] Waiting for 1 local, 0 remote core engine proc(s) to start.
(APIServer pid=7) DEBUG 05-08 11:36:19 [v1/engine/utils.py:1168] Waiting for 1 local, 0 remote core engine proc(s) to start.
... (repeats every 10 seconds indefinitely)

Additional context

  • GPU topology: PIX between GPU0-GPU1 and GPU2-GPU3, NODE between the two pairs (PCIe-only, no NVLink)
  • NCCL init completes successfully on all 4 ranks with P2P/CUMEM channels established
  • SymmMemCommunicator not available for compute capability 12.0 (Blackwell)
  • Custom allreduce disabled: "not supported on more than two PCIe-only GPUs"
  • NVFP4 quantization is experimental per the log warning: "Detected ModelOpt NVFP4 checkpoint. Please note that the format is experimental and could change in future."
  • 3 out of 4 NvFp4 MoE backends rejected the device (FLASHINFER_TRTLLM, FLASHINFER_CUTEDSL, FLASHINFER_CUTEDSL_BATCHED all report "kernel does not support current device cuda"), falling back to FLASHINFER_CUTLASS

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

vllm - 💡(How to fix) Fix [Bug]: Engine hangs indefinitely during model weight loading for nvidia/Qwen3.5-397B-A17B-NVFP4 on Blackwell GPUs (RTX PRO 6000) with TP=4