vllm - 💡(How to fix) Fix [Bug]: EP Deadlock with Hybrid GDN/Mamba Architecture (Qwen3.5) [1 participants]

vllm2026-05-06 19:40:36

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

vllm-project/vllm#41862•Fetched 2026-05-07 03:32:19

View on GitHub

Comments

Participants

Timeline

Reactions

Author

pbelevich

Participants

pbelevich

Timeline (top)

labeled ×1mentioned ×1subscribed ×1

All Expert Parallelism configurations deadlock on Qwen3.5-397B-A17B-FP8 during CUDA graph capture or profile forward. Worker rank 0 completes torch.compile but other ranks hang indefinitely after the Mamba page size initialization step.

Error Message

vLLM should detect incompatibility and refuse to start with clear error:

Root Cause

Root Cause Hypothesis

Code Example

Collecting environment information...
==============================
        System Info
==============================
OS                           : Ubuntu 24.04.4 LTS (x86_64)
GCC version                  : (Ubuntu 13.3.0-6ubuntu2~24.04.1) 13.3.0
Clang version                : Could not collect
CMake version                : version 3.28.3
Libc version                 : glibc-2.39

==============================
       PyTorch Info
==============================
PyTorch version              : 2.11.0+cu130
Is debug build               : False
CUDA used to build PyTorch   : 13.0
ROCM used to build PyTorch   : N/A
XPU used to build PyTorch    : N/A

==============================
      Python Environment
==============================
Python version               : 3.12.3 (main, Mar 23 2026, 19:04:32) [GCC 13.3.0] (64-bit runtime)
Python platform              : Linux-6.12.64-87.122.amzn2023.x86_64-x86_64-with-glibc2.39
    
==============================
       CUDA / GPU Info
==============================
Is CUDA available            : True
CUDA runtime version         : 13.0.88
CUDA_MODULE_LOADING set to   : 
GPU models and configuration : GPU 0: NVIDIA H200
Nvidia driver version        : 580.126.09
cuDNN version                : Could not collect
HIP runtime version          : N/A
MIOpen runtime version       : N/A
Is XNNPACK available         : True

==============================
          CPU Info
==============================
Architecture:                            x86_64
CPU op-mode(s):                          32-bit, 64-bit
Address sizes:                           46 bits physical, 48 bits virtual
Byte Order:                              Little Endian
CPU(s):                                  192
On-line CPU(s) list:                     0-191
Vendor ID:                               GenuineIntel
Model name:                              Intel(R) Xeon(R) Platinum 8488C
CPU family:                              6
Model:                                   143
Thread(s) per core:                      2
Core(s) per socket:                      48
Socket(s):                               2
Stepping:                                8
BogoMIPS:                                4800.00
L1d cache:                               4.5 MiB (96 instances)
L1i cache:                               3 MiB (96 instances)
L2 cache:                                192 MiB (96 instances)
L3 cache:                                210 MiB (2 instances)
NUMA node(s):                            2
NUMA node0 CPU(s):                       0-47,96-143
NUMA node1 CPU(s):                       48-95,144-191

==============================
Versions of relevant libraries
==============================
[pip3] flashinfer-python==0.6.8.post1
[pip3] numpy==2.3.5
[pip3] nvidia-cublas==13.1.0.3
[pip3] nvidia-cuda-cupti==13.0.85
[pip3] nvidia-cuda-nvrtc==13.0.88
[pip3] nvidia-cuda-runtime==13.0.96
[pip3] nvidia-cudnn-cu13==9.19.0.56
[pip3] nvidia-cudnn-frontend==1.18.0
[pip3] nvidia-cufft==12.0.0.61
[pip3] nvidia-cufile==1.15.1.6
[pip3] nvidia-curand==10.4.0.35
[pip3] nvidia-cusolver==12.0.4.66
[pip3] nvidia-cusparse==12.6.3.3
[pip3] nvidia-cusparselt-cu13==0.8.0
[pip3] nvidia-cutlass-dsl==4.4.2
[pip3] nvidia-cutlass-dsl-libs-base==4.4.2
[pip3] nvidia-ml-py==13.595.45
[pip3] nvidia-nccl-cu13==2.28.9
[pip3] nvidia-nvshmem-cu13==3.4.5
[pip3] nvidia-nvtx==13.0.85
[pip3] torch==2.11.0+cu130
[pip3] torch_c_dlpack_ext==0.1.5
[pip3] torchaudio==2.11.0+cu130
[pip3] torchvision==0.26.0+cu130
[pip3] transformers==5.6.2
[pip3] triton==3.6.0

==============================
         vLLM Info
==============================
vLLM Version                 : 0.20.0
vLLM Build Flags:
  CUDA Archs: 8.0 8.6 8.9 9.0 10.0; ROCm: Disabled; XPU: Disabled
GPU Topology:
  GPU0: NVIDIA H200 (141 GiB)
  8 GPUs per node connected via NVLink (NV18)
  Cross-node: 16× EFA v2 (100 Gbps each, 1.6 Tbps total)

==============================
     Environment Variables
==============================
CUDA_VERSION=13.0.3
NCCL_VERSION=2.28.3-1
TORCH_CUDA_ARCH_LIST=8.0 8.6 8.9 9.0 10.0
VLLM_TARGET_DEVICE=cuda
CUDA_HOME=/usr/local/cuda
PYTORCH_NVML_BASED_CUDA_CHECK=1
TORCHINDUCTOR_COMPILE_THREADS=1

---

PyTorch version              : 2.11.0+cu130
Is debug build               : False
CUDA used to build PyTorch   : 13.0
OS                           : Ubuntu 24.04.4 LTS (x86_64)
GCC version                  : (Ubuntu 13.3.0-6ubuntu2~24.04.1) 13.3.0
CMake version                : version 3.28.3
Python version               : 3.12.3 (64-bit runtime)
Python platform              : Linux-6.12.64-87.122.amzn2023.x86_64-x86_64-with-glibc2.39
Is CUDA available            : True
CUDA runtime version         : 13.0.88
GPU models and configuration : GPU 0-7: NVIDIA H200 (141 GiB each)
Nvidia driver version        : 580.126.09
vLLM Version                 : 0.20.0
vLLM Build Flags             : CUDA Archs: 8.0 8.6 8.9 9.0 10.0; ROCm: Disabled; XPU: Disabled
NCCL_VERSION                 : 2.28.3-1
TORCH_CUDA_ARCH_LIST         : 8.0 8.6 8.9 9.0 10.0

[pip3] flashinfer-python==0.6.8.post1
[pip3] torch==2.11.0+cu130
[pip3] transformers==5.6.2
[pip3] triton==3.6.0
[pip3] nvidia-nccl-cu13==2.28.9
[pip3] nvidia-nvshmem-cu13==3.4.5

---

VLLM_USE_DEEP_GEMM=0 VLLM_WORKER_MULTIPROC_METHOD=spawn \
vllm serve Qwen/Qwen3.5-397B-A17B-FP8 \
  --trust-remote-code \
  --data-parallel-size 8 \
  --enable-expert-parallel \
  --language-model-only \
  --gpu-memory-utilization 0.90 \
  --max-model-len 32768

---

VLLM_USE_DEEP_GEMM=0 VLLM_WORKER_MULTIPROC_METHOD=spawn \
vllm serve Qwen/Qwen3.5-397B-A17B-FP8 \
  --trust-remote-code \
  --data-parallel-size 8 \
  --enable-expert-parallel \
  --language-model-only \
  --enforce-eager \
  --gpu-memory-utilization 0.90 \
  --max-model-len 32768

---

VLLM_USE_DEEP_GEMM=0 VLLM_WORKER_MULTIPROC_METHOD=spawn \
vllm serve Qwen/Qwen3.5-397B-A17B-FP8 \
  --trust-remote-code \
  --tensor-parallel-size 8 \
  --language-model-only \
  --gdn-prefill-backend triton \
  --gpu-memory-utilization 0.90 \
  --max-model-len 32768

RAW_BUFFERClick to expand / collapse

Your current environment

<details> <summary>The output of <code>python collect_env.py</code></summary>

Collecting environment information...
==============================
        System Info
==============================
OS                           : Ubuntu 24.04.4 LTS (x86_64)
GCC version                  : (Ubuntu 13.3.0-6ubuntu2~24.04.1) 13.3.0
Clang version                : Could not collect
CMake version                : version 3.28.3
Libc version                 : glibc-2.39

==============================
       PyTorch Info
==============================
PyTorch version              : 2.11.0+cu130
Is debug build               : False
CUDA used to build PyTorch   : 13.0
ROCM used to build PyTorch   : N/A
XPU used to build PyTorch    : N/A

==============================
      Python Environment
==============================
Python version               : 3.12.3 (main, Mar 23 2026, 19:04:32) [GCC 13.3.0] (64-bit runtime)
Python platform              : Linux-6.12.64-87.122.amzn2023.x86_64-x86_64-with-glibc2.39
    
==============================
       CUDA / GPU Info
==============================
Is CUDA available            : True
CUDA runtime version         : 13.0.88
CUDA_MODULE_LOADING set to   : 
GPU models and configuration : GPU 0: NVIDIA H200
Nvidia driver version        : 580.126.09
cuDNN version                : Could not collect
HIP runtime version          : N/A
MIOpen runtime version       : N/A
Is XNNPACK available         : True

==============================
          CPU Info
==============================
Architecture:                            x86_64
CPU op-mode(s):                          32-bit, 64-bit
Address sizes:                           46 bits physical, 48 bits virtual
Byte Order:                              Little Endian
CPU(s):                                  192
On-line CPU(s) list:                     0-191
Vendor ID:                               GenuineIntel
Model name:                              Intel(R) Xeon(R) Platinum 8488C
CPU family:                              6
Model:                                   143
Thread(s) per core:                      2
Core(s) per socket:                      48
Socket(s):                               2
Stepping:                                8
BogoMIPS:                                4800.00
L1d cache:                               4.5 MiB (96 instances)
L1i cache:                               3 MiB (96 instances)
L2 cache:                                192 MiB (96 instances)
L3 cache:                                210 MiB (2 instances)
NUMA node(s):                            2
NUMA node0 CPU(s):                       0-47,96-143
NUMA node1 CPU(s):                       48-95,144-191

==============================
Versions of relevant libraries
==============================
[pip3] flashinfer-python==0.6.8.post1
[pip3] numpy==2.3.5
[pip3] nvidia-cublas==13.1.0.3
[pip3] nvidia-cuda-cupti==13.0.85
[pip3] nvidia-cuda-nvrtc==13.0.88
[pip3] nvidia-cuda-runtime==13.0.96
[pip3] nvidia-cudnn-cu13==9.19.0.56
[pip3] nvidia-cudnn-frontend==1.18.0
[pip3] nvidia-cufft==12.0.0.61
[pip3] nvidia-cufile==1.15.1.6
[pip3] nvidia-curand==10.4.0.35
[pip3] nvidia-cusolver==12.0.4.66
[pip3] nvidia-cusparse==12.6.3.3
[pip3] nvidia-cusparselt-cu13==0.8.0
[pip3] nvidia-cutlass-dsl==4.4.2
[pip3] nvidia-cutlass-dsl-libs-base==4.4.2
[pip3] nvidia-ml-py==13.595.45
[pip3] nvidia-nccl-cu13==2.28.9
[pip3] nvidia-nvshmem-cu13==3.4.5
[pip3] nvidia-nvtx==13.0.85
[pip3] torch==2.11.0+cu130
[pip3] torch_c_dlpack_ext==0.1.5
[pip3] torchaudio==2.11.0+cu130
[pip3] torchvision==0.26.0+cu130
[pip3] transformers==5.6.2
[pip3] triton==3.6.0

==============================
         vLLM Info
==============================
vLLM Version                 : 0.20.0
vLLM Build Flags:
  CUDA Archs: 8.0 8.6 8.9 9.0 10.0; ROCm: Disabled; XPU: Disabled
GPU Topology:
  GPU0: NVIDIA H200 (141 GiB)
  8 GPUs per node connected via NVLink (NV18)
  Cross-node: 16× EFA v2 (100 Gbps each, 1.6 Tbps total)

==============================
     Environment Variables
==============================
CUDA_VERSION=13.0.3
NCCL_VERSION=2.28.3-1
TORCH_CUDA_ARCH_LIST=8.0 8.6 8.9 9.0 10.0
VLLM_TARGET_DEVICE=cuda
CUDA_HOME=/usr/local/cuda
PYTORCH_NVML_BASED_CUDA_CHECK=1
TORCHINDUCTOR_COMPILE_THREADS=1

</details>

🐛 Describe the bug

Summary

Severity

High — blocks all EP configurations for the Qwen3.5 model family.

Environment

Component	Version
vLLM	0.20.0
PyTorch	2.11.0+cu130
CUDA	13.0.3
NVIDIA Driver	580.105.08
NCCL	2.30.4-1

Hardware

Instance: 1× AWS p5en.48xlarge
GPU: 8× NVIDIA H200 (141 GiB each)

`collect_env.py` Output

PyTorch version              : 2.11.0+cu130
Is debug build               : False
CUDA used to build PyTorch   : 13.0
OS                           : Ubuntu 24.04.4 LTS (x86_64)
GCC version                  : (Ubuntu 13.3.0-6ubuntu2~24.04.1) 13.3.0
CMake version                : version 3.28.3
Python version               : 3.12.3 (64-bit runtime)
Python platform              : Linux-6.12.64-87.122.amzn2023.x86_64-x86_64-with-glibc2.39
Is CUDA available            : True
CUDA runtime version         : 13.0.88
GPU models and configuration : GPU 0-7: NVIDIA H200 (141 GiB each)
Nvidia driver version        : 580.126.09
vLLM Version                 : 0.20.0
vLLM Build Flags             : CUDA Archs: 8.0 8.6 8.9 9.0 10.0; ROCm: Disabled; XPU: Disabled
NCCL_VERSION                 : 2.28.3-1
TORCH_CUDA_ARCH_LIST         : 8.0 8.6 8.9 9.0 10.0

[pip3] flashinfer-python==0.6.8.post1
[pip3] torch==2.11.0+cu130
[pip3] transformers==5.6.2
[pip3] triton==3.6.0
[pip3] nvidia-nccl-cu13==2.28.9
[pip3] nvidia-nvshmem-cu13==3.4.5

Steps to Reproduce

Config 1: DP=8 EP=8 with CUDA Graphs

VLLM_USE_DEEP_GEMM=0 VLLM_WORKER_MULTIPROC_METHOD=spawn \
vllm serve Qwen/Qwen3.5-397B-A17B-FP8 \
  --trust-remote-code \
  --data-parallel-size 8 \
  --enable-expert-parallel \
  --language-model-only \
  --gpu-memory-utilization 0.90 \
  --max-model-len 32768

Config 2: DP=8 EP=8 with enforce-eager

VLLM_USE_DEEP_GEMM=0 VLLM_WORKER_MULTIPROC_METHOD=spawn \
vllm serve Qwen/Qwen3.5-397B-A17B-FP8 \
  --trust-remote-code \
  --data-parallel-size 8 \
  --enable-expert-parallel \
  --language-model-only \
  --enforce-eager \
  --gpu-memory-utilization 0.90 \
  --max-model-len 32768

Both deadlock.

Observed Behavior

With CUDA Graphs (Config 1):

All 8 ranks load model successfully (56.3 GiB/GPU, 35-40s)
All ranks reach "Setting attention block size to 2096 tokens" (Mamba page setup)
Rank 0 (Worker_DP0_EP0) completes torch.compile: torch.compile took 60.76 s in total
Ranks 1-7: No further log output. Stuck forever.
EngineCore processes report: No available shared memory broadcast block found in 60 seconds

With enforce-eager (Config 2):

Same as above through step 2
Rank 0 logs dp_utils.py:28 Using CPU all reduce to synchronize DP padding between ranks
All ranks reach Mamba page size messages
All 8 EngineCore processes stuck in shm_broadcast wait
No torch.compile occurs (eager mode), but still deadlocks

Contrast: Working Configuration

TP=8 (same model, same node) with --gdn-prefill-backend triton works perfectly:

VLLM_USE_DEEP_GEMM=0 VLLM_WORKER_MULTIPROC_METHOD=spawn \
vllm serve Qwen/Qwen3.5-397B-A17B-FP8 \
  --trust-remote-code \
  --tensor-parallel-size 8 \
  --language-model-only \
  --gdn-prefill-backend triton \
  --gpu-memory-utilization 0.90 \
  --max-model-len 32768

Achieves 12.79ms P50 TPOT with CUDA graphs.

Root Cause Hypothesis

EP requires all ranks to participate in all-to-all communication during the profile/first-forward pass. The hybrid Mamba/GDN layers have state that doesn't participate in expert routing (they're not MoE layers), but their execution is interleaved with MoE layers in the model graph. The profile forward pass likely:

Attempts all-to-all for MoE layers (collective)
Hits Mamba/GDN state initialization (per-rank)
The ordering mismatch between ranks causes a deadlock

Model Architecture Notes

Qwen3.5-397B-A17B-FP8:

60 layers, interleaved: standard attention + GDN linear attention + Mamba SSM
512 experts, top-10 routing, ~17B active
GQA: 32 query heads, 2 KV heads, head_dim=256
Mamba page size: 2096 tokens (0.58% padding)
Uses AgRsAll2AllManager (allgather_reducescatter) as default EP backend

Expected Behavior

Either:

EP should work with hybrid GDN/Mamba architectures
vLLM should detect incompatibility and refuse to start with clear error: "Expert parallelism is not supported for models with GDN/Mamba hybrid layers. Use --tensor-parallel-size instead."

cc: @nkumaraws

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#LLM response #prompt template #agent execution #callback error #environment variable

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

vllm - 💡(How to fix) Fix [Bug]: EP Deadlock with Hybrid GDN/Mamba Architecture (Qwen3.5) [1 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Root Cause Hypothesis

Code Example

Your current environment

🐛 Describe the bug

Summary

Severity

Environment

Hardware

`collect_env.py` Output

Steps to Reproduce

Config 1: DP=8 EP=8 with CUDA Graphs

Config 2: DP=8 EP=8 with enforce-eager

Observed Behavior

With CUDA Graphs (Config 1):

With enforce-eager (Config 2):

Contrast: Working Configuration

Root Cause Hypothesis

Model Architecture Notes

Expected Behavior

Before submitting a new issue...

Still need to ship something?

TRENDING

vllm - 💡(How to fix) Fix [Bug]: EP Deadlock with Hybrid GDN/Mamba Architecture (Qwen3.5) [1 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Root Cause Hypothesis

Code Example

Your current environment

🐛 Describe the bug

Summary

Severity

Environment

Hardware

collect_env.py Output

Steps to Reproduce

Config 1: DP=8 EP=8 with CUDA Graphs

Config 2: DP=8 EP=8 with enforce-eager

Observed Behavior

With CUDA Graphs (Config 1):

With enforce-eager (Config 2):

Contrast: Working Configuration

Root Cause Hypothesis

Model Architecture Notes

Expected Behavior

Before submitting a new issue...

Still need to ship something?

RELATED_DISCOVERY

TRENDING

`collect_env.py` Output