vllm - 💡(How to fix) Fix [Bug]: NIXL Disagg Does Not Support GDN Attention (Qwen3.5 Hybrid) [1 comments, 2 participants]

vllm2026-05-06 19:33:41

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

vllm-project/vllm#41860•Fetched 2026-05-07 03:32:21

View on GitHub

Comments

Participants

Timeline

Reactions

Author

pbelevich

Participants

pbelevich

ZhanqiuHu

Timeline (top)

commented ×1cross-referenced ×1labeled ×1mentioned ×1

The NIXL KV connector (NixlConnector) cannot transfer GDN (Gated Dilated Neighborhood) attention state used by Qwen3.5 models. The 3-read conv transfer path only supports Mamba2 models, blocking disaggregated serving for the entire Qwen3.5 family.

Error Message

ValueError: Hybrid KV cache manager is disabled but failed to convert the KV cache specs to one unified type.

Root Cause

Qwen3.5's GDN (Gated Dilated Neighborhood) attention layers use a state representation that differs from both standard attention KV cache and Mamba2 SSM state. The NIXL connector's 3-read conv transfer implementation only handles:

Standard attention KV cache ✓
Mamba2 SSM state ✓
GDN attention state ✗

GDN state has mamba_type='gdn_attention' with a temporal shape (intermediate_size // tp, state_size) that cannot be reconstructed on the receiving end.

Fix Action

Fix / Workaround

Error 3 (HARD BLOCKER — no workaround):

NotImplementedError: 3-read conv transfer only supports Mamba2 models, got mamba_type='gdn_attention'.
Mamba1 SSM temporal shape is (intermediate_size // tp, state_size) which cannot be used to reconstruct intermediate_size.

Code Example

Collecting environment information...
==============================
        System Info
==============================
OS                           : Ubuntu 24.04.4 LTS (x86_64)
GCC version                  : (Ubuntu 13.3.0-6ubuntu2~24.04.1) 13.3.0
Clang version                : Could not collect
CMake version                : version 3.28.3
Libc version                 : glibc-2.39

==============================
       PyTorch Info
==============================
PyTorch version              : 2.11.0+cu130
Is debug build               : False
CUDA used to build PyTorch   : 13.0
ROCM used to build PyTorch   : N/A
XPU used to build PyTorch    : N/A

==============================
      Python Environment
==============================
Python version               : 3.12.3 (main, Mar 23 2026, 19:04:32) [GCC 13.3.0] (64-bit runtime)
Python platform              : Linux-6.12.64-87.122.amzn2023.x86_64-x86_64-with-glibc2.39
    
==============================
       CUDA / GPU Info
==============================
Is CUDA available            : True
CUDA runtime version         : 13.0.88
CUDA_MODULE_LOADING set to   : 
GPU models and configuration : GPU 0: NVIDIA H200
Nvidia driver version        : 580.126.09
cuDNN version                : Could not collect
HIP runtime version          : N/A
MIOpen runtime version       : N/A
Is XNNPACK available         : True

==============================
          CPU Info
==============================
Architecture:                            x86_64
CPU op-mode(s):                          32-bit, 64-bit
Address sizes:                           46 bits physical, 48 bits virtual
Byte Order:                              Little Endian
CPU(s):                                  192
On-line CPU(s) list:                     0-191
Vendor ID:                               GenuineIntel
Model name:                              Intel(R) Xeon(R) Platinum 8488C
CPU family:                              6
Model:                                   143
Thread(s) per core:                      2
Core(s) per socket:                      48
Socket(s):                               2
Stepping:                                8
BogoMIPS:                                4800.00
L1d cache:                               4.5 MiB (96 instances)
L1i cache:                               3 MiB (96 instances)
L2 cache:                                192 MiB (96 instances)
L3 cache:                                210 MiB (2 instances)
NUMA node(s):                            2
NUMA node0 CPU(s):                       0-47,96-143
NUMA node1 CPU(s):                       48-95,144-191

==============================
Versions of relevant libraries
==============================
[pip3] flashinfer-python==0.6.8.post1
[pip3] numpy==2.3.5
[pip3] nvidia-cublas==13.1.0.3
[pip3] nvidia-cuda-cupti==13.0.85
[pip3] nvidia-cuda-nvrtc==13.0.88
[pip3] nvidia-cuda-runtime==13.0.96
[pip3] nvidia-cudnn-cu13==9.19.0.56
[pip3] nvidia-cudnn-frontend==1.18.0
[pip3] nvidia-cufft==12.0.0.61
[pip3] nvidia-cufile==1.15.1.6
[pip3] nvidia-curand==10.4.0.35
[pip3] nvidia-cusolver==12.0.4.66
[pip3] nvidia-cusparse==12.6.3.3
[pip3] nvidia-cusparselt-cu13==0.8.0
[pip3] nvidia-cutlass-dsl==4.4.2
[pip3] nvidia-cutlass-dsl-libs-base==4.4.2
[pip3] nvidia-ml-py==13.595.45
[pip3] nvidia-nccl-cu13==2.28.9
[pip3] nvidia-nvshmem-cu13==3.4.5
[pip3] nvidia-nvtx==13.0.85
[pip3] torch==2.11.0+cu130
[pip3] torch_c_dlpack_ext==0.1.5
[pip3] torchaudio==2.11.0+cu130
[pip3] torchvision==0.26.0+cu130
[pip3] transformers==5.6.2
[pip3] triton==3.6.0

==============================
         vLLM Info
==============================
vLLM Version                 : 0.20.0
vLLM Build Flags:
  CUDA Archs: 8.0 8.6 8.9 9.0 10.0; ROCm: Disabled; XPU: Disabled
GPU Topology:
  GPU0: NVIDIA H200 (141 GiB)
  8 GPUs per node connected via NVLink (NV18)
  Cross-node: 16× EFA v2 (100 Gbps each, 1.6 Tbps total)

==============================
     Environment Variables
==============================
CUDA_VERSION=13.0.3
NCCL_VERSION=2.28.3-1
TORCH_CUDA_ARCH_LIST=8.0 8.6 8.9 9.0 10.0
VLLM_TARGET_DEVICE=cuda
CUDA_HOME=/usr/local/cuda
PYTORCH_NVML_BASED_CUDA_CHECK=1
TORCHINDUCTOR_COMPILE_THREADS=1

---

PyTorch version              : 2.11.0+cu130
Is debug build               : False
CUDA used to build PyTorch   : 13.0
OS                           : Ubuntu 24.04.4 LTS (x86_64)
GCC version                  : (Ubuntu 13.3.0-6ubuntu2~24.04.1) 13.3.0
CMake version                : version 3.28.3
Python version               : 3.12.3 (64-bit runtime)
Python platform              : Linux-6.12.64-87.122.amzn2023.x86_64-x86_64-with-glibc2.39
Is CUDA available            : True
CUDA runtime version         : 13.0.88
GPU models and configuration : GPU 0-7: NVIDIA H200 (141 GiB each)
Nvidia driver version        : 580.126.09
vLLM Version                 : 0.20.0
vLLM Build Flags             : CUDA Archs: 8.0 8.6 8.9 9.0 10.0; ROCm: Disabled; XPU: Disabled
NCCL_VERSION                 : 2.28.3-1
TORCH_CUDA_ARCH_LIST         : 8.0 8.6 8.9 9.0 10.0

[pip3] flashinfer-python==0.6.8.post1
[pip3] torch==2.11.0+cu130
[pip3] transformers==5.6.2
[pip3] triton==3.6.0
[pip3] nvidia-nccl-cu13==2.28.9
[pip3] nvidia-nvshmem-cu13==3.4.5

---

export VLLM_USE_DEEP_GEMM=0
export VLLM_WORKER_MULTIPROC_METHOD=spawn
export VLLM_NIXL_SIDE_CHANNEL_HOST=0.0.0.0
export VLLM_NIXL_SIDE_CHANNEL_PORT=5557
export VLLM_SSM_CONV_STATE_LAYOUT=DS

KV_CONFIG='{"kv_connector":"NixlConnector","kv_role":"kv_both","kv_load_failure_policy":"fail","kv_connector_extra_config":{"backends":["LIBFABRIC"]}}'

vllm serve Qwen/Qwen3.5-397B-A17B-FP8 \
  --trust-remote-code \
  --tensor-parallel-size 8 \
  --language-model-only \
  --gdn-prefill-backend triton \
  --gpu-memory-utilization 0.90 \
  --max-model-len 32768 \
  --enable-chunked-prefill \
  --enable-prefix-caching \
  --no-disable-hybrid-kv-cache-manager \
  --kv-transfer-config "$KV_CONFIG" \
  --host 0.0.0.0 \
  --port 8100

---

ValueError: Hybrid KV cache manager is disabled but failed to convert the KV cache specs to one unified type.

---

RuntimeError: 3-read Mamba conv transfer requires DS conv state layout. Set VLLM_SSM_CONV_STATE_LAYOUT=DS

---

NotImplementedError: 3-read conv transfer only supports Mamba2 models, got mamba_type='gdn_attention'.
Mamba1 SSM temporal shape is (intermediate_size // tp, state_size) which cannot be used to reconstruct intermediate_size.

RAW_BUFFERClick to expand / collapse

Your current environment

<details> <summary>The output of <code>python collect_env.py</code></summary>

Collecting environment information...
==============================
        System Info
==============================
OS                           : Ubuntu 24.04.4 LTS (x86_64)
GCC version                  : (Ubuntu 13.3.0-6ubuntu2~24.04.1) 13.3.0
Clang version                : Could not collect
CMake version                : version 3.28.3
Libc version                 : glibc-2.39

==============================
       PyTorch Info
==============================
PyTorch version              : 2.11.0+cu130
Is debug build               : False
CUDA used to build PyTorch   : 13.0
ROCM used to build PyTorch   : N/A
XPU used to build PyTorch    : N/A

==============================
      Python Environment
==============================
Python version               : 3.12.3 (main, Mar 23 2026, 19:04:32) [GCC 13.3.0] (64-bit runtime)
Python platform              : Linux-6.12.64-87.122.amzn2023.x86_64-x86_64-with-glibc2.39
    
==============================
       CUDA / GPU Info
==============================
Is CUDA available            : True
CUDA runtime version         : 13.0.88
CUDA_MODULE_LOADING set to   : 
GPU models and configuration : GPU 0: NVIDIA H200
Nvidia driver version        : 580.126.09
cuDNN version                : Could not collect
HIP runtime version          : N/A
MIOpen runtime version       : N/A
Is XNNPACK available         : True

==============================
          CPU Info
==============================
Architecture:                            x86_64
CPU op-mode(s):                          32-bit, 64-bit
Address sizes:                           46 bits physical, 48 bits virtual
Byte Order:                              Little Endian
CPU(s):                                  192
On-line CPU(s) list:                     0-191
Vendor ID:                               GenuineIntel
Model name:                              Intel(R) Xeon(R) Platinum 8488C
CPU family:                              6
Model:                                   143
Thread(s) per core:                      2
Core(s) per socket:                      48
Socket(s):                               2
Stepping:                                8
BogoMIPS:                                4800.00
L1d cache:                               4.5 MiB (96 instances)
L1i cache:                               3 MiB (96 instances)
L2 cache:                                192 MiB (96 instances)
L3 cache:                                210 MiB (2 instances)
NUMA node(s):                            2
NUMA node0 CPU(s):                       0-47,96-143
NUMA node1 CPU(s):                       48-95,144-191

==============================
Versions of relevant libraries
==============================
[pip3] flashinfer-python==0.6.8.post1
[pip3] numpy==2.3.5
[pip3] nvidia-cublas==13.1.0.3
[pip3] nvidia-cuda-cupti==13.0.85
[pip3] nvidia-cuda-nvrtc==13.0.88
[pip3] nvidia-cuda-runtime==13.0.96
[pip3] nvidia-cudnn-cu13==9.19.0.56
[pip3] nvidia-cudnn-frontend==1.18.0
[pip3] nvidia-cufft==12.0.0.61
[pip3] nvidia-cufile==1.15.1.6
[pip3] nvidia-curand==10.4.0.35
[pip3] nvidia-cusolver==12.0.4.66
[pip3] nvidia-cusparse==12.6.3.3
[pip3] nvidia-cusparselt-cu13==0.8.0
[pip3] nvidia-cutlass-dsl==4.4.2
[pip3] nvidia-cutlass-dsl-libs-base==4.4.2
[pip3] nvidia-ml-py==13.595.45
[pip3] nvidia-nccl-cu13==2.28.9
[pip3] nvidia-nvshmem-cu13==3.4.5
[pip3] nvidia-nvtx==13.0.85
[pip3] torch==2.11.0+cu130
[pip3] torch_c_dlpack_ext==0.1.5
[pip3] torchaudio==2.11.0+cu130
[pip3] torchvision==0.26.0+cu130
[pip3] transformers==5.6.2
[pip3] triton==3.6.0

==============================
         vLLM Info
==============================
vLLM Version                 : 0.20.0
vLLM Build Flags:
  CUDA Archs: 8.0 8.6 8.9 9.0 10.0; ROCm: Disabled; XPU: Disabled
GPU Topology:
  GPU0: NVIDIA H200 (141 GiB)
  8 GPUs per node connected via NVLink (NV18)
  Cross-node: 16× EFA v2 (100 Gbps each, 1.6 Tbps total)

==============================
     Environment Variables
==============================
CUDA_VERSION=13.0.3
NCCL_VERSION=2.28.3-1
TORCH_CUDA_ARCH_LIST=8.0 8.6 8.9 9.0 10.0
VLLM_TARGET_DEVICE=cuda
CUDA_HOME=/usr/local/cuda
PYTORCH_NVML_BASED_CUDA_CHECK=1
TORCHINDUCTOR_COMPILE_THREADS=1

</details>

🐛 Describe the bug

Summary

Severity

Medium — blocks disaggregated serving for a major new model family (Qwen3.5).

Environment

Component	Version
vLLM	0.20.0
NIXL	1.0.1
PyTorch	2.11.0+cu130
CUDA	13.0.3

Hardware

2× AWS p5en.48xlarge (8×H200 each)
16× EFA per node

`collect_env.py` Output

PyTorch version              : 2.11.0+cu130
Is debug build               : False
CUDA used to build PyTorch   : 13.0
OS                           : Ubuntu 24.04.4 LTS (x86_64)
GCC version                  : (Ubuntu 13.3.0-6ubuntu2~24.04.1) 13.3.0
CMake version                : version 3.28.3
Python version               : 3.12.3 (64-bit runtime)
Python platform              : Linux-6.12.64-87.122.amzn2023.x86_64-x86_64-with-glibc2.39
Is CUDA available            : True
CUDA runtime version         : 13.0.88
GPU models and configuration : GPU 0-7: NVIDIA H200 (141 GiB each)
Nvidia driver version        : 580.126.09
vLLM Version                 : 0.20.0
vLLM Build Flags             : CUDA Archs: 8.0 8.6 8.9 9.0 10.0; ROCm: Disabled; XPU: Disabled
NCCL_VERSION                 : 2.28.3-1
TORCH_CUDA_ARCH_LIST         : 8.0 8.6 8.9 9.0 10.0

[pip3] flashinfer-python==0.6.8.post1
[pip3] torch==2.11.0+cu130
[pip3] transformers==5.6.2
[pip3] triton==3.6.0
[pip3] nvidia-nccl-cu13==2.28.9
[pip3] nvidia-nvshmem-cu13==3.4.5

Steps to Reproduce

Deploy Qwen3.5-397B-A17B-FP8 with NIXL KV transfer:

export VLLM_USE_DEEP_GEMM=0
export VLLM_WORKER_MULTIPROC_METHOD=spawn
export VLLM_NIXL_SIDE_CHANNEL_HOST=0.0.0.0
export VLLM_NIXL_SIDE_CHANNEL_PORT=5557
export VLLM_SSM_CONV_STATE_LAYOUT=DS

KV_CONFIG='{"kv_connector":"NixlConnector","kv_role":"kv_both","kv_load_failure_policy":"fail","kv_connector_extra_config":{"backends":["LIBFABRIC"]}}'

vllm serve Qwen/Qwen3.5-397B-A17B-FP8 \
  --trust-remote-code \
  --tensor-parallel-size 8 \
  --language-model-only \
  --gdn-prefill-backend triton \
  --gpu-memory-utilization 0.90 \
  --max-model-len 32768 \
  --enable-chunked-prefill \
  --enable-prefix-caching \
  --no-disable-hybrid-kv-cache-manager \
  --kv-transfer-config "$KV_CONFIG" \
  --host 0.0.0.0 \
  --port 8100

Model loads, torch.compile completes, CUDA graphs captured.
Crash during KV cache initialization.

Error Progression

Three errors had to be resolved before hitting the hard blocker:

Error 1 (fixed with `--no-disable-hybrid-kv-cache-manager`):

ValueError: Hybrid KV cache manager is disabled but failed to convert the KV cache specs to one unified type.

Error 2 (fixed with `VLLM_SSM_CONV_STATE_LAYOUT=DS`):

RuntimeError: 3-read Mamba conv transfer requires DS conv state layout. Set VLLM_SSM_CONV_STATE_LAYOUT=DS

Error 3 (HARD BLOCKER — no workaround):

NotImplementedError: 3-read conv transfer only supports Mamba2 models, got mamba_type='gdn_attention'.
Mamba1 SSM temporal shape is (intermediate_size // tp, state_size) which cannot be used to reconstruct intermediate_size.

Root Cause

Standard attention KV cache ✓
Mamba2 SSM state ✓
GDN attention state ✗

GDN state has mamba_type='gdn_attention' with a temporal shape (intermediate_size // tp, state_size) that cannot be reconstructed on the receiving end.

Impact

All Qwen3.5 models are affected:

Qwen3.5-397B-A17B-FP8
Qwen3.5-122B-A10B (if hybrid GDN)
Any future model using GDN attention

Without disaggregated serving, these models can only scale via independent replicas (no shared prefill/decode separation).

Suggested Fix

Implement a GDN-specific state transfer path in NixlConnector that:

Serializes the full GDN state tensor (including intermediate_size metadata)
Handles the TP-sharded (intermediate_size // tp, state_size) shape
Reconstructs on the consumer side with knowledge of the original intermediate_size

Alternatively, make the connector forward the raw state bytes with shape metadata, without requiring shape reconstruction logic.

cc: @nkumaraws

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#prompt template #agent execution #callback error #memory management #environment variable

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

vllm - 💡(How to fix) Fix [Bug]: NIXL Disagg Does Not Support GDN Attention (Qwen3.5 Hybrid) [1 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Fix / Workaround

Error 3 (HARD BLOCKER — no workaround):

Code Example

Your current environment

🐛 Describe the bug

Summary

Severity

Environment

Hardware

`collect_env.py` Output

Steps to Reproduce

Error Progression

Error 1 (fixed with `--no-disable-hybrid-kv-cache-manager`):

Error 2 (fixed with `VLLM_SSM_CONV_STATE_LAYOUT=DS`):

Error 3 (HARD BLOCKER — no workaround):

Root Cause

Impact

Suggested Fix

Before submitting a new issue...

Still need to ship something?

TRENDING

vllm - 💡(How to fix) Fix [Bug]: NIXL Disagg Does Not Support GDN Attention (Qwen3.5 Hybrid) [1 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Fix / Workaround

Error 3 (HARD BLOCKER — no workaround):

Code Example

Your current environment

🐛 Describe the bug

Summary

Severity

Environment

Hardware

collect_env.py Output

Steps to Reproduce

Error Progression

Error 1 (fixed with --no-disable-hybrid-kv-cache-manager):

Error 2 (fixed with VLLM_SSM_CONV_STATE_LAYOUT=DS):

Error 3 (HARD BLOCKER — no workaround):

Root Cause

Impact

Suggested Fix

Before submitting a new issue...

Still need to ship something?

RELATED_DISCOVERY

TRENDING

`collect_env.py` Output

Error 1 (fixed with `--no-disable-hybrid-kv-cache-manager`):

Error 2 (fixed with `VLLM_SSM_CONV_STATE_LAYOUT=DS`):