vllm - 💡(How to fix) Fix [Bug]: NIXL Disagg Does Not Support GDN Attention (Qwen3.5 Hybrid) [1 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#41860Fetched 2026-05-07 03:32:21
View on GitHub
Comments
1
Participants
2
Timeline
5
Reactions
0
Author
Participants
Timeline (top)
commented ×1cross-referenced ×1labeled ×1mentioned ×1

The NIXL KV connector (NixlConnector) cannot transfer GDN (Gated Dilated Neighborhood) attention state used by Qwen3.5 models. The 3-read conv transfer path only supports Mamba2 models, blocking disaggregated serving for the entire Qwen3.5 family.

Error Message

ValueError: Hybrid KV cache manager is disabled but failed to convert the KV cache specs to one unified type.

Root Cause

Qwen3.5's GDN (Gated Dilated Neighborhood) attention layers use a state representation that differs from both standard attention KV cache and Mamba2 SSM state. The NIXL connector's 3-read conv transfer implementation only handles:

  • Standard attention KV cache ✓
  • Mamba2 SSM state ✓
  • GDN attention state ✗

GDN state has mamba_type='gdn_attention' with a temporal shape (intermediate_size // tp, state_size) that cannot be reconstructed on the receiving end.

Fix Action

Fix / Workaround

Error 3 (HARD BLOCKER — no workaround):

NotImplementedError: 3-read conv transfer only supports Mamba2 models, got mamba_type='gdn_attention'.
Mamba1 SSM temporal shape is (intermediate_size // tp, state_size) which cannot be used to reconstruct intermediate_size.

Code Example

Collecting environment information...
==============================
        System Info
==============================
OS                           : Ubuntu 24.04.4 LTS (x86_64)
GCC version                  : (Ubuntu 13.3.0-6ubuntu2~24.04.1) 13.3.0
Clang version                : Could not collect
CMake version                : version 3.28.3
Libc version                 : glibc-2.39

==============================
       PyTorch Info
==============================
PyTorch version              : 2.11.0+cu130
Is debug build               : False
CUDA used to build PyTorch   : 13.0
ROCM used to build PyTorch   : N/A
XPU used to build PyTorch    : N/A

==============================
      Python Environment
==============================
Python version               : 3.12.3 (main, Mar 23 2026, 19:04:32) [GCC 13.3.0] (64-bit runtime)
Python platform              : Linux-6.12.64-87.122.amzn2023.x86_64-x86_64-with-glibc2.39
    
==============================
       CUDA / GPU Info
==============================
Is CUDA available            : True
CUDA runtime version         : 13.0.88
CUDA_MODULE_LOADING set to   : 
GPU models and configuration : GPU 0: NVIDIA H200
Nvidia driver version        : 580.126.09
cuDNN version                : Could not collect
HIP runtime version          : N/A
MIOpen runtime version       : N/A
Is XNNPACK available         : True

==============================
          CPU Info
==============================
Architecture:                            x86_64
CPU op-mode(s):                          32-bit, 64-bit
Address sizes:                           46 bits physical, 48 bits virtual
Byte Order:                              Little Endian
CPU(s):                                  192
On-line CPU(s) list:                     0-191
Vendor ID:                               GenuineIntel
Model name:                              Intel(R) Xeon(R) Platinum 8488C
CPU family:                              6
Model:                                   143
Thread(s) per core:                      2
Core(s) per socket:                      48
Socket(s):                               2
Stepping:                                8
BogoMIPS:                                4800.00
L1d cache:                               4.5 MiB (96 instances)
L1i cache:                               3 MiB (96 instances)
L2 cache:                                192 MiB (96 instances)
L3 cache:                                210 MiB (2 instances)
NUMA node(s):                            2
NUMA node0 CPU(s):                       0-47,96-143
NUMA node1 CPU(s):                       48-95,144-191

==============================
Versions of relevant libraries
==============================
[pip3] flashinfer-python==0.6.8.post1
[pip3] numpy==2.3.5
[pip3] nvidia-cublas==13.1.0.3
[pip3] nvidia-cuda-cupti==13.0.85
[pip3] nvidia-cuda-nvrtc==13.0.88
[pip3] nvidia-cuda-runtime==13.0.96
[pip3] nvidia-cudnn-cu13==9.19.0.56
[pip3] nvidia-cudnn-frontend==1.18.0
[pip3] nvidia-cufft==12.0.0.61
[pip3] nvidia-cufile==1.15.1.6
[pip3] nvidia-curand==10.4.0.35
[pip3] nvidia-cusolver==12.0.4.66
[pip3] nvidia-cusparse==12.6.3.3
[pip3] nvidia-cusparselt-cu13==0.8.0
[pip3] nvidia-cutlass-dsl==4.4.2
[pip3] nvidia-cutlass-dsl-libs-base==4.4.2
[pip3] nvidia-ml-py==13.595.45
[pip3] nvidia-nccl-cu13==2.28.9
[pip3] nvidia-nvshmem-cu13==3.4.5
[pip3] nvidia-nvtx==13.0.85
[pip3] torch==2.11.0+cu130
[pip3] torch_c_dlpack_ext==0.1.5
[pip3] torchaudio==2.11.0+cu130
[pip3] torchvision==0.26.0+cu130
[pip3] transformers==5.6.2
[pip3] triton==3.6.0

==============================
         vLLM Info
==============================
vLLM Version                 : 0.20.0
vLLM Build Flags:
  CUDA Archs: 8.0 8.6 8.9 9.0 10.0; ROCm: Disabled; XPU: Disabled
GPU Topology:
  GPU0: NVIDIA H200 (141 GiB)
  8 GPUs per node connected via NVLink (NV18)
  Cross-node: 16× EFA v2 (100 Gbps each, 1.6 Tbps total)

==============================
     Environment Variables
==============================
CUDA_VERSION=13.0.3
NCCL_VERSION=2.28.3-1
TORCH_CUDA_ARCH_LIST=8.0 8.6 8.9 9.0 10.0
VLLM_TARGET_DEVICE=cuda
CUDA_HOME=/usr/local/cuda
PYTORCH_NVML_BASED_CUDA_CHECK=1
TORCHINDUCTOR_COMPILE_THREADS=1

---

PyTorch version              : 2.11.0+cu130
Is debug build               : False
CUDA used to build PyTorch   : 13.0
OS                           : Ubuntu 24.04.4 LTS (x86_64)
GCC version                  : (Ubuntu 13.3.0-6ubuntu2~24.04.1) 13.3.0
CMake version                : version 3.28.3
Python version               : 3.12.3 (64-bit runtime)
Python platform              : Linux-6.12.64-87.122.amzn2023.x86_64-x86_64-with-glibc2.39
Is CUDA available            : True
CUDA runtime version         : 13.0.88
GPU models and configuration : GPU 0-7: NVIDIA H200 (141 GiB each)
Nvidia driver version        : 580.126.09
vLLM Version                 : 0.20.0
vLLM Build Flags             : CUDA Archs: 8.0 8.6 8.9 9.0 10.0; ROCm: Disabled; XPU: Disabled
NCCL_VERSION                 : 2.28.3-1
TORCH_CUDA_ARCH_LIST         : 8.0 8.6 8.9 9.0 10.0

[pip3] flashinfer-python==0.6.8.post1
[pip3] torch==2.11.0+cu130
[pip3] transformers==5.6.2
[pip3] triton==3.6.0
[pip3] nvidia-nccl-cu13==2.28.9
[pip3] nvidia-nvshmem-cu13==3.4.5

---

export VLLM_USE_DEEP_GEMM=0
export VLLM_WORKER_MULTIPROC_METHOD=spawn
export VLLM_NIXL_SIDE_CHANNEL_HOST=0.0.0.0
export VLLM_NIXL_SIDE_CHANNEL_PORT=5557
export VLLM_SSM_CONV_STATE_LAYOUT=DS

KV_CONFIG='{"kv_connector":"NixlConnector","kv_role":"kv_both","kv_load_failure_policy":"fail","kv_connector_extra_config":{"backends":["LIBFABRIC"]}}'

vllm serve Qwen/Qwen3.5-397B-A17B-FP8 \
  --trust-remote-code \
  --tensor-parallel-size 8 \
  --language-model-only \
  --gdn-prefill-backend triton \
  --gpu-memory-utilization 0.90 \
  --max-model-len 32768 \
  --enable-chunked-prefill \
  --enable-prefix-caching \
  --no-disable-hybrid-kv-cache-manager \
  --kv-transfer-config "$KV_CONFIG" \
  --host 0.0.0.0 \
  --port 8100

---

ValueError: Hybrid KV cache manager is disabled but failed to convert the KV cache specs to one unified type.

---

RuntimeError: 3-read Mamba conv transfer requires DS conv state layout. Set VLLM_SSM_CONV_STATE_LAYOUT=DS

---

NotImplementedError: 3-read conv transfer only supports Mamba2 models, got mamba_type='gdn_attention'.
Mamba1 SSM temporal shape is (intermediate_size // tp, state_size) which cannot be used to reconstruct intermediate_size.
RAW_BUFFERClick to expand / collapse

Your current environment

<details> <summary>The output of <code>python collect_env.py</code></summary>
Collecting environment information...
==============================
        System Info
==============================
OS                           : Ubuntu 24.04.4 LTS (x86_64)
GCC version                  : (Ubuntu 13.3.0-6ubuntu2~24.04.1) 13.3.0
Clang version                : Could not collect
CMake version                : version 3.28.3
Libc version                 : glibc-2.39

==============================
       PyTorch Info
==============================
PyTorch version              : 2.11.0+cu130
Is debug build               : False
CUDA used to build PyTorch   : 13.0
ROCM used to build PyTorch   : N/A
XPU used to build PyTorch    : N/A

==============================
      Python Environment
==============================
Python version               : 3.12.3 (main, Mar 23 2026, 19:04:32) [GCC 13.3.0] (64-bit runtime)
Python platform              : Linux-6.12.64-87.122.amzn2023.x86_64-x86_64-with-glibc2.39
    
==============================
       CUDA / GPU Info
==============================
Is CUDA available            : True
CUDA runtime version         : 13.0.88
CUDA_MODULE_LOADING set to   : 
GPU models and configuration : GPU 0: NVIDIA H200
Nvidia driver version        : 580.126.09
cuDNN version                : Could not collect
HIP runtime version          : N/A
MIOpen runtime version       : N/A
Is XNNPACK available         : True

==============================
          CPU Info
==============================
Architecture:                            x86_64
CPU op-mode(s):                          32-bit, 64-bit
Address sizes:                           46 bits physical, 48 bits virtual
Byte Order:                              Little Endian
CPU(s):                                  192
On-line CPU(s) list:                     0-191
Vendor ID:                               GenuineIntel
Model name:                              Intel(R) Xeon(R) Platinum 8488C
CPU family:                              6
Model:                                   143
Thread(s) per core:                      2
Core(s) per socket:                      48
Socket(s):                               2
Stepping:                                8
BogoMIPS:                                4800.00
L1d cache:                               4.5 MiB (96 instances)
L1i cache:                               3 MiB (96 instances)
L2 cache:                                192 MiB (96 instances)
L3 cache:                                210 MiB (2 instances)
NUMA node(s):                            2
NUMA node0 CPU(s):                       0-47,96-143
NUMA node1 CPU(s):                       48-95,144-191

==============================
Versions of relevant libraries
==============================
[pip3] flashinfer-python==0.6.8.post1
[pip3] numpy==2.3.5
[pip3] nvidia-cublas==13.1.0.3
[pip3] nvidia-cuda-cupti==13.0.85
[pip3] nvidia-cuda-nvrtc==13.0.88
[pip3] nvidia-cuda-runtime==13.0.96
[pip3] nvidia-cudnn-cu13==9.19.0.56
[pip3] nvidia-cudnn-frontend==1.18.0
[pip3] nvidia-cufft==12.0.0.61
[pip3] nvidia-cufile==1.15.1.6
[pip3] nvidia-curand==10.4.0.35
[pip3] nvidia-cusolver==12.0.4.66
[pip3] nvidia-cusparse==12.6.3.3
[pip3] nvidia-cusparselt-cu13==0.8.0
[pip3] nvidia-cutlass-dsl==4.4.2
[pip3] nvidia-cutlass-dsl-libs-base==4.4.2
[pip3] nvidia-ml-py==13.595.45
[pip3] nvidia-nccl-cu13==2.28.9
[pip3] nvidia-nvshmem-cu13==3.4.5
[pip3] nvidia-nvtx==13.0.85
[pip3] torch==2.11.0+cu130
[pip3] torch_c_dlpack_ext==0.1.5
[pip3] torchaudio==2.11.0+cu130
[pip3] torchvision==0.26.0+cu130
[pip3] transformers==5.6.2
[pip3] triton==3.6.0

==============================
         vLLM Info
==============================
vLLM Version                 : 0.20.0
vLLM Build Flags:
  CUDA Archs: 8.0 8.6 8.9 9.0 10.0; ROCm: Disabled; XPU: Disabled
GPU Topology:
  GPU0: NVIDIA H200 (141 GiB)
  8 GPUs per node connected via NVLink (NV18)
  Cross-node: 16× EFA v2 (100 Gbps each, 1.6 Tbps total)

==============================
     Environment Variables
==============================
CUDA_VERSION=13.0.3
NCCL_VERSION=2.28.3-1
TORCH_CUDA_ARCH_LIST=8.0 8.6 8.9 9.0 10.0
VLLM_TARGET_DEVICE=cuda
CUDA_HOME=/usr/local/cuda
PYTORCH_NVML_BASED_CUDA_CHECK=1
TORCHINDUCTOR_COMPILE_THREADS=1
</details>

🐛 Describe the bug

Summary

The NIXL KV connector (NixlConnector) cannot transfer GDN (Gated Dilated Neighborhood) attention state used by Qwen3.5 models. The 3-read conv transfer path only supports Mamba2 models, blocking disaggregated serving for the entire Qwen3.5 family.

Severity

Medium — blocks disaggregated serving for a major new model family (Qwen3.5).

Environment

ComponentVersion
vLLM0.20.0
NIXL1.0.1
PyTorch2.11.0+cu130
CUDA13.0.3

Hardware

  • 2× AWS p5en.48xlarge (8×H200 each)
  • 16× EFA per node

collect_env.py Output

PyTorch version              : 2.11.0+cu130
Is debug build               : False
CUDA used to build PyTorch   : 13.0
OS                           : Ubuntu 24.04.4 LTS (x86_64)
GCC version                  : (Ubuntu 13.3.0-6ubuntu2~24.04.1) 13.3.0
CMake version                : version 3.28.3
Python version               : 3.12.3 (64-bit runtime)
Python platform              : Linux-6.12.64-87.122.amzn2023.x86_64-x86_64-with-glibc2.39
Is CUDA available            : True
CUDA runtime version         : 13.0.88
GPU models and configuration : GPU 0-7: NVIDIA H200 (141 GiB each)
Nvidia driver version        : 580.126.09
vLLM Version                 : 0.20.0
vLLM Build Flags             : CUDA Archs: 8.0 8.6 8.9 9.0 10.0; ROCm: Disabled; XPU: Disabled
NCCL_VERSION                 : 2.28.3-1
TORCH_CUDA_ARCH_LIST         : 8.0 8.6 8.9 9.0 10.0

[pip3] flashinfer-python==0.6.8.post1
[pip3] torch==2.11.0+cu130
[pip3] transformers==5.6.2
[pip3] triton==3.6.0
[pip3] nvidia-nccl-cu13==2.28.9
[pip3] nvidia-nvshmem-cu13==3.4.5

Steps to Reproduce

  1. Deploy Qwen3.5-397B-A17B-FP8 with NIXL KV transfer:
export VLLM_USE_DEEP_GEMM=0
export VLLM_WORKER_MULTIPROC_METHOD=spawn
export VLLM_NIXL_SIDE_CHANNEL_HOST=0.0.0.0
export VLLM_NIXL_SIDE_CHANNEL_PORT=5557
export VLLM_SSM_CONV_STATE_LAYOUT=DS

KV_CONFIG='{"kv_connector":"NixlConnector","kv_role":"kv_both","kv_load_failure_policy":"fail","kv_connector_extra_config":{"backends":["LIBFABRIC"]}}'

vllm serve Qwen/Qwen3.5-397B-A17B-FP8 \
  --trust-remote-code \
  --tensor-parallel-size 8 \
  --language-model-only \
  --gdn-prefill-backend triton \
  --gpu-memory-utilization 0.90 \
  --max-model-len 32768 \
  --enable-chunked-prefill \
  --enable-prefix-caching \
  --no-disable-hybrid-kv-cache-manager \
  --kv-transfer-config "$KV_CONFIG" \
  --host 0.0.0.0 \
  --port 8100
  1. Model loads, torch.compile completes, CUDA graphs captured.
  2. Crash during KV cache initialization.

Error Progression

Three errors had to be resolved before hitting the hard blocker:

Error 1 (fixed with --no-disable-hybrid-kv-cache-manager):

ValueError: Hybrid KV cache manager is disabled but failed to convert the KV cache specs to one unified type.

Error 2 (fixed with VLLM_SSM_CONV_STATE_LAYOUT=DS):

RuntimeError: 3-read Mamba conv transfer requires DS conv state layout. Set VLLM_SSM_CONV_STATE_LAYOUT=DS

Error 3 (HARD BLOCKER — no workaround):

NotImplementedError: 3-read conv transfer only supports Mamba2 models, got mamba_type='gdn_attention'.
Mamba1 SSM temporal shape is (intermediate_size // tp, state_size) which cannot be used to reconstruct intermediate_size.

Root Cause

Qwen3.5's GDN (Gated Dilated Neighborhood) attention layers use a state representation that differs from both standard attention KV cache and Mamba2 SSM state. The NIXL connector's 3-read conv transfer implementation only handles:

  • Standard attention KV cache ✓
  • Mamba2 SSM state ✓
  • GDN attention state ✗

GDN state has mamba_type='gdn_attention' with a temporal shape (intermediate_size // tp, state_size) that cannot be reconstructed on the receiving end.

Impact

All Qwen3.5 models are affected:

  • Qwen3.5-397B-A17B-FP8
  • Qwen3.5-122B-A10B (if hybrid GDN)
  • Any future model using GDN attention

Without disaggregated serving, these models can only scale via independent replicas (no shared prefill/decode separation).

Suggested Fix

Implement a GDN-specific state transfer path in NixlConnector that:

  1. Serializes the full GDN state tensor (including intermediate_size metadata)
  2. Handles the TP-sharded (intermediate_size // tp, state_size) shape
  3. Reconstructs on the consumer side with knowledge of the original intermediate_size

Alternatively, make the connector forward the raw state bytes with shape metadata, without requiring shape reconstruction logic.

cc: @nkumaraws

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

vllm - 💡(How to fix) Fix [Bug]: NIXL Disagg Does Not Support GDN Attention (Qwen3.5 Hybrid) [1 comments, 2 participants]