vllm - 💡(How to fix) Fix [Feature]: Support PD disaggregation / KV transfer for hybrid SSM/GDN models such as Qwen3.5-397B-A17B-W8A8

Fix Action

Fix / Workaround

Questions

Is PD disaggregation / KV transfer for hybrid SSM/GDN models such as Qwen3.5-397B-A17B-W8A8 planned in vLLM?
Is gdn_attention support planned for MooncakeConnector or NixlConnector?
Is there an expected version or roadmap for this support?
Is there any recommended workaround for running PD disaggregation with models that contain both MambaSpec and FullAttentionSpec cache groups?
Should Mooncake be updated to choose attention backends per cache group instead of using get_current_attn_backend() from the first layer? Expected behavior

This may be usable only as a functional workaround to verify the transfer path, but it is not a suitable performance solution because it likely requires CPU-side staging/copying instead of high-performance device-memory transfer.

Currently, I do not have a production-ready workaround for PD disaggregation with this hybrid MambaSpec + FullAttentionSpec model.

🚀 The feature, motivation and pitch

Feature request

I would like to ask whether vLLM plans to support PD disaggregation / KV transfer for hybrid SSM/GDN models, such as Qwen3.5-397B-A17B-W8A8, and whether there is an expected release version or roadmap for this support.

Background

Qwen3.5-397B-A17B-W8A8 is a hybrid attention model. Its text_config.layer_types contains both linear_attention and full_attention layers.

The layer pattern is roughly:

linear_attention
linear_attention
linear_attention
full_attention
...

In my environment, the KV cache groups are split as follows:

group[0]: MambaSpec, mamba_type='gdn_attention', linear_attn layers 0,4,8,...
group[1]: MambaSpec, mamba_type='gdn_attention', linear_attn layers 1,5,9,...
group[2]: MambaSpec, mamba_type='gdn_attention', linear_attn layers 2,6,10,...
group[3]: FullAttentionSpec, self_attn/full_attention layers 3,7,11,...

So this model contains both:
MambaSpec / GDN / linear_attention state cache
FullAttentionSpec / full_attention KV cache

Issue with MooncakeConnector

When using MooncakeConnector for PD disaggregation, Mooncake currently obtains the attention backend from the first layer:

backend = get_current_attn_backend(vllm_config)

Since the first layer of Qwen3.5-397B-A17B-W8A8 is linear_attention, this returns:

GDNAttentionBackend

Then Mooncake passes this backend into a non-Mamba TransferTopology:

TransferTopology(
    ...
    is_mamba=False,
    attn_backends=[backend],
)

This leads to an invalid combination:

is_mamba=False
attn_backends=[GDNAttentionBackend]

For the FullAttentionSpec group, Mooncake still uses GDNAttentionBackend, and later calls:

attn_backend.get_kv_cache_shape(...)

However, GDNAttentionBackend is an SSM backend and uses MambaSpec / state cache, not standard KV cache. Therefore it does not have a valid standard KV cache shape.

Issue with NixlConnector

I also tried NixlConnector. NIXL can detect MambaSpec and enters the SSM conv state transfer path. After setting:

export VLLM_SSM_CONV_STATE_LAYOUT=DS

it proceeds further, but then fails with:

NotImplementedError: 3-read conv transfer only supports Mamba2 models, got mamba_type='gdn_attention'.
Mamba1 SSM temporal shape is (intermediate_size // tp, state_size) which cannot be used to reconstruct intermediate_size.

So it seems that the current NIXL SSM conv transfer supports Mamba2, but not gdn_attention.

Questions
1. Is PD disaggregation / KV transfer for hybrid SSM/GDN models such as Qwen3.5-397B-A17B-W8A8 planned in vLLM?
2. Is gdn_attention support planned for MooncakeConnector or NixlConnector?
3. Is there an expected version or roadmap for this support?
4. Is there any recommended workaround for running PD disaggregation with models that contain both MambaSpec and FullAttentionSpec cache groups?
5. Should Mooncake be updated to choose attention backends per cache group instead of using get_current_attn_backend() from the first layer?
Expected behavior

For hybrid models, the connector should handle cache groups separately, for example:

MambaSpec / gdn_attention groups:
    use GDN / SSM state transfer logic

FullAttentionSpec groups:
    use normal full-attention KV transfer logic with a non-SSM backend

It should not use the first layer backend globally for all cache groups.

Environment
vLLM version: 0.20.0
Model: Qwen3.5-397B-A17B-W8A8
Connectors tested:
MooncakeConnector
NixlConnector
Model structure:
60 text layers
45 linear_attention layers
15 full_attention layers
MambaSpec groups with mamba_type='gdn_attention'
FullAttentionSpec group for full_attention layers

It would be very helpful to know whether this is already on the roadmap and, if so, which vLLM release is expected to include support for gdn_attention state transfer in PD disaggregation.

### Alternatives



I tried the following alternatives:

1. MooncakeConnector

Mooncake currently gets the attention backend from the first layer through `get_current_attn_backend(vllm_config)`. Since this model starts with a `linear_attention` layer, Mooncake gets `GDNAttentionBackend`. This backend is then used even for the `FullAttentionSpec` cache group, which causes the standard KV cache shape path to call into a GDN/SSM backend.

2. NixlConnector

NIXL can detect `MambaSpec` and reaches the SSM conv state transfer path. However, it currently fails for this model because the 3-read conv transfer only supports Mamba2, while this model uses `mamba_type='gdn_attention'`.

3. SimpleCPUOffloadConnector

This may be usable only as a functional workaround to verify the transfer path, but it is not a suitable performance solution because it likely requires CPU-side staging/copying instead of high-performance device-memory transfer.

Currently, I do not have a production-ready workaround for PD disaggregation with this hybrid `MambaSpec` + `FullAttentionSpec` model.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

vllm - 💡(How to fix) Fix [Feature]: Support PD disaggregation / KV transfer for hybrid SSM/GDN models such as Qwen3.5-397B-A17B-W8A8

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Fix Action

Fix / Workaround

🚀 The feature, motivation and pitch

Feature request

Background

Still need to ship something?

TRENDING