vllm - 💡(How to fix) Fix [Feature]: Support PD disaggregation / KV transfer for hybrid SSM/GDN models such as Qwen3.5-397B-A17B-W8A8

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

Root Cause

NIXL can detect MambaSpec and reaches the SSM conv state transfer path. However, it currently fails for this model because the 3-read conv transfer only supports Mamba2, while this model uses mamba_type='gdn_attention'.

Fix Action

Fix / Workaround

Questions

  1. Is PD disaggregation / KV transfer for hybrid SSM/GDN models such as Qwen3.5-397B-A17B-W8A8 planned in vLLM?
  2. Is gdn_attention support planned for MooncakeConnector or NixlConnector?
  3. Is there an expected version or roadmap for this support?
  4. Is there any recommended workaround for running PD disaggregation with models that contain both MambaSpec and FullAttentionSpec cache groups?
  5. Should Mooncake be updated to choose attention backends per cache group instead of using get_current_attn_backend() from the first layer? Expected behavior

This may be usable only as a functional workaround to verify the transfer path, but it is not a suitable performance solution because it likely requires CPU-side staging/copying instead of high-performance device-memory transfer.

Currently, I do not have a production-ready workaround for PD disaggregation with this hybrid MambaSpec + FullAttentionSpec model.

RAW_BUFFERClick to expand / collapse

🚀 The feature, motivation and pitch

Feature request

I would like to ask whether vLLM plans to support PD disaggregation / KV transfer for hybrid SSM/GDN models, such as Qwen3.5-397B-A17B-W8A8, and whether there is an expected release version or roadmap for this support.

Background

Qwen3.5-397B-A17B-W8A8 is a hybrid attention model. Its text_config.layer_types contains both linear_attention and full_attention layers.

The layer pattern is roughly:

linear_attention
linear_attention
linear_attention
full_attention
...

In my environment, the KV cache groups are split as follows:

group[0]: MambaSpec, mamba_type='gdn_attention', linear_attn layers 0,4,8,...
group[1]: MambaSpec, mamba_type='gdn_attention', linear_attn layers 1,5,9,...
group[2]: MambaSpec, mamba_type='gdn_attention', linear_attn layers 2,6,10,...
group[3]: FullAttentionSpec, self_attn/full_attention layers 3,7,11,...

So this model contains both:
MambaSpec / GDN / linear_attention state cache
FullAttentionSpec / full_attention KV cache

Issue with MooncakeConnector

When using MooncakeConnector for PD disaggregation, Mooncake currently obtains the attention backend from the first layer:

backend = get_current_attn_backend(vllm_config)

Since the first layer of Qwen3.5-397B-A17B-W8A8 is linear_attention, this returns:

GDNAttentionBackend

Then Mooncake passes this backend into a non-Mamba TransferTopology:

TransferTopology(
    ...
    is_mamba=False,
    attn_backends=[backend],
)

This leads to an invalid combination:

is_mamba=False
attn_backends=[GDNAttentionBackend]

For the FullAttentionSpec group, Mooncake still uses GDNAttentionBackend, and later calls:

attn_backend.get_kv_cache_shape(...)

However, GDNAttentionBackend is an SSM backend and uses MambaSpec / state cache, not standard KV cache. Therefore it does not have a valid standard KV cache shape.

Issue with NixlConnector

I also tried NixlConnector. NIXL can detect MambaSpec and enters the SSM conv state transfer path. After setting:

export VLLM_SSM_CONV_STATE_LAYOUT=DS

it proceeds further, but then fails with:

NotImplementedError: 3-read conv transfer only supports Mamba2 models, got mamba_type='gdn_attention'.
Mamba1 SSM temporal shape is (intermediate_size // tp, state_size) which cannot be used to reconstruct intermediate_size.

So it seems that the current NIXL SSM conv transfer supports Mamba2, but not gdn_attention.

Questions
1. Is PD disaggregation / KV transfer for hybrid SSM/GDN models such as Qwen3.5-397B-A17B-W8A8 planned in vLLM?
2. Is gdn_attention support planned for MooncakeConnector or NixlConnector?
3. Is there an expected version or roadmap for this support?
4. Is there any recommended workaround for running PD disaggregation with models that contain both MambaSpec and FullAttentionSpec cache groups?
5. Should Mooncake be updated to choose attention backends per cache group instead of using get_current_attn_backend() from the first layer?
Expected behavior

For hybrid models, the connector should handle cache groups separately, for example:

MambaSpec / gdn_attention groups:
    use GDN / SSM state transfer logic

FullAttentionSpec groups:
    use normal full-attention KV transfer logic with a non-SSM backend

It should not use the first layer backend globally for all cache groups.

Environment
vLLM version: 0.20.0
Model: Qwen3.5-397B-A17B-W8A8
Connectors tested:
MooncakeConnector
NixlConnector
Model structure:
60 text layers
45 linear_attention layers
15 full_attention layers
MambaSpec groups with mamba_type='gdn_attention'
FullAttentionSpec group for full_attention layers

It would be very helpful to know whether this is already on the roadmap and, if so, which vLLM release is expected to include support for gdn_attention state transfer in PD disaggregation.

### Alternatives



I tried the following alternatives:

1. MooncakeConnector

Mooncake currently gets the attention backend from the first layer through `get_current_attn_backend(vllm_config)`. Since this model starts with a `linear_attention` layer, Mooncake gets `GDNAttentionBackend`. This backend is then used even for the `FullAttentionSpec` cache group, which causes the standard KV cache shape path to call into a GDN/SSM backend.

2. NixlConnector

NIXL can detect `MambaSpec` and reaches the SSM conv state transfer path. However, it currently fails for this model because the 3-read conv transfer only supports Mamba2, while this model uses `mamba_type='gdn_attention'`.

3. SimpleCPUOffloadConnector

This may be usable only as a functional workaround to verify the transfer path, but it is not a suitable performance solution because it likely requires CPU-side staging/copying instead of high-performance device-memory transfer.

Currently, I do not have a production-ready workaround for PD disaggregation with this hybrid `MambaSpec` + `FullAttentionSpec` model.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

vllm - 💡(How to fix) Fix [Feature]: Support PD disaggregation / KV transfer for hybrid SSM/GDN models such as Qwen3.5-397B-A17B-W8A8