vllm - 💡(How to fix) Fix [Feature]: Batch-invariant support for GDN_ATTN (Qwen3-Next / Qwen3.6 hybrid Mamba+GDN MoE models)

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

Error Message

RuntimeError: VLLM batch_invariant mode is not supported for GDN_ATTN.

Fix Action

Fix / Workaround

Both v0.21.0 and nightly (May 2026) fail with the same error. The check is triggered as soon as the engine selects the Mamba/GDN attention backend during init, before any AWQ-kernel logic runs — so it is independent of --quantization, --attention-backend, VLLM_ATTENTION_BACKEND (unrecognized in 0.21.0), and other workaround knobs.

Happy to test patches on A100 + Qwen3.6-A3B if helpful.

Code Example

RuntimeError: VLLM batch_invariant mode is not supported for GDN_ATTN.

---

docker run --rm --gpus all --ipc host \
  -e HUGGING_FACE_HUB_TOKEN=... \
  -e VLLM_BATCH_INVARIANT=1 \
  vllm/vllm-openai:v0.21.0 \
  --model cyankiwi/Qwen3.6-35B-A3B-AWQ-4bit \
  --trust-remote-code \
  --max-model-len 20480
RAW_BUFFERClick to expand / collapse

Your current environment

<details> <summary>Versions</summary>
  • vLLM: 0.21.0 (Docker image vllm/vllm-openai:v0.21.0) and nightly (digest sha256:d1bd760bf6630f67378206c7945afb6ab9bc046064a51fe421461e91261dcd7b, pulled 2026-05-18)
  • PyTorch: 2.11.0+cu130
  • CUDA: 13.0
  • GPU: NVIDIA A100-SXM4-80GB (compute capability 8.0, SM80)
  • Model: cyankiwi/Qwen3.6-35B-A3B-AWQ-4bit (Qwen3-Next-style hybrid Mamba + Gated-Delta-Net + softmax-attention MoE; quantization: compressed-tensors)
  • TP: 1
</details>

🐛 Describe the bug / 🛠 Feature request

Setting VLLM_BATCH_INVARIANT=1 on a model that contains GDN (Gated-Delta-Net) linear-attention layers causes engine startup to abort with:

RuntimeError: VLLM batch_invariant mode is not supported for GDN_ATTN.

Source: vllm/v1/attention/selector.py:154 in _cached_get_mamba_attn_backend.

This is a hard incompatibility — no fallback, no partial mode. It blocks reproducibility work for all Qwen3-Next / Qwen3.6-style models (and any other hybrid Mamba/GDN architecture).

Reproduction

docker run --rm --gpus all --ipc host \
  -e HUGGING_FACE_HUB_TOKEN=... \
  -e VLLM_BATCH_INVARIANT=1 \
  vllm/vllm-openai:v0.21.0 \
  --model cyankiwi/Qwen3.6-35B-A3B-AWQ-4bit \
  --trust-remote-code \
  --max-model-len 20480

Both v0.21.0 and nightly (May 2026) fail with the same error. The check is triggered as soon as the engine selects the Mamba/GDN attention backend during init, before any AWQ-kernel logic runs — so it is independent of --quantization, --attention-backend, VLLM_ATTENTION_BACKEND (unrecognized in 0.21.0), and other workaround knobs.

Related

  • #42456 — added SM80 batch-invariant support for the regular attention path. This issue requests the analogous coverage for the linear-attention (Mamba/GDN) path.
  • #29581 — closed; addressed batch-invariance for AWQ-Marlin kernels on the non-hybrid Qwen3-30B-A3B. Different layer family (no GDN); doesn't cover this case.
  • #32992 — closed; B200 / Blackwell + torch.compile. Different hardware.

Happy to test patches on A100 + Qwen3.6-A3B if helpful.

Before submitting a new issue...

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

vllm - 💡(How to fix) Fix [Feature]: Batch-invariant support for GDN_ATTN (Qwen3-Next / Qwen3.6 hybrid Mamba+GDN MoE models)