vllm - 💡(How to fix) Fix [Bug] turboquant_4bit_nc KV cache fails unify_kv_cache_spec_page_size assertion on hybrid Qwen3_5 (GDN + attention) architecture [2 comments, 1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#41560Fetched 2026-05-04 04:58:52
View on GitHub
Comments
2
Participants
1
Timeline
4
Reactions
0
Participants
Timeline (top)
commented ×2closed ×1cross-referenced ×1

--kv-cache-dtype turboquant_4bit_nc causes a deterministic AssertionError at engine init when used with hybrid Qwen3_5ForConditionalGeneration models (Qwen3.6 27B family). The turboquant page-size bytes do not satisfy the mamba/GDN-aligned max page-size constraint after vLLM's internal padding step in unify_kv_cache_spec_page_size.

The crash is consistent and unrecoverable — no combination of --max-model-len, --gpu-memory-utilization, or --max-num-seqs resolves it. The vLLM build used is v0.20.0-tq-hybrid-v3, which specifically targets hybrid architectures.

Root Cause

Suspected root cause

Code Example

python -m vllm.entrypoints.openai.api_server \
  --model Intel/Qwen3.6-27B-int4-AutoRound \
  --quantization auto_round \
  --kv-cache-dtype turboquant_4bit_nc \
  --max-model-len 96000 \
  --gpu-memory-utilization 0.97 \
  --compilation-config '{"cudagraph_mode":"PIECEWISE","cudagraph_capture_sizes":[1]}' \
  --no-scheduler-reserve-full-isl \
  --language-model-only \
  --skip-mm-profiling \
  --trust-remote-code

---

AssertionError
  File "vllm/v1/core/kv_cache_utils.py", line 1030, in unify_kv_cache_spec_page_size
RAW_BUFFERClick to expand / collapse

Summary

--kv-cache-dtype turboquant_4bit_nc causes a deterministic AssertionError at engine init when used with hybrid Qwen3_5ForConditionalGeneration models (Qwen3.6 27B family). The turboquant page-size bytes do not satisfy the mamba/GDN-aligned max page-size constraint after vLLM's internal padding step in unify_kv_cache_spec_page_size.

The crash is consistent and unrecoverable — no combination of --max-model-len, --gpu-memory-utilization, or --max-num-seqs resolves it. The vLLM build used is v0.20.0-tq-hybrid-v3, which specifically targets hybrid architectures.

Reproduction

python -m vllm.entrypoints.openai.api_server \
  --model Intel/Qwen3.6-27B-int4-AutoRound \
  --quantization auto_round \
  --kv-cache-dtype turboquant_4bit_nc \
  --max-model-len 96000 \
  --gpu-memory-utilization 0.97 \
  --compilation-config '{"cudagraph_mode":"PIECEWISE","cudagraph_capture_sizes":[1]}' \
  --no-scheduler-reserve-full-isl \
  --language-model-only \
  --skip-mm-profiling \
  --trust-remote-code

Crash:

AssertionError
  File "vllm/v1/core/kv_cache_utils.py", line 1030, in unify_kv_cache_spec_page_size

Stack context: Model weights load successfully (~16.6 GiB / 118 s). Engine enters profile_cudagraph_memory_init_minimal_kv_cache_for_profiling, then crashes at the page-size unification check.

The crash occurs before any KV blocks are allocated, so there is no partial-init state to recover from.

Affected configuration

  • Model: Intel/Qwen3.6-27B-int4-AutoRound (Qwen3_5ForConditionalGeneration)
  • Architecture: Hybrid GDN (Gated DeltaNet) + full attention layers interleaved
  • KV dtype: turboquant_4bit_nc
  • vLLM build: v0.20.0-tq-hybrid-v3 (internal build targeting hybrid arch support)

The Qwen3_5ForConditionalGeneration architecture interleaves DeltaNet/GDN blocks with standard attention blocks. Each layer type produces a different KV cache spec geometry. The mamba/GDN layers use a different page-size than attention layers.

Suspected root cause

unify_kv_cache_spec_page_size (line 1030 in vllm/v1/core/kv_cache_utils.py) walks all layer KV specs and asserts that turboquant_4bit_nc page-size bytes are compatible with the max page-size across all specs after a 0.06% padding step. On a hybrid model, the GDN/mamba layer page geometry differs from attention layer page geometry. The turboquant_4bit_nc page-size calculation does not account for the hybrid page-size variance after padding, causing the assertion to fail.

With fp8_e5m2 or auto (bfloat16) KV dtype, this assertion passes cleanly on the same model.

Suggested fix paths

  1. Adjust turboquant_4bit_nc page-size calculation to be compatible with the unified max page-size including the padding step on hybrid architectures
  2. Skip the unify_kv_cache_spec_page_size assertion for non-attention KV slots (GDN/mamba slots have a different geometry that should not be held to the attention page-size constraint)
  3. Document that turboquant_4bit_nc is incompatible with hybrid GDN+attention architectures until this is resolved

Related issues

  • #38041 — Related: _reshape_kv_cache AssertionError on Qwen3.5 hybrid arch with V2 runner (different assertion site, same hybrid arch root class)

Environment

  • vLLM version: 0.20.0 (image harbor.focuscell.org/infra/vllm-openai:v0.20.0-tq-hybrid-v3)
  • Hardware: NVIDIA RTX 3090 24 GiB (Ampere, SM 86)
  • Model: Intel/Qwen3.6-27B-int4-AutoRound (Qwen3_5ForConditionalGeneration)
  • OS: Talos Linux, CUDA 12.4

Internal references

extent analysis

TL;DR

The most likely fix is to adjust the turboquant_4bit_nc page-size calculation to account for the hybrid page-size variance after padding in unify_kv_cache_spec_page_size.

Guidance

  • Review the unify_kv_cache_spec_page_size function in vllm/v1/core/kv_cache_utils.py to understand the page-size unification logic and how it fails for hybrid architectures.
  • Consider implementing one of the suggested fix paths: adjusting the turboquant_4bit_nc page-size calculation, skipping the assertion for non-attention KV slots, or documenting the incompatibility.
  • Verify the fix by running the reproduction command with the modified code and checking if the AssertionError is resolved.

Example

No code snippet is provided as the issue requires a deeper understanding of the unify_kv_cache_spec_page_size function and the turboquant_4bit_nc page-size calculation.

Notes

The issue is specific to the turboquant_4bit_nc KV dtype and hybrid GDN+attention architectures, and the fix may not be applicable to other configurations.

Recommendation

Apply a workaround by using a different KV dtype, such as fp8_e5m2 or auto (bfloat16), which are known to work with the same model, until a permanent fix is implemented.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

vllm - 💡(How to fix) Fix [Bug] turboquant_4bit_nc KV cache fails unify_kv_cache_spec_page_size assertion on hybrid Qwen3_5 (GDN + attention) architecture [2 comments, 1 participants]