vllm - 💡(How to fix) Fix [Bug] turboquant_4bit_nc KV cache fails unify_kv_cache_spec_page_size assertion on hybrid Qwen3_5 (GDN + attention) architecture [2 comments, 1 participants]

vllm2026-05-03 15:14:52

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

vllm-project/vllm#41560•Fetched 2026-05-04 04:58:52

View on GitHub

Comments

Participants

Timeline

Reactions

Author

seantechco

Participants

seantechco

Timeline (top)

commented ×2closed ×1cross-referenced ×1

--kv-cache-dtype turboquant_4bit_nc causes a deterministic AssertionError at engine init when used with hybrid Qwen3_5ForConditionalGeneration models (Qwen3.6 27B family). The turboquant page-size bytes do not satisfy the mamba/GDN-aligned max page-size constraint after vLLM's internal padding step in unify_kv_cache_spec_page_size.

The crash is consistent and unrecoverable — no combination of --max-model-len, --gpu-memory-utilization, or --max-num-seqs resolves it. The vLLM build used is v0.20.0-tq-hybrid-v3, which specifically targets hybrid architectures.

Root Cause

Suspected root cause

Code Example

python -m vllm.entrypoints.openai.api_server \
  --model Intel/Qwen3.6-27B-int4-AutoRound \
  --quantization auto_round \
  --kv-cache-dtype turboquant_4bit_nc \
  --max-model-len 96000 \
  --gpu-memory-utilization 0.97 \
  --compilation-config '{"cudagraph_mode":"PIECEWISE","cudagraph_capture_sizes":[1]}' \
  --no-scheduler-reserve-full-isl \
  --language-model-only \
  --skip-mm-profiling \
  --trust-remote-code

---

AssertionError
  File "vllm/v1/core/kv_cache_utils.py", line 1030, in unify_kv_cache_spec_page_size

RAW_BUFFERClick to expand / collapse

Summary

Reproduction

python -m vllm.entrypoints.openai.api_server \
  --model Intel/Qwen3.6-27B-int4-AutoRound \
  --quantization auto_round \
  --kv-cache-dtype turboquant_4bit_nc \
  --max-model-len 96000 \
  --gpu-memory-utilization 0.97 \
  --compilation-config '{"cudagraph_mode":"PIECEWISE","cudagraph_capture_sizes":[1]}' \
  --no-scheduler-reserve-full-isl \
  --language-model-only \
  --skip-mm-profiling \
  --trust-remote-code

Crash:

AssertionError
  File "vllm/v1/core/kv_cache_utils.py", line 1030, in unify_kv_cache_spec_page_size

Stack context: Model weights load successfully (~16.6 GiB / 118 s). Engine enters profile_cudagraph_memory → _init_minimal_kv_cache_for_profiling, then crashes at the page-size unification check.

The crash occurs before any KV blocks are allocated, so there is no partial-init state to recover from.

Affected configuration

Model: Intel/Qwen3.6-27B-int4-AutoRound (Qwen3_5ForConditionalGeneration)
Architecture: Hybrid GDN (Gated DeltaNet) + full attention layers interleaved
KV dtype: turboquant_4bit_nc
vLLM build: v0.20.0-tq-hybrid-v3 (internal build targeting hybrid arch support)

The Qwen3_5ForConditionalGeneration architecture interleaves DeltaNet/GDN blocks with standard attention blocks. Each layer type produces a different KV cache spec geometry. The mamba/GDN layers use a different page-size than attention layers.

Suspected root cause

unify_kv_cache_spec_page_size (line 1030 in vllm/v1/core/kv_cache_utils.py) walks all layer KV specs and asserts that turboquant_4bit_nc page-size bytes are compatible with the max page-size across all specs after a 0.06% padding step. On a hybrid model, the GDN/mamba layer page geometry differs from attention layer page geometry. The turboquant_4bit_nc page-size calculation does not account for the hybrid page-size variance after padding, causing the assertion to fail.

With fp8_e5m2 or auto (bfloat16) KV dtype, this assertion passes cleanly on the same model.

Suggested fix paths

Adjust turboquant_4bit_nc page-size calculation to be compatible with the unified max page-size including the padding step on hybrid architectures
Skip the unify_kv_cache_spec_page_size assertion for non-attention KV slots (GDN/mamba slots have a different geometry that should not be held to the attention page-size constraint)
Document that turboquant_4bit_nc is incompatible with hybrid GDN+attention architectures until this is resolved

Related issues

#38041 — Related: _reshape_kv_cache AssertionError on Qwen3.5 hybrid arch with V2 runner (different assertion site, same hybrid arch root class)

Environment

vLLM version: 0.20.0 (image harbor.focuscell.org/infra/vllm-openai:v0.20.0-tq-hybrid-v3)
Hardware: NVIDIA RTX 3090 24 GiB (Ampere, SM 86)
Model: Intel/Qwen3.6-27B-int4-AutoRound (Qwen3_5ForConditionalGeneration)
OS: Talos Linux, CUDA 12.4

Internal references

Our investigation: https://git.developerdojo.org/HardMagic/inference-tuning/-/work_items/57

extent analysis

TL;DR

The most likely fix is to adjust the turboquant_4bit_nc page-size calculation to account for the hybrid page-size variance after padding in unify_kv_cache_spec_page_size.

Guidance

Review the unify_kv_cache_spec_page_size function in vllm/v1/core/kv_cache_utils.py to understand the page-size unification logic and how it fails for hybrid architectures.
Consider implementing one of the suggested fix paths: adjusting the turboquant_4bit_nc page-size calculation, skipping the assertion for non-attention KV slots, or documenting the incompatibility.
Verify the fix by running the reproduction command with the modified code and checking if the AssertionError is resolved.

Example

No code snippet is provided as the issue requires a deeper understanding of the unify_kv_cache_spec_page_size function and the turboquant_4bit_nc page-size calculation.

Notes

The issue is specific to the turboquant_4bit_nc KV dtype and hybrid GDN+attention architectures, and the fix may not be applicable to other configurations.

Recommendation

Apply a workaround by using a different KV dtype, such as fp8_e5m2 or auto (bfloat16), which are known to work with the same model, until a permanent fix is implemented.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #mixed precision #training loop #device allocation #model download

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

vllm - 💡(How to fix) Fix [Bug] turboquant_4bit_nc KV cache fails unify_kv_cache_spec_page_size assertion on hybrid Qwen3_5 (GDN + attention) architecture [2 comments, 1 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Suspected root cause

Code Example

Summary

Reproduction

Affected configuration

Suspected root cause

Suggested fix paths

Related issues

Environment

Internal references

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

TRENDING

vllm - 💡(How to fix) Fix [Bug] turboquant_4bit_nc KV cache fails unify_kv_cache_spec_page_size assertion on hybrid Qwen3_5 (GDN + attention) architecture [2 comments, 1 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Suspected root cause

Code Example

Summary

Reproduction

Affected configuration

Suspected root cause

Suggested fix paths

Related issues

Environment

Internal references

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING