vllm - 💡(How to fix) Fix [Bug]: AssertionError at kv_cache_utils.py:1042 — dense draft model + hybrid-attention main (DeltaNet+SWA) fails in unify_kv_cache_spec_page_size

vllm2026-05-26 00:27:41

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

Pairing a dense draft model (LocoOperator/LocoOperator-4B) with a hybrid-attention main model (Qwen/Qwen3-Coder-Next-80B-A3B, DeltaNet + sliding-window attention) fails at engine init with a bare AssertionError in unify_kv_cache_spec_page_size — the dense draft's page_size_bytes doesn't equal the hybrid main's max_page_size.

Error Message

Failure reproduces deterministically — 3 boot attempts ~80s apart, identical traceback, weights load to ~60% before the engine hits the assertion. The assertion at kv_cache_utils.py:1042 has no error message attached (assert x == y with no , f"..." clause). Operators get a bare AssertionError with no expected-vs-actual values. Even just assert new_spec.page_size_bytes == max_page_size, f"draft spec page_size_bytes={new_spec.page_size_bytes} != main max_page_size={max_page_size}" would save significant git-archeology time.

Root Cause

Code Example

File "/usr/local/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 5951, in profile_cudagraph_memory
    self._init_minimal_kv_cache_for_profiling()
File "/usr/local/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 5870, in _init_minimal_kv_cache_for_profiling
    kv_cache_groups = get_kv_cache_groups(self.vllm_config, kv_cache_spec)
File "/usr/local/lib/python3.12/site-packages/vllm/v1/core/kv_cache_utils.py", line 1654, in get_kv_cache_groups
    kv_cache_spec = unify_kv_cache_spec_page_size(kv_cache_spec)
File "/usr/local/lib/python3.12/site-packages/vllm/v1/core/kv_cache_utils.py", line 1042, in unify_kv_cache_spec_page_size
    assert new_spec.page_size_bytes == max_page_size
AssertionError

---

--speculative-config '{"model": "/draft", "num_speculative_tokens": 3}'

RAW_BUFFERClick to expand / collapse

Summary

Stack trace (live, deterministic across 3 boot attempts)

File "/usr/local/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 5951, in profile_cudagraph_memory
    self._init_minimal_kv_cache_for_profiling()
File "/usr/local/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 5870, in _init_minimal_kv_cache_for_profiling
    kv_cache_groups = get_kv_cache_groups(self.vllm_config, kv_cache_spec)
File "/usr/local/lib/python3.12/site-packages/vllm/v1/core/kv_cache_utils.py", line 1654, in get_kv_cache_groups
    kv_cache_spec = unify_kv_cache_spec_page_size(kv_cache_spec)
File "/usr/local/lib/python3.12/site-packages/vllm/v1/core/kv_cache_utils.py", line 1042, in unify_kv_cache_spec_page_size
    assert new_spec.page_size_bytes == max_page_size
AssertionError

Reproduction

Main: Qwen/Qwen3-Coder-Next-80B-A3B-NVFP4 (DeltaNet + SWA hybrid attention, MoE 3B-active / 80B total)
Draft: LocoOperator/LocoOperator-4B (4B dense)

Args added to an otherwise-working spec:

--speculative-config '{"model": "/draft", "num_speculative_tokens": 3}'

Hardware: 1× RTX PRO 6000 Blackwell (SM_120, 96 GiB), TP=1, --kv-cache-dtype fp8_e4m3
Image: vllm/vllm-openai:v0.20.2-cu129-ubuntu2404

Failure reproduces deterministically — 3 boot attempts ~80s apart, identical traceback, weights load to ~60% before the engine hits the assertion.

Secondary observation

The assertion at kv_cache_utils.py:1042 has no error message attached (assert x == y with no , f"..." clause). Operators get a bare AssertionError with no expected-vs-actual values. Even just assert new_spec.page_size_bytes == max_page_size, f"draft spec page_size_bytes={new_spec.page_size_bytes} != main max_page_size={max_page_size}" would save significant git-archeology time.

Known historical working config

A fork tracking commit a29a754d1b (early 2026) had this exact pairing serving correctly and producing ~2× decode TPS lift on the same hardware class. The regression appears to be downstream of that point — the validation was either added or tightened.

Ask

Is dense-draft + hybrid-attention-main a supported pairing? It worked in the past.
If yes — what config are we missing?
If no — is there a tracked RFC for adding support, and can we contribute a minimal repro?

RFC #42082 — Standardize KV-cache Layouts
Closed RFC #31634 — Decouple page_size_bytes calculation in AttentionSpec for TPU/RPA

Reproduction host is a Blackwell fleet running Qwen3-Coder-Next-80B-A3B as the AGENTIC_CODE primary — restoring this spec-decode path would meaningfully reduce per-user decode latency.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering