vllm - 💡(How to fix) Fix [Bug]: AssertionError at kv_cache_utils.py:1042 — dense draft model + hybrid-attention main (DeltaNet+SWA) fails in unify_kv_cache_spec_page_size

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

Pairing a dense draft model (LocoOperator/LocoOperator-4B) with a hybrid-attention main model (Qwen/Qwen3-Coder-Next-80B-A3B, DeltaNet + sliding-window attention) fails at engine init with a bare AssertionError in unify_kv_cache_spec_page_size — the dense draft's page_size_bytes doesn't equal the hybrid main's max_page_size.

Error Message

Failure reproduces deterministically — 3 boot attempts ~80s apart, identical traceback, weights load to ~60% before the engine hits the assertion. The assertion at kv_cache_utils.py:1042 has no error message attached (assert x == y with no , f"..." clause). Operators get a bare AssertionError with no expected-vs-actual values. Even just assert new_spec.page_size_bytes == max_page_size, f"draft spec page_size_bytes={new_spec.page_size_bytes} != main max_page_size={max_page_size}" would save significant git-archeology time.

Root Cause

Pairing a dense draft model (LocoOperator/LocoOperator-4B) with a hybrid-attention main model (Qwen/Qwen3-Coder-Next-80B-A3B, DeltaNet + sliding-window attention) fails at engine init with a bare AssertionError in unify_kv_cache_spec_page_size — the dense draft's page_size_bytes doesn't equal the hybrid main's max_page_size.

Code Example

File "/usr/local/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 5951, in profile_cudagraph_memory
    self._init_minimal_kv_cache_for_profiling()
File "/usr/local/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 5870, in _init_minimal_kv_cache_for_profiling
    kv_cache_groups = get_kv_cache_groups(self.vllm_config, kv_cache_spec)
File "/usr/local/lib/python3.12/site-packages/vllm/v1/core/kv_cache_utils.py", line 1654, in get_kv_cache_groups
    kv_cache_spec = unify_kv_cache_spec_page_size(kv_cache_spec)
File "/usr/local/lib/python3.12/site-packages/vllm/v1/core/kv_cache_utils.py", line 1042, in unify_kv_cache_spec_page_size
    assert new_spec.page_size_bytes == max_page_size
AssertionError

---

--speculative-config '{"model": "/draft", "num_speculative_tokens": 3}'
RAW_BUFFERClick to expand / collapse

Summary

Pairing a dense draft model (LocoOperator/LocoOperator-4B) with a hybrid-attention main model (Qwen/Qwen3-Coder-Next-80B-A3B, DeltaNet + sliding-window attention) fails at engine init with a bare AssertionError in unify_kv_cache_spec_page_size — the dense draft's page_size_bytes doesn't equal the hybrid main's max_page_size.

Stack trace (live, deterministic across 3 boot attempts)

File "/usr/local/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 5951, in profile_cudagraph_memory
    self._init_minimal_kv_cache_for_profiling()
File "/usr/local/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 5870, in _init_minimal_kv_cache_for_profiling
    kv_cache_groups = get_kv_cache_groups(self.vllm_config, kv_cache_spec)
File "/usr/local/lib/python3.12/site-packages/vllm/v1/core/kv_cache_utils.py", line 1654, in get_kv_cache_groups
    kv_cache_spec = unify_kv_cache_spec_page_size(kv_cache_spec)
File "/usr/local/lib/python3.12/site-packages/vllm/v1/core/kv_cache_utils.py", line 1042, in unify_kv_cache_spec_page_size
    assert new_spec.page_size_bytes == max_page_size
AssertionError

Reproduction

  • Main: Qwen/Qwen3-Coder-Next-80B-A3B-NVFP4 (DeltaNet + SWA hybrid attention, MoE 3B-active / 80B total)
  • Draft: LocoOperator/LocoOperator-4B (4B dense)
  • Args added to an otherwise-working spec:
    --speculative-config '{"model": "/draft", "num_speculative_tokens": 3}'
  • Hardware: 1× RTX PRO 6000 Blackwell (SM_120, 96 GiB), TP=1, --kv-cache-dtype fp8_e4m3
  • Image: vllm/vllm-openai:v0.20.2-cu129-ubuntu2404

Failure reproduces deterministically — 3 boot attempts ~80s apart, identical traceback, weights load to ~60% before the engine hits the assertion.

Secondary observation

The assertion at kv_cache_utils.py:1042 has no error message attached (assert x == y with no , f"..." clause). Operators get a bare AssertionError with no expected-vs-actual values. Even just assert new_spec.page_size_bytes == max_page_size, f"draft spec page_size_bytes={new_spec.page_size_bytes} != main max_page_size={max_page_size}" would save significant git-archeology time.

Known historical working config

A fork tracking commit a29a754d1b (early 2026) had this exact pairing serving correctly and producing ~2× decode TPS lift on the same hardware class. The regression appears to be downstream of that point — the validation was either added or tightened.

Ask

  1. Is dense-draft + hybrid-attention-main a supported pairing? It worked in the past.
  2. If yes — what config are we missing?
  3. If no — is there a tracked RFC for adding support, and can we contribute a minimal repro?

Related

  • RFC #42082 — Standardize KV-cache Layouts
  • Closed RFC #31634 — Decouple page_size_bytes calculation in AttentionSpec for TPU/RPA

Reproduction host is a Blackwell fleet running Qwen3-Coder-Next-80B-A3B as the AGENTIC_CODE primary — restoring this spec-decode path would meaningfully reduce per-user decode latency.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING