vllm - ✅(Solved) Fix V2 model runner crashes on Qwen3.5 mixed attention (linear + full) [1 pull requests, 1 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#38041Fetched 2026-04-08 01:26:56
View on GitHub
Comments
1
Participants
2
Timeline
8
Reactions
0
Timeline (top)
referenced ×4subscribed ×2commented ×1cross-referenced ×1

When enabling VLLM_USE_V2_MODEL_RUNNER=1 with Qwen3.5 models (Qwen3_5ForConditionalGeneration / Qwen3_5ForCausalLM), the engine crashes during KV cache initialization with an AssertionError in _reshape_kv_cache.

Error Message

(EngineCore_DP0 pid=220) ERROR [core.py:1100]
  File ".../vllm/v1/worker/gpu/attn_utils.py", line 112, in _reshape_kv_cache
    assert isinstance(kv_cache_spec, AttentionSpec)
AssertionError

Root Cause

Qwen3.5 uses a hybrid architecture with both full attention (GQA) and linear attention (GDN/Mamba-like) layers. The layer_types config alternates between "linear_attention" and "full_attention":

"layer_types": ["linear_attention", "linear_attention", "linear_attention", "full_attention", ...]

The linear attention layers produce a KV cache spec that is not an AttentionSpec (it's a recurrent state spec). The V2 model runner's _reshape_kv_cache in attn_utils.py assumes all specs are AttentionSpec and crashes.

Fix Action

Fixed

PR fix notes

PR #38081: [Bugfix] Fix V2 model runner crash on hybrid attention models (Qwen3.5)

Description (problem / solution / changelog)

Summary

  • Fix V2 model runner (VLLM_USE_V2_MODEL_RUNNER=1) crash on hybrid attention models like Qwen3.5
  • Root cause: _reshape_kv_cache() in attn_utils.py only handled AttentionSpec, but Qwen3.5's linear attention (Gated DeltaNet) layers produce MambaSpec, causing AssertionError at startup
  • Fix: Port MambaSpec handling from V1 model runner's _reshape_kv_cache_tensors() to V2's _reshape_kv_cache(), using the same torch.as_strided approach for state tensor reshaping

Fixes #38041

Before Fix (crash log)

<details><summary>V2 model runner crashes with AssertionError on Qwen3.5</summary>
$ VLLM_USE_V2_MODEL_RUNNER=1 python -m vllm.entrypoints.openai.api_server \
    --model /ssd1/models/Qwen3.5-35B-A3B --trust-remote-code \
    --tensor-parallel-size 2 --dtype float16 --max-model-len 4096

(Worker pid=73635) INFO [gpu_worker.py:272] Using V2 Model Runner
(Worker_TP0 pid=73635) INFO [model_runner.py:266] Loading model from scratch...
(Worker_TP0 pid=73635) INFO [qwen3_next.py:202] Using Triton/FLA GDN prefill kernel
...
(Worker_TP0 pid=73635) ERROR [multiproc_executor.py:949]
  File "vllm/v1/worker/gpu/attn_utils.py", line 166, in init_kv_cache
    kv_caches = _reshape_kv_cache(
  File "vllm/v1/worker/gpu/attn_utils.py", line 122, in _reshape_kv_cache
    assert isinstance(kv_cache_spec, AttentionSpec)
AssertionError

RuntimeError: Engine core initialization failed.
</details>

After Fix (successful run)

<details><summary>V2 model runner successfully loads and serves Qwen3.5</summary>
$ VLLM_USE_V2_MODEL_RUNNER=1 python -m vllm.entrypoints.openai.api_server \
    --model /ssd1/models/Qwen3.5-35B-A3B --trust-remote-code \
    --tensor-parallel-size 2 --dtype float16 --max-model-len 4096

(APIServer) INFO [model.py:541] Resolved architecture: Qwen3_5MoeForConditionalGeneration
(Worker) INFO [gpu_worker.py:272] Using V2 Model Runner
(Worker_TP0) INFO [qwen3_next.py:202] Using Triton/FLA GDN prefill kernel
(Worker_TP0) INFO [gpu_worker.py:436] Available KV cache memory: 36.24 GiB
(EngineCore) INFO [kv_cache_utils.py:1319] GPU KV cache size: 949,344 tokens
(APIServer) INFO: Application startup complete.
(APIServer) INFO: Uvicorn running on http://0.0.0.0:8562

$ curl http://localhost:8562/v1/chat/completions \
    -d '{"model":"/ssd1/models/Qwen3.5-35B-A3B","messages":[{"role":"user","content":"Hello"}],"max_tokens":50}'

{"id":"chatcmpl-aef24845d232b4b0","object":"chat.completion","created":1774424213,
 "model":"/ssd1/models/Qwen3.5-35B-A3B",
 "choices":[{"index":0,"message":{"role":"assistant","content":"Thinking Process:\n\n1. ..."},
 "finish_reason":"length"}],
 "usage":{"prompt_tokens":19,"total_tokens":69,"completion_tokens":50}}
</details>

Test plan

  • Verified V2 model runner starts successfully with Qwen3.5-35B-A3B (TP=2, float16)
  • Verified inference produces valid output via /v1/chat/completions
  • pre-commit checks passed (ruff check, ruff format, mypy, typos, SPDX headers)
  • Not duplicating any existing PR (verified via gh pr list --search)

Notes

  • This is AI-assisted work (Claude). All changes reviewed by human.
  • The V1 model runner (gpu_model_runner.py) already handles both AttentionSpec and MambaSpec correctly. This PR aligns V2 model runner behavior with V1.
  • Only 1 file changed: vllm/v1/worker/gpu/attn_utils.py

Changed files

  • vllm/v1/worker/gpu/attn_utils.py (modified, +54/-26)

Code Example

(EngineCore_DP0 pid=220) ERROR [core.py:1100]
  File ".../vllm/v1/worker/gpu/attn_utils.py", line 112, in _reshape_kv_cache
    assert isinstance(kv_cache_spec, AttentionSpec)
AssertionError

---

"layer_types": ["linear_attention", "linear_attention", "linear_attention", "full_attention", ...]

---

from vllm import LLM

llm = LLM(
    model="Qwen/Qwen3.5-7B",
    trust_remote_code=True,
    quantization="fp8",
    max_model_len=32768,
    enforce_eager=True,
)
RAW_BUFFERClick to expand / collapse

Description

When enabling VLLM_USE_V2_MODEL_RUNNER=1 with Qwen3.5 models (Qwen3_5ForConditionalGeneration / Qwen3_5ForCausalLM), the engine crashes during KV cache initialization with an AssertionError in _reshape_kv_cache.

Environment

  • vLLM version: 0.17.2.dev0+g95c0f928c.d20260313 (nightly)
  • GPU: NVIDIA GH200 480GB
  • Model: Qwen3.5 9B (loaded via VL wrapper with language_model_only=True)
  • Config: FP8 online quantization, 262K context, chunked prefill, eager mode

Error

(EngineCore_DP0 pid=220) ERROR [core.py:1100]
  File ".../vllm/v1/worker/gpu/attn_utils.py", line 112, in _reshape_kv_cache
    assert isinstance(kv_cache_spec, AttentionSpec)
AssertionError

Root Cause

Qwen3.5 uses a hybrid architecture with both full attention (GQA) and linear attention (GDN/Mamba-like) layers. The layer_types config alternates between "linear_attention" and "full_attention":

"layer_types": ["linear_attention", "linear_attention", "linear_attention", "full_attention", ...]

The linear attention layers produce a KV cache spec that is not an AttentionSpec (it's a recurrent state spec). The V2 model runner's _reshape_kv_cache in attn_utils.py assumes all specs are AttentionSpec and crashes.

Steps to Reproduce

  1. Load any Qwen3.5 model (e.g., 9B or 27B)
  2. Set VLLM_USE_V2_MODEL_RUNNER=1
  3. Start the engine
from vllm import LLM

llm = LLM(
    model="Qwen/Qwen3.5-7B",
    trust_remote_code=True,
    quantization="fp8",
    max_model_len=32768,
    enforce_eager=True,
)

Notes

  • Without VLLM_USE_V2_MODEL_RUNNER, the model loads and serves correctly with the default V1 model runner.
  • A related issue: the V1 engine's unify_kv_cache_spec_page_size in kv_cache_utils.py also fails with NotImplementedError when loading text-only Qwen3_5ForCausalLM, because linear and full attention layers have incompatible page sizes (linear attention layers have block_size=None). Loading through the VL wrapper (Qwen3_5ForConditionalGeneration with language_model_only=True) works around this.

Expected Behavior

The V2 model runner should handle mixed attention architectures by skipping or appropriately handling non-AttentionSpec KV cache entries in _reshape_kv_cache.

extent analysis

Fix Plan

To fix the issue, we need to modify the _reshape_kv_cache function in attn_utils.py to handle non-AttentionSpec KV cache entries. Here are the steps:

  • Modify the _reshape_kv_cache function to check the type of kv_cache_spec before asserting it's an AttentionSpec.
  • If kv_cache_spec is not an AttentionSpec, skip it or handle it accordingly.

Example code:

def _reshape_kv_cache(self, kv_cache_spec):
    if not isinstance(kv_cache_spec, AttentionSpec):
        # Handle non-AttentionSpec KV cache entries
        # For example, skip them or log a warning
        print(f"Skipping non-AttentionSpec KV cache entry: {kv_cache_spec}")
        return
    # Rest of the function remains the same

Alternatively, you can also add a try-except block to catch the AssertionError and handle it:

def _reshape_kv_cache(self, kv_cache_spec):
    try:
        assert isinstance(kv_cache_spec, AttentionSpec)
    except AssertionError:
        # Handle non-AttentionSpec KV cache entries
        print(f"Skipping non-AttentionSpec KV cache entry: {kv_cache_spec}")
        return
    # Rest of the function remains the same

Verification

To verify the fix, you can run the following code:

from vllm import LLM

llm = LLM(
    model="Qwen/Qwen3.5-7B",
    trust_remote_code=True,
    quantization="fp8",
    max_model_len=32768,
    enforce_eager=True,
)

# Set VLLM_USE_V2_MODEL_RUNNER=1
import os
os.environ['VLLM_USE_V2_MODEL_RUNNER'] = '1'

# Start the engine
llm.start()

If the fix is correct, the engine should start without crashing, and you should see the expected output.

Extra Tips

  • Make sure to test the fix with different models and configurations to ensure it works correctly in all cases.
  • Consider adding a check for the layer_types config to ensure that the model is using a hybrid architecture with both full attention and linear attention layers.
  • If you're using a VL wrapper, make sure to update the wrapper to handle the changes in the _reshape_kv_cache function.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING