vllm - ✅(Solved) Fix V2 model runner crashes on Qwen3.5 mixed attention (linear + full) [1 pull requests, 1 comments, 2 participants]

vllm2026-03-24 21:44:00

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

vllm-project/vllm#38041•Fetched 2026-04-08 01:26:56

View on GitHub

Comments

Participants

Timeline

Reactions

Author

bhaktatejas922

Participants

bhaktatejas922

ZJY0516

Timeline (top)

referenced ×4subscribed ×2commented ×1cross-referenced ×1

When enabling VLLM_USE_V2_MODEL_RUNNER=1 with Qwen3.5 models (Qwen3_5ForConditionalGeneration / Qwen3_5ForCausalLM), the engine crashes during KV cache initialization with an AssertionError in _reshape_kv_cache.

Error Message

(EngineCore_DP0 pid=220) ERROR [core.py:1100]
  File ".../vllm/v1/worker/gpu/attn_utils.py", line 112, in _reshape_kv_cache
    assert isinstance(kv_cache_spec, AttentionSpec)
AssertionError

Root Cause

Qwen3.5 uses a hybrid architecture with both full attention (GQA) and linear attention (GDN/Mamba-like) layers. The layer_types config alternates between "linear_attention" and "full_attention":

"layer_types": ["linear_attention", "linear_attention", "linear_attention", "full_attention", ...]

The linear attention layers produce a KV cache spec that is not an AttentionSpec (it's a recurrent state spec). The V2 model runner's _reshape_kv_cache in attn_utils.py assumes all specs are AttentionSpec and crashes.

Fix Action

Fixed

Fixed by PR: [Bugfix] Fix V2 model runner crash on hybrid attention models (Qwen3.5) (https://github.com/vllm-project/vllm/pull/38081)

PR fix notes

PR #38081: [Bugfix] Fix V2 model runner crash on hybrid attention models (Qwen3.5)

Repository: vllm-project/vllm
Author: Lidang-Jiang
State: open | merged: False
Link: https://github.com/vllm-project/vllm/pull/38081

Description (problem / solution / changelog)

Summary

Fix V2 model runner (VLLM_USE_V2_MODEL_RUNNER=1) crash on hybrid attention models like Qwen3.5
Root cause: _reshape_kv_cache() in attn_utils.py only handled AttentionSpec, but Qwen3.5's linear attention (Gated DeltaNet) layers produce MambaSpec, causing AssertionError at startup
Fix: Port MambaSpec handling from V1 model runner's _reshape_kv_cache_tensors() to V2's _reshape_kv_cache(), using the same torch.as_strided approach for state tensor reshaping

Fixes #38041

Before Fix (crash log)

<details><summary>V2 model runner crashes with AssertionError on Qwen3.5</summary>

$ VLLM_USE_V2_MODEL_RUNNER=1 python -m vllm.entrypoints.openai.api_server \
    --model /ssd1/models/Qwen3.5-35B-A3B --trust-remote-code \
    --tensor-parallel-size 2 --dtype float16 --max-model-len 4096

(Worker pid=73635) INFO [gpu_worker.py:272] Using V2 Model Runner
(Worker_TP0 pid=73635) INFO [model_runner.py:266] Loading model from scratch...
(Worker_TP0 pid=73635) INFO [qwen3_next.py:202] Using Triton/FLA GDN prefill kernel
...
(Worker_TP0 pid=73635) ERROR [multiproc_executor.py:949]
  File "vllm/v1/worker/gpu/attn_utils.py", line 166, in init_kv_cache
    kv_caches = _reshape_kv_cache(
  File "vllm/v1/worker/gpu/attn_utils.py", line 122, in _reshape_kv_cache
    assert isinstance(kv_cache_spec, AttentionSpec)
AssertionError

RuntimeError: Engine core initialization failed.

</details>

After Fix (successful run)

<details><summary>V2 model runner successfully loads and serves Qwen3.5</summary>

$ VLLM_USE_V2_MODEL_RUNNER=1 python -m vllm.entrypoints.openai.api_server \
    --model /ssd1/models/Qwen3.5-35B-A3B --trust-remote-code \
    --tensor-parallel-size 2 --dtype float16 --max-model-len 4096

(APIServer) INFO [model.py:541] Resolved architecture: Qwen3_5MoeForConditionalGeneration
(Worker) INFO [gpu_worker.py:272] Using V2 Model Runner
(Worker_TP0) INFO [qwen3_next.py:202] Using Triton/FLA GDN prefill kernel
(Worker_TP0) INFO [gpu_worker.py:436] Available KV cache memory: 36.24 GiB
(EngineCore) INFO [kv_cache_utils.py:1319] GPU KV cache size: 949,344 tokens
(APIServer) INFO: Application startup complete.
(APIServer) INFO: Uvicorn running on http://0.0.0.0:8562

$ curl http://localhost:8562/v1/chat/completions \
    -d '{"model":"/ssd1/models/Qwen3.5-35B-A3B","messages":[{"role":"user","content":"Hello"}],"max_tokens":50}'

{"id":"chatcmpl-aef24845d232b4b0","object":"chat.completion","created":1774424213,
 "model":"/ssd1/models/Qwen3.5-35B-A3B",
 "choices":[{"index":0,"message":{"role":"assistant","content":"Thinking Process:\n\n1. ..."},
 "finish_reason":"length"}],
 "usage":{"prompt_tokens":19,"total_tokens":69,"completion_tokens":50}}

</details>

Test plan

Verified V2 model runner starts successfully with Qwen3.5-35B-A3B (TP=2, float16)
Verified inference produces valid output via /v1/chat/completions
pre-commit checks passed (ruff check, ruff format, mypy, typos, SPDX headers)
Not duplicating any existing PR (verified via gh pr list --search)

Notes

This is AI-assisted work (Claude). All changes reviewed by human.
The V1 model runner (gpu_model_runner.py) already handles both AttentionSpec and MambaSpec correctly. This PR aligns V2 model runner behavior with V1.
Only 1 file changed: vllm/v1/worker/gpu/attn_utils.py

Changed files

vllm/v1/worker/gpu/attn_utils.py (modified, +54/-26)

Code Example

(EngineCore_DP0 pid=220) ERROR [core.py:1100]
  File ".../vllm/v1/worker/gpu/attn_utils.py", line 112, in _reshape_kv_cache
    assert isinstance(kv_cache_spec, AttentionSpec)
AssertionError

---

"layer_types": ["linear_attention", "linear_attention", "linear_attention", "full_attention", ...]

---

from vllm import LLM

llm = LLM(
    model="Qwen/Qwen3.5-7B",
    trust_remote_code=True,
    quantization="fp8",
    max_model_len=32768,
    enforce_eager=True,
)

RAW_BUFFERClick to expand / collapse

Description

Environment

vLLM version: 0.17.2.dev0+g95c0f928c.d20260313 (nightly)
GPU: NVIDIA GH200 480GB
Model: Qwen3.5 9B (loaded via VL wrapper with language_model_only=True)
Config: FP8 online quantization, 262K context, chunked prefill, eager mode

Error

(EngineCore_DP0 pid=220) ERROR [core.py:1100]
  File ".../vllm/v1/worker/gpu/attn_utils.py", line 112, in _reshape_kv_cache
    assert isinstance(kv_cache_spec, AttentionSpec)
AssertionError

Root Cause

"layer_types": ["linear_attention", "linear_attention", "linear_attention", "full_attention", ...]

Steps to Reproduce

Load any Qwen3.5 model (e.g., 9B or 27B)
Set VLLM_USE_V2_MODEL_RUNNER=1
Start the engine

from vllm import LLM

llm = LLM(
    model="Qwen/Qwen3.5-7B",
    trust_remote_code=True,
    quantization="fp8",
    max_model_len=32768,
    enforce_eager=True,
)

Notes

Without VLLM_USE_V2_MODEL_RUNNER, the model loads and serves correctly with the default V1 model runner.
A related issue: the V1 engine's unify_kv_cache_spec_page_size in kv_cache_utils.py also fails with NotImplementedError when loading text-only Qwen3_5ForCausalLM, because linear and full attention layers have incompatible page sizes (linear attention layers have block_size=None). Loading through the VL wrapper (Qwen3_5ForConditionalGeneration with language_model_only=True) works around this.

Expected Behavior

The V2 model runner should handle mixed attention architectures by skipping or appropriately handling non-AttentionSpec KV cache entries in _reshape_kv_cache.

extent analysis

Fix Plan

To fix the issue, we need to modify the _reshape_kv_cache function in attn_utils.py to handle non-AttentionSpec KV cache entries. Here are the steps:

Modify the _reshape_kv_cache function to check the type of kv_cache_spec before asserting it's an AttentionSpec.
If kv_cache_spec is not an AttentionSpec, skip it or handle it accordingly.

Example code:

def _reshape_kv_cache(self, kv_cache_spec):
    if not isinstance(kv_cache_spec, AttentionSpec):
        # Handle non-AttentionSpec KV cache entries
        # For example, skip them or log a warning
        print(f"Skipping non-AttentionSpec KV cache entry: {kv_cache_spec}")
        return
    # Rest of the function remains the same

Alternatively, you can also add a try-except block to catch the AssertionError and handle it:

def _reshape_kv_cache(self, kv_cache_spec):
    try:
        assert isinstance(kv_cache_spec, AttentionSpec)
    except AssertionError:
        # Handle non-AttentionSpec KV cache entries
        print(f"Skipping non-AttentionSpec KV cache entry: {kv_cache_spec}")
        return
    # Rest of the function remains the same

Verification

To verify the fix, you can run the following code:

from vllm import LLM

llm = LLM(
    model="Qwen/Qwen3.5-7B",
    trust_remote_code=True,
    quantization="fp8",
    max_model_len=32768,
    enforce_eager=True,
)

# Set VLLM_USE_V2_MODEL_RUNNER=1
import os
os.environ['VLLM_USE_V2_MODEL_RUNNER'] = '1'

# Start the engine
llm.start()

If the fix is correct, the engine should start without crashing, and you should see the expected output.

Extra Tips

Make sure to test the fix with different models and configurations to ensure it works correctly in all cases.
Consider adding a check for the layer_types config to ensure that the model is using a hybrid architecture with both full attention and linear attention layers.
If you're using a VL wrapper, make sure to update the wrapper to handle the changes in the _reshape_kv_cache function.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#SSR setup #ISR setup #authentication setup #request error #file not found

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

vllm - ✅(Solved) Fix V2 model runner crashes on Qwen3.5 mixed attention (linear + full) [1 pull requests, 1 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Fixed

PR fix notes

PR #38081: [Bugfix] Fix V2 model runner crash on hybrid attention models (Qwen3.5)

Description (problem / solution / changelog)

Summary

Before Fix (crash log)

After Fix (successful run)

Test plan

Notes

Changed files

Code Example

Description

Environment

Error

Root Cause

Steps to Reproduce

Notes

Expected Behavior

extent analysis

Fix Plan

Verification

Extra Tips

Still need to ship something?

RELATED_DISCOVERY

TRENDING