vllm - ✅(Solved) Fix [Bug]: The hit rate of prefix caching in Qwen3.5 35BA3B is very low, always less than 0.1% [1 pull requests, 1 comments, 2 participants]

vllm2026-03-09 12:32:34

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

vllm-project/vllm#36493•Fetched 2026-04-08 00:36:34

View on GitHub

Comments

Participants

Timeline

Reactions

Author

piekey1994

Participants

haosdent

piekey1994

Timeline (top)

referenced ×5subscribed ×3cross-referenced ×2commented ×1

Fix Action

Fixed

Fixed by PR: [WIP] [Hybrid][GDN] Enable prefix caching 'all' mode for Qwen3.5/Qwen3Next (https://github.com/vllm-project/vllm/pull/36649)

PR fix notes

PR #36649: [WIP] [Hybrid][GDN] Enable prefix caching 'all' mode for Qwen3.5/Qwen3Next

Repository: vllm-project/vllm
Author: haosdent
State: open | merged: False
Link: https://github.com/vllm-project/vllm/pull/36649

Description (problem / solution / changelog)

Purpose

Fixes #36493

Enable block-level GDN (Gated Delta Net) state caching for Qwen3.5 and Qwen3Next hybrid models, fixing near-0% prefix cache hit rates.

Previously, these models fell back to mamba_cache_mode="align" (caching only 1-2 tail blocks), causing the HybridKVCacheCoordinator to report 0% hits since Mamba/GDN null blocks were never cached. This PR implements full "all" mode support following the Mamba2 pattern:

Expose per-chunk intermediate states (h tensor) from the GDN Triton kernel in float32 via return_intermediate_states
Add "all" mode block-index metadata to GDNAttentionMetadata and its builder
Save GDN SSM + conv1d states at block boundaries during prefill, with correct decode block-boundary handling
Set mamba_chunk_size=64 in Qwen3.5/Qwen3Next configs for proper block alignment
Add SupportsMambaPrefixCaching interface to all Qwen3.5/Qwen3Next model classes (CausalLM + ConditionalGeneration)
Auto-set mamba_ssm_cache_dtype=float32 when mamba_cache_mode=all to prevent precision loss in cached intermediate states

Test Plan

Test Result

E2E: `Qwen/Qwen3.5-35B-A3B`

vllm serve Qwen/Qwen3.5-35B-A3B \
  --enable-prefix-caching \
  --mamba-cache-mode all \
  --enable-chunked-prefill \
  --language-model-only \
  --enforce-eager \
  --max-model-len 8192

Run	Prefill time	Cached tokens	Tokens match
Run 1 (fill cache)	37s	0	-
Run 2 (use cache)	2.6s (14x faster)	1088	Yes

Kernel-level: GDN intermediate state split correctness

Verified that processing tokens [0..3001] in one pass produces the same output as splitting at token 2880 (block boundary) and resuming from the cached h state:

Test	Result
Kernel split via `final_state`	PASS (max diff = 0.0)
Kernel split via `h[chunk_idx]` (float32)	PASS (max diff = 0.0)
Kernel split via `h[chunk_idx]` (bfloat16 roundtrip)	FAIL (max diff = 2.0) - motivates the float32 requirement

Changed files

vllm/model_executor/layers/fla/ops/chunk.py (modified, +22/-5)
vllm/model_executor/layers/fla/ops/chunk_delta_h.py (modified, +2/-1)
vllm/model_executor/models/config.py (modified, +14/-0)
vllm/model_executor/models/qwen3_5.py (modified, +5/-8)
vllm/model_executor/models/qwen3_next.py (modified, +255/-59)
vllm/transformers_utils/configs/qwen3_5.py (modified, +2/-1)
vllm/transformers_utils/configs/qwen3_5_moe.py (modified, +2/-1)
vllm/transformers_utils/configs/qwen3_next.py (modified, +2/-1)
vllm/v1/attention/backends/gdn_attn.py (modified, +99/-8)

RAW_BUFFERClick to expand / collapse

Your current environment

vllm=0.17.0 H800*2 cuda 12.9.86 use vllm/vllm-openai:v0.17.0 docker

🐛 Describe the bug

I used an intention to identify the test set, and 20% of the characters in front of them were consistent. On qwen3-30ba3b, the hit rate of prefix caching was about 20%. But in qwen3-35ba3b, most of the time it is 0%, and occasionally it has a hit rate of 0.1%. My startup script: vllm serve /data/models/Qwen3.5-35B-A3B
--host 0.0.0.0
--served-model-name default
--port 8889
--language-model-only
--max-num-seqs 128
--enable-prefix-caching
--max-model-len auto

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

Fix Plan

The fix involves adjusting the prefix caching configuration to improve the hit rate.

Check the --max-model-len parameter, as it may be affecting the prefix caching.
Try setting --max-model-len to a fixed value instead of auto.
Consider increasing the --max-num-seqs parameter to allow for more sequences to be cached.

Example code snippet:

vllm serve /data/models/Qwen3.5-35B-A3B \
--host 0.0.0.0 \
--served-model-name default \
--port 8889 \
--language-model-only \
--max-num-seqs 256 \
--enable-prefix-caching \
--max-model-len 2048

Verification

To verify the fix, monitor the prefix caching hit rate after applying the changes. You can do this by checking the server logs or using a tool to monitor the caching performance.

Extra Tips

Make sure to test the changes in a controlled environment before applying them to production.
Consider experimenting with different values for --max-num-seqs and --max-model-len to find the optimal configuration for your specific use case.
Refer to the vllm documentation for more information on prefix caching and configuration options.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #ssr #installation #tensor shape #autograd error #permission error #memory optimization #batch processing #GPU compatibility #latency issue

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

vllm - ✅(Solved) Fix [Bug]: The hit rate of prefix caching in Qwen3.5 35BA3B is very low, always less than 0.1% [1 pull requests, 1 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Fix Action

Fixed

PR fix notes

PR #36649: [WIP] [Hybrid][GDN] Enable prefix caching 'all' mode for Qwen3.5/Qwen3Next

Description (problem / solution / changelog)

Purpose

Test Plan

Test Result

E2E: `Qwen/Qwen3.5-35B-A3B`

Kernel-level: GDN intermediate state split correctness

Changed files

Your current environment

🐛 Describe the bug

Before submitting a new issue...

extent analysis

Fix Plan

Verification

Extra Tips

Still need to ship something?

TRENDING

vllm - ✅(Solved) Fix [Bug]: The hit rate of prefix caching in Qwen3.5 35BA3B is very low, always less than 0.1% [1 pull requests, 1 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Fix Action

Fixed

PR fix notes

PR #36649: [WIP] [Hybrid][GDN] Enable prefix caching 'all' mode for Qwen3.5/Qwen3Next

Description (problem / solution / changelog)

Purpose

Test Plan

Test Result

E2E: Qwen/Qwen3.5-35B-A3B

Kernel-level: GDN intermediate state split correctness

Changed files

Your current environment

🐛 Describe the bug

Before submitting a new issue...

extent analysis

Fix Plan

Verification

Extra Tips

Still need to ship something?

RELATED_DISCOVERY

TRENDING

E2E: `Qwen/Qwen3.5-35B-A3B`