vllm - ✅(Solved) Fix [Bug]: The hit rate of prefix caching in Qwen3.5 35BA3B is very low, always less than 0.1% [1 pull requests, 1 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#36493Fetched 2026-04-08 00:36:34
View on GitHub
Comments
1
Participants
2
Timeline
12
Reactions
3
Participants
Timeline (top)
referenced ×5subscribed ×3cross-referenced ×2commented ×1

Fix Action

Fixed

PR fix notes

PR #36649: [WIP] [Hybrid][GDN] Enable prefix caching 'all' mode for Qwen3.5/Qwen3Next

Description (problem / solution / changelog)

Purpose

Fixes #36493

Enable block-level GDN (Gated Delta Net) state caching for Qwen3.5 and Qwen3Next hybrid models, fixing near-0% prefix cache hit rates.

Previously, these models fell back to mamba_cache_mode="align" (caching only 1-2 tail blocks), causing the HybridKVCacheCoordinator to report 0% hits since Mamba/GDN null blocks were never cached. This PR implements full "all" mode support following the Mamba2 pattern:

  • Expose per-chunk intermediate states (h tensor) from the GDN Triton kernel in float32 via return_intermediate_states
  • Add "all" mode block-index metadata to GDNAttentionMetadata and its builder
  • Save GDN SSM + conv1d states at block boundaries during prefill, with correct decode block-boundary handling
  • Set mamba_chunk_size=64 in Qwen3.5/Qwen3Next configs for proper block alignment
  • Add SupportsMambaPrefixCaching interface to all Qwen3.5/Qwen3Next model classes (CausalLM + ConditionalGeneration)
  • Auto-set mamba_ssm_cache_dtype=float32 when mamba_cache_mode=all to prevent precision loss in cached intermediate states

Test Plan

Test Result

E2E: Qwen/Qwen3.5-35B-A3B

vllm serve Qwen/Qwen3.5-35B-A3B \
  --enable-prefix-caching \
  --mamba-cache-mode all \
  --enable-chunked-prefill \
  --language-model-only \
  --enforce-eager \
  --max-model-len 8192
RunPrefill timeCached tokensTokens match
Run 1 (fill cache)37s0-
Run 2 (use cache)2.6s (14x faster)1088Yes

Kernel-level: GDN intermediate state split correctness

Verified that processing tokens [0..3001] in one pass produces the same output as splitting at token 2880 (block boundary) and resuming from the cached h state:

TestResult
Kernel split via final_statePASS (max diff = 0.0)
Kernel split via h[chunk_idx] (float32)PASS (max diff = 0.0)
Kernel split via h[chunk_idx] (bfloat16 roundtrip)FAIL (max diff = 2.0) - motivates the float32 requirement

Changed files

  • vllm/model_executor/layers/fla/ops/chunk.py (modified, +22/-5)
  • vllm/model_executor/layers/fla/ops/chunk_delta_h.py (modified, +2/-1)
  • vllm/model_executor/models/config.py (modified, +14/-0)
  • vllm/model_executor/models/qwen3_5.py (modified, +5/-8)
  • vllm/model_executor/models/qwen3_next.py (modified, +255/-59)
  • vllm/transformers_utils/configs/qwen3_5.py (modified, +2/-1)
  • vllm/transformers_utils/configs/qwen3_5_moe.py (modified, +2/-1)
  • vllm/transformers_utils/configs/qwen3_next.py (modified, +2/-1)
  • vllm/v1/attention/backends/gdn_attn.py (modified, +99/-8)
RAW_BUFFERClick to expand / collapse

Your current environment

vllm=0.17.0 H800*2 cuda 12.9.86 use vllm/vllm-openai:v0.17.0 docker

🐛 Describe the bug

I used an intention to identify the test set, and 20% of the characters in front of them were consistent. On qwen3-30ba3b, the hit rate of prefix caching was about 20%. But in qwen3-35ba3b, most of the time it is 0%, and occasionally it has a hit rate of 0.1%. My startup script: vllm serve /data/models/Qwen3.5-35B-A3B
--host 0.0.0.0
--served-model-name default
--port 8889
--language-model-only
--max-num-seqs 128
--enable-prefix-caching
--max-model-len auto

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

Fix Plan

The fix involves adjusting the prefix caching configuration to improve the hit rate.

  • Check the --max-model-len parameter, as it may be affecting the prefix caching.
  • Try setting --max-model-len to a fixed value instead of auto.
  • Consider increasing the --max-num-seqs parameter to allow for more sequences to be cached.

Example code snippet:

vllm serve /data/models/Qwen3.5-35B-A3B \
--host 0.0.0.0 \
--served-model-name default \
--port 8889 \
--language-model-only \
--max-num-seqs 256 \
--enable-prefix-caching \
--max-model-len 2048

Verification

To verify the fix, monitor the prefix caching hit rate after applying the changes. You can do this by checking the server logs or using a tool to monitor the caching performance.

Extra Tips

  • Make sure to test the changes in a controlled environment before applying them to production.
  • Consider experimenting with different values for --max-num-seqs and --max-model-len to find the optimal configuration for your specific use case.
  • Refer to the vllm documentation for more information on prefix caching and configuration options.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING