vllm - 💡(How to fix) Fix [Bug]: Prefix cache align-mode has a 0% cache hit rate for Qwen3.6-35B-A3B [1 pull requests]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

Fix Action

Fixed

Code Example

==============================
        System Info
==============================
OS                           : Ubuntu 24.04.3 LTS (x86_64)
GCC version                  : (Ubuntu 13.3.0-6ubuntu2~24.04.1) 13.3.0
Clang version                : Could not collect
CMake version                : Could not collect
Libc version                 : glibc-2.39

==============================
       PyTorch Info
==============================
PyTorch version              : 2.11.0+cu130
Is debug build               : False
CUDA used to build PyTorch   : 13.0
ROCM used to build PyTorch   : N/A
XPU used to build PyTorch    : N/A

==============================
      Python Environment
==============================
Python version               : 3.12.3 (main, Mar 23 2026, 19:04:32) [GCC 13.3.0] (64-bit runtime)
Python platform              : Linux-5.15.0-119-generic-x86_64-with-glibc2.39
    
==============================
       CUDA / GPU Info
==============================
Is CUDA available            : True
CUDA runtime version         : 13.0.88
CUDA_MODULE_LOADING set to   : 
GPU models and configuration : 
GPU 0: NVIDIA H800
GPU 1: NVIDIA H800
GPU 2: NVIDIA H800
GPU 3: NVIDIA H800
GPU 4: NVIDIA H800
GPU 5: NVIDIA H800
GPU 6: NVIDIA H800
GPU 7: NVIDIA H800

Nvidia driver version        : 580.95.05
cuDNN version                : Could not collect
HIP runtime version          : N/A
MIOpen runtime version       : N/A
Is XNNPACK available         : True

==============================
          CPU Info
==============================
Architecture:                         x86_64
CPU op-mode(s):                       32-bit, 64-bit
Address sizes:                        52 bits physical, 57 bits virtual
Byte Order:                           Little Endian
CPU(s):                               192
On-line CPU(s) list:                  0-191
Vendor ID:                            GenuineIntel
BIOS Vendor ID:                       Intel(R) Corporation
Model name:                           Intel(R) Xeon(R) Platinum 8468
BIOS Model name:                      Intel(R) Xeon(R) Platinum 8468  CPU @ 2.1GHz
BIOS CPU family:                      179
CPU family:                           6
Model:                                143
Thread(s) per core:                   2
Core(s) per socket:                   48
Socket(s):                            2
Stepping:                             8
CPU(s) scaling MHz:                   26%
CPU max MHz:                          3800.0000
CPU min MHz:                          800.0000
BogoMIPS:                             4200.00
Flags:                                fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popc

---

vllm serve Qwen/Qwen3.6-35B-A3B \
    -tp 2 \
    --host 0.0.0.0 \
    --enable-chunked-prefill \
    --max-num-batched-tokens 8192 \
    --no-enable-prefix-caching \
    --mamba-cache-mode all \
    --gdn-prefill-backend triton

---

vllm serve Qwen/Qwen3.6-35B-A3B \
    -tp 2 \
    --host 0.0.0.0 \
    --enable-chunked-prefill \
    --max-num-batched-tokens 8192 \
    --enable-prefix-caching \
    --mamba-cache-mode align \
    --gdn-prefill-backend triton

---

vllm serve Qwen/Qwen3.6-35B-A3B \
    -tp 2 \
    --host 0.0.0.0 \
    --enable-chunked-prefill \
    --max-num-batched-tokens 8192 \
    --enable-prefix-caching \
    --mamba-cache-mode all \
    --gdn-prefill-backend triton

---

vllm bench serve Qwen/Qwen3.6-35B-A3B \
    --dataset-name prefix_repetition \
    --num-prompts 100 \
    --base-url http://localhost:8000 \
    --ignore-eos \
    --max-concurrency 1 \
    --prefix-repetition-prefix-len 6000 \
    --prefix-repetition-suffix-len 2000 \
    --prefix-repetition-num-prefixes 20 \
    --prefix-repetition-output-len 100
RAW_BUFFERClick to expand / collapse

Your current environment

<details> <summary>The output of <code>python collect_env.py</code></summary>
==============================
        System Info
==============================
OS                           : Ubuntu 24.04.3 LTS (x86_64)
GCC version                  : (Ubuntu 13.3.0-6ubuntu2~24.04.1) 13.3.0
Clang version                : Could not collect
CMake version                : Could not collect
Libc version                 : glibc-2.39

==============================
       PyTorch Info
==============================
PyTorch version              : 2.11.0+cu130
Is debug build               : False
CUDA used to build PyTorch   : 13.0
ROCM used to build PyTorch   : N/A
XPU used to build PyTorch    : N/A

==============================
      Python Environment
==============================
Python version               : 3.12.3 (main, Mar 23 2026, 19:04:32) [GCC 13.3.0] (64-bit runtime)
Python platform              : Linux-5.15.0-119-generic-x86_64-with-glibc2.39
    
==============================
       CUDA / GPU Info
==============================
Is CUDA available            : True
CUDA runtime version         : 13.0.88
CUDA_MODULE_LOADING set to   : 
GPU models and configuration : 
GPU 0: NVIDIA H800
GPU 1: NVIDIA H800
GPU 2: NVIDIA H800
GPU 3: NVIDIA H800
GPU 4: NVIDIA H800
GPU 5: NVIDIA H800
GPU 6: NVIDIA H800
GPU 7: NVIDIA H800

Nvidia driver version        : 580.95.05
cuDNN version                : Could not collect
HIP runtime version          : N/A
MIOpen runtime version       : N/A
Is XNNPACK available         : True

==============================
          CPU Info
==============================
Architecture:                         x86_64
CPU op-mode(s):                       32-bit, 64-bit
Address sizes:                        52 bits physical, 57 bits virtual
Byte Order:                           Little Endian
CPU(s):                               192
On-line CPU(s) list:                  0-191
Vendor ID:                            GenuineIntel
BIOS Vendor ID:                       Intel(R) Corporation
Model name:                           Intel(R) Xeon(R) Platinum 8468
BIOS Model name:                      Intel(R) Xeon(R) Platinum 8468  CPU @ 2.1GHz
BIOS CPU family:                      179
CPU family:                           6
Model:                                143
Thread(s) per core:                   2
Core(s) per socket:                   48
Socket(s):                            2
Stepping:                             8
CPU(s) scaling MHz:                   26%
CPU max MHz:                          3800.0000
CPU min MHz:                          800.0000
BogoMIPS:                             4200.00
Flags:                                fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popc
</details>

🐛 Describe the bug

I tested Qwen/Qwen3.6-35B-A3B on H800 with a total input length of 8,000 tokens and an output length of 100 tokens. Notably, max-num-batched-tokens was set to 8192. The results are as follows.

Launch commands

no-apc

vllm serve Qwen/Qwen3.6-35B-A3B \
  -tp 2 \
  --host 0.0.0.0 \
  --enable-chunked-prefill \
  --max-num-batched-tokens 8192 \
  --no-enable-prefix-caching \
  --mamba-cache-mode all \
  --gdn-prefill-backend triton

align-mode

vllm serve Qwen/Qwen3.6-35B-A3B \
  -tp 2 \
  --host 0.0.0.0 \
  --enable-chunked-prefill \
  --max-num-batched-tokens 8192 \
  --enable-prefix-caching \
  --mamba-cache-mode align \
  --gdn-prefill-backend triton

all-mode

vllm serve Qwen/Qwen3.6-35B-A3B \
  -tp 2 \
  --host 0.0.0.0 \
  --enable-chunked-prefill \
  --max-num-batched-tokens 8192 \
  --enable-prefix-caching \
  --mamba-cache-mode all \
  --gdn-prefill-backend triton

Since all mode currently requires the Triton GDN prefill backend, I used --gdn-prefill-backend triton for all runs to keep the comparison consistent.

Benchmark command

vllm bench serve Qwen/Qwen3.6-35B-A3B \
  --dataset-name prefix_repetition \
  --num-prompts 100 \
  --base-url http://localhost:8000 \
  --ignore-eos \
  --max-concurrency 1 \
  --prefix-repetition-prefix-len 6000 \
  --prefix-repetition-suffix-len 2000 \
  --prefix-repetition-num-prefixes 20 \
  --prefix-repetition-output-len 100

Results

ModeHit rateOutput token throughput
no-apc-140.49 tok/s
align-mode0%137.18 tok/s
all-mode54.4%125.17 tok/s

Question

In this benchmark, align mode has a 0% hit rate. Although all mode does get cache hits, its overall throughput is still lower than running without prefix caching. What would be the recommended way to improve the cache hit rate and avoid the throughput regression?

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

vllm - 💡(How to fix) Fix [Bug]: Prefix cache align-mode has a 0% cache hit rate for Qwen3.6-35B-A3B [1 pull requests]