vllm - 💡(How to fix) Fix [Bug]: Prefix cache align-mode has a 0% cache hit rate for Qwen3.6-35B-A3B [1 pull requests]

Code Example

==============================
        System Info
==============================
OS                           : Ubuntu 24.04.3 LTS (x86_64)
GCC version                  : (Ubuntu 13.3.0-6ubuntu2~24.04.1) 13.3.0
Clang version                : Could not collect
CMake version                : Could not collect
Libc version                 : glibc-2.39

==============================
       PyTorch Info
==============================
PyTorch version              : 2.11.0+cu130
Is debug build               : False
CUDA used to build PyTorch   : 13.0
ROCM used to build PyTorch   : N/A
XPU used to build PyTorch    : N/A

==============================
      Python Environment
==============================
Python version               : 3.12.3 (main, Mar 23 2026, 19:04:32) [GCC 13.3.0] (64-bit runtime)
Python platform              : Linux-5.15.0-119-generic-x86_64-with-glibc2.39
    
==============================
       CUDA / GPU Info
==============================
Is CUDA available            : True
CUDA runtime version         : 13.0.88
CUDA_MODULE_LOADING set to   : 
GPU models and configuration : 
GPU 0: NVIDIA H800
GPU 1: NVIDIA H800
GPU 2: NVIDIA H800
GPU 3: NVIDIA H800
GPU 4: NVIDIA H800
GPU 5: NVIDIA H800
GPU 6: NVIDIA H800
GPU 7: NVIDIA H800

Nvidia driver version        : 580.95.05
cuDNN version                : Could not collect
HIP runtime version          : N/A
MIOpen runtime version       : N/A
Is XNNPACK available         : True

==============================
          CPU Info
==============================
Architecture:                         x86_64
CPU op-mode(s):                       32-bit, 64-bit
Address sizes:                        52 bits physical, 57 bits virtual
Byte Order:                           Little Endian
CPU(s):                               192
On-line CPU(s) list:                  0-191
Vendor ID:                            GenuineIntel
BIOS Vendor ID:                       Intel(R) Corporation
Model name:                           Intel(R) Xeon(R) Platinum 8468
BIOS Model name:                      Intel(R) Xeon(R) Platinum 8468  CPU @ 2.1GHz
BIOS CPU family:                      179
CPU family:                           6
Model:                                143
Thread(s) per core:                   2
Core(s) per socket:                   48
Socket(s):                            2
Stepping:                             8
CPU(s) scaling MHz:                   26%
CPU max MHz:                          3800.0000
CPU min MHz:                          800.0000
BogoMIPS:                             4200.00
Flags:                                fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popc

---

vllm serve Qwen/Qwen3.6-35B-A3B \
    -tp 2 \
    --host 0.0.0.0 \
    --enable-chunked-prefill \
    --max-num-batched-tokens 8192 \
    --no-enable-prefix-caching \
    --mamba-cache-mode all \
    --gdn-prefill-backend triton

---

vllm serve Qwen/Qwen3.6-35B-A3B \
    -tp 2 \
    --host 0.0.0.0 \
    --enable-chunked-prefill \
    --max-num-batched-tokens 8192 \
    --enable-prefix-caching \
    --mamba-cache-mode align \
    --gdn-prefill-backend triton

---

vllm serve Qwen/Qwen3.6-35B-A3B \
    -tp 2 \
    --host 0.0.0.0 \
    --enable-chunked-prefill \
    --max-num-batched-tokens 8192 \
    --enable-prefix-caching \
    --mamba-cache-mode all \
    --gdn-prefill-backend triton

---

vllm bench serve Qwen/Qwen3.6-35B-A3B \
    --dataset-name prefix_repetition \
    --num-prompts 100 \
    --base-url http://localhost:8000 \
    --ignore-eos \
    --max-concurrency 1 \
    --prefix-repetition-prefix-len 6000 \
    --prefix-repetition-suffix-len 2000 \
    --prefix-repetition-num-prefixes 20 \
    --prefix-repetition-output-len 100

Your current environment

<details> <summary>The output of <code>python collect_env.py</code></summary>

==============================
        System Info
==============================
OS                           : Ubuntu 24.04.3 LTS (x86_64)
GCC version                  : (Ubuntu 13.3.0-6ubuntu2~24.04.1) 13.3.0
Clang version                : Could not collect
CMake version                : Could not collect
Libc version                 : glibc-2.39

==============================
       PyTorch Info
==============================
PyTorch version              : 2.11.0+cu130
Is debug build               : False
CUDA used to build PyTorch   : 13.0
ROCM used to build PyTorch   : N/A
XPU used to build PyTorch    : N/A

==============================
      Python Environment
==============================
Python version               : 3.12.3 (main, Mar 23 2026, 19:04:32) [GCC 13.3.0] (64-bit runtime)
Python platform              : Linux-5.15.0-119-generic-x86_64-with-glibc2.39
    
==============================
       CUDA / GPU Info
==============================
Is CUDA available            : True
CUDA runtime version         : 13.0.88
CUDA_MODULE_LOADING set to   : 
GPU models and configuration : 
GPU 0: NVIDIA H800
GPU 1: NVIDIA H800
GPU 2: NVIDIA H800
GPU 3: NVIDIA H800
GPU 4: NVIDIA H800
GPU 5: NVIDIA H800
GPU 6: NVIDIA H800
GPU 7: NVIDIA H800

Nvidia driver version        : 580.95.05
cuDNN version                : Could not collect
HIP runtime version          : N/A
MIOpen runtime version       : N/A
Is XNNPACK available         : True

==============================
          CPU Info
==============================
Architecture:                         x86_64
CPU op-mode(s):                       32-bit, 64-bit
Address sizes:                        52 bits physical, 57 bits virtual
Byte Order:                           Little Endian
CPU(s):                               192
On-line CPU(s) list:                  0-191
Vendor ID:                            GenuineIntel
BIOS Vendor ID:                       Intel(R) Corporation
Model name:                           Intel(R) Xeon(R) Platinum 8468
BIOS Model name:                      Intel(R) Xeon(R) Platinum 8468  CPU @ 2.1GHz
BIOS CPU family:                      179
CPU family:                           6
Model:                                143
Thread(s) per core:                   2
Core(s) per socket:                   48
Socket(s):                            2
Stepping:                             8
CPU(s) scaling MHz:                   26%
CPU max MHz:                          3800.0000
CPU min MHz:                          800.0000
BogoMIPS:                             4200.00
Flags:                                fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popc

</details>

🐛 Describe the bug

I tested Qwen/Qwen3.6-35B-A3B on H800 with a total input length of 8,000 tokens and an output length of 100 tokens. Notably, max-num-batched-tokens was set to 8192. The results are as follows.

Launch commands

no-apc

vllm serve Qwen/Qwen3.6-35B-A3B \
  -tp 2 \
  --host 0.0.0.0 \
  --enable-chunked-prefill \
  --max-num-batched-tokens 8192 \
  --no-enable-prefix-caching \
  --mamba-cache-mode all \
  --gdn-prefill-backend triton

align-mode

vllm serve Qwen/Qwen3.6-35B-A3B \
  -tp 2 \
  --host 0.0.0.0 \
  --enable-chunked-prefill \
  --max-num-batched-tokens 8192 \
  --enable-prefix-caching \
  --mamba-cache-mode align \
  --gdn-prefill-backend triton

all-mode

vllm serve Qwen/Qwen3.6-35B-A3B \
  -tp 2 \
  --host 0.0.0.0 \
  --enable-chunked-prefill \
  --max-num-batched-tokens 8192 \
  --enable-prefix-caching \
  --mamba-cache-mode all \
  --gdn-prefill-backend triton

Since all mode currently requires the Triton GDN prefill backend, I used --gdn-prefill-backend triton for all runs to keep the comparison consistent.

Benchmark command

vllm bench serve Qwen/Qwen3.6-35B-A3B \
  --dataset-name prefix_repetition \
  --num-prompts 100 \
  --base-url http://localhost:8000 \
  --ignore-eos \
  --max-concurrency 1 \
  --prefix-repetition-prefix-len 6000 \
  --prefix-repetition-suffix-len 2000 \
  --prefix-repetition-num-prefixes 20 \
  --prefix-repetition-output-len 100

Results

Mode	Hit rate	Output token throughput
no-apc	-	140.49 tok/s
align-mode	0%	137.18 tok/s
all-mode	54.4%	125.17 tok/s

Question

In this benchmark, align mode has a 0% hit rate. Although all mode does get cache hits, its overall throughput is still lower than running without prefix caching. What would be the recommended way to improve the cache hit rate and avoid the throughput regression?

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

vllm - 💡(How to fix) Fix [Bug]: Prefix cache align-mode has a 0% cache hit rate for Qwen3.6-35B-A3B [1 pull requests]

Recommended Tools

GitHub issue graph ai analysis

Fix Action

Fixed

Code Example

Your current environment

🐛 Describe the bug

Launch commands

no-apc

align-mode

all-mode

Benchmark command

Results

Question

Before submitting a new issue...

Still need to ship something?

TRENDING

vllm - 💡(How to fix) Fix [Bug]: Prefix cache align-mode has a 0% cache hit rate for Qwen3.6-35B-A3B [1 pull requests]

Recommended Tools

GitHub issue graph ai analysis

Fix Action

Fixed

Code Example

Your current environment

🐛 Describe the bug

Launch commands

no-apc

align-mode

all-mode

Benchmark command

Results

Question

Before submitting a new issue...

Still need to ship something?

RELATED_DISCOVERY

TRENDING