vllm - 💡(How to fix) Fix [Bug]: Prefix-cache 0% hit on re-sent request — DeepSeek-V4-Flash hybrid groups lose all first-block cache keys on every request reassignment (DSv4 variant of #32802)

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

For DeepSeek-V4-Flash, the chat-template prelude makes the first block of every request hash to the same value (256 tokens of identical model-side framing precede any user content). For each of the 5 KV-cache groups in the hybrid coordinator, this produces a single-storage entry in BlockHashToBlockMap._cache keyed by BlockHashWithGroupId(shared_hash, group_id) — 5 entries total per "first-block-of-request" position.

When a new request runs get_new_blocks → _maybe_evict_cached_block, the physical blocks being reassigned have their old hash entries removed from _cache. Because the first-block entries are single-storage (one block per group per cached request), each reassignment that touches such a block immediately destroys the cache key. The next lookup for the shared first-block hash returns None, the hybrid coordinator's intersection cascade collapses hit_length to 0 across every group, and a cold prefill follows — even though the pool is <10% utilised and num_preemptions_total=0.

This is the same family as #32802 (GPT-OSS hybrid), but the mechanism is "single-storage entries destroyed on reassignment", not EAGLE spiral block-drop. PR #33524 special-cased the simple 1-Full+1-SWA topology; for DSv4-Flash with 5 KV-cache groups in 4 attention-groups, the bug surfaces through this much-simpler eviction path.

This bug is not currently tracked by the DeepSeek V4 roadmap (reviewed 2026-05-14). The roadmap's KV-cache section covers only offloading; no item addresses cache-coordinator correctness for multi-group hybrid models.

Root Cause

When a new request runs get_new_blocks → _maybe_evict_cached_block, the physical blocks being reassigned have their old hash entries removed from _cache. Because the first-block entries are single-storage (one block per group per cached request), each reassignment that touches such a block immediately destroys the cache key. The next lookup for the shared first-block hash returns None, the hybrid coordinator's intersection cascade collapses hit_length to 0 across every group, and a cold prefill follows — even though the pool is <10% utilised and num_preemptions_total=0.

Fix Action

Fix / Workaround

Mechanism (verified via runtime patches)

Attempted workarounds (both failed)

I tried two patches before opening this issue, both negative results worth documenting:

Code Example

==============================
        System Info
==============================
OS                           : Ubuntu 22.04.5 LTS (x86_64)
GCC version                  : (Ubuntu 11.4.0-1ubuntu1~22.04.3) 11.4.0
Clang version                : Could not collect
CMake version                : Could not collect
Libc version                 : glibc-2.35

==============================
       PyTorch Info
==============================
PyTorch version              : 2.11.0+cu130
Is debug build               : False
CUDA used to build PyTorch   : 13.0
ROCM used to build PyTorch   : N/A
XPU used to build PyTorch    : N/A

==============================
      Python Environment
==============================
Python version               : 3.12.13 (main, May  4 2026, 09:06:35) [GCC 11.4.0] (64-bit runtime)
Python platform              : Linux-6.8.0-100-generic-x86_64-with-glibc2.35

==============================
       CUDA / GPU Info
==============================
Is CUDA available            : True
CUDA runtime version         : 13.0.88
CUDA_MODULE_LOADING set to   :
GPU models and configuration :
GPU 0: NVIDIA H200 NVL
GPU 1: NVIDIA H200 NVL

Nvidia driver version        : 595.71.05
cuDNN version                : Could not collect

==============================
          CPU Info
==============================
Architecture:                            x86_64
CPU(s):                                  60
Model name:                              AMD EPYC-Turin Processor
Virtualization:                          AMD-V
Hypervisor vendor:                       KVM
Virtualization type:                     full

==============================
Versions of relevant libraries
==============================
[pip3] flashinfer-python==0.6.8.post1
[pip3] numpy==2.2.6
[pip3] nvidia-cublas==13.1.0.3
[pip3] nvidia-cuda-cupti==13.0.85
[pip3] nvidia-cuda-nvrtc==13.0.88
[pip3] nvidia-cuda-runtime==13.0.96
[pip3] nvidia-cudnn-cu13==9.19.0.56
[pip3] nvidia-cudnn-frontend==1.18.0
[pip3] nvidia-cufft==12.0.0.61
[pip3] nvidia-cufile==1.15.1.6
[pip3] nvidia-curand==10.4.0.35
[pip3] nvidia-cusolver==12.0.4.66
[pip3] nvidia-cusparse==12.6.3.3
[pip3] nvidia-cusparselt-cu13==0.8.0
[pip3] nvidia-cutlass-dsl==4.4.2
[pip3] nvidia-cutlass-dsl-libs-base==4.4.2
[pip3] nvidia-ml-py==13.595.45
[pip3] nvidia-nccl-cu13==2.28.9
[pip3] nvidia-nvjitlink==13.0.88
[pip3] nvidia-nvshmem-cu13==3.4.5
[pip3] nvidia-nvtx==13.0.85
[pip3] pyzmq==27.1.0
[pip3] torch==2.11.0+cu130
[pip3] torchaudio==2.11.0+cu130
[pip3] torchvision==0.26.0+cu130
[pip3] transformers==5.7.0
[pip3] triton==3.6.0

==============================
         vLLM Info
==============================
vLLM Version                 : 0.20.1
vLLM Build Flags:
  CUDA Archs: 7.5 8.0 8.6 8.9 9.0 10.0 12.0+PTX; ROCm: Disabled; XPU: Disabled
GPU Topology:
      GPU0   GPU1   CPU Affinity   NUMA Affinity   GPU NUMA ID
GPU0   X    NV6    0-59           0               N/A
GPU1   NV6  X      0-59           0               N/A

==============================
     Environment Variables
==============================
VLLM_USE_DEEP_GEMM=1
TORCH_CUDA_ARCH_LIST=7.5 8.0 8.6 8.9 9.0 10.0 12.0+PTX
VLLM_USAGE_SOURCE=production-docker-image
CUDA_VERSION=13.0.2
VLLM_ENABLE_CUDA_COMPATIBILITY=0
VLLM_ENGINE_READY_TIMEOUT_S=3600
VLLM_MEMORY_PROFILER_ESTIMATE_CUDAGRAPHS=0
VLLM_RPC_TIMEOUT=600000
PYTORCH_NVML_BASED_CUDA_CHECK=1

---

import json, time, urllib.request, hashlib

def long_prompt(seed_phrase, target_tokens=250_000):
    chunks = []
    for i in range(target_tokens // 50):
        h = hashlib.md5(f"{seed_phrase}-{i}".encode()).hexdigest()
        chunks.append(f"Section {i} [{h}]: This is filler text with seed {seed_phrase}, iteration {i}, hash {h}, padding " + "x"*100)
    return "\n".join(chunks)

PROMPT_A = long_prompt("alpha", 250_000)
PROMPT_B = long_prompt("bravo", 250_000)

def send(prompt, label):
    body = {
        "model": "deepseek-ai/DeepSeek-V4-Flash",
        "messages": [{"role": "user", "content": prompt + "\n\nReply 'ok' only."}],
        "max_tokens": 3, "temperature": 0.0, "stream": False,
        "chat_template_kwargs": {"enable_thinking": False},
    }
    # POST to http://localhost:8000/v1/chat/completions, time it,
    # read prefix_cache_hits_total / queries_total around it.

send(PROMPT_A, "A.1"); time.sleep(1)
send(PROMPT_A, "A.2"); time.sleep(1)   # should be ~100% hit
send(PROMPT_B, "B");   time.sleep(1)   # different prefix, cold
send(PROMPT_A, "A.3")                   # should be ~100% hit, IS 0%

---

gidx=0  MLAAttentionSpec        block_size=256  group_ids=[0]    manager=FullAttentionManager
gidx=1  SlidingWindowMLASpec    block_size=64   group_ids=[1,2]  manager=SlidingWindowManager
gidx=2  SlidingWindowMLASpec    block_size=4   group_ids=[3]    manager=SlidingWindowManager
gidx=3  SlidingWindowMLASpec    block_size=8   group_ids=[4]    manager=SlidingWindowManager

---

DIAG-ENTRY max_cache_hit_length=373942 num_groups=5 num_attn_groups=4 first_hash[:8]=ea70661868b2dcbb
DIAG-LOOKUP iter=1 gidx=0 type=MLAAttentionSpec      block_size=256 max_in=373942 hit_out=0 blocks=0
DIAG-LOOKUP iter=1 gidx=1 type=SlidingWindowMLASpec  block_size=64  max_in=0      hit_out=0 blocks=0
DIAG-LOOKUP iter=1 gidx=2 type=SlidingWindowMLASpec  block_size=4   max_in=0      hit_out=0 blocks=0
DIAG-LOOKUP iter=1 gidx=3 type=SlidingWindowMLASpec  block_size=8   max_in=0      hit_out=0 blocks=0
DIAG-CONVERGED iter=2 final_hit=0

---

10:07:54  INSERT (5 keys, all "existing=new")A.1 caches its 5 first-block entries
10:08:05  POP    block_id=11321B's get_new_blocks → _maybe_evict_cached_block
10:08:05  POP    block_id=11257                    pops blocks whose hash matches A.1's first-block keys
10:08:06  POP    block_id=11799
10:08:06  POP    block_id=11691
10:08:46  INSERT (5 keys, all "existing=new")B's first-block entries (fresh, since A's were gone)
10:08:58  POP    block_id=5908A.3's get_new_blocks evicts B's entries
10:08:58  POP    block_id=5972
10:08:58  POP    block_id=4647
10:08:58  POP    block_id=6760

---

09:31:07  STM-LOOKUP class=FullAttn hash=ea7066186... lookup=HITA.2 finds A's first-block
09:31:09  STM-LOOKUP class=FullAttn hash=ea7066186... lookup=MISSB starts (same hash, expected miss, will re-cache)
09:31:54  STM-LOOKUP class=FullAttn hash=ea7066186... lookup=MISSA.3 should hit B's re-cache; doesn't
RAW_BUFFERClick to expand / collapse

Your current environment

==============================
        System Info
==============================
OS                           : Ubuntu 22.04.5 LTS (x86_64)
GCC version                  : (Ubuntu 11.4.0-1ubuntu1~22.04.3) 11.4.0
Clang version                : Could not collect
CMake version                : Could not collect
Libc version                 : glibc-2.35

==============================
       PyTorch Info
==============================
PyTorch version              : 2.11.0+cu130
Is debug build               : False
CUDA used to build PyTorch   : 13.0
ROCM used to build PyTorch   : N/A
XPU used to build PyTorch    : N/A

==============================
      Python Environment
==============================
Python version               : 3.12.13 (main, May  4 2026, 09:06:35) [GCC 11.4.0] (64-bit runtime)
Python platform              : Linux-6.8.0-100-generic-x86_64-with-glibc2.35

==============================
       CUDA / GPU Info
==============================
Is CUDA available            : True
CUDA runtime version         : 13.0.88
CUDA_MODULE_LOADING set to   :
GPU models and configuration :
GPU 0: NVIDIA H200 NVL
GPU 1: NVIDIA H200 NVL

Nvidia driver version        : 595.71.05
cuDNN version                : Could not collect

==============================
          CPU Info
==============================
Architecture:                            x86_64
CPU(s):                                  60
Model name:                              AMD EPYC-Turin Processor
Virtualization:                          AMD-V
Hypervisor vendor:                       KVM
Virtualization type:                     full

==============================
Versions of relevant libraries
==============================
[pip3] flashinfer-python==0.6.8.post1
[pip3] numpy==2.2.6
[pip3] nvidia-cublas==13.1.0.3
[pip3] nvidia-cuda-cupti==13.0.85
[pip3] nvidia-cuda-nvrtc==13.0.88
[pip3] nvidia-cuda-runtime==13.0.96
[pip3] nvidia-cudnn-cu13==9.19.0.56
[pip3] nvidia-cudnn-frontend==1.18.0
[pip3] nvidia-cufft==12.0.0.61
[pip3] nvidia-cufile==1.15.1.6
[pip3] nvidia-curand==10.4.0.35
[pip3] nvidia-cusolver==12.0.4.66
[pip3] nvidia-cusparse==12.6.3.3
[pip3] nvidia-cusparselt-cu13==0.8.0
[pip3] nvidia-cutlass-dsl==4.4.2
[pip3] nvidia-cutlass-dsl-libs-base==4.4.2
[pip3] nvidia-ml-py==13.595.45
[pip3] nvidia-nccl-cu13==2.28.9
[pip3] nvidia-nvjitlink==13.0.88
[pip3] nvidia-nvshmem-cu13==3.4.5
[pip3] nvidia-nvtx==13.0.85
[pip3] pyzmq==27.1.0
[pip3] torch==2.11.0+cu130
[pip3] torchaudio==2.11.0+cu130
[pip3] torchvision==0.26.0+cu130
[pip3] transformers==5.7.0
[pip3] triton==3.6.0

==============================
         vLLM Info
==============================
vLLM Version                 : 0.20.1
vLLM Build Flags:
  CUDA Archs: 7.5 8.0 8.6 8.9 9.0 10.0 12.0+PTX; ROCm: Disabled; XPU: Disabled
GPU Topology:
      GPU0   GPU1   CPU Affinity   NUMA Affinity   GPU NUMA ID
GPU0   X    NV6    0-59           0               N/A
GPU1   NV6  X      0-59           0               N/A

==============================
     Environment Variables
==============================
VLLM_USE_DEEP_GEMM=1
TORCH_CUDA_ARCH_LIST=7.5 8.0 8.6 8.9 9.0 10.0 12.0+PTX
VLLM_USAGE_SOURCE=production-docker-image
CUDA_VERSION=13.0.2
VLLM_ENABLE_CUDA_COMPATIBILITY=0
VLLM_ENGINE_READY_TIMEOUT_S=3600
VLLM_MEMORY_PROFILER_ESTIMATE_CUDAGRAPHS=0
VLLM_RPC_TIMEOUT=600000
PYTORCH_NVML_BASED_CUDA_CHECK=1

🐛 Describe the bug

Summary

For DeepSeek-V4-Flash, the chat-template prelude makes the first block of every request hash to the same value (256 tokens of identical model-side framing precede any user content). For each of the 5 KV-cache groups in the hybrid coordinator, this produces a single-storage entry in BlockHashToBlockMap._cache keyed by BlockHashWithGroupId(shared_hash, group_id) — 5 entries total per "first-block-of-request" position.

When a new request runs get_new_blocks → _maybe_evict_cached_block, the physical blocks being reassigned have their old hash entries removed from _cache. Because the first-block entries are single-storage (one block per group per cached request), each reassignment that touches such a block immediately destroys the cache key. The next lookup for the shared first-block hash returns None, the hybrid coordinator's intersection cascade collapses hit_length to 0 across every group, and a cold prefill follows — even though the pool is <10% utilised and num_preemptions_total=0.

This is the same family as #32802 (GPT-OSS hybrid), but the mechanism is "single-storage entries destroyed on reassignment", not EAGLE spiral block-drop. PR #33524 special-cased the simple 1-Full+1-SWA topology; for DSv4-Flash with 5 KV-cache groups in 4 attention-groups, the bug surfaces through this much-simpler eviction path.

This bug is not currently tracked by the DeepSeek V4 roadmap (reviewed 2026-05-14). The roadmap's KV-cache section covers only offloading; no item addresses cache-coordinator correctness for multi-group hybrid models.

Environment

  • vLLM image: vllm/vllm-openai:v0.20.1@sha256:9eff9734a30b6713a8566217d36f8277630fd2d31cec7f0a0292835901a23aa4 (from SemiAnalysisAI/InferenceX#1222 recipe)
  • Model: deepseek-ai/DeepSeek-V4-Flash
  • GPU: 2× H200 SXM 141 GB
  • Flags: --tensor-parallel-size 2 --enable-expert-parallel --kv-cache-dtype fp8 --block-size 256 --enable-prefix-caching --gpu-memory-utilization 0.95 --max-num-seqs 512 --max-num-batched-tokens 4096
  • cache_config_info reports: num_gpu_blocks=54774 (≈14M tokens pool), sliding_window=128, prefix_caching_hash_algo=sha256

Reproduction (deterministic, runs in <5 minutes)

Synthetic 4-request test, all via direct vLLM /v1/chat/completions. PROMPT_A and PROMPT_B share only the chat-template prelude; user-content diverges from the first user-message token.

import json, time, urllib.request, hashlib

def long_prompt(seed_phrase, target_tokens=250_000):
    chunks = []
    for i in range(target_tokens // 50):
        h = hashlib.md5(f"{seed_phrase}-{i}".encode()).hexdigest()
        chunks.append(f"Section {i} [{h}]: This is filler text with seed {seed_phrase}, iteration {i}, hash {h}, padding " + "x"*100)
    return "\n".join(chunks)

PROMPT_A = long_prompt("alpha", 250_000)
PROMPT_B = long_prompt("bravo", 250_000)

def send(prompt, label):
    body = {
        "model": "deepseek-ai/DeepSeek-V4-Flash",
        "messages": [{"role": "user", "content": prompt + "\n\nReply 'ok' only."}],
        "max_tokens": 3, "temperature": 0.0, "stream": False,
        "chat_template_kwargs": {"enable_thinking": False},
    }
    # POST to http://localhost:8000/v1/chat/completions, time it,
    # read prefix_cache_hits_total / queries_total around it.

send(PROMPT_A, "A.1"); time.sleep(1)
send(PROMPT_A, "A.2"); time.sleep(1)   # should be ~100% hit
send(PROMPT_B, "B");   time.sleep(1)   # different prefix, cold
send(PROMPT_A, "A.3")                   # should be ~100% hit, IS 0%

Observed (consistent across many runs)

StepWallhits / querieshit-rate
A.1~50s0 / 373,9430.0%
A.2~1s373,760 / 373,943100.0%
B~43s0 / 378,1750.0%
A.3~50s0 / 373,9430.0%

KV-pool utilisation stayed at <10% throughout, num_preemptions_total=0. Pool has ~54k blocks; A and B together use a small fraction.

Mechanism (verified via runtime patches)

I added DIAG logging to four files (kv_cache_coordinator.py, block_pool.py, single_type_kv_cache_manager.py, kv_cache_utils.py) — bind-mounted over the official v0.20.1 image, no rebuild. Findings:

1. DSv4-Flash hybrid topology

gidx=0  MLAAttentionSpec        block_size=256  group_ids=[0]    manager=FullAttentionManager
gidx=1  SlidingWindowMLASpec    block_size=64   group_ids=[1,2]  manager=SlidingWindowManager
gidx=2  SlidingWindowMLASpec    block_size=4   group_ids=[3]    manager=SlidingWindowManager
gidx=3  SlidingWindowMLASpec    block_size=8   group_ids=[4]    manager=SlidingWindowManager

Five KV groups in four attention-groups: one full-attention MLA group plus three SWA-MLA groups with different (block_size, sliding_window) tuples.

2. The MLA group is the one returning 0 at A.3

HybridKVCacheCoordinator.find_longest_cache_hit log during A.3:

DIAG-ENTRY max_cache_hit_length=373942 num_groups=5 num_attn_groups=4 first_hash[:8]=ea70661868b2dcbb
DIAG-LOOKUP iter=1 gidx=0 type=MLAAttentionSpec      block_size=256 max_in=373942 hit_out=0 blocks=0
DIAG-LOOKUP iter=1 gidx=1 type=SlidingWindowMLASpec  block_size=64  max_in=0      hit_out=0 blocks=0
DIAG-LOOKUP iter=1 gidx=2 type=SlidingWindowMLASpec  block_size=4   max_in=0      hit_out=0 blocks=0
DIAG-LOOKUP iter=1 gidx=3 type=SlidingWindowMLASpec  block_size=8   max_in=0      hit_out=0 blocks=0
DIAG-CONVERGED iter=2 final_hit=0

The MLA group (gidx=0) immediately returns hit_out=0. curr_hit_length collapses to 0 and propagates to every subsequent group via max_in=0. The hybrid intersection cascade is a consequence, not the cause; SWA groups are not the actor.

3. Block hashes are stable

NONE_HASH is process-global and unchanged across requests. first_hash[:8] is identical across A.1, A.2, A.3, and B (since chat-template prelude is identical across all four). reset_prefix_cache is never called. Hash drift / salt instability is not the issue.

4. The actual eviction trail

Instrumented BlockHashToBlockMap.insert and .pop with stack traces. For the first-block hash (5 keys total, one per group_id):

10:07:54  INSERT (5 keys, all "existing=new")  ← A.1 caches its 5 first-block entries
10:08:05  POP    block_id=11321                 ← B's get_new_blocks → _maybe_evict_cached_block
10:08:05  POP    block_id=11257                    pops blocks whose hash matches A.1's first-block keys
10:08:06  POP    block_id=11799
10:08:06  POP    block_id=11691
10:08:46  INSERT (5 keys, all "existing=new")  ← B's first-block entries (fresh, since A's were gone)
10:08:58  POP    block_id=5908                  ← A.3's get_new_blocks evicts B's entries
10:08:58  POP    block_id=5972
10:08:58  POP    block_id=4647
10:08:58  POP    block_id=6760

Each first-block cache entry is single-storage (one block_id per (hash, group_id)-keyed entry). When get_new_blocks reassigns a physical block, _maybe_evict_cached_block calls BlockHashToBlockMap.pop(block.block_hash, block.block_id) — which removes the single entry from _cache entirely. There's no second block_id to fall back on.

The five POPs at 10:08:05–06 are 5 distinct keys (one per group_id), not 5 entries in one dict. My earlier 8-byte logging clipped before the group_id portion of BlockHashWithGroupId, making them look identical. They're actually (shared_hash, gid=0), (shared_hash, gid=1), etc. — each single-storage, each destroyed on the corresponding block's reassignment.

5. Why B's re-caching doesn't save A.3

B's prefill re-populates the same 5 cache keys at 10:08:46 — with B's freshly-allocated block_ids. Those entries SHOULD make A.3's lookup hit (B's first-block content is identical to A's, since shared chat-template). But by the time A.3 invokes find_longest_cache_hit, A.3's own get_new_blocks (for its expected-cold prefill — vLLM allocates "scratch" blocks before/during lookup) has already evicted those same blocks. The 5 POPs at 10:08:58 show this: A.3's allocation walks the queue, picks up the recently-freed-by-B blocks (now at the front-ish of the LRU queue), and _maybe_evict_cached_block strips them from _cache. By the time the coordinator's lookup runs, the keys are gone.

6. STM-LOOKUP timeline confirms it

09:31:07  STM-LOOKUP class=FullAttn hash=ea7066186... lookup=HIT   ← A.2 finds A's first-block
09:31:09  STM-LOOKUP class=FullAttn hash=ea7066186... lookup=MISS  ← B starts (same hash, expected miss, will re-cache)
09:31:54  STM-LOOKUP class=FullAttn hash=ea7066186... lookup=MISS  ← A.3 should hit B's re-cache; doesn't

Cause one-liner

Every block in a warm pool is cached (single-storage entry in BlockHashToBlockMap._cache). get_new_blocks reassigns blocks from the LRU-ordered free queue; each reassignment removes the block's hash entry from _cache via _maybe_evict_cached_block. For DSv4-Flash where every request shares the first-block hash (chat-template prelude), this means request N+1's allocation destroys request N's first-block cache entries. Request N+2 misses, even though physically the prelude content is computed and resident in some block.

Attempted workarounds (both failed)

I tried two patches before opening this issue, both negative results worth documenting:

Attempt 1: popleft_n prefers block_hash is None

Idea: prefer "uncached" blocks (no cache entry to destroy) over LRU front.

Result: A.3 still 0% hit. Every call logged "uncached pool exhausted, may evict cached". After A.1's prefill, cache_full_blocks runs on every block — there are no block_hash is None blocks left in a warm pool. "Cached vs uncached" is the wrong dimension.

Attempt 2: get_new_blocks prefers dict-storage entries with N>1

Idea: prefer blocks whose _cache[block.block_hash] is a dict with multiple entries (popping leaves the key alive). Defer single-storage entries (whose pop destroys the key).

Result: A.3 still 0% hit. Every call logged safe=0. There are essentially no dict-storage entries with N>1 in the pool — almost all blocks have unique content hashes (single-storage). The shared-prefix entries, while semantically "valuable", land in different keys per group_id (via make_block_hash_with_group_id) — so each is single-storage too.

Both attempts confirm that the bug is structural to the design — single-storage cache entries are destroyed on physical reassignment, and there's no clean criterion at allocation time to avoid it.

Suggested fix direction

The fix needs architectural change, not a one-line tweak. Sketch options, ordered from cheapest to most thorough:

  1. Targeted special-case in HybridKVCacheCoordinator.find_longest_cache_hit for the DSv4-Flash topology. When the lookup encounters a multi-group hybrid model with enable_prefix_caching=True, and the MLA group's first-block lookup returns 0, do NOT propagate max_in=0 to subsequent groups — let each group return its own hit-length independently. The engine would still cold-prefill the MLA group's first block but could reuse cached entries in the SWA / c4 / c128 groups (if those happen to have survived). Surface symptom mitigation, not root cause.

  2. Defer eviction of cached entries until pool pressure exists. Currently _maybe_evict_cached_block runs unconditionally in get_new_blocks — even when the pool has tons of free space. A guard like if num_free_blocks_after_alloc < threshold: ... would keep cache entries intact when the pool is mostly empty. Risk: doesn't address pressure cases.

  3. Pin "prefix-prefix" blocks (the first few blocks of every request). Make them reference-counted at the cache level, not just the request level: each cache key holds a refcount, and _maybe_evict_cached_block decrements rather than removes. Blocks that physically must be reassigned would only have their cache entry removed when the refcount hits zero. Substantial scheduler-touching change.

  4. Two-tier KV pool: separate working-set blocks from cached-only blocks. Cached-only blocks would only be reclaimed under pressure. This is the textbook fix and would solve the entire family of "prefix evicted with free space everywhere" bugs.

PR #33524's author already noted that a general fix is hard because "SWA attn and Mamba-style attn do not follow the downward-closed property". DSv4-Flash adds the "every request's first block collides on hash" dimension on top.

I'm happy to test patches on this setup — the repro is deterministic, the DIAG patches are minimal bind-mounts, and the 2× H200 box can churn through experiments in <5 minutes per round.

Roadmap context

This bug is not currently tracked by the DeepSeek V4 roadmap (reviewed 2026-05-14). The roadmap's KV-cache section covers only offloading (CPU / distributed); no item addresses cache-coordinator correctness for multi-group hybrid models. "Model Runner V2 Integration" (open, @WoosukKwon) and "MTP optimizations" are the only roadmap items that might touch this code path incidentally — but neither is scoped to fix prefix-cache correctness for multi-group hybrid models.

Related

  • #32802 (GPT-OSS hybrid + EAGLE) — closed by PR #33524 for GPT-OSS specifically. Author explicitly deferred a structural fix for "more complicated models with multiple attention groups". DSv4-Flash is such a model, with a different concrete mechanism but the same end result.
  • PR #33524 — "[Fix] prefix cache hit rate == 0 bug with gpt-oss style models", merged 2026-02-02. Special-cases 1-Full+1-SWA topology only.
  • PR #33270 — referenced in #33524 as an earlier workaround attempt.
  • #40902 (V4 roadmap) — does not currently cover this bug.
  • #38182 (MTP last-block-drop) — disabled MTP, this bug stayed.
  • #37164 (TOCTOU race) / #39146 (KV block corruption) — separate concurrency bugs we see in the same deployment; out of scope here.

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

vllm - 💡(How to fix) Fix [Bug]: Prefix-cache 0% hit on re-sent request — DeepSeek-V4-Flash hybrid groups lose all first-block cache keys on every request reassignment (DSv4 variant of #32802)