vllm - 💡(How to fix) Fix [Performance]: Prefix cache hit lower on vLLM than on other inference stacks [1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#38194Fetched 2026-04-08 01:31:47
View on GitHub
Comments
0
Participants
1
Timeline
2
Reactions
1
Author
Participants
Timeline (top)
labeled ×1subscribed ×1

Code Example

VLLM_USE_FLASHINFER_MOE_FP4=1 \
vllm serve nvidia/DeepSeek-V3.2-NVFP4 \
    --tensor-parallel-size 8 \
    --quantization modelopt_fp4 \
    --host 0.0.0.0 \
    --port 30000 \
    --max-model-len 32768 \
    --enable-prefix-caching
RAW_BUFFERClick to expand / collapse

Report of performance regression

hi our prefix cache metric is only 60-80% on vLLM running DeepSeek v3.2 but is 95% consistently with DeepSeek API. There might be a problem with prefix caching?

our workload is multi-turn conversation in an AI character app at high concurrency

Your current environment (if you think it is necessary)

VLLM_USE_FLASHINFER_MOE_FP4=1 \
vllm serve nvidia/DeepSeek-V3.2-NVFP4 \
    --tensor-parallel-size 8 \
    --quantization modelopt_fp4 \
    --host 0.0.0.0 \
    --port 30000 \
    --max-model-len 32768 \
    --enable-prefix-caching

env_info.txt

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

Fix Plan

The fix involves optimizing prefix caching for high concurrency workloads.

Steps to Fix

  • Update vllm serve command to include --prefix-cache-size and --prefix-cache-eviction-policy flags.
  • Implement a custom prefix cache eviction policy to handle high concurrency.

Example Code

VLLM_USE_FLASHINFER_MOE_FP4=1 \
vllm serve nvidia/DeepSeek-V3.2-NVFP4 \
    --tensor-parallel-size 8 \
    --quantization modelopt_fp4 \
    --host 0.0.0.0 \
    --port 30000 \
    --max-model-len 32768 \
    --enable-prefix-caching \
    --prefix-cache-size 10000 \
    --prefix-cache-eviction-policy lru

Alternatively, you can implement a custom cache using a library like cachetools in Python:

from cachetools import LRUCache

cache = LRUCache(maxsize=10000)

def get_prefix_cache(key):
    if key in cache:
        return cache[key]
    else:
        # Calculate prefix cache value
        value = calculate_prefix_cache(key)
        cache[key] = value
        return value

Verification

Verify the fix by monitoring the prefix cache metric and checking for improved performance. You can use tools like prometheus and grafana to monitor metrics.

Extra Tips

  • Monitor cache hit ratio and adjust --prefix-cache-size accordingly.
  • Consider using a more advanced caching library like redis or memcached for larger workloads.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

vllm - 💡(How to fix) Fix [Performance]: Prefix cache hit lower on vLLM than on other inference stacks [1 participants]