vllm - 💡(How to fix) Fix [Performance]: Prefix cache hit lower on vLLM than on other inference stacks [1 participants]

vllm2026-03-26 06:23:03

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

vllm-project/vllm#38194•Fetched 2026-04-08 01:31:47

View on GitHub

Comments

Participants

Timeline

Reactions

Author

baoskee

Participants

baoskee

Timeline (top)

labeled ×1subscribed ×1

Code Example

VLLM_USE_FLASHINFER_MOE_FP4=1 \
vllm serve nvidia/DeepSeek-V3.2-NVFP4 \
    --tensor-parallel-size 8 \
    --quantization modelopt_fp4 \
    --host 0.0.0.0 \
    --port 30000 \
    --max-model-len 32768 \
    --enable-prefix-caching

RAW_BUFFERClick to expand / collapse

Report of performance regression

hi our prefix cache metric is only 60-80% on vLLM running DeepSeek v3.2 but is 95% consistently with DeepSeek API. There might be a problem with prefix caching?

our workload is multi-turn conversation in an AI character app at high concurrency

Your current environment (if you think it is necessary)

VLLM_USE_FLASHINFER_MOE_FP4=1 \
vllm serve nvidia/DeepSeek-V3.2-NVFP4 \
    --tensor-parallel-size 8 \
    --quantization modelopt_fp4 \
    --host 0.0.0.0 \
    --port 30000 \
    --max-model-len 32768 \
    --enable-prefix-caching

env_info.txt

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

Fix Plan

The fix involves optimizing prefix caching for high concurrency workloads.

Steps to Fix

Update vllm serve command to include --prefix-cache-size and --prefix-cache-eviction-policy flags.
Implement a custom prefix cache eviction policy to handle high concurrency.

Example Code

VLLM_USE_FLASHINFER_MOE_FP4=1 \
vllm serve nvidia/DeepSeek-V3.2-NVFP4 \
    --tensor-parallel-size 8 \
    --quantization modelopt_fp4 \
    --host 0.0.0.0 \
    --port 30000 \
    --max-model-len 32768 \
    --enable-prefix-caching \
    --prefix-cache-size 10000 \
    --prefix-cache-eviction-policy lru

Alternatively, you can implement a custom cache using a library like cachetools in Python:

from cachetools import LRUCache

cache = LRUCache(maxsize=10000)

def get_prefix_cache(key):
    if key in cache:
        return cache[key]
    else:
        # Calculate prefix cache value
        value = calculate_prefix_cache(key)
        cache[key] = value
        return value

Verification

Verify the fix by monitoring the prefix cache metric and checking for improved performance. You can use tools like prometheus and grafana to monitor metrics.

Extra Tips

Monitor cache hit ratio and adjust --prefix-cache-size accordingly.
Consider using a more advanced caching library like redis or memcached for larger workloads.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #API rate limit #retriever error #indexing error #inference speed

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

vllm - 💡(How to fix) Fix [Performance]: Prefix cache hit lower on vLLM than on other inference stacks [1 participants]

Recommended Tools

GitHub issue graph ai analysis

Code Example

Report of performance regression

Your current environment (if you think it is necessary)

Before submitting a new issue...

extent analysis

Fix Plan

Steps to Fix

Example Code

Verification

Extra Tips

Still need to ship something?

TRENDING

vllm - 💡(How to fix) Fix [Performance]: Prefix cache hit lower on vLLM than on other inference stacks [1 participants]

Recommended Tools

GitHub issue graph ai analysis

Code Example

Report of performance regression

Your current environment (if you think it is necessary)

Before submitting a new issue...

extent analysis

Fix Plan

Steps to Fix

Example Code

Verification

Extra Tips

Still need to ship something?

RELATED_DISCOVERY

TRENDING