vllm - ✅(Solved) Fix [RFC] Tail-Optimized LRU (T-LRU): Reducing Tail Latency via Conversation-Aware KV Cache Eviction [1 pull requests, 1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#37823Fetched 2026-04-08 01:17:54
View on GitHub
Comments
0
Participants
1
Timeline
2
Reactions
0
Participants
Timeline (top)
cross-referenced ×1labeled ×1

We propose Tail-Optimized LRU (T-LRU), a lightweight modification to vLLM's existing LRU prefix-cache eviction policy that reduces P95 tail Time-to-First-Token (TTFT) by up to 27.4% on real conversation workloads, with no overhead during normal (cache-hit) operation and no change to API surface.

The idea and full analysis appear in our paper:

Tail-Optimized Caching for LLM Inference Wenxin Zhang, Ciamac C. Moallemi, Tianyi Peng Columbia Business School NeurIPS 2025. arXiv: https://arxiv.org/abs/2510.15152

Root Cause

We propose Tail-Optimized LRU (T-LRU), a lightweight modification to vLLM's existing LRU prefix-cache eviction policy that reduces P95 tail Time-to-First-Token (TTFT) by up to 27.4% on real conversation workloads, with no overhead during normal (cache-hit) operation and no change to API surface.

The idea and full analysis appear in our paper:

Tail-Optimized Caching for LLM Inference Wenxin Zhang, Ciamac C. Moallemi, Tianyi Peng Columbia Business School NeurIPS 2025. arXiv: https://arxiv.org/abs/2510.15152

Fix Action

Fixed

PR fix notes

PR #37825: [Core] Add Tail-Optimized LRU (T-LRU) KV cache eviction policy

Description (problem / solution / changelog)

This PR implements the T-LRU policy proposed in #37823 and our NeurIPS 2025 paper (https://arxiv.org/abs/2510.15152).

Purpose

Implements the Tail-Optimized LRU (T-LRU) KV cache eviction policy proposed in issue #37823 and our NeurIPS 2025 paper (arXiv:2510.15152).

T-LRU is a two-queue extension of vLLM's LRU prefix cache eviction policy that reduces tail (P95/P99) TTFT by preferentially evicting KV cache blocks that are provably safe to evict without violating a user-specified latency SLA. It is fully backward-compatible: when --tlru-xi-tokens is not set, behavior is identical to standard LRU.

For a request with conversation history H blocks and estimated next-query length Q_hat blocks, the TEL-safe cap is B = max(0, H + Q_hat - xi). Blocks at positions B..H-1 (the suffix/tail) can be evicted without pushing the next turn's recomputation cost above xi tokens. T-LRU routes these blocks to a dedicated tel_safe_queue and drains it before the normal LRU queue.

Changes

  • vllm/v1/core/kv_cache_utils.py: add is_tel_safe: bool to KVCacheBlock
  • vllm/v1/core/block_pool.py: add tel_safe_queue, free_blocks_tlru(), modify get_new_blocks() to drain tel_safe_queue first
  • vllm/v1/core/single_type_kv_cache_manager.py: route freed blocks to free_blocks_tlru() when T-LRU is enabled
  • vllm/v1/core/kv_cache_coordinator.py: forward tlru_xi_blocks, tlru_qhat_blocks to BlockPool
  • vllm/v1/core/kv_cache_manager.py: convert token params to block params
  • vllm/v1/core/sched/scheduler.py: read T-LRU params from CacheConfig
  • vllm/config/cache.py: add tlru_xi_tokens, tlru_qhat_tokens to CacheConfig (excluded from compute_hash)
  • vllm/engine/arg_utils.py: expose --tlru-xi-tokens, --tlru-qhat-tokens CLI flags
  • tests/v1/core/test_tlru_eviction.py: 15 unit tests covering routing logic, queue priority, config wiring
  • docs/design/tlru_caching.md: user-facing documentation

Test Plan

# CPU-only unit tests (no GPU required):
pytest tests/v1/core/test_tlru_eviction.py -v

# Full KV cache regression tests (requires GPU):
pytest tests/v1/core/test_kv_cache_utils.py -m cpu_test

GPU test results pending — will update

## Changed files

- `docs/design/tlru_caching.md` (added, +92/-0)
- `tests/v1/core/test_tlru_eviction.py` (added, +321/-0)
- `vllm/config/cache.py` (modified, +17/-0)
- `vllm/engine/arg_utils.py` (modified, +2/-0)
- `vllm/v1/core/block_pool.py` (modified, +93/-6)
- `vllm/v1/core/kv_cache_coordinator.py` (modified, +29/-12)
- `vllm/v1/core/kv_cache_manager.py` (modified, +14/-0)
- `vllm/v1/core/kv_cache_utils.py` (modified, +6/-0)
- `vllm/v1/core/sched/scheduler.py` (modified, +2/-0)
- `vllm/v1/core/single_type_kv_cache_manager.py` (modified, +9/-2)

Code Example

vllm serve <model> --enable-prefix-caching --tlru-xi-tokens 4096 --tlru-qhat-tokens 200

---

# Branch: feature/tlru-eviction-policy on github.com/wenxinzhang0/vllm
vllm serve meta-llama/Llama-3-8B \
  --enable-prefix-caching \
  --tlru-xi-tokens 4096 \
  --tlru-qhat-tokens 200
RAW_BUFFERClick to expand / collapse

Motivation.

Summary

We propose Tail-Optimized LRU (T-LRU), a lightweight modification to vLLM's existing LRU prefix-cache eviction policy that reduces P95 tail Time-to-First-Token (TTFT) by up to 27.4% on real conversation workloads, with no overhead during normal (cache-hit) operation and no change to API surface.

The idea and full analysis appear in our paper:

Tail-Optimized Caching for LLM Inference Wenxin Zhang, Ciamac C. Moallemi, Tianyi Peng Columbia Business School NeurIPS 2025. arXiv: https://arxiv.org/abs/2510.15152

Motivation

vLLM's current eviction policy is LRU. LRU maximizes cache-hit rate but is conversation-length blind: it treats a block from a 5-turn, 10 000-token conversation identically to a block from a 1-turn, 100-token conversation. This creates an avoidable source of tail latency.

Key Insight

For a conversation with history H blocks and estimated next-query length Q_hat blocks, evicting more than B = max(0, H + Q_hat - xi) blocks does not further increase the next turn's TTFT beyond the SLA threshold xi. Any block beyond this cap is TEL-safe (Tail Excess Latency safe) - it can be evicted without causing an SLO violation for that conversation.

Proposed Change.

Algorithm

T-LRU is a two-queue extension of LRU. When a request completes and its blocks are freed:

  1. Compute B = max(0, H + Q_hat - xi) for the request.
    1. The last H - B blocks (suffix/tail) are TEL-safe -> append to tel_safe_queue.
    1. The first B blocks (prefix/head) are TEL-unsafe -> append to the existing LRU free queue as before. When new blocks are needed:
  2. Drain tel_safe_queue first.
    1. Fall back to the normal LRU queue for any remaining demand. xi and Q_hat are tunable parameters exposed as CLI flags:
vllm serve <model> --enable-prefix-caching --tlru-xi-tokens 4096 --tlru-qhat-tokens 200

Implementation

All changes are confined to the v1 KV cache stack. No changes to the scheduler, attention kernels, or serving API.

  • vllm/v1/core/kv_cache_utils.py: Added is_tel_safe: bool = False field to KVCacheBlock
    • vllm/v1/core/block_pool.py: Added tel_safe_queue, free_blocks_tlru(), modified get_new_blocks()
      • vllm/v1/core/single_type_kv_cache_manager.py: free() routes to free_blocks_tlru() when T-LRU is enabled
        • vllm/v1/core/kv_cache_coordinator.py: Forwards params to BlockPool
          • tests/v1/core/test_tlru_eviction.py: 15 unit tests

Experimental Results (WildChat dataset)

T-LRU reduces P95 TTFT by up to 27.4% vs LRU and closes 25-79% of the gap to the clairvoyant offline optimum. Results on ShareGPT are consistent.

Feedback Period.

At least one week.

CC List.

No response

Any Other Things.

Design Decisions & Open Questions

  1. Default for Q_hat (tlru-qhat-tokens): We default to 200 tokens.
    1. tel_safe_queue vs FreeKVCacheBlockQueue pointer: The current separate-deque approach is simpler.
    1. Interaction with prefix caching: TEL-safe blocks are evicted before LRU pool blocks, which is intentional.

How to Try It

# Branch: feature/tlru-eviction-policy on github.com/wenxinzhang0/vllm
vllm serve meta-llama/Llama-3-8B \
  --enable-prefix-caching \
  --tlru-xi-tokens 4096 \
  --tlru-qhat-tokens 200

References

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

Fix Plan

To implement the Tail-Optimized LRU (T-LRU) eviction policy, follow these steps:

  • Update the KVCacheBlock class in vllm/v1/core/kv_cache_utils.py to include an is_tel_safe field:

class KVCacheBlock: # existing fields... is_tel_safe: bool = False

*   Modify the `BlockPool` class in `vllm/v1/core/block_pool.py` to include a `tel_safe_queue` and implement the `free_blocks_tlru` method:
    ```python
from collections import deque

class BlockPool:
    # existing fields and methods...
    tel_safe_queue = deque()

    def free_blocks_tlru(self, blocks, xi, qhat):
        for block in blocks:
            h = block.conversation_history_length
            b = max(0, h + qhat - xi)
            if block.index < b:
                # TEL-unsafe block, append to LRU free queue
                self.free_queue.append(block)
            else:
                # TEL-safe block, append to tel_safe_queue
                self.tel_safe_queue.append(block)

    def get_new_blocks(self, num_blocks):
        # Drain tel_safe_queue first
        while self.tel_safe_queue and num_blocks > 0:
            block = self.tel_safe_queue.popleft()
            # Allocate block
            num_blocks -= 1
        # Fall back to normal LRU queue for remaining demand
        while self.free_queue and num_blocks > 0:
            block = self.free_queue.popleft()
            # Allocate block
            num_blocks -= 1
  • Update the SingleTypeKVCacheManager class in vllm/v1/core/single_type_kv_cache_manager.py to route free calls to free_blocks_tlru when T-LRU is enabled:

class SingleTypeKVCacheManager: # existing fields and methods... def free(self, blocks): if self.use_tlru: self.block_pool.free_blocks_tlru(blocks, self.xi, self.qhat) else: # Existing LRU eviction logic pass

*   Expose `xi` and `qhat` as tunable parameters through CLI flags:
    ```bash
vllm serve <model> --enable-prefix-caching --tlru-xi-tokens 4096 --tlru-qhat-tokens 200

Verification

To verify the implementation, run the provided unit tests in tests/v1/core/test_tlru_eviction.py and measure the P95 Time-to-First-Token (TTFT) reduction on real conversation workloads.

Extra Tips

  • The default value for qhat (tlru-qhat-tokens) is set to 200 tokens, but this

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING