vllm - ✅(Solved) Fix [RFC] Tail-Optimized LRU (T-LRU): Reducing Tail Latency via Conversation-Aware KV Cache Eviction [1 pull requests, 1 participants]

vllm2026-03-22 20:29:23

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

vllm-project/vllm#37823•Fetched 2026-04-08 01:17:54

View on GitHub

Comments

Participants

Timeline

Reactions

Author

wenxinzhang0

Participants

wenxinzhang0

Timeline (top)

cross-referenced ×1labeled ×1

We propose Tail-Optimized LRU (T-LRU), a lightweight modification to vLLM's existing LRU prefix-cache eviction policy that reduces P95 tail Time-to-First-Token (TTFT) by up to 27.4% on real conversation workloads, with no overhead during normal (cache-hit) operation and no change to API surface.

The idea and full analysis appear in our paper:

Tail-Optimized Caching for LLM Inference Wenxin Zhang, Ciamac C. Moallemi, Tianyi Peng Columbia Business School NeurIPS 2025. arXiv: https://arxiv.org/abs/2510.15152

Root Cause

The idea and full analysis appear in our paper:

Tail-Optimized Caching for LLM Inference Wenxin Zhang, Ciamac C. Moallemi, Tianyi Peng Columbia Business School NeurIPS 2025. arXiv: https://arxiv.org/abs/2510.15152

Fix Action

Fixed

Fixed by PR: [Core] Add Tail-Optimized LRU (T-LRU) KV cache eviction policy (https://github.com/vllm-project/vllm/pull/37825)

PR fix notes

PR #37825: [Core] Add Tail-Optimized LRU (T-LRU) KV cache eviction policy

Repository: vllm-project/vllm
Author: wenxinzhang0
State: open | merged: False
Link: https://github.com/vllm-project/vllm/pull/37825

Description (problem / solution / changelog)

This PR implements the T-LRU policy proposed in #37823 and our NeurIPS 2025 paper (https://arxiv.org/abs/2510.15152).

Purpose

Implements the Tail-Optimized LRU (T-LRU) KV cache eviction policy proposed in issue #37823 and our NeurIPS 2025 paper (arXiv:2510.15152).

T-LRU is a two-queue extension of vLLM's LRU prefix cache eviction policy that reduces tail (P95/P99) TTFT by preferentially evicting KV cache blocks that are provably safe to evict without violating a user-specified latency SLA. It is fully backward-compatible: when --tlru-xi-tokens is not set, behavior is identical to standard LRU.

For a request with conversation history H blocks and estimated next-query length Q_hat blocks, the TEL-safe cap is B = max(0, H + Q_hat - xi). Blocks at positions B..H-1 (the suffix/tail) can be evicted without pushing the next turn's recomputation cost above xi tokens. T-LRU routes these blocks to a dedicated tel_safe_queue and drains it before the normal LRU queue.

Changes

vllm/v1/core/kv_cache_utils.py: add is_tel_safe: bool to KVCacheBlock
vllm/v1/core/block_pool.py: add tel_safe_queue, free_blocks_tlru(), modify get_new_blocks() to drain tel_safe_queue first
vllm/v1/core/single_type_kv_cache_manager.py: route freed blocks to free_blocks_tlru() when T-LRU is enabled
vllm/v1/core/kv_cache_coordinator.py: forward tlru_xi_blocks, tlru_qhat_blocks to BlockPool
vllm/v1/core/kv_cache_manager.py: convert token params to block params
vllm/v1/core/sched/scheduler.py: read T-LRU params from CacheConfig
vllm/config/cache.py: add tlru_xi_tokens, tlru_qhat_tokens to CacheConfig (excluded from compute_hash)
vllm/engine/arg_utils.py: expose --tlru-xi-tokens, --tlru-qhat-tokens CLI flags
tests/v1/core/test_tlru_eviction.py: 15 unit tests covering routing logic, queue priority, config wiring
docs/design/tlru_caching.md: user-facing documentation

Test Plan

# CPU-only unit tests (no GPU required):
pytest tests/v1/core/test_tlru_eviction.py -v

# Full KV cache regression tests (requires GPU):
pytest tests/v1/core/test_kv_cache_utils.py -m cpu_test

GPU test results pending — will update

## Changed files

- `docs/design/tlru_caching.md` (added, +92/-0)
- `tests/v1/core/test_tlru_eviction.py` (added, +321/-0)
- `vllm/config/cache.py` (modified, +17/-0)
- `vllm/engine/arg_utils.py` (modified, +2/-0)
- `vllm/v1/core/block_pool.py` (modified, +93/-6)
- `vllm/v1/core/kv_cache_coordinator.py` (modified, +29/-12)
- `vllm/v1/core/kv_cache_manager.py` (modified, +14/-0)
- `vllm/v1/core/kv_cache_utils.py` (modified, +6/-0)
- `vllm/v1/core/sched/scheduler.py` (modified, +2/-0)
- `vllm/v1/core/single_type_kv_cache_manager.py` (modified, +9/-2)

Code Example

vllm serve <model> --enable-prefix-caching --tlru-xi-tokens 4096 --tlru-qhat-tokens 200

---

# Branch: feature/tlru-eviction-policy on github.com/wenxinzhang0/vllm
vllm serve meta-llama/Llama-3-8B \
  --enable-prefix-caching \
  --tlru-xi-tokens 4096 \
  --tlru-qhat-tokens 200

RAW_BUFFERClick to expand / collapse

Motivation.

Summary

The idea and full analysis appear in our paper:

Tail-Optimized Caching for LLM Inference Wenxin Zhang, Ciamac C. Moallemi, Tianyi Peng Columbia Business School NeurIPS 2025. arXiv: https://arxiv.org/abs/2510.15152

Motivation

vLLM's current eviction policy is LRU. LRU maximizes cache-hit rate but is conversation-length blind: it treats a block from a 5-turn, 10 000-token conversation identically to a block from a 1-turn, 100-token conversation. This creates an avoidable source of tail latency.

Key Insight

For a conversation with history H blocks and estimated next-query length Q_hat blocks, evicting more than B = max(0, H + Q_hat - xi) blocks does not further increase the next turn's TTFT beyond the SLA threshold xi. Any block beyond this cap is TEL-safe (Tail Excess Latency safe) - it can be evicted without causing an SLO violation for that conversation.

Proposed Change.

Algorithm

T-LRU is a two-queue extension of LRU. When a request completes and its blocks are freed:

Compute B = max(0, H + Q_hat - xi) for the request.
1. The last H - B blocks (suffix/tail) are TEL-safe -> append to tel_safe_queue.
1. The first B blocks (prefix/head) are TEL-unsafe -> append to the existing LRU free queue as before. When new blocks are needed:
Drain tel_safe_queue first.
1. Fall back to the normal LRU queue for any remaining demand. xi and Q_hat are tunable parameters exposed as CLI flags:

vllm serve <model> --enable-prefix-caching --tlru-xi-tokens 4096 --tlru-qhat-tokens 200

Implementation

All changes are confined to the v1 KV cache stack. No changes to the scheduler, attention kernels, or serving API.

vllm/v1/core/kv_cache_utils.py: Added is_tel_safe: bool = False field to KVCacheBlock
- vllm/v1/core/block_pool.py: Added tel_safe_queue, free_blocks_tlru(), modified get_new_blocks()
- - vllm/v1/core/single_type_kv_cache_manager.py: free() routes to free_blocks_tlru() when T-LRU is enabled
- - - vllm/v1/core/kv_cache_coordinator.py: Forwards params to BlockPool
- - - - tests/v1/core/test_tlru_eviction.py: 15 unit tests

Experimental Results (WildChat dataset)

T-LRU reduces P95 TTFT by up to 27.4% vs LRU and closes 25-79% of the gap to the clairvoyant offline optimum. Results on ShareGPT are consistent.

Feedback Period.

At least one week.

CC List.

No response

Any Other Things.

Design Decisions & Open Questions

Default for Q_hat (tlru-qhat-tokens): We default to 200 tokens.
1. tel_safe_queue vs FreeKVCacheBlockQueue pointer: The current separate-deque approach is simpler.
1. Interaction with prefix caching: TEL-safe blocks are evicted before LRU pool blocks, which is intentional.

How to Try It

# Branch: feature/tlru-eviction-policy on github.com/wenxinzhang0/vllm
vllm serve meta-llama/Llama-3-8B \
  --enable-prefix-caching \
  --tlru-xi-tokens 4096 \
  --tlru-qhat-tokens 200

References

Zhang, Moallemi, Peng. Tail-Optimized Caching for LLM Inference. NeurIPS 2025. https://arxiv.org/abs/2510.15152
- Kwon et al. Efficient Memory Management for Large Language Model Serving with PagedAttention. SOSP 2023.
- - vLLM Prefix Caching documentation: https://docs.vllm.ai/en/stable/design/prefix_caching/

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

Fix Plan

To implement the Tail-Optimized LRU (T-LRU) eviction policy, follow these steps:

Update the KVCacheBlock class in vllm/v1/core/kv_cache_utils.py to include an is_tel_safe field:

class KVCacheBlock: # existing fields... is_tel_safe: bool = False

*   Modify the `BlockPool` class in `vllm/v1/core/block_pool.py` to include a `tel_safe_queue` and implement the `free_blocks_tlru` method:
    ```python
from collections import deque

class BlockPool:
    # existing fields and methods...
    tel_safe_queue = deque()

    def free_blocks_tlru(self, blocks, xi, qhat):
        for block in blocks:
            h = block.conversation_history_length
            b = max(0, h + qhat - xi)
            if block.index < b:
                # TEL-unsafe block, append to LRU free queue
                self.free_queue.append(block)
            else:
                # TEL-safe block, append to tel_safe_queue
                self.tel_safe_queue.append(block)

    def get_new_blocks(self, num_blocks):
        # Drain tel_safe_queue first
        while self.tel_safe_queue and num_blocks > 0:
            block = self.tel_safe_queue.popleft()
            # Allocate block
            num_blocks -= 1
        # Fall back to normal LRU queue for remaining demand
        while self.free_queue and num_blocks > 0:
            block = self.free_queue.popleft()
            # Allocate block
            num_blocks -= 1

Update the SingleTypeKVCacheManager class in vllm/v1/core/single_type_kv_cache_manager.py to route free calls to free_blocks_tlru when T-LRU is enabled:

class SingleTypeKVCacheManager: # existing fields and methods... def free(self, blocks): if self.use_tlru: self.block_pool.free_blocks_tlru(blocks, self.xi, self.qhat) else: # Existing LRU eviction logic pass

*   Expose `xi` and `qhat` as tunable parameters through CLI flags:
    ```bash
vllm serve <model> --enable-prefix-caching --tlru-xi-tokens 4096 --tlru-qhat-tokens 200

Verification

To verify the implementation, run the provided unit tests in tests/v1/core/test_tlru_eviction.py and measure the P95 Time-to-First-Token (TTFT) reduction on real conversation workloads.

Extra Tips

The default value for qhat (tlru-qhat-tokens) is set to 200 tokens, but this

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #chain error #conversation history #tool integration #memory management

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

vllm - ✅(Solved) Fix [RFC] Tail-Optimized LRU (T-LRU): Reducing Tail Latency via Conversation-Aware KV Cache Eviction [1 pull requests, 1 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Fix Action

Fixed

PR fix notes

PR #37825: [Core] Add Tail-Optimized LRU (T-LRU) KV cache eviction policy

Description (problem / solution / changelog)

Purpose

Changes

Test Plan

Code Example

Motivation.

Summary

Motivation

Key Insight

Proposed Change.

Algorithm

Implementation

Experimental Results (WildChat dataset)

Feedback Period.

CC List.

Any Other Things.

Design Decisions & Open Questions

How to Try It

References

Before submitting a new issue...

extent analysis

Fix Plan

Verification

Extra Tips

Still need to ship something?

RELATED_DISCOVERY

TRENDING