vllm - 💡(How to fix) Fix [Feature] Phase-aware KV cache quantization for reasoning models (58% distortion reduction measured) [1 participants]

vllm2026-04-09 12:57:54

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

vllm-project/vllm#39416•Fetched 2026-04-10 03:40:45

View on GitHub

Comments

Participants

Timeline

Reactions

Author

myProjectsRavi

Participants

myProjectsRavi

Timeline (top)

labeled ×1

I would like to propose adding per-phase KV cache quantization for reasoning models (DeepSeek-R1, QwQ, o-series). The idea is simple: instead of applying a single kv_cache_dtype to all tokens, allow different bit widths for the think-phase and answer-phase tokens based on measured redundancy.

I have published a paper with a closed-form theorem and empirical validation showing this approach cuts attention KL divergence by 58% compared to uniform 3-bit quantization on DeepSeek-R1-Distill-1.5B, with no additional inference-time compute.

Root Cause

Code Example

# In EngineArgs or SamplingParams
kv_cache_dtype_think: str = "auto"   # e.g., "fp8" or "int4"
kv_cache_dtype_answer: str = "auto"  # e.g., "fp8" or "int3"

RAW_BUFFERClick to expand / collapse

🚀 The feature, motivation and pitch

Summary

Motivation

vLLM currently supports uniform KV cache quantization via --kv-cache-dtype (fp8, fp8_e4m3, etc.). This works well for standard LLMs. But reasoning models have a structural asymmetry that uniform quantization ignores:

Think-phase tokens (~75% of generation): internal scratchpad, variable redundancy
Answer-phase tokens (~25% of generation): final response, different redundancy profile

On distilled reasoning models (1.5B-7B), the answer phase is actually more redundant than the think phase - the opposite of what is reported on the full 671B model. Applying the wrong bit allocation nearly doubles the attention distortion.

Measured Data (DeepSeek-R1-Distill-Qwen-1.5B, n=50, GSM8K)

Metric	Think Phase	Answer Phase
Pairwise cosine ρ	0.463 ± 0.040	0.544 ± 0.045
Fraction of tokens	75.5%	24.5%

Quantization Config	b_think	b_answer	KL Divergence	vs Uniform 3-bit
Uniform 3-bit	3	3	0.00303	baseline
Theory-aligned (4/3)	4	3	0.00126	-58%
Anti-aligned (3/4)	3	4	0.00234	-23%

The theory-aligned allocation follows a simple rule: give fewer bits to the more redundant phase. Which phase that is depends on the model and must be measured, not assumed.

Proposed Implementation

The minimal implementation would add two optional parameters:

# In EngineArgs or SamplingParams
kv_cache_dtype_think: str = "auto"   # e.g., "fp8" or "int4"
kv_cache_dtype_answer: str = "auto"  # e.g., "fp8" or "int3"

The phase boundary can be detected via:

Template markers (preferred): DeepSeek-R1 emits </think> tags. vLLM already parses stop tokens - this is the same mechanism. Token count heuristic: For models without markers, use a configurable fraction (default 75% think).
The quantization itself doesn't change - it's still the same per-channel scalar quantization vLLM already uses. The only change is which precision gets applied to which token range in the cache.

Where in the codebase Based on my reading of the vLLM source:

vllm/attention/backends/ - The attention backends already handle quantized KV caches. The change would be to support a per-token-range dtype rather than a single dtype for the entire cache.
vllm/worker/cache_engine.py - Cache allocation could accept a phase boundary and allocate the two regions at different precisions.

I am happy to contribute a PR if there's interest. The core logic is about 50 lines of code, the complexity is in fitting it into vLLM's cache management layer.

References

Paper: Think Less, Store Smarter: Type-Aware KV Cache Quantization (open access, all code included)
Diagnostic tool: github.com/myProjectsRavi/taqg-kv-cache-optimization - measures per-phase ρ on any HuggingFace model in ~1 hour on a T4
TAQG Theorem: b_H ≤ b_L - ⌊log₂((1-ρ_L)/(1-ρ_H) + 1)⌋ - closed-form, direction-agnostic

Alternatives

No response

Additional context

This approach is complementary to existing optimizations (PagedAttention, chunked prefill, speculative decoding). It is a pure precision-allocation improvement - no architectural changes, no retraining, no additional memory overhead. The think/answer split is already semantically meaningful for reasoning models; we are just using it to inform the quantization policy.

I am particularly interested in hearing whether the vLLM team has observed quality differences when quantizing reasoning model KV caches uniformly, since that would directly validate the motivation here.

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

TL;DR

Implement per-phase KV cache quantization by adding optional parameters kv_cache_dtype_think and kv_cache_dtype_answer to EngineArgs or SamplingParams.

Guidance

Review the proposed implementation and consider adding the two optional parameters to support per-phase KV cache quantization.
Investigate the feasibility of detecting the phase boundary via template markers or token count heuristic.
Examine the attention backends in vllm/attention/backends/ and cache allocation in vllm/worker/cache_engine.py to determine the necessary changes.
Evaluate the potential benefits of per-phase KV cache quantization, including reduced attention KL divergence and improved model performance.

Example

# In EngineArgs or SamplingParams
kv_cache_dtype_think: str = "fp8"   # e.g., "fp8" or "int4"
kv_cache_dtype_answer: str = "int3"  # e.g., "fp8" or "int3"

Notes

The proposed implementation is complementary to existing optimizations and does not require architectural changes, retraining, or additional memory overhead. However, the effectiveness of per-phase KV cache quantization may depend on the specific model and use case.

Recommendation

Apply the proposed per-phase KV cache quantization implementation to support more efficient and effective model performance. This approach has been shown to reduce attention KL divergence by 58% compared to uniform 3-bit quantization on DeepSeek-R1-Distill-1.5B.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#optimization #search optimization #API routing #API middleware #SSR setup

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

vllm - 💡(How to fix) Fix [Feature] Phase-aware KV cache quantization for reasoning models (58% distortion reduction measured) [1 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Code Example

🚀 The feature, motivation and pitch

Summary

Motivation

Measured Data (DeepSeek-R1-Distill-Qwen-1.5B, n=50, GSM8K)

Proposed Implementation

Alternatives

Additional context

Before submitting a new issue...

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

TRENDING

vllm - 💡(How to fix) Fix [Feature] Phase-aware KV cache quantization for reasoning models (58% distortion reduction measured) [1 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Code Example

🚀 The feature, motivation and pitch

Summary

Motivation

Measured Data (DeepSeek-R1-Distill-Qwen-1.5B, n=50, GSM8K)

Proposed Implementation

Alternatives

Additional context

Before submitting a new issue...

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING