vllm - 💡(How to fix) Fix [Feature] Phase-aware KV cache quantization for reasoning models (58% distortion reduction measured) [1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#39416Fetched 2026-04-10 03:40:45
View on GitHub
Comments
0
Participants
1
Timeline
1
Reactions
0
Participants
Timeline (top)
labeled ×1

I would like to propose adding per-phase KV cache quantization for reasoning models (DeepSeek-R1, QwQ, o-series). The idea is simple: instead of applying a single kv_cache_dtype to all tokens, allow different bit widths for the think-phase and answer-phase tokens based on measured redundancy.

I have published a paper with a closed-form theorem and empirical validation showing this approach cuts attention KL divergence by 58% compared to uniform 3-bit quantization on DeepSeek-R1-Distill-1.5B, with no additional inference-time compute.

Root Cause

I would like to propose adding per-phase KV cache quantization for reasoning models (DeepSeek-R1, QwQ, o-series). The idea is simple: instead of applying a single kv_cache_dtype to all tokens, allow different bit widths for the think-phase and answer-phase tokens based on measured redundancy.

I have published a paper with a closed-form theorem and empirical validation showing this approach cuts attention KL divergence by 58% compared to uniform 3-bit quantization on DeepSeek-R1-Distill-1.5B, with no additional inference-time compute.

Code Example

# In EngineArgs or SamplingParams
kv_cache_dtype_think: str = "auto"   # e.g., "fp8" or "int4"
kv_cache_dtype_answer: str = "auto"  # e.g., "fp8" or "int3"
RAW_BUFFERClick to expand / collapse

🚀 The feature, motivation and pitch

Summary

I would like to propose adding per-phase KV cache quantization for reasoning models (DeepSeek-R1, QwQ, o-series). The idea is simple: instead of applying a single kv_cache_dtype to all tokens, allow different bit widths for the think-phase and answer-phase tokens based on measured redundancy.

I have published a paper with a closed-form theorem and empirical validation showing this approach cuts attention KL divergence by 58% compared to uniform 3-bit quantization on DeepSeek-R1-Distill-1.5B, with no additional inference-time compute.

Motivation

vLLM currently supports uniform KV cache quantization via --kv-cache-dtype (fp8, fp8_e4m3, etc.). This works well for standard LLMs. But reasoning models have a structural asymmetry that uniform quantization ignores:

  • Think-phase tokens (~75% of generation): internal scratchpad, variable redundancy
  • Answer-phase tokens (~25% of generation): final response, different redundancy profile

On distilled reasoning models (1.5B-7B), the answer phase is actually more redundant than the think phase - the opposite of what is reported on the full 671B model. Applying the wrong bit allocation nearly doubles the attention distortion.

Measured Data (DeepSeek-R1-Distill-Qwen-1.5B, n=50, GSM8K)

MetricThink PhaseAnswer Phase
Pairwise cosine ρ0.463 ± 0.0400.544 ± 0.045
Fraction of tokens75.5%24.5%
Quantization Configb_thinkb_answerKL Divergencevs Uniform 3-bit
Uniform 3-bit330.00303baseline
Theory-aligned (4/3)430.00126-58%
Anti-aligned (3/4)340.00234-23%

The theory-aligned allocation follows a simple rule: give fewer bits to the more redundant phase. Which phase that is depends on the model and must be measured, not assumed.

Proposed Implementation

The minimal implementation would add two optional parameters:

# In EngineArgs or SamplingParams
kv_cache_dtype_think: str = "auto"   # e.g., "fp8" or "int4"
kv_cache_dtype_answer: str = "auto"  # e.g., "fp8" or "int3"

The phase boundary can be detected via:

  1. Template markers (preferred): DeepSeek-R1 emits </think> tags. vLLM already parses stop tokens - this is the same mechanism. Token count heuristic: For models without markers, use a configurable fraction (default 75% think).
  2. The quantization itself doesn't change - it's still the same per-channel scalar quantization vLLM already uses. The only change is which precision gets applied to which token range in the cache.

Where in the codebase Based on my reading of the vLLM source:

  • vllm/attention/backends/ - The attention backends already handle quantized KV caches. The change would be to support a per-token-range dtype rather than a single dtype for the entire cache.
  • vllm/worker/cache_engine.py - Cache allocation could accept a phase boundary and allocate the two regions at different precisions.

I am happy to contribute a PR if there's interest. The core logic is about 50 lines of code, the complexity is in fitting it into vLLM's cache management layer.

References

  1. Paper: Think Less, Store Smarter: Type-Aware KV Cache Quantization (open access, all code included)
  2. Diagnostic tool: github.com/myProjectsRavi/taqg-kv-cache-optimization - measures per-phase ρ on any HuggingFace model in ~1 hour on a T4
  3. TAQG Theorem: b_H ≤ b_L - ⌊log₂((1-ρ_L)/(1-ρ_H) + 1)⌋ - closed-form, direction-agnostic

Alternatives

No response

Additional context

This approach is complementary to existing optimizations (PagedAttention, chunked prefill, speculative decoding). It is a pure precision-allocation improvement - no architectural changes, no retraining, no additional memory overhead. The think/answer split is already semantically meaningful for reasoning models; we are just using it to inform the quantization policy.

I am particularly interested in hearing whether the vLLM team has observed quality differences when quantizing reasoning model KV caches uniformly, since that would directly validate the motivation here.

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

TL;DR

Implement per-phase KV cache quantization by adding optional parameters kv_cache_dtype_think and kv_cache_dtype_answer to EngineArgs or SamplingParams.

Guidance

  • Review the proposed implementation and consider adding the two optional parameters to support per-phase KV cache quantization.
  • Investigate the feasibility of detecting the phase boundary via template markers or token count heuristic.
  • Examine the attention backends in vllm/attention/backends/ and cache allocation in vllm/worker/cache_engine.py to determine the necessary changes.
  • Evaluate the potential benefits of per-phase KV cache quantization, including reduced attention KL divergence and improved model performance.

Example

# In EngineArgs or SamplingParams
kv_cache_dtype_think: str = "fp8"   # e.g., "fp8" or "int4"
kv_cache_dtype_answer: str = "int3"  # e.g., "fp8" or "int3"

Notes

The proposed implementation is complementary to existing optimizations and does not require architectural changes, retraining, or additional memory overhead. However, the effectiveness of per-phase KV cache quantization may depend on the specific model and use case.

Recommendation

Apply the proposed per-phase KV cache quantization implementation to support more efficient and effective model performance. This approach has been shown to reduce attention KL divergence by 58% compared to uniform 3-bit quantization on DeepSeek-R1-Distill-1.5B.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING