vllm - ✅(Solved) Fix [Feature]: Add TurboQuant Support for KV Cache Quantization [1 pull requests, 32 comments, 19 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#38171Fetched 2026-04-08 01:31:57
View on GitHub
Comments
32
Participants
19
Timeline
288
Reactions
202

Fix Action

Fixed

PR fix notes

PR #2214: [POC] Add Turbo Quant

Description (problem / solution / changelog)

Purpose

KV cache is the primary memory bottleneck for long-context multimodal inference in vllm-omni. This PR implements TurboQuant (ICLR 2026) — an online vector quantization algorithm that compresses KV cache to 2–4 bits with zero quality loss.

RFC: #2215 | Upstream vLLM: vllm-project/vllm#38171

Results

Needle-in-a-Haystack — Qwen2.5-7B-Instruct (bf16), NVIDIA H200, context 4K–16K

configexact matchavg cache GBvs full
full6/60.5101.0x
TurboQuant 2-bit6/60.0687.5x smaller
TurboQuant 2.56/60.0875.9x smaller
TurboQuant 3-bit6/60.1064.8x smaller
TurboQuant 3.56/60.1124.5x smaller
TurboQuant 4-bit6/60.1323.9x smaller

Phase Plan

Phase 1 — Algorithm PoC ✅ (this PR)

  • Core algorithm with bit packing and fractional bit-widths (2.5, 3.5)
  • Needle-in-a-haystack: 6/6 exact match across all bit-widths
  • Exported as TurboQuantConfig / TurboQuantState from vllm_omni.quantization

Phase 2 — vLLM KV Cache Integration

Option A: vllm-omni forkOption B: Upstream vLLM
ScopeModify vLLM submodule locallyContribute to vllm-project/vllm
SpeedFast, test on Qwen3-Omni immediatelyLonger review cycle
MaintenanceCarries fork diffNo fork, inherits updates

Both options require: CacheDType enum extension, TurboQuantKVCacheMethod, paged KV block layout changes, attention layer wiring.

Phase 3 — Performance

Triton kernels for fused rotation + quantize. Fused attention on packed data. Stack with FP8 weight quantization.

Test Plan

python tests/diffusion/quantization/test_turboquant_standalone.py          # correctness
python tests/diffusion/quantization/bench_turboquant_needle.py \
  --model Qwen/Qwen2.5-7B-Instruct --contexts 4096 8192 16384             # needle E2E
python tests/diffusion/quantization/bench_turboquant_kvcache.py \
  --model qwen2.5-omni                                                     # storage benchmark

Test Result

  • Correctness: 26/26 pass, MSE matches paper Theorem 1
  • Needle: 6/6 exact match at all bit-widths
  • KV cache: up to 7.5x compression vs FP16

Related: #1854, #1867, #2207

Changed files

  • tests/diffusion/quantization/bench_turboquant_needle.py (added, +281/-0)
  • tests/diffusion/quantization/test_turboquant_standalone.py (added, +189/-0)
  • vllm_omni/model_executor/stage_configs/qwen3_omni_moe.yaml (modified, +5/-0)
  • vllm_omni/quantization/__init__.py (modified, +6/-0)
  • vllm_omni/quantization/turboquant.py (added, +529/-0)

Code Example

@register_quantization_config("turboquant")
class TurboQuantConfig(QuantizationConfig):
    def get_name(self) -> str:
        return "turboquant"
    
    def get_quant_method(self, layer: torch.nn.Module, prefix: str):
        if isinstance(layer, Attention):
            return TurboQuantKVCacheMethod(self)
        return None
RAW_BUFFERClick to expand / collapse

[Feature]: Add TurboQuant Support for KV Cache Quantization

🚀 The feature, motivation and pitch

TurboQuant is a novel online vector quantization method that achieves near-optimal distortion rates for both MSE and inner product preservation, specifically designed for KV cache compression in LLMs. As described in the paper TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate, this method provides:

  • Provably optimal distortion bounds within a factor of ~2.7 of information-theoretic limits
  • Unbiased inner product estimation - critical for attention mechanisms
  • Online application - no data-dependent preprocessing required
  • Accelerator-friendly design - suitable for real-time AI workloads
  • Significant memory savings - 4-5× compression while maintaining model performance

The paper demonstrates TurboQuant achieving perfect recall on Needle-in-a-Haystack tests at 4× compression and competitive performance on LongBench with just 2.5-3.5 bits per dimension.

This would be a valuable addition to vLLM's quantization portfolio, complementing existing scalar methods (FP8, INT4, etc.) with a vector quantization approach specifically optimized for attention KV caches.

Alternatives

Current vLLM KV cache quantization options include:

  • FP8 quantization (fp8_e4m3, fp8_e5m2) - scalar quantization with element-wise MSE optimization
  • Compressed-tensors KV cache - supports various schemes but primarily scalar-based
  • ModelOpt FP8 - NVIDIA's FP8 implementation

TurboQuant differs by:

  1. Using vector quantization with learned codebooks instead of scalar quantization
  2. Optimizing for inner product preservation rather than just MSE
  3. Providing theoretical guarantees on distortion bounds
  4. Supporting variable bit-widths (2.5-bit, 3.5-bit) through outlier handling

Additional context

Technical Integration Approach

Based on vLLM's existing quantization framework, TurboQuant integration would require:

1. Extend Cache Configuration

Add "turboquant" to CacheDType literal in vllm/config/cache.py and update the dtype mapping in vllm/utils/torch_utils.py to use integer storage types for codebook indices.

2. Create TurboQuantConfig Class

Implement using the registration decorator pattern:

@register_quantization_config("turboquant")
class TurboQuantConfig(QuantizationConfig):
    def get_name(self) -> str:
        return "turboquant"
    
    def get_quant_method(self, layer: torch.nn.Module, prefix: str):
        if isinstance(layer, Attention):
            return TurboQuantKVCacheMethod(self)
        return None

This follows the established pattern from vllm/model_executor/layers/quantization/__init__.py .

3. Implement KV Cache Method

Create TurboQuantKVCacheMethod extending BaseKVCacheMethod to:

  • Register codebook parameters instead of scalar scales
  • Handle both MSE-optimized and inner-product-optimized variants
  • Support per-head quantization strategies

4. Update Quantization Detection

Modify is_quantized_kv_cache() to recognize TurboQuant as a quantized format.

5. Implement CUDA/Triton Kernels

Develop two key operations:

  • Encode kernel: Quantize K/V tensors to codebook indices for cache storage
  • Decode kernel: Reconstruct K/V tensors from indices before attention computation

These would integrate with the existing attention backend system.

6. Memory Management

Update KVCacheSpec calculations to account for:

  • Reduced storage from vector quantization
  • Additional codebook memory overhead
  • Variable compression ratios based on bit-width

Alternatives

No response

Additional context

No response

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

Fix Plan

To integrate TurboQuant into the existing quantization framework, follow these steps:

  1. Extend Cache Configuration: Add "turboquant" to CacheDType in vllm/config/cache.py and update vllm/utils/torch_utils.py to use integer storage types for codebook indices.
  2. Create TurboQuantConfig Class: Implement the TurboQuantConfig class using the registration decorator pattern:

@register_quantization_config("turboquant") class TurboQuantConfig(QuantizationConfig): def get_name(self) -> str: return "turboquant"

def get_quant_method(self, layer: torch.nn.Module, prefix: str):
    if isinstance(layer, Attention):
        return TurboQuantKVCacheMethod(self)
    return None
3. **Implement KV Cache Method**: Create `TurboQuantKVCacheMethod` extending `BaseKVCacheMethod` to handle codebook parameters, MSE-optimized and inner-product-optimized variants, and per-head quantization strategies.
4. **Update Quantization Detection**: Modify `is_quantized_kv_cache()` to recognize TurboQuant as a quantized format.
5. **Implement CUDA/Triton Kernels**: Develop encode and decode kernels for quantizing K/V tensors to codebook indices and reconstructing K/V tensors from indices.
6. **Memory Management**: Update `KVCacheSpec` calculations to account for reduced storage, codebook memory overhead, and variable compression ratios.

### Example Code
```python
# turboquant_config.py
from vllm.model_executor.layers.quantization import QuantizationConfig
from vllm.model_executor.layers.quantization import register_quantization_config

@register_quantization_config("turboquant")
class TurboQuantConfig(QuantizationConfig):
 def get_name(self) -> str:
     return "turboquant"
 
 def get_quant_method(self, layer: torch.nn.Module, prefix: str):
     if isinstance(layer, Attention):
         return TurboQuantKVCacheMethod(self)
     return None
# turboquant_kv_cache_method.py
from vllm.model_executor.layers.quantization import BaseKVCacheMethod
from vllm.model_executor.layers.attention import Attention

class TurboQuantKVCacheMethod(BaseKVCacheMethod):
    def __init__(self, config: TurboQuantConfig):
        super().__init__(config)
        # Initialize codebook parameters and quantization strategy

    def quantize(self, kv_tensors: torch.Tensor):
        # Implement encode kernel to quantize K/V tensors to codebook indices
        pass

    def dequantize(self, indices: torch.Tensor):
        # Implement decode kernel to reconstruct K/V tensors from indices
        pass

Verification

To verify the fix, test the TurboQuant integration with various models and datasets, checking for:

  • Correct quantization and dequantization of K/V tensors

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING