vllm - ✅(Solved) Fix [Feature]: Add TurboQuant Support for KV Cache Quantization [1 pull requests, 32 comments, 19 participants]

tunglinwood · 2026-03-26T01:24:23Z

[vllm] PR 2214: POC Add Turbo Quant - Repository: vllm-project/vllm-omni - Author: lishunyang12 - State: open | merged: False - Link: https://github.com/vllm-p… # PR #2214: [POC] Add Turbo Quant - Repository: vllm-project/vllm-omni - Author: lishunyang12 - State: open | merged: False - Link: https://github.com/vllm-project/vllm-omni/pull/2214 ## Description (problem / solution / changelog) ## Purpose KV cache is the primary memory bottleneck for long-context multimodal inference in vllm-omni. This PR implements [TurboQuant](https://arxiv.org/abs/2504.19874) (ICLR 2026) — an online vector quantization algorithm that compresses KV cache to 2–4 bits with zero quality loss. RFC: #2215 | Upstream vLLM: vllm-project/vllm#38171 ## Results **Needle-in-a-Haystack** — Qwen2.5-7B-Instruct (bf16), NVIDIA H200, context 4K–16K | config | exact match | avg cache GB | vs full | |--------|------------|-------------|---------| | full | 6/6 | 0.510 | 1.0x | | TurboQuant 2-bit | 6/6 | 0.068 | **7.5x smaller** | | TurboQuant 2.5 | 6/6 | 0.087 | **5.9x smaller** | | TurboQuant 3-bit | 6/6 | 0.106 | **4.8x smaller** | | TurboQuant 3.5 | 6/6 | 0.112 | **4.5x smaller** | | TurboQuant 4-bit | 6/6 | 0.132 | **3.9x smaller** | ## Phase Plan ### Phase 1 — Algorithm PoC ✅ (this PR) - Core algorithm with bit packing and fractional bit-widths (2.5, 3.5) - Needle-in-a-haystack: 6/6 exact match across all bit-widths - Exported as `TurboQuantConfig` / `TurboQuantState` from `vllm_omni.quantization` ### Phase 2 — vLLM KV Cache Integration | | Option A: vllm-omni fork | Option B: Upstream vLLM | |--|--------------------------|------------------------| | Scope | Modify vLLM submodule locally | Contribute to vllm-project/vllm | | Speed | Fast, test on Qwen3-Omni immediately | Longer review cycle | | Maintenance | Carries fork diff | No fork, inherits updates | Both options require: `CacheDType` enum extension, `TurboQuantKVCacheMethod`, paged KV block layout changes, attention layer wiring. ### Phase 3 — Performance Triton kernels for fused rotation + quantize. Fused attention on packed data. Stack with FP8 weight quantization. ## Test Plan ```bash python tests/diffusion/quantization/test_turboquant_standalone.py # correctness python tests/diffusion/quantization/bench_turboquant_needle.py \ --model Qwen/Qwen2.5-7B-Instruct --contexts 4096 8192 16384 # needle E2E python tests/diffusion/quantization/bench_turboquant_kvcache.py \ --model qwen2.5-omni # storage benchmark ``` ## Test Result - Correctness: 26/26 pass, MSE matches paper Theorem 1 - Needle: **6/6 exact match** at all bit-widths - KV cache: up to **7.5x compression** vs FP16 Related: #1854, #1867, #2207 ## Changed files - `tests/diffusion/quantization/bench_turboquant_needle.py` (added, +281/-0) - `tests/diffusion/quantization/test_turboquant_standalone.py` (added, +189/-0) - `vllm_omni/model_executor/stage_configs/qwen3_omni_moe.yaml` (modified, +5/-0) - `vllm_omni/quantization/__init__.py` (modified, +6/-0) - `vllm_omni/quantization/turboquant.py` (added, +529/-0) ## Fixed - Fixed by PR: [POC] TurboQuant: Sub-4-bit KV Cache Quantization for Omni Models (https://github.com/vllm-project/vllm-omni/pull/2214) # [Feature]: Add TurboQuant Support for KV Cache Quantization ## 🚀 The feature, motivation and pitch **TurboQuant** is a novel online vector quantization method that achieves near-optimal distortion rates for both MSE and inner product preservation, specifically designed for KV cache compression in LLMs. As described in the paper [TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate](https://arxiv.org/pdf/2504.19874), this method provides: - **Provably optimal distortion bounds** within a factor of ~2.7 of information-theoretic limits - **Unbiased inner product estimation** - critical for attention mechanisms - **Online application** - no data-dependent preprocessing required - **Accelerator-friendly design** - suitable for real-time AI workloads - **Significant memory savings** - 4-5× compression while maintaining model performance The paper demonstrates TurboQuant achieving perfect recall on Needle-in-a-Haystack tests at 4× compression and competitive performance on LongBench with just 2.5-3.5 bits per dimension. This would be a valuable addition to vLLM's quantization portfolio, complementing existing scalar methods (FP8, INT4, etc.) with a vector quantization approach specifically optimized for attention KV caches. ## Alternatives Current vLLM KV cache quantization options include: - **FP8 quantization** (`fp8_e4m3`, `fp8_e5m2`) - scalar quantization with element-wise MSE optimization - **Compressed-tensors KV cache** - supports various schemes but primarily scalar-based - **ModelOpt FP8** - NVIDIA's FP8 implementation TurboQuant differs by: 1. Using **vector quantization** with learned codebooks instead of scalar quantization 2. Optimizing for **inner product preservation** rather than just MSE 3. Providing **theoretical guarantees** on distortion bounds 4. S

vllm2026-03-26 01:24:23

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

vllm-project/vllm#38171•Fetched 2026-04-08 01:31:57

View on GitHub

Comments

Participants

Timeline

288

Reactions

202

Author

Participants

Timeline (top)

subscribed ×204commented ×32mentioned ×23cross-referenced ×17

Fix Action

Fixed

Fixed by PR: [POC] TurboQuant: Sub-4-bit KV Cache Quantization for Omni Models (https://github.com/vllm-project/vllm-omni/pull/2214)

PR fix notes

PR #2214: [POC] Add Turbo Quant

Repository: vllm-project/vllm-omni
Author: lishunyang12
State: open | merged: False
Link: https://github.com/vllm-project/vllm-omni/pull/2214

Description (problem / solution / changelog)

Purpose

KV cache is the primary memory bottleneck for long-context multimodal inference in vllm-omni. This PR implements TurboQuant (ICLR 2026) — an online vector quantization algorithm that compresses KV cache to 2–4 bits with zero quality loss.

RFC: #2215 | Upstream vLLM: vllm-project/vllm#38171

Results

Needle-in-a-Haystack — Qwen2.5-7B-Instruct (bf16), NVIDIA H200, context 4K–16K

config	exact match	avg cache GB	vs full
full	6/6	0.510	1.0x
TurboQuant 2-bit	6/6	0.068	7.5x smaller
TurboQuant 2.5	6/6	0.087	5.9x smaller
TurboQuant 3-bit	6/6	0.106	4.8x smaller
TurboQuant 3.5	6/6	0.112	4.5x smaller
TurboQuant 4-bit	6/6	0.132	3.9x smaller

Phase Plan

Phase 1 — Algorithm PoC ✅ (this PR)

Core algorithm with bit packing and fractional bit-widths (2.5, 3.5)
Needle-in-a-haystack: 6/6 exact match across all bit-widths
Exported as TurboQuantConfig / TurboQuantState from vllm_omni.quantization

Phase 2 — vLLM KV Cache Integration

	Option A: vllm-omni fork	Option B: Upstream vLLM
Scope	Modify vLLM submodule locally	Contribute to vllm-project/vllm
Speed	Fast, test on Qwen3-Omni immediately	Longer review cycle
Maintenance	Carries fork diff	No fork, inherits updates

Both options require: CacheDType enum extension, TurboQuantKVCacheMethod, paged KV block layout changes, attention layer wiring.

Phase 3 — Performance

Triton kernels for fused rotation + quantize. Fused attention on packed data. Stack with FP8 weight quantization.

Test Plan

python tests/diffusion/quantization/test_turboquant_standalone.py          # correctness
python tests/diffusion/quantization/bench_turboquant_needle.py \
  --model Qwen/Qwen2.5-7B-Instruct --contexts 4096 8192 16384             # needle E2E
python tests/diffusion/quantization/bench_turboquant_kvcache.py \
  --model qwen2.5-omni                                                     # storage benchmark

Test Result

Correctness: 26/26 pass, MSE matches paper Theorem 1
Needle: 6/6 exact match at all bit-widths
KV cache: up to 7.5x compression vs FP16

Related: #1854, #1867, #2207

Changed files

tests/diffusion/quantization/bench_turboquant_needle.py (added, +281/-0)
tests/diffusion/quantization/test_turboquant_standalone.py (added, +189/-0)
vllm_omni/model_executor/stage_configs/qwen3_omni_moe.yaml (modified, +5/-0)
vllm_omni/quantization/__init__.py (modified, +6/-0)
vllm_omni/quantization/turboquant.py (added, +529/-0)

Code Example

@register_quantization_config("turboquant")
class TurboQuantConfig(QuantizationConfig):
    def get_name(self) -> str:
        return "turboquant"
    
    def get_quant_method(self, layer: torch.nn.Module, prefix: str):
        if isinstance(layer, Attention):
            return TurboQuantKVCacheMethod(self)
        return None

RAW_BUFFERClick to expand / collapse

[Feature]: Add TurboQuant Support for KV Cache Quantization

🚀 The feature, motivation and pitch

TurboQuant is a novel online vector quantization method that achieves near-optimal distortion rates for both MSE and inner product preservation, specifically designed for KV cache compression in LLMs. As described in the paper TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate, this method provides:

Provably optimal distortion bounds within a factor of ~2.7 of information-theoretic limits
Unbiased inner product estimation - critical for attention mechanisms
Online application - no data-dependent preprocessing required
Accelerator-friendly design - suitable for real-time AI workloads
Significant memory savings - 4-5× compression while maintaining model performance

The paper demonstrates TurboQuant achieving perfect recall on Needle-in-a-Haystack tests at 4× compression and competitive performance on LongBench with just 2.5-3.5 bits per dimension.

This would be a valuable addition to vLLM's quantization portfolio, complementing existing scalar methods (FP8, INT4, etc.) with a vector quantization approach specifically optimized for attention KV caches.

Alternatives

Current vLLM KV cache quantization options include:

FP8 quantization (fp8_e4m3, fp8_e5m2) - scalar quantization with element-wise MSE optimization
Compressed-tensors KV cache - supports various schemes but primarily scalar-based
ModelOpt FP8 - NVIDIA's FP8 implementation

TurboQuant differs by:

Using vector quantization with learned codebooks instead of scalar quantization
Optimizing for inner product preservation rather than just MSE
Providing theoretical guarantees on distortion bounds
Supporting variable bit-widths (2.5-bit, 3.5-bit) through outlier handling

Additional context

Technical Integration Approach

Based on vLLM's existing quantization framework, TurboQuant integration would require:

1. Extend Cache Configuration

Add "turboquant" to CacheDType literal in vllm/config/cache.py and update the dtype mapping in vllm/utils/torch_utils.py to use integer storage types for codebook indices.

2. Create TurboQuantConfig Class

Implement using the registration decorator pattern:

@register_quantization_config("turboquant")
class TurboQuantConfig(QuantizationConfig):
    def get_name(self) -> str:
        return "turboquant"
    
    def get_quant_method(self, layer: torch.nn.Module, prefix: str):
        if isinstance(layer, Attention):
            return TurboQuantKVCacheMethod(self)
        return None

This follows the established pattern from vllm/model_executor/layers/quantization/__init__.py .

3. Implement KV Cache Method

Create TurboQuantKVCacheMethod extending BaseKVCacheMethod to:

Register codebook parameters instead of scalar scales
Handle both MSE-optimized and inner-product-optimized variants
Support per-head quantization strategies

4. Update Quantization Detection

Modify is_quantized_kv_cache() to recognize TurboQuant as a quantized format.

5. Implement CUDA/Triton Kernels

Develop two key operations:

Encode kernel: Quantize K/V tensors to codebook indices for cache storage
Decode kernel: Reconstruct K/V tensors from indices before attention computation

These would integrate with the existing attention backend system.

6. Memory Management

Update KVCacheSpec calculations to account for:

Reduced storage from vector quantization
Additional codebook memory overhead
Variable compression ratios based on bit-width

Alternatives

No response

Additional context

No response

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

Fix Plan

To integrate TurboQuant into the existing quantization framework, follow these steps:

Extend Cache Configuration: Add "turboquant" to CacheDType in vllm/config/cache.py and update vllm/utils/torch_utils.py to use integer storage types for codebook indices.
Create TurboQuantConfig Class: Implement the TurboQuantConfig class using the registration decorator pattern:

@register_quantization_config("turboquant") class TurboQuantConfig(QuantizationConfig): def get_name(self) -> str: return "turboquant"

def get_quant_method(self, layer: torch.nn.Module, prefix: str):
    if isinstance(layer, Attention):
        return TurboQuantKVCacheMethod(self)
    return None

3. **Implement KV Cache Method**: Create `TurboQuantKVCacheMethod` extending `BaseKVCacheMethod` to handle codebook parameters, MSE-optimized and inner-product-optimized variants, and per-head quantization strategies.
4. **Update Quantization Detection**: Modify `is_quantized_kv_cache()` to recognize TurboQuant as a quantized format.
5. **Implement CUDA/Triton Kernels**: Develop encode and decode kernels for quantizing K/V tensors to codebook indices and reconstructing K/V tensors from indices.
6. **Memory Management**: Update `KVCacheSpec` calculations to account for reduced storage, codebook memory overhead, and variable compression ratios.

### Example Code
```python
# turboquant_config.py
from vllm.model_executor.layers.quantization import QuantizationConfig
from vllm.model_executor.layers.quantization import register_quantization_config

@register_quantization_config("turboquant")
class TurboQuantConfig(QuantizationConfig):
 def get_name(self) -> str:
     return "turboquant"
 
 def get_quant_method(self, layer: torch.nn.Module, prefix: str):
     if isinstance(layer, Attention):
         return TurboQuantKVCacheMethod(self)
     return None

# turboquant_kv_cache_method.py
from vllm.model_executor.layers.quantization import BaseKVCacheMethod
from vllm.model_executor.layers.attention import Attention

class TurboQuantKVCacheMethod(BaseKVCacheMethod):
    def __init__(self, config: TurboQuantConfig):
        super().__init__(config)
        # Initialize codebook parameters and quantization strategy

    def quantize(self, kv_tensors: torch.Tensor):
        # Implement encode kernel to quantize K/V tensors to codebook indices
        pass

    def dequantize(self, indices: torch.Tensor):
        # Implement decode kernel to reconstruct K/V tensors from indices
        pass

Verification

To verify the fix, test the TurboQuant integration with various models and datasets, checking for:

Correct quantization and dequantization of K/V tensors

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#optimization #memory management #response parsing #generation error #database connection

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

vllm - ✅(Solved) Fix [Feature]: Add TurboQuant Support for KV Cache Quantization [1 pull requests, 32 comments, 19 participants]

Recommended Tools

GitHub issue graph ai analysis

Fix Action

Fixed

PR fix notes

PR #2214: [POC] Add Turbo Quant

Description (problem / solution / changelog)

Purpose

Results

Phase Plan

Phase 1 — Algorithm PoC ✅ (this PR)

Phase 2 — vLLM KV Cache Integration

Phase 3 — Performance

Test Plan

Test Result

Changed files

Code Example

[Feature]: Add TurboQuant Support for KV Cache Quantization

🚀 The feature, motivation and pitch

Alternatives

Additional context

Technical Integration Approach

1. Extend Cache Configuration

2. Create TurboQuantConfig Class

3. Implement KV Cache Method

4. Update Quantization Detection

5. Implement CUDA/Triton Kernels

6. Memory Management

Alternatives

Additional context

Before submitting a new issue...

extent analysis

Fix Plan

Verification

Still need to ship something?

RELATED_DISCOVERY

TRENDING