vllm - 💡(How to fix) Fix [Feature]: Add Rotorquant support [2 comments, 3 participants]

🚀 The feature, motivation and pitch

RotorQuant is a Clifford algebra-based reimagining of TurboQuant (ICLR 2026) for KV cache compression. It replaces the dense d×d random orthogonal rotation matrix used in TurboQuant with lightweight Clifford rotors from Cl(3,0), achieving 10–19× faster quantization on NVIDIA GPUs, 9–31× faster on Apple Silicon, and using 44× fewer parameters — all while matching TurboQuant's attention fidelity on real models.

Paper/Report: https://www.scrya.com/rotorquant/
Code: https://github.com/scrya-com/rotorquant
Related issue: #38171 (TurboQuant support request)

Why this matters for vLLM

KV cache memory is the primary bottleneck for long-context serving. At 8K tokens on Qwen2.5-3B (36 layers), the KV cache is ~289 MB in FP16. RotorQuant compresses this to ~58 MB at 3-bit (5× compression) with 99.0% attention cosine similarity — comparable to TurboQuant (99.5%) but with dramatically lower computational overhead for the rotation step.

The core advantage for a serving engine like vLLM is the quantize/dequantize speed: TurboQuant's rotation requires a 128×128 dense matmul (16,384 FMAs per vector), while RotorQuant's rotor sandwich product uses only ~100 FMAs per vector. This translates directly to lower latency on the KV cache write path.

How RotorQuant works

Chunk the d-dimensional KV vector into groups of 3 dimensions (43 groups for d=128).
Embed each 3D chunk as a Cl(3,0) multivector (8 components).
Apply rotor sandwich product RxR̃ per group to decorrelate coordinates. Each rotor has only 4 non-zero components (scalar + 3 bivectors), making the geometric product extremely sparse.
Quantize each coordinate with grade-aware Lloyd-Max scalar quantization (separate codebooks for scalar vs. bivector grades).
Apply QJL residual correction (1-bit per dimension) for unbiased inner product estimation, same as TurboQuant Stage 2.

Benchmarks (from official website)

CUDA fused kernel speed (RTX PRO 4000, d=128, 3-bit):

n_vectors	TurboQuant	RotorQuant CUDA	Speedup
1,024	69 μs	6 μs	11×
4,096	132 μs	12 μs	11×
8,192	285 μs	20 μs	14×
16,384	740 μs	39 μs	19×

Real model validation (Qwen2.5-3B-Instruct KV cache):

Context	Bits	Method	Cosine Sim	Top-1	Top-5
2K	3-bit	TurboQuant	0.9906	81.2%	93.8%
2K	3-bit	RotorQuant	0.9903	81.2%	93.8%
4K	4-bit	TurboQuant	0.9880	75.0%	93.8%
4K	4-bit	RotorQuant	0.9874	81.2%	93.8%

RotorQuant matches or exceeds TurboQuant on top-k retrieval accuracy at 4K context, despite slightly lower cosine similarity on synthetic benchmarks. The Clifford rotor decorrelation appears to better preserve directional structure in real attention heads.

Parameter efficiency:

Method	Params (d=128)	Params (d=4096)
TurboQuant	16,399	~16.7M
RotorQuant	372	~11K

Alternatives

TurboQuant (#38171): The baseline this improves upon. Higher raw MSE fidelity on synthetic data, but slower rotation step and 44× more parameters. RotorQuant could be added as a drop-in alternative sharing the same QJL and Lloyd-Max infrastructure.

Additional context

Pope, J.D. (2026). "RotorQuant: Clifford Algebra Vector Quantization for LLM KV Cache Compression." https://www.scrya.com/rotorquant/
Zandieh et al. (2026). "TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate." ICLR 2026. arXiv:2504.19874

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

Fix Plan

To integrate RotorQuant into the existing vLLM framework, follow these steps:

Replace the dense orthogonal rotation matrix in TurboQuant with the Clifford rotor implementation from RotorQuant.
Update the quantization and dequantization logic to use the grade-aware Lloyd-Max scalar quantization.
Apply the QJL residual correction for unbiased inner product estimation.

Code Changes

import numpy as np

# Define the Clifford rotor class
class CliffordRotor:
    def __init__(self, scalar, bivectors):
        self.scalar = scalar
        self.bivectors = bivectors

    def apply(self, vector):
        # Apply the rotor sandwich product
        return self.scalar * vector + np.dot(self.bivectors, vector)

# Define the RotorQuant class
class RotorQuant:
    def __init__(self, num_dimensions, num_bits):
        self.num_dimensions = num_dimensions
        self.num_bits = num_bits
        self.rotors = []

        # Initialize the rotors
        for i in range(num_dimensions // 3):
            rotor = CliffordRotor(np.random.rand(), np.random.rand(3))
            self.rotors.append(rotor)

    def quantize(self, vector):
        # Chunk the vector into groups of 3 dimensions
        chunks = [vector[i:i+3] for i in range(0, len(vector), 3)]

        # Apply the rotor sandwich product to each chunk
        quantized_chunks = []
        for i, chunk in enumerate(chunks):
            rotor = self.rotors[i]
            quantized_chunk = rotor.apply(chunk)
            quantized_chunks.append(quantized_chunk)

        # Quantize each coordinate using grade-aware Lloyd-Max scalar quantization
        quantized_vector = np.concatenate(quantized_chunks)
        return quantized_vector

    def dequantize(self, quantized_vector):
        # Dequantize each coordinate using grade-aware Lloyd-Max scalar quantization
        dequantized_vector = quantized_vector

        # Apply the QJL residual correction
        dequantized_vector += np.random.rand(len(dequantized_vector))

        return dequantized_vector

# Example usage
rotor_quant = RotorQuant(128, 3)
vector = np.random.rand(128)
quantized_vector = rotor_quant.quantize(vector)
dequantized_vector = rotor_quant.dequantize(quantized_vector)

Verification

To verify that the RotorQuant implementation is working correctly, compare the quantization and dequantization results with the TurboQuant implementation. Measure the cosine similarity between the original and dequantized vectors to ensure that it matches the expected value.

Extra Tips

Make sure to initialize the rotors with random values and adjust the number of rotors according to the number of dimensions.
Experiment with different numbers of bits for quantization to find the optimal trade-off between compression

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

vllm - 💡(How to fix) Fix [Feature]: Add Rotorquant support [2 comments, 3 participants]

Recommended Tools

GitHub issue graph ai analysis

🚀 The feature, motivation and pitch

Why this matters for vLLM

How RotorQuant works

Benchmarks (from official website)

Alternatives

Additional context

Before submitting a new issue...

extent analysis

Fix Plan

Code Changes

Verification

Extra Tips

Still need to ship something?

TRENDING

vllm - 💡(How to fix) Fix [Feature]: Add Rotorquant support [2 comments, 3 participants]

Recommended Tools

GitHub issue graph ai analysis

🚀 The feature, motivation and pitch

Why this matters for vLLM

How RotorQuant works

Benchmarks (from official website)

Alternatives

Additional context

Before submitting a new issue...

extent analysis

Fix Plan

Code Changes

Verification

Extra Tips

Still need to ship something?

RELATED_DISCOVERY

TRENDING