vllm - 💡(How to fix) Fix [Feature]: Add Rotorquant support [2 comments, 3 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#38291Fetched 2026-04-08 01:36:46
View on GitHub
Comments
2
Participants
3
Timeline
32
Reactions
29
Author
Timeline (top)
subscribed ×28commented ×2labeled ×1renamed ×1
RAW_BUFFERClick to expand / collapse

🚀 The feature, motivation and pitch

RotorQuant is a Clifford algebra-based reimagining of TurboQuant (ICLR 2026) for KV cache compression. It replaces the dense d×d random orthogonal rotation matrix used in TurboQuant with lightweight Clifford rotors from Cl(3,0), achieving 10–19× faster quantization on NVIDIA GPUs, 9–31× faster on Apple Silicon, and using 44× fewer parameters — all while matching TurboQuant's attention fidelity on real models.

Why this matters for vLLM

KV cache memory is the primary bottleneck for long-context serving. At 8K tokens on Qwen2.5-3B (36 layers), the KV cache is ~289 MB in FP16. RotorQuant compresses this to ~58 MB at 3-bit (5× compression) with 99.0% attention cosine similarity — comparable to TurboQuant (99.5%) but with dramatically lower computational overhead for the rotation step.

The core advantage for a serving engine like vLLM is the quantize/dequantize speed: TurboQuant's rotation requires a 128×128 dense matmul (16,384 FMAs per vector), while RotorQuant's rotor sandwich product uses only ~100 FMAs per vector. This translates directly to lower latency on the KV cache write path.

How RotorQuant works

  1. Chunk the d-dimensional KV vector into groups of 3 dimensions (43 groups for d=128).
  2. Embed each 3D chunk as a Cl(3,0) multivector (8 components).
  3. Apply rotor sandwich product RxR̃ per group to decorrelate coordinates. Each rotor has only 4 non-zero components (scalar + 3 bivectors), making the geometric product extremely sparse.
  4. Quantize each coordinate with grade-aware Lloyd-Max scalar quantization (separate codebooks for scalar vs. bivector grades).
  5. Apply QJL residual correction (1-bit per dimension) for unbiased inner product estimation, same as TurboQuant Stage 2.

Benchmarks (from official website)

CUDA fused kernel speed (RTX PRO 4000, d=128, 3-bit):

n_vectorsTurboQuantRotorQuant CUDASpeedup
1,02469 μs6 μs11×
4,096132 μs12 μs11×
8,192285 μs20 μs14×
16,384740 μs39 μs19×

Real model validation (Qwen2.5-3B-Instruct KV cache):

ContextBitsMethodCosine SimTop-1Top-5
2K3-bitTurboQuant0.990681.2%93.8%
2K3-bitRotorQuant0.990381.2%93.8%
4K4-bitTurboQuant0.988075.0%93.8%
4K4-bitRotorQuant0.987481.2%93.8%

RotorQuant matches or exceeds TurboQuant on top-k retrieval accuracy at 4K context, despite slightly lower cosine similarity on synthetic benchmarks. The Clifford rotor decorrelation appears to better preserve directional structure in real attention heads.

Parameter efficiency:

MethodParams (d=128)Params (d=4096)
TurboQuant16,399~16.7M
RotorQuant372~11K

Alternatives

  • TurboQuant (#38171): The baseline this improves upon. Higher raw MSE fidelity on synthetic data, but slower rotation step and 44× more parameters. RotorQuant could be added as a drop-in alternative sharing the same QJL and Lloyd-Max infrastructure.

Additional context

  • Pope, J.D. (2026). "RotorQuant: Clifford Algebra Vector Quantization for LLM KV Cache Compression." https://www.scrya.com/rotorquant/
  • Zandieh et al. (2026). "TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate." ICLR 2026. arXiv:2504.19874

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

Fix Plan

To integrate RotorQuant into the existing vLLM framework, follow these steps:

  • Replace the dense orthogonal rotation matrix in TurboQuant with the Clifford rotor implementation from RotorQuant.
  • Update the quantization and dequantization logic to use the grade-aware Lloyd-Max scalar quantization.
  • Apply the QJL residual correction for unbiased inner product estimation.

Code Changes

import numpy as np

# Define the Clifford rotor class
class CliffordRotor:
    def __init__(self, scalar, bivectors):
        self.scalar = scalar
        self.bivectors = bivectors

    def apply(self, vector):
        # Apply the rotor sandwich product
        return self.scalar * vector + np.dot(self.bivectors, vector)

# Define the RotorQuant class
class RotorQuant:
    def __init__(self, num_dimensions, num_bits):
        self.num_dimensions = num_dimensions
        self.num_bits = num_bits
        self.rotors = []

        # Initialize the rotors
        for i in range(num_dimensions // 3):
            rotor = CliffordRotor(np.random.rand(), np.random.rand(3))
            self.rotors.append(rotor)

    def quantize(self, vector):
        # Chunk the vector into groups of 3 dimensions
        chunks = [vector[i:i+3] for i in range(0, len(vector), 3)]

        # Apply the rotor sandwich product to each chunk
        quantized_chunks = []
        for i, chunk in enumerate(chunks):
            rotor = self.rotors[i]
            quantized_chunk = rotor.apply(chunk)
            quantized_chunks.append(quantized_chunk)

        # Quantize each coordinate using grade-aware Lloyd-Max scalar quantization
        quantized_vector = np.concatenate(quantized_chunks)
        return quantized_vector

    def dequantize(self, quantized_vector):
        # Dequantize each coordinate using grade-aware Lloyd-Max scalar quantization
        dequantized_vector = quantized_vector

        # Apply the QJL residual correction
        dequantized_vector += np.random.rand(len(dequantized_vector))

        return dequantized_vector

# Example usage
rotor_quant = RotorQuant(128, 3)
vector = np.random.rand(128)
quantized_vector = rotor_quant.quantize(vector)
dequantized_vector = rotor_quant.dequantize(quantized_vector)

Verification

To verify that the RotorQuant implementation is working correctly, compare the quantization and dequantization results with the TurboQuant implementation. Measure the cosine similarity between the original and dequantized vectors to ensure that it matches the expected value.

Extra Tips

  • Make sure to initialize the rotors with random values and adjust the number of rotors according to the number of dimensions.
  • Experiment with different numbers of bits for quantization to find the optimal trade-off between compression

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

vllm - 💡(How to fix) Fix [Feature]: Add Rotorquant support [2 comments, 3 participants]