vllm - ✅(Solved) Fix [Bug]: TurboQuant fails on non-power-of-2 head_dim (Phi-2, MSE-K presets) [1 pull requests, 1 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#41413Fetched 2026-05-01 05:33:45
View on GitHub
Comments
1
Participants
2
Timeline
14
Reactions
0
Author
Timeline (top)
mentioned ×4subscribed ×4project_v2_item_status_changed ×2added_to_project_v2 ×1

Error Message

File "vllm/v1/attention/backends/turboquant_attn.py", line 380, in do_kv_cache_update y = x_hat @ PiT RuntimeError: mat1 and mat2 shapes cannot be multiplied (16384x80 and 128x128)

Root Cause

_build_hadamard_cached(d) doubles H until H.shape[0] >= d then normalizes by sqrt(d). For d=80 it overshoots to 128×128 and divides by sqrt(80) — wrong size, not orthonormal at that size. _ensure_on_device stores it as layer._tq_PiT, and the MSE-K rotation GEMM hits a shape mismatch on q @ PiT.

Fix Action

Fix

PR fixes the bug by padding to next_power_of_2(head_dim) in WHT space and slicing back at the I/O boundary. Pow-2 head_dim is byte-identical to upstream (verified with PPL on Qwen3-8B).

cc @vibhavagarwal5

PR fix notes

PR #41414: [Bugfix][Attention][TurboQuant] Pad head_dim to power-of-2 for WHT

Description (problem / solution / changelog)

Purpose

Fix a latent correctness bug in the TurboQuant rotation path for models with non-power-of-2 head_dim. Reproduced on microsoft/phi-2 (head_dim=80) with turboquant_4bit_nc:

File "vllm/v1/attention/backends/turboquant_attn.py", line 380, in do_kv_cache_update
    y = x_hat @ PiT
RuntimeError: mat1 and mat2 shapes cannot be multiplied (16384x80 and 128x128)

Root cause

_build_hadamard_cached(d) constructs the Sylvester Hadamard by doubling H until H.shape[0] >= d, then normalizes by sqrt(d):

H = torch.tensor([[1.0]])
while H.shape[0] < d:
    H = torch.cat([torch.cat([H, H], 1), torch.cat([H, -H], 1)], 0)
return (H / math.sqrt(d)).to(...)

For d=80 the loop overshoots to 128×128 and the result is normalized by 1/sqrt(80) — the matrix is the wrong size and not orthonormal at the constructed size. _ensure_on_device stores this 128×128 tensor as layer._tq_PiT; the MSE-K decode/store kernels then attempt q @ PiT with q at width 80 and PiT at 128×128.

The bug is path-specific:

  • MSE-K presets (turboquant_4bit_nc, turboquant_3bit_nc, turboquant_k3v4_nc) call the rotation GEMM and crash at engine init.
  • FP8-K (turboquant_k8v4) bypasses the WHT entirely (in-kernel FP8 cast, no rotation), so the broken matrix is built but never multiplied — the model loads but wastes VRAM on the unused buffer.

Summary of changes

  • Add padded_head_dim = next_power_of_2(head_dim) and needs_padding to TurboQuantConfig. Pow-2 head_dim is identity.
  • On the MSE-K path, run the WHT in padded_head_dim space throughout: zero-pad K and V at the kernel-launch boundary, run store/decode/continuation kernels with D=padded_head_dim, and slice the decode output back to head_dim before returning. Padded V columns hold zero quantization indices and contribute nothing to the reduction.
  • FP8-K path is untouched (raw head_dim, no rotation, kernel masks non-pow-2 loads directly).
  • For pow-2 head_dim (the common case: 64, 128, 256 — every current Qwen3, Llama, Mistral target), padded_head_dim == head_dim and every code path reduces to the prior behavior. Byte-counts in key_packed_size / value_packed_size are bitwise-identical.

Duplicate-work check

Searched open PRs and issues touching TurboQuant + head_dim / Hadamard / Phi-2 / non-power-of-2 before opening this PR. The closest references are:

  • Tracking issue #40069 (TurboQuant follow-ups) — does not list head_dim padding.
  • PR #39890 (erhan1209) adds new "official" 3-bit/4-bit grouped TQ presets but does not change _build_hadamard_cached or address non-pow-2 dim.
  • PR #40792 (hoseung2) optimizes k8v4 decode with GQA head grouping — orthogonal kernel optimization, unaffected by this fix.

No open PR addresses the rotation-shape mismatch on non-pow-2 head_dim. Issue #41413 was filed alongside this PR.

Test Plan / Results

Tested on AMD MI300X (gfx942), ROCm 7.2, vLLM ROCm 7.2.1 wheels.

Bug reproduction (Phi-2 d=80 + turboquant_4bit_nc):

Command (control + treatment):

python3 -c "
import os; os.environ['VLLM_ROCM_USE_AITER_FP4BMM']='0'
from vllm import LLM, SamplingParams
llm = LLM(model='microsoft/phi-2', dtype='bfloat16',
          kv_cache_dtype='turboquant_4bit_nc', max_model_len=2048,
          gpu_memory_utilization=0.40)
print(llm.generate(['The capital of France is'],
                   SamplingParams(max_tokens=32, temperature=0.0))[0].outputs[0].text)
"
BranchResult
upstream main c2fb01331RuntimeError: mat1 and mat2 shapes cannot be multiplied (16384x80 and 128x128)
this PR Paris.

No regression — Qwen3-8B (d=128) 3-chunk PPL on wikitext-2-raw/wiki.test.raw @ 8K:

Command:

python3 -c "
import os, math
os.environ['VLLM_ROCM_USE_AITER_FP4BMM']='0'
from vllm import LLM, SamplingParams
llm = LLM(model='Qwen/Qwen3-8B', dtype='bfloat16', max_model_len=8192,
          kv_cache_dtype='<preset>', gpu_memory_utilization=0.40,
          enable_prefix_caching=False, max_num_batched_tokens=512)
tok = llm.get_tokenizer()
ids = tok.encode(open('wiki.test.raw').read(), add_special_tokens=False)
chunks = [ids[i:i+8191] for i in range(0, len(ids)-8191, 8191)][:3]
total_lp, total_tok = 0.0, 0
sp = SamplingParams(max_tokens=1, temperature=0.0, prompt_logprobs=1)
for ch in chunks:
    o = llm.generate({'prompt_token_ids': ch}, sp, use_tqdm=False)[0]
    for i, lp_dict in enumerate(o.prompt_logprobs[1:]):
        if lp_dict and ch[i+1] in lp_dict:
            total_lp += lp_dict[ch[i+1]].logprob
            total_tok += 1
print(math.exp(-total_lp/total_tok))
"
Presetupstream mainthis PRΔ
turboquant_k8v47.86307.86300
turboquant_4bit_nc7.90417.90410

Both PPLs are byte-identical, same token count (24570).

Unit tests on this PR:

python3 -m pytest tests/quantization/test_turboquant.py -v

130/130 passed in 28.34s.

Includes 14 new tests for the non-pow-2 head_dim path:

  • padded_head_dim is identity for pow-2 head_dim (64, 128, 256)
  • non-pow-2 head_dim rounds up correctly (80→128, 96→128, 192→256, 40→64)
  • MSE preset at head_dim=80: key_packed_size=66, value_packed_size=68 (sized to padded 128)
  • FP8 preset at head_dim=80: key_packed_size=80, value_packed_size=44 (head_dim-sized, FP8 path)
  • Store + decode round-trip across {turboquant_k8v4, turboquant_4bit_nc} × {80, 96}: cosine similarity vs the stored V passes the same thresholds as the pow-2 case (>0.95 FP8, >0.85 MSE) and the returned tensor is sliced back to head_dim

AI assistance

This PR was prepared with AI assistance (Anthropic Claude). Each line of the diff was reviewed by the human submitter, the bug reproduction was run on the human's hardware (AMD MI300X dev cloud), and the no-regression PPL numbers are from runs the human supervised. Commits carry a Co-authored-by: Claude trailer per AGENTS.md.

Fixes #41413.

cc @vibhavagarwal5

Changed files

  • tests/quantization/test_turboquant.py (modified, +160/-3)
  • vllm/model_executor/layers/quantization/turboquant/config.py (modified, +37/-7)
  • vllm/v1/attention/backends/turboquant_attn.py (modified, +40/-13)
  • vllm/v1/attention/ops/triton_turboquant_decode.py (modified, +29/-8)
  • vllm/v1/attention/ops/triton_turboquant_store.py (modified, +34/-12)

Code Example

from vllm import LLM, SamplingParams
llm = LLM(model="microsoft/phi-2", dtype="bfloat16",
          kv_cache_dtype="turboquant_4bit_nc",
          max_model_len=2048)
llm.generate(["The capital of France is"], SamplingParams(max_tokens=32))

---

File "vllm/v1/attention/backends/turboquant_attn.py", line 380, in do_kv_cache_update
    y = x_hat @ PiT
RuntimeError: mat1 and mat2 shapes cannot be multiplied (16384x80 and 128x128)
RAW_BUFFERClick to expand / collapse

Reproduction

from vllm import LLM, SamplingParams
llm = LLM(model="microsoft/phi-2", dtype="bfloat16",
          kv_cache_dtype="turboquant_4bit_nc",
          max_model_len=2048)
llm.generate(["The capital of France is"], SamplingParams(max_tokens=32))

Engine init crashes:

File "vllm/v1/attention/backends/turboquant_attn.py", line 380, in do_kv_cache_update
    y = x_hat @ PiT
RuntimeError: mat1 and mat2 shapes cannot be multiplied (16384x80 and 128x128)

Root cause

_build_hadamard_cached(d) doubles H until H.shape[0] >= d then normalizes by sqrt(d). For d=80 it overshoots to 128×128 and divides by sqrt(80) — wrong size, not orthonormal at that size. _ensure_on_device stores it as layer._tq_PiT, and the MSE-K rotation GEMM hits a shape mismatch on q @ PiT.

Affected presets / models

  • Affected: turboquant_4bit_nc, turboquant_3bit_nc, turboquant_k3v4_nc on any non-power-of-2 head_dim
  • Not affected: turboquant_k8v4 — FP8-K bypasses the WHT (in-kernel FP8 cast), so the broken matrix is built but never multiplied. The model loads, but the PiT buffer is wasted VRAM.
  • Models with non-pow-2 head_dim: Phi-2 (d=80) is the canonical example. Most modern LLMs (Qwen3 4B/8B/14B/30B/235B, Llama, Mistral, Gemma 4) use head_dim=128 and are unaffected.

Environment

  • Reproduced on AMD MI300X, ROCm 7.2, vLLM c2fb01331 (main as of 2026-04-30)
  • Should reproduce on any platform — bug is in the platform-independent rotation path

Fix

PR fixes the bug by padding to next_power_of_2(head_dim) in WHT space and slicing back at the I/O boundary. Pow-2 head_dim is byte-identical to upstream (verified with PPL on Qwen3-8B).

cc @vibhavagarwal5

extent analysis

TL;DR

The most likely fix is to pad the head_dim to the next power of 2 in the WHT space and slice back at the I/O boundary.

Guidance

  • Verify that the issue is caused by a non-power-of-2 head_dim by checking the model's configuration.
  • Check if the model is using one of the affected presets (turboquant_4bit_nc, turboquant_3bit_nc, turboquant_k3v4_nc) and if the head_dim is not a power of 2.
  • Apply the fix by padding to next_power_of_2(head_dim) in WHT space and slicing back at the I/O boundary, as described in the PR.
  • Test the fix with a model that has a non-power-of-2 head_dim, such as Phi-2 (d=80).

Example

No code snippet is provided as the fix is described in the issue and involves modifying the WHT space padding.

Notes

The fix should work on any platform, as the bug is in the platform-independent rotation path. However, it's essential to verify that the issue is caused by a non-power-of-2 head_dim and that the model is using one of the affected presets.

Recommendation

Apply the workaround by padding to next_power_of_2(head_dim) in WHT space and slicing back at the I/O boundary, as this fix addresses the root cause of the issue and has been verified to work with models that have non-power-of-2 head_dim.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

vllm - ✅(Solved) Fix [Bug]: TurboQuant fails on non-power-of-2 head_dim (Phi-2, MSE-K presets) [1 pull requests, 1 comments, 2 participants]