vllm - ✅(Solved) Fix [Bug]: TurboQuant fails on non-power-of-2 head_dim (Phi-2, MSE-K presets) [1 pull requests, 1 comments, 2 participants]

TheTom · 2026-04-30T20:44:41Z

[vllm] PR 41414: Bugfix Attention TurboQuant Pad head dim to power-of-2 for WHT - Repository: vllm-project/vllm - Author: TheTom - State: open | merged: False… # PR #41414: [Bugfix][Attention][TurboQuant] Pad head_dim to power-of-2 for WHT - Repository: vllm-project/vllm - Author: TheTom - State: open | merged: False - Link: https://github.com/vllm-project/vllm/pull/41414 ## Description (problem / solution / changelog) ## Purpose Fix a latent correctness bug in the TurboQuant rotation path for models with non-power-of-2 `head_dim`. Reproduced on `microsoft/phi-2` (`head_dim=80`) with `turboquant_4bit_nc`: ``` File "vllm/v1/attention/backends/turboquant_attn.py", line 380, in do_kv_cache_update y = x_hat @ PiT RuntimeError: mat1 and mat2 shapes cannot be multiplied (16384x80 and 128x128) ``` ## Root cause `_build_hadamard_cached(d)` constructs the Sylvester Hadamard by doubling `H` until `H.shape[0] >= d`, then normalizes by `sqrt(d)`: ```python H = torch.tensor([[1.0]]) while H.shape[0] < d: H = torch.cat([torch.cat([H, H], 1), torch.cat([H, -H], 1)], 0) return (H / math.sqrt(d)).to(...) ``` For `d=80` the loop overshoots to 128×128 and the result is normalized by `1/sqrt(80)` — the matrix is the wrong size *and* not orthonormal at the constructed size. `_ensure_on_device` stores this 128×128 tensor as `layer._tq_PiT`; the MSE-K decode/store kernels then attempt `q @ PiT` with `q` at width 80 and `PiT` at 128×128. The bug is path-specific: - **MSE-K presets** (`turboquant_4bit_nc`, `turboquant_3bit_nc`, `turboquant_k3v4_nc`) call the rotation GEMM and crash at engine init. - **FP8-K** (`turboquant_k8v4`) bypasses the WHT entirely (in-kernel FP8 cast, no rotation), so the broken matrix is built but never multiplied — the model loads but wastes VRAM on the unused buffer. ## Summary of changes - Add `padded_head_dim = next_power_of_2(head_dim)` and `needs_padding` to `TurboQuantConfig`. Pow-2 `head_dim` is identity. - On the MSE-K path, run the WHT in `padded_head_dim` space throughout: zero-pad K and V at the kernel-launch boundary, run store/decode/continuation kernels with `D=padded_head_dim`, and slice the decode output back to `head_dim` before returning. Padded V columns hold zero quantization indices and contribute nothing to the reduction. - FP8-K path is untouched (raw `head_dim`, no rotation, kernel masks non-pow-2 loads directly). - For pow-2 `head_dim` (the common case: 64, 128, 256 — every current Qwen3, Llama, Mistral target), `padded_head_dim == head_dim` and every code path reduces to the prior behavior. Byte-counts in `key_packed_size` / `value_packed_size` are bitwise-identical. ## Duplicate-work check Searched open PRs and issues touching TurboQuant + `head_dim` / Hadamard / Phi-2 / non-power-of-2 before opening this PR. The closest references are: - Tracking issue #40069 (TurboQuant follow-ups) — does not list `head_dim` padding. - PR #39890 (erhan1209) adds new "official" 3-bit/4-bit grouped TQ presets but does not change `_build_hadamard_cached` or address non-pow-2 dim. - PR #40792 (hoseung2) optimizes k8v4 decode with GQA head grouping — orthogonal kernel optimization, unaffected by this fix. No open PR addresses the rotation-shape mismatch on non-pow-2 `head_dim`. Issue #41413 was filed alongside this PR. ## Test Plan / Results Tested on AMD MI300X (`gfx942`), ROCm 7.2, vLLM ROCm 7.2.1 wheels. **Bug reproduction (Phi-2 `d=80` + `turboquant_4bit_nc`):** Command (control + treatment): ```bash python3 -c " import os; os.environ['VLLM_ROCM_USE_AITER_FP4BMM']='0' from vllm import LLM, SamplingParams llm = LLM(model='microsoft/phi-2', dtype='bfloat16', kv_cache_dtype='turboquant_4bit_nc', max_model_len=2048, gpu_memory_utilization=0.40) print(llm.generate(['The capital of France is'], SamplingParams(max_tokens=32, temperature=0.0))[0].outputs[0].text) " ``` | Branch | Result | |---|---| | upstream main `c2fb01331` | `RuntimeError: mat1 and mat2 shapes cannot be multiplied (16384x80 and 128x128)` | | this PR | ` Paris.` ✓ | **No regression — Qwen3-8B (`d=128`) 3-chunk PPL on `wikitext-2-raw/wiki.test.raw` @ 8K:** Command: ```bash python3 -c " import os, math os.environ['VLLM_ROCM_USE_AITER_FP4BMM']='0' from vllm import LLM, SamplingParams llm = LLM(model='Qwen/Qwen3-8B', dtype='bfloat16', max_model_len=8192, kv_cache_dtype=' ', gpu_memory_utilization=0.40, enable_prefix_caching=False, max_num_batched_tokens=512) tok = llm.get_tokenizer() ids = tok.encode(open('wiki.test.raw').read(), add_special_tokens=False) chunks = [ids[i:i+8191] for i in range(0, len(ids)-8191, 8191)][:3] total_lp, total_tok = 0.0, 0 sp = SamplingParams(max_tokens=1, temperature=0.0, prompt_logprobs=1) for ch in chunks: o = llm.generate({'prompt_token_ids': ch}, sp, use_tqdm=False)[0] for i, lp_dict in enumerate(o.prompt_logprobs[1:]): if lp_dict and ch[i+1] in lp_dict: total_lp += lp_dict[ch[i+1]].logprob total_tok += 1 print(math.exp(-total_lp/total_tok)) " ``` | Preset | upstream main | this PR | Δ | |---|---|--

vllm2026-04-30 20:44:41

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

vllm-project/vllm#41413•Fetched 2026-05-01 05:33:45

View on GitHub

Comments

Participants

Timeline

Reactions

Author

TheTom

Participants

github-actions[bot]

TheTom

Timeline (top)

mentioned ×4subscribed ×4project_v2_item_status_changed ×2added_to_project_v2 ×1

Error Message

File "vllm/v1/attention/backends/turboquant_attn.py", line 380, in do_kv_cache_update y = x_hat @ PiT RuntimeError: mat1 and mat2 shapes cannot be multiplied (16384x80 and 128x128)

Root Cause

_build_hadamard_cached(d) doubles H until H.shape[0] >= d then normalizes by sqrt(d). For d=80 it overshoots to 128×128 and divides by sqrt(80) — wrong size, not orthonormal at that size. _ensure_on_device stores it as layer._tq_PiT, and the MSE-K rotation GEMM hits a shape mismatch on q @ PiT.

Fix Action

Fix

PR fixes the bug by padding to next_power_of_2(head_dim) in WHT space and slicing back at the I/O boundary. Pow-2 head_dim is byte-identical to upstream (verified with PPL on Qwen3-8B).

cc @vibhavagarwal5

PR fix notes

PR #41414: [Bugfix][Attention][TurboQuant] Pad head_dim to power-of-2 for WHT

Repository: vllm-project/vllm
Author: TheTom
State: open | merged: False
Link: https://github.com/vllm-project/vllm/pull/41414

Description (problem / solution / changelog)

Purpose

Fix a latent correctness bug in the TurboQuant rotation path for models with non-power-of-2 head_dim. Reproduced on microsoft/phi-2 (head_dim=80) with turboquant_4bit_nc:

File "vllm/v1/attention/backends/turboquant_attn.py", line 380, in do_kv_cache_update
    y = x_hat @ PiT
RuntimeError: mat1 and mat2 shapes cannot be multiplied (16384x80 and 128x128)

Root cause

_build_hadamard_cached(d) constructs the Sylvester Hadamard by doubling H until H.shape[0] >= d, then normalizes by sqrt(d):

H = torch.tensor([[1.0]])
while H.shape[0] < d:
    H = torch.cat([torch.cat([H, H], 1), torch.cat([H, -H], 1)], 0)
return (H / math.sqrt(d)).to(...)

For d=80 the loop overshoots to 128×128 and the result is normalized by 1/sqrt(80) — the matrix is the wrong size and not orthonormal at the constructed size. _ensure_on_device stores this 128×128 tensor as layer._tq_PiT; the MSE-K decode/store kernels then attempt q @ PiT with q at width 80 and PiT at 128×128.

The bug is path-specific:

MSE-K presets (turboquant_4bit_nc, turboquant_3bit_nc, turboquant_k3v4_nc) call the rotation GEMM and crash at engine init.
FP8-K (turboquant_k8v4) bypasses the WHT entirely (in-kernel FP8 cast, no rotation), so the broken matrix is built but never multiplied — the model loads but wastes VRAM on the unused buffer.

Summary of changes

Add padded_head_dim = next_power_of_2(head_dim) and needs_padding to TurboQuantConfig. Pow-2 head_dim is identity.
On the MSE-K path, run the WHT in padded_head_dim space throughout: zero-pad K and V at the kernel-launch boundary, run store/decode/continuation kernels with D=padded_head_dim, and slice the decode output back to head_dim before returning. Padded V columns hold zero quantization indices and contribute nothing to the reduction.
FP8-K path is untouched (raw head_dim, no rotation, kernel masks non-pow-2 loads directly).
For pow-2 head_dim (the common case: 64, 128, 256 — every current Qwen3, Llama, Mistral target), padded_head_dim == head_dim and every code path reduces to the prior behavior. Byte-counts in key_packed_size / value_packed_size are bitwise-identical.

Duplicate-work check

Searched open PRs and issues touching TurboQuant + head_dim / Hadamard / Phi-2 / non-power-of-2 before opening this PR. The closest references are:

Tracking issue #40069 (TurboQuant follow-ups) — does not list head_dim padding.
PR #39890 (erhan1209) adds new "official" 3-bit/4-bit grouped TQ presets but does not change _build_hadamard_cached or address non-pow-2 dim.
PR #40792 (hoseung2) optimizes k8v4 decode with GQA head grouping — orthogonal kernel optimization, unaffected by this fix.

No open PR addresses the rotation-shape mismatch on non-pow-2 head_dim. Issue #41413 was filed alongside this PR.

Test Plan / Results

Tested on AMD MI300X (gfx942), ROCm 7.2, vLLM ROCm 7.2.1 wheels.

Bug reproduction (Phi-2 d=80 + turboquant_4bit_nc):

Command (control + treatment):

python3 -c "
import os; os.environ['VLLM_ROCM_USE_AITER_FP4BMM']='0'
from vllm import LLM, SamplingParams
llm = LLM(model='microsoft/phi-2', dtype='bfloat16',
          kv_cache_dtype='turboquant_4bit_nc', max_model_len=2048,
          gpu_memory_utilization=0.40)
print(llm.generate(['The capital of France is'],
                   SamplingParams(max_tokens=32, temperature=0.0))[0].outputs[0].text)
"

Branch	Result
upstream main `c2fb01331`	`RuntimeError: mat1 and mat2 shapes cannot be multiplied (16384x80 and 128x128)`
this PR	`Paris.` ✓

No regression — Qwen3-8B (d=128) 3-chunk PPL on wikitext-2-raw/wiki.test.raw @ 8K:

Command:

python3 -c "
import os, math
os.environ['VLLM_ROCM_USE_AITER_FP4BMM']='0'
from vllm import LLM, SamplingParams
llm = LLM(model='Qwen/Qwen3-8B', dtype='bfloat16', max_model_len=8192,
          kv_cache_dtype='<preset>', gpu_memory_utilization=0.40,
          enable_prefix_caching=False, max_num_batched_tokens=512)
tok = llm.get_tokenizer()
ids = tok.encode(open('wiki.test.raw').read(), add_special_tokens=False)
chunks = [ids[i:i+8191] for i in range(0, len(ids)-8191, 8191)][:3]
total_lp, total_tok = 0.0, 0
sp = SamplingParams(max_tokens=1, temperature=0.0, prompt_logprobs=1)
for ch in chunks:
    o = llm.generate({'prompt_token_ids': ch}, sp, use_tqdm=False)[0]
    for i, lp_dict in enumerate(o.prompt_logprobs[1:]):
        if lp_dict and ch[i+1] in lp_dict:
            total_lp += lp_dict[ch[i+1]].logprob
            total_tok += 1
print(math.exp(-total_lp/total_tok))
"

Preset	upstream main	this PR	Δ
`turboquant_k8v4`	7.8630	7.8630	0
`turboquant_4bit_nc`	7.9041	7.9041	0

Both PPLs are byte-identical, same token count (24570).

Unit tests on this PR:

python3 -m pytest tests/quantization/test_turboquant.py -v

→ 130/130 passed in 28.34s.

Includes 14 new tests for the non-pow-2 head_dim path:

padded_head_dim is identity for pow-2 head_dim (64, 128, 256)
non-pow-2 head_dim rounds up correctly (80→128, 96→128, 192→256, 40→64)
MSE preset at head_dim=80: key_packed_size=66, value_packed_size=68 (sized to padded 128)
FP8 preset at head_dim=80: key_packed_size=80, value_packed_size=44 (head_dim-sized, FP8 path)
Store + decode round-trip across {turboquant_k8v4, turboquant_4bit_nc} × {80, 96}: cosine similarity vs the stored V passes the same thresholds as the pow-2 case (>0.95 FP8, >0.85 MSE) and the returned tensor is sliced back to head_dim

AI assistance

This PR was prepared with AI assistance (Anthropic Claude). Each line of the diff was reviewed by the human submitter, the bug reproduction was run on the human's hardware (AMD MI300X dev cloud), and the no-regression PPL numbers are from runs the human supervised. Commits carry a Co-authored-by: Claude trailer per AGENTS.md.

Fixes #41413.

cc @vibhavagarwal5

Changed files

tests/quantization/test_turboquant.py (modified, +160/-3)
vllm/model_executor/layers/quantization/turboquant/config.py (modified, +37/-7)
vllm/v1/attention/backends/turboquant_attn.py (modified, +40/-13)
vllm/v1/attention/ops/triton_turboquant_decode.py (modified, +29/-8)
vllm/v1/attention/ops/triton_turboquant_store.py (modified, +34/-12)

Code Example

from vllm import LLM, SamplingParams
llm = LLM(model="microsoft/phi-2", dtype="bfloat16",
          kv_cache_dtype="turboquant_4bit_nc",
          max_model_len=2048)
llm.generate(["The capital of France is"], SamplingParams(max_tokens=32))

---

File "vllm/v1/attention/backends/turboquant_attn.py", line 380, in do_kv_cache_update
    y = x_hat @ PiT
RuntimeError: mat1 and mat2 shapes cannot be multiplied (16384x80 and 128x128)

RAW_BUFFERClick to expand / collapse

Reproduction

from vllm import LLM, SamplingParams
llm = LLM(model="microsoft/phi-2", dtype="bfloat16",
          kv_cache_dtype="turboquant_4bit_nc",
          max_model_len=2048)
llm.generate(["The capital of France is"], SamplingParams(max_tokens=32))

Engine init crashes:

File "vllm/v1/attention/backends/turboquant_attn.py", line 380, in do_kv_cache_update
    y = x_hat @ PiT
RuntimeError: mat1 and mat2 shapes cannot be multiplied (16384x80 and 128x128)

Root cause

Affected presets / models

Affected: turboquant_4bit_nc, turboquant_3bit_nc, turboquant_k3v4_nc on any non-power-of-2 head_dim
Not affected: turboquant_k8v4 — FP8-K bypasses the WHT (in-kernel FP8 cast), so the broken matrix is built but never multiplied. The model loads, but the PiT buffer is wasted VRAM.
Models with non-pow-2 head_dim: Phi-2 (d=80) is the canonical example. Most modern LLMs (Qwen3 4B/8B/14B/30B/235B, Llama, Mistral, Gemma 4) use head_dim=128 and are unaffected.

Environment

Reproduced on AMD MI300X, ROCm 7.2, vLLM c2fb01331 (main as of 2026-04-30)
Should reproduce on any platform — bug is in the platform-independent rotation path

Fix

PR fixes the bug by padding to next_power_of_2(head_dim) in WHT space and slicing back at the I/O boundary. Pow-2 head_dim is byte-identical to upstream (verified with PPL on Qwen3-8B).

cc @vibhavagarwal5

extent analysis

TL;DR

The most likely fix is to pad the head_dim to the next power of 2 in the WHT space and slice back at the I/O boundary.

Guidance

Verify that the issue is caused by a non-power-of-2 head_dim by checking the model's configuration.
Check if the model is using one of the affected presets (turboquant_4bit_nc, turboquant_3bit_nc, turboquant_k3v4_nc) and if the head_dim is not a power of 2.
Apply the fix by padding to next_power_of_2(head_dim) in WHT space and slicing back at the I/O boundary, as described in the PR.
Test the fix with a model that has a non-power-of-2 head_dim, such as Phi-2 (d=80).

Example

No code snippet is provided as the fix is described in the issue and involves modifying the WHT space padding.

Notes

The fix should work on any platform, as the bug is in the platform-independent rotation path. However, it's essential to verify that the issue is caused by a non-power-of-2 head_dim and that the model is using one of the affected presets.

Recommendation

Apply the workaround by padding to next_power_of_2(head_dim) in WHT space and slicing back at the I/O boundary, as this fix addresses the root cause of the issue and has been verified to work with models that have non-power-of-2 head_dim.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #ssr #installation #API versioning #request timeout

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

vllm - ✅(Solved) Fix [Bug]: TurboQuant fails on non-power-of-2 head_dim (Phi-2, MSE-K presets) [1 pull requests, 1 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Fix

PR fix notes

PR #41414: [Bugfix][Attention][TurboQuant] Pad head_dim to power-of-2 for WHT

Description (problem / solution / changelog)

Purpose

Root cause

Summary of changes

Duplicate-work check

Test Plan / Results

AI assistance

Changed files

Code Example

Reproduction

Root cause

Affected presets / models

Environment

Fix

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING