vllm - ✅(Solved) Fix [Tracking issue]: TurboQuant/HIGGS Attention follow-ups [1 pull requests, 1 participants]

vllm2026-04-16 21:54:28

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

vllm-project/vllm#40069•Fetched 2026-04-17 08:27:23

View on GitHub

Comments

Participants

Timeline

Reactions

Author

mgoin

Participants

mgoin

Timeline (top)

cross-referenced ×2subscribed ×2labeled ×1mentioned ×1

Fix Action

Fixed

Fixed by PR: [TurboQuant] enable FA3/FA4 for prefill paths (https://github.com/vllm-project/vllm/pull/40092)

PR fix notes

PR #40092: [TurboQuant] enable FA3/FA4 for prefill paths

Repository: vllm-project/vllm
Author: huangzhilin-hzl
State: open | merged: False
Link: https://github.com/vllm-project/vllm/pull/40092

Description (problem / solution / changelog)

Purpose

Resolves part of https://github.com/vllm-project/vllm/issues/40069 (Backend Coverage: extend flash_attn_varlen_func support to FA3/4).

Two issues fixed:

FA version passthrough: TurboQuant prefill paths call flash_attn_varlen_func without the fa_version kwarg, so on Hopper (SM90) the call defaults to FA2 instead of leveraging FA3, and on Blackwell (SM100) it misses FA4 entirely. The standard FlashAttention backend already detects and passes fa_version at init time; this PR aligns TurboQuant to the same pattern.
Mixed-backend assert fix: _get_sliding_window_configs() in flash_attn.py asserts all Attention layers are FlashAttentionImpl. When kv_cache_dtype_skip_layers routes some layers to a different backend (e.g. TurboQuant), this assert fails. Fixed by skipping non-FA layers, since they use their own metadata builders.

Test Plan

# 1. Unit tests
python -m pytest tests/quantization/test_turboquant.py -v

# 2. GSM8K correctness eval (all 4 TQ presets)
python -m pytest -s -v tests/evals/gsm8k/test_gsm8k_correctness.py \
    --config-list-file=tests/evals/gsm8k/configs/models-turboquant.txt

# 3. E2E inference with CUDAGraph (no enforce_eager, validates assert fix)
CUDA_VISIBLE_DEVICES=0 HF_HUB_OFFLINE=1 python -c "
from vllm import LLM, SamplingParams
for dtype in ['turboquant_k8v4', 'turboquant_3bit_nc']:
    llm = LLM(model='Qwen/Qwen3-4B', kv_cache_dtype=dtype,
              max_model_len=2048, gpu_memory_utilization=0.5)
    outputs = llm.generate(['What is 2+2?'], SamplingParams(max_tokens=32))
    print(f'{dtype}: {outputs[0].outputs[0].text[:80]}')
    del llm
"

Test Result

Hardware: NVIDIA H20 (SM90 / Hopper)

FA version detection

FA version for head_size=128: 3   (was: unspecified, defaulting to FA2)
FA version for head_size=256: 3

Unit tests

114 passed, 6 failed (pre-existing rotation matrix atol issues, unrelated)

Confirmed pre-existing: same 6 failures on unmodified code via git stash / re-run.

E2E inference with CUDAGraph (enforce_eager=False)

Validates both the FA3 passthrough and the assert fix (AOT schedule path is entered).

Preset	CUDAGraph Capture	Result
k8v4	51 piecewise + 51 full	PASSED
t3nc	51 piecewise + 51 full	PASSED

GSM8K correctness eval (Qwen3-4B, 1319 questions, 5-shot)

Preset	Accuracy	Threshold	Result
k8v4 (FP8 key + 4-bit value)	-	>= 0.80	PASSED
t4nc (4-bit MSE + NC)	-	>= 0.80	PASSED
k3v4nc (3-bit key + 4-bit value + NC)	-	>= 0.78	PASSED
t3nc (3-bit all + NC)	0.7574	>= 0.75	PASSED

Note: t3nc failed in batch run due to GPU memory from zombie processes, passed when run alone.

<details> <summary> Essential Elements of an Effective PR Description Checklist </summary>

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.

</details>

Changed files

tests/evals/gsm8k/configs/Qwen3-4B-TQ-k3v4nc.yaml (modified, +1/-1)
tests/evals/gsm8k/configs/Qwen3-4B-TQ-k8v4.yaml (modified, +1/-1)
tests/evals/gsm8k/configs/Qwen3-4B-TQ-t3nc.yaml (modified, +1/-1)
tests/evals/gsm8k/configs/Qwen3-4B-TQ-t4nc.yaml (modified, +1/-1)
vllm/v1/attention/backends/flash_attn.py (modified, +7/-2)
vllm/v1/attention/backends/turboquant_attn.py (modified, +7/-0)

extent analysis

TL;DR

Expand the flash_attn_varlen_func to support FA3/4 and explore hybrid attention models to improve the TurboQuant/HIGGS KV cache attention backend.

Guidance

Review the current implementation of flash_attn_varlen_func and identify the necessary changes to expand its support to FA3/4.
Investigate the feasibility of implementing hybrid attention models, such as Qwen3.5, mamba+attention, or interleaved SWA, and their potential impact on the backend's performance.
Consider adding MLA support through a new attention backend to further improve the system's capabilities.
Evaluate the current presets (e.g., k8v4, t4nc, k3v4nc, t3nc) and perform long-context evaluations to inform the development of new presets and configuration defaults.

Example

No specific code snippet is provided due to the lack of technical details in the issue.

Notes

The provided issue seems to be a high-level overview of the tasks and features to be implemented or improved in the TurboQuant/HIGGS KV cache attention backend. Without more specific technical information, it's challenging to provide a detailed solution or code examples.

Recommendation

Apply workaround: Focus on expanding the flash_attn_varlen_func and exploring hybrid attention models as a starting point to improve the backend's performance and capabilities.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#vector store #embedding generation #cache error #pipeline error #runtime error

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

vllm - ✅(Solved) Fix [Tracking issue]: TurboQuant/HIGGS Attention follow-ups [1 pull requests, 1 participants]

Recommended Tools

GitHub issue graph ai analysis

Fix Action

Fixed

PR fix notes

PR #40092: [TurboQuant] enable FA3/FA4 for prefill paths

Description (problem / solution / changelog)

Purpose

Test Plan

Test Result

FA version detection

Unit tests

E2E inference with CUDAGraph (enforce_eager=False)

GSM8K correctness eval (Qwen3-4B, 1319 questions, 5-shot)

Changed files

Backend coverage

Accuracy

Feature compatibility

Performance

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

TRENDING

vllm - ✅(Solved) Fix [Tracking issue]: TurboQuant/HIGGS Attention follow-ups [1 pull requests, 1 participants]

Recommended Tools

GitHub issue graph ai analysis

Fix Action

Fixed

PR fix notes

PR #40092: [TurboQuant] enable FA3/FA4 for prefill paths

Description (problem / solution / changelog)

Purpose

Test Plan

Test Result

FA version detection

Unit tests

E2E inference with CUDAGraph (enforce_eager=False)

GSM8K correctness eval (Qwen3-4B, 1319 questions, 5-shot)

Changed files

Backend coverage

Accuracy

Feature compatibility

Performance

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING