vllm - ✅(Solved) Fix [Tracking issue]: TurboQuant/HIGGS Attention follow-ups [1 pull requests, 1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#40069Fetched 2026-04-17 08:27:23
View on GitHub
Comments
0
Participants
1
Timeline
6
Reactions
1
Author
Participants
Timeline (top)
cross-referenced ×2subscribed ×2labeled ×1mentioned ×1

Fix Action

Fixed

PR fix notes

PR #40092: [TurboQuant] enable FA3/FA4 for prefill paths

Description (problem / solution / changelog)

Purpose

Resolves part of https://github.com/vllm-project/vllm/issues/40069 (Backend Coverage: extend flash_attn_varlen_func support to FA3/4).

Two issues fixed:

  1. FA version passthrough: TurboQuant prefill paths call flash_attn_varlen_func without the fa_version kwarg, so on Hopper (SM90) the call defaults to FA2 instead of leveraging FA3, and on Blackwell (SM100) it misses FA4 entirely. The standard FlashAttention backend already detects and passes fa_version at init time; this PR aligns TurboQuant to the same pattern.

  2. Mixed-backend assert fix: _get_sliding_window_configs() in flash_attn.py asserts all Attention layers are FlashAttentionImpl. When kv_cache_dtype_skip_layers routes some layers to a different backend (e.g. TurboQuant), this assert fails. Fixed by skipping non-FA layers, since they use their own metadata builders.

Test Plan

# 1. Unit tests
python -m pytest tests/quantization/test_turboquant.py -v

# 2. GSM8K correctness eval (all 4 TQ presets)
python -m pytest -s -v tests/evals/gsm8k/test_gsm8k_correctness.py \
    --config-list-file=tests/evals/gsm8k/configs/models-turboquant.txt

# 3. E2E inference with CUDAGraph (no enforce_eager, validates assert fix)
CUDA_VISIBLE_DEVICES=0 HF_HUB_OFFLINE=1 python -c "
from vllm import LLM, SamplingParams
for dtype in ['turboquant_k8v4', 'turboquant_3bit_nc']:
    llm = LLM(model='Qwen/Qwen3-4B', kv_cache_dtype=dtype,
              max_model_len=2048, gpu_memory_utilization=0.5)
    outputs = llm.generate(['What is 2+2?'], SamplingParams(max_tokens=32))
    print(f'{dtype}: {outputs[0].outputs[0].text[:80]}')
    del llm
"

Test Result

Hardware: NVIDIA H20 (SM90 / Hopper)

FA version detection

FA version for head_size=128: 3   (was: unspecified, defaulting to FA2)
FA version for head_size=256: 3

Unit tests

114 passed, 6 failed (pre-existing rotation matrix atol issues, unrelated)

Confirmed pre-existing: same 6 failures on unmodified code via git stash / re-run.

E2E inference with CUDAGraph (enforce_eager=False)

Validates both the FA3 passthrough and the assert fix (AOT schedule path is entered).

PresetCUDAGraph CaptureResult
k8v451 piecewise + 51 fullPASSED
t3nc51 piecewise + 51 fullPASSED

GSM8K correctness eval (Qwen3-4B, 1319 questions, 5-shot)

PresetAccuracyThresholdResult
k8v4 (FP8 key + 4-bit value)->= 0.80PASSED
t4nc (4-bit MSE + NC)->= 0.80PASSED
k3v4nc (3-bit key + 4-bit value + NC)->= 0.78PASSED
t3nc (3-bit all + NC)0.7574>= 0.75PASSED

Note: t3nc failed in batch run due to GPU memory from zombie processes, passed when run alone.


<details> <summary> Essential Elements of an Effective PR Description Checklist </summary>
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
</details>

Changed files

  • tests/evals/gsm8k/configs/Qwen3-4B-TQ-k3v4nc.yaml (modified, +1/-1)
  • tests/evals/gsm8k/configs/Qwen3-4B-TQ-k8v4.yaml (modified, +1/-1)
  • tests/evals/gsm8k/configs/Qwen3-4B-TQ-t3nc.yaml (modified, +1/-1)
  • tests/evals/gsm8k/configs/Qwen3-4B-TQ-t4nc.yaml (modified, +1/-1)
  • vllm/v1/attention/backends/flash_attn.py (modified, +7/-2)
  • vllm/v1/attention/backends/turboquant_attn.py (modified, +7/-0)
RAW_BUFFERClick to expand / collapse

Tracking follow-up work on the TurboQuant/HIGGS KV cache attention backend initially landed in #38479.

Backend coverage

  • Expand flash_attn_varlen_func to FA3/4, not just FA2
  • Hybrid attention models (e.g. Qwen3.5, mamba+attention, interleaved SWA?)
  • MLA support (through a new attention backend?)

Accuracy

  • Long-context evals across presets (k8v4, t4nc, k3v4nc, t3nc): RULER, NIAH at 32K–1M, LongBench
  • Per-layer sensitivity sweep to inform --kv-cache-dtype-skip-layers defaults
  • Publish recommended config table (quality vs. compression vs. throughput) based on eval results
  • Add new presets as the sweeps suggest (e.g. mixed-bit, per-layer schedules)

Feature compatibility

Things currently disabled or unverified with the TurboQuant backend; enable and test:

  • Speculative decoding / Eagle
  • KV connector / disaggregated serving (NIXL, LMCache, Mooncake)

Performance

  • CUDA/cutedsl kernels to replace the triton kernels
  • Validate AMD performance
  • Revisit stream-overlap gating under CUDAGraph
  • FP8 decode path parity on Hopper

cc @vibhavagarwal5

extent analysis

TL;DR

Expand the flash_attn_varlen_func to support FA3/4 and explore hybrid attention models to improve the TurboQuant/HIGGS KV cache attention backend.

Guidance

  • Review the current implementation of flash_attn_varlen_func and identify the necessary changes to expand its support to FA3/4.
  • Investigate the feasibility of implementing hybrid attention models, such as Qwen3.5, mamba+attention, or interleaved SWA, and their potential impact on the backend's performance.
  • Consider adding MLA support through a new attention backend to further improve the system's capabilities.
  • Evaluate the current presets (e.g., k8v4, t4nc, k3v4nc, t3nc) and perform long-context evaluations to inform the development of new presets and configuration defaults.

Example

No specific code snippet is provided due to the lack of technical details in the issue.

Notes

The provided issue seems to be a high-level overview of the tasks and features to be implemented or improved in the TurboQuant/HIGGS KV cache attention backend. Without more specific technical information, it's challenging to provide a detailed solution or code examples.

Recommendation

Apply workaround: Focus on expanding the flash_attn_varlen_func and exploring hybrid attention models as a starting point to improve the backend's performance and capabilities.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING