vllm - 💡(How to fix) Fix [Bug]: TurboQuant crashes on T4/Turing (SM75) — FlashAttention capability not checked [1 pull requests]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

Error Message

RuntimeError: FlashAttention only supports Ampere GPUs or newer.

Root Cause

turboquant_attn.py line 60:

_HAS_FLASH_ATTN = is_flash_attn_varlen_func_available()
if _HAS_FLASH_ATTN:
    from vllm.v1.attention.backends.fa_utils import flash_attn_varlen_func

is_flash_attn_varlen_func_available() (in fa_utils.py) checks whether the flash_attn_varlen_func callable exists in the installed package. On any CUDA system with flash-attn installed, this returns True regardless of GPU architecture. FA2 kernels require SM >= 8.0 (Ampere+). On SM75 (T4/Turing), the kernel launch fails at runtime.

Fix Action

Fixed

Code Example

vLLM 0.21.0 (pip install)
GPU: Tesla T4 (SM75, Turing)Kaggle free tier
CUDA: 12.4
OS: Ubuntu 22.04 (Kaggle)
Python: 3.11

---

RuntimeError: FlashAttention only supports Ampere GPUs or newer.

---

from vllm import LLM
llm = LLM(
    model="Qwen/Qwen2.5-0.5B",
    kv_cache_dtype="turboquant_4bit_nc",
    enforce_eager=True,
    max_model_len=2048,
    gpu_memory_utilization=0.5,
)
# Crashes during engine init on T4

---

_HAS_FLASH_ATTN = is_flash_attn_varlen_func_available()
if _HAS_FLASH_ATTN:
    from vllm.v1.attention.backends.fa_utils import flash_attn_varlen_func
RAW_BUFFERClick to expand / collapse

Your current environment

vLLM 0.21.0 (pip install)
GPU: Tesla T4 (SM75, Turing) — Kaggle free tier
CUDA: 12.4
OS: Ubuntu 22.04 (Kaggle)
Python: 3.11

🐛 Describe the bug

Running any model with --kv-cache-dtype turboquant_4bit_nc on a T4 GPU crashes immediately during prefill:

RuntimeError: FlashAttention only supports Ampere GPUs or newer.

The TQ backend has a working SDPA fallback path (F.scaled_dot_product_attention) gated behind if not _HAS_FLASH_ATTN, but it never activates on CUDA because is_flash_attn_varlen_func_available() returns True on all CUDA platforms — it checks package availability, not hardware compute capability.

How to reproduce

from vllm import LLM
llm = LLM(
    model="Qwen/Qwen2.5-0.5B",
    kv_cache_dtype="turboquant_4bit_nc",
    enforce_eager=True,
    max_model_len=2048,
    gpu_memory_utilization=0.5,
)
# Crashes during engine init on T4

Root cause

turboquant_attn.py line 60:

_HAS_FLASH_ATTN = is_flash_attn_varlen_func_available()
if _HAS_FLASH_ATTN:
    from vllm.v1.attention.backends.fa_utils import flash_attn_varlen_func

is_flash_attn_varlen_func_available() (in fa_utils.py) checks whether the flash_attn_varlen_func callable exists in the installed package. On any CUDA system with flash-attn installed, this returns True regardless of GPU architecture. FA2 kernels require SM >= 8.0 (Ampere+). On SM75 (T4/Turing), the kernel launch fails at runtime.

Expected behavior

TQ should fall back to the SDPA prefill path on GPUs that can't run FA2. The fallback path already exists and works correctly — it just needs the capability gate.

Related issues

  • #41403 — Gate 6 (FA capability gating) is the same class of bug
  • #40069 — TQ/HIGGS tracking issue

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

FAQ

Expected behavior

TQ should fall back to the SDPA prefill path on GPUs that can't run FA2. The fallback path already exists and works correctly — it just needs the capability gate.

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING