vllm - 💡(How to fix) Fix [Bug]: TurboQuant crashes on T4/Turing (SM75) — FlashAttention capability not checked [1 pull requests]

vllm2026-05-25 07:51:29

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

Error Message

RuntimeError: FlashAttention only supports Ampere GPUs or newer.

Root Cause

turboquant_attn.py line 60:

_HAS_FLASH_ATTN = is_flash_attn_varlen_func_available()
if _HAS_FLASH_ATTN:
    from vllm.v1.attention.backends.fa_utils import flash_attn_varlen_func

is_flash_attn_varlen_func_available() (in fa_utils.py) checks whether the flash_attn_varlen_func callable exists in the installed package. On any CUDA system with flash-attn installed, this returns True regardless of GPU architecture. FA2 kernels require SM >= 8.0 (Ampere+). On SM75 (T4/Turing), the kernel launch fails at runtime.

Fix Action

Fixed

Fixed by PR: fix(turboquant): use SDPA prefill fallback on pre-Ampere GPUs (https://github.com/vllm-project/vllm/pull/43577)

Code Example

vLLM 0.21.0 (pip install)
GPU: Tesla T4 (SM75, Turing) — Kaggle free tier
CUDA: 12.4
OS: Ubuntu 22.04 (Kaggle)
Python: 3.11

---

RuntimeError: FlashAttention only supports Ampere GPUs or newer.

---

from vllm import LLM
llm = LLM(
    model="Qwen/Qwen2.5-0.5B",
    kv_cache_dtype="turboquant_4bit_nc",
    enforce_eager=True,
    max_model_len=2048,
    gpu_memory_utilization=0.5,
)
# Crashes during engine init on T4

---

_HAS_FLASH_ATTN = is_flash_attn_varlen_func_available()
if _HAS_FLASH_ATTN:
    from vllm.v1.attention.backends.fa_utils import flash_attn_varlen_func

RAW_BUFFERClick to expand / collapse

Your current environment

vLLM 0.21.0 (pip install)
GPU: Tesla T4 (SM75, Turing) — Kaggle free tier
CUDA: 12.4
OS: Ubuntu 22.04 (Kaggle)
Python: 3.11

🐛 Describe the bug

Running any model with --kv-cache-dtype turboquant_4bit_nc on a T4 GPU crashes immediately during prefill:

RuntimeError: FlashAttention only supports Ampere GPUs or newer.

The TQ backend has a working SDPA fallback path (F.scaled_dot_product_attention) gated behind if not _HAS_FLASH_ATTN, but it never activates on CUDA because is_flash_attn_varlen_func_available() returns True on all CUDA platforms — it checks package availability, not hardware compute capability.

How to reproduce

from vllm import LLM
llm = LLM(
    model="Qwen/Qwen2.5-0.5B",
    kv_cache_dtype="turboquant_4bit_nc",
    enforce_eager=True,
    max_model_len=2048,
    gpu_memory_utilization=0.5,
)
# Crashes during engine init on T4

Root cause

turboquant_attn.py line 60:

_HAS_FLASH_ATTN = is_flash_attn_varlen_func_available()
if _HAS_FLASH_ATTN:
    from vllm.v1.attention.backends.fa_utils import flash_attn_varlen_func

Expected behavior

TQ should fall back to the SDPA prefill path on GPUs that can't run FA2. The fallback path already exists and works correctly — it just needs the capability gate.

Related issues

#41403 — Gate 6 (FA capability gating) is the same class of bug
#40069 — TQ/HIGGS tracking issue

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

FAQ

Expected behavior

TQ should fall back to the SDPA prefill path on GPUs that can't run FA2. The fallback path already exists and works correctly — it just needs the capability gate.

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

vllm - 💡(How to fix) Fix [Bug]: TurboQuant crashes on T4/Turing (SM75) — FlashAttention capability not checked [1 pull requests]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Fixed

Code Example

Your current environment

🐛 Describe the bug

How to reproduce

Root cause

Expected behavior

Related issues

Before submitting a new issue...

FAQ

Expected behavior

Still need to ship something?

TRENDING