vllm - 💡(How to fix) Fix [RFC]: [V1][Attention] Add an experimental training-free sparse prefill attention backend for long-context workloads (vLLM 0.19.1)

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
RAW_BUFFERClick to expand / collapse

Motivation.

Motivation

Long-context prefill is often dominated by attention computation. As context lengths grow to 64K, 128K, and beyond, dense prefill attention becomes a major deployment bottleneck even when decoding is already optimized.

We have been experimenting with a training-free sparse prefill attention mechanism called BFLA (Block-Filtered Long-Context Attention) for vLLM-style paged-attention workloads. BFLA is designed as an opt-in runtime attention backend: it does not modify model weights, does not require training or calibration, and falls back to dense attention for unsupported cases.

Reference implementation:

https://github.com/Alicewithrabbit/BFLA

I would like to ask whether the vLLM maintainers would be open to upstreaming this as an experimental V1 Triton attention feature, possibly in several small PRs.

Proposed Change.

Proposal

Add an experimental sparse prefill path to the V1 Triton attention backend.

At a high level, the method works in two stages:

  1. Runtime block-level importance estimation

    During prefill, BFLA partitions the query and KV sequences into coarse blocks. It builds lightweight pooled Q/K representations and estimates causal block importance at runtime. The resulting block-level mask identifies which KV regions should be retained for each query block and KV head.

  2. Tile-level sparse prefill execution

    The coarse block mask is expanded to the Triton attention tile grid. The fused prefill kernel skips dropped KV tiles while preserving exact token-level causal attention inside every retained tile. The sparse path is only used for safe prefill-like workloads; otherwise it falls back to the existing dense Triton attention path.

The feature would be disabled by default and enabled only through an explicit experimental flag or environment variable.

Feedback Period.

No response

CC List.

No response

Any Other Things.

No response

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

vllm - 💡(How to fix) Fix [RFC]: [V1][Attention] Add an experimental training-free sparse prefill attention backend for long-context workloads (vLLM 0.19.1)