vllm - 💡(How to fix) Fix [RFC]: [V1][Attention] Add an experimental training-free sparse prefill attention backend for long-context workloads (vLLM 0.19.1)

StepCodex · 2026-05-12T12:27:26Z

[vllm] Motivation. Motivation Long-context prefill is often dominated by attention computation. As context lengths grow to 64K, 128K, and beyond, dense prefill… ### Motivation. ### Motivation Long-context prefill is often dominated by attention computation. As context lengths grow to 64K, 128K, and beyond, dense prefill attention becomes a major deployment bottleneck even when decoding is already optimized. We have been experimenting with a training-free sparse prefill attention mechanism called **BFLA (Block-Filtered Long-Context Attention)** for vLLM-style paged-attention workloads. BFLA is designed as an opt-in runtime attention backend: it does not modify model weights, does not require training or calibration, and falls back to dense attention for unsupported cases. Reference implementation: https://github.com/Alicewithrabbit/BFLA I would like to ask whether the vLLM maintainers would be open to upstreaming this as an experimental V1 Triton attention feature, possibly in several small PRs. ### Proposed Change. ### Proposal Add an experimental sparse prefill path to the V1 Triton attention backend. At a high level, the method works in two stages: 1. **Runtime block-level importance estimation** During prefill, BFLA partitions the query and KV sequences into coarse blocks. It builds lightweight pooled Q/K representations and estimates causal block importance at runtime. The resulting block-level mask identifies which KV regions should be retained for each query block and KV head. 2. **Tile-level sparse prefill execution** The coarse block mask is expanded to the Triton attention tile grid. The fused prefill kernel skips dropped KV tiles while preserving exact token-level causal attention inside every retained tile. The sparse path is only used for safe prefill-like workloads; otherwise it falls back to the existing dense Triton attention path. The feature would be disabled by default and enabled only through an explicit experimental flag or environment variable. ### Feedback Period. _No response_ ### CC List. _No response_ ### Any Other Things. _No response_ ### Before submitting a new issue... - [x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the [documentation page](https://docs.vllm.ai/en/latest/), which can answer lots of frequently asked questions.

vllm2026-05-12 12:27:26

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

RAW_BUFFERClick to expand / collapse

Motivation.

Motivation

Long-context prefill is often dominated by attention computation. As context lengths grow to 64K, 128K, and beyond, dense prefill attention becomes a major deployment bottleneck even when decoding is already optimized.

We have been experimenting with a training-free sparse prefill attention mechanism called BFLA (Block-Filtered Long-Context Attention) for vLLM-style paged-attention workloads. BFLA is designed as an opt-in runtime attention backend: it does not modify model weights, does not require training or calibration, and falls back to dense attention for unsupported cases.

Reference implementation:

https://github.com/Alicewithrabbit/BFLA

I would like to ask whether the vLLM maintainers would be open to upstreaming this as an experimental V1 Triton attention feature, possibly in several small PRs.

Proposed Change.

Proposal

Add an experimental sparse prefill path to the V1 Triton attention backend.

At a high level, the method works in two stages:

Runtime block-level importance estimation

During prefill, BFLA partitions the query and KV sequences into coarse blocks. It builds lightweight pooled Q/K representations and estimates causal block importance at runtime. The resulting block-level mask identifies which KV regions should be retained for each query block and KV head.
Tile-level sparse prefill execution

The coarse block mask is expanded to the Triton attention tile grid. The fused prefill kernel skips dropped KV tiles while preserving exact token-level causal attention inside every retained tile. The sparse path is only used for safe prefill-like workloads; otherwise it falls back to the existing dense Triton attention path.

The feature would be disabled by default and enabled only through an explicit experimental flag or environment variable.

Feedback Period.

No response

CC List.

No response

Any Other Things.

No response

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#model download #tokenizer error #prompt formatting #chain error #environment variable

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

vllm - 💡(How to fix) Fix [RFC]: [V1][Attention] Add an experimental training-free sparse prefill attention backend for long-context workloads (vLLM 0.19.1)

Recommended Tools

GitHub issue graph ai analysis

Motivation.

Motivation

Proposed Change.

Proposal

Feedback Period.

CC List.

Any Other Things.

Before submitting a new issue...

Still need to ship something?

TRENDING

vllm - 💡(How to fix) Fix [RFC]: [V1][Attention] Add an experimental training-free sparse prefill attention backend for long-context workloads (vLLM 0.19.1)

Recommended Tools

GitHub issue graph ai analysis

Motivation.

Motivation

Proposed Change.

Proposal

Feedback Period.

CC List.

Any Other Things.

Before submitting a new issue...

Still need to ship something?

RELATED_DISCOVERY

TRENDING