vllm - ✅(Solved) Fix [Bug] flash_attn _get_sliding_window_configs asserts FlashAttentionImpl over all attention layers, breaks any non-FA backend [2 pull requests, 1 participants]

Alberto-Codes · 2026-04-10T17:09:10Z

[vllm] PR 7: perf tq : fuse MSE store ops + inline decode Q cast + cleanup - Repository: vibhavagarwal5/vllm - Author: vibhavagarwal5 - State: closed | merged:… # PR #7: perf(tq): fuse MSE store ops + inline decode Q cast + cleanup - Repository: vibhavagarwal5/vllm - Author: vibhavagarwal5 - State: closed | merged: True - Link: https://github.com/vibhavagarwal5/vllm/pull/7 ## Description (problem / solution / changelog) ## Summary - **WHT rotation**: Replace QR-decomposed random orthogonal matrices with Walsh-Hadamard Transform + random sign flips for key/query rotation. Drop-in replacement (same D×D matmul), orthonormal + self-inverse, enables future in-kernel butterfly fusion - **Fused MSE store**: Bucketize/centroid-gather/residual-norm fused into single Triton kernel (`_tq_fused_store_mse`), eliminating 4 PyTorch kernel launches per layer - **In-kernel FP8 cast**: FP8 key cast moved from host-side `torch.float8_e4m3fn` to in-kernel `tl.float8e4nv`/`tl.float8e4b15`, removing a separate kernel launch - **Value quant dedup**: Extracted shared `_store_quantized_value` Triton JIT helper, deduplicating ~60 lines between FP8 and MSE store kernels - **Prefill .tolist() optimization**: Single CPU-GPU sync instead of per-request `.item()` calls in prefill loop - **CUDAGraph memory fix**: Static `NUM_KV_SPLITS=32` reduced estimated memory from 33 GiB → 8.7 GiB - **Dead code cleanup**: Removed unused loggers, kernel constexprs (`NUM_Q_HEADS`, `PADDED_SLOT`, `MAX_NUM_BLOCKS`, `N_CENTROIDS`), `value_packed_size` params, stale QR matrix buffer ## Benchmark results (Qwen3-4B, 4× RTX PRO 6000 Blackwell, cudagraphs+compile) ### Quality remains constant | Config | K cos | V cos | PPL | NIAH | GSM8K | Invalid | |--------------------|-------|-------|------|----------------|-------|---------| | baseline | — | — | 1.54 | 77/77 (100%) | 0.900 | 0.000 | | turboquant_k8v4 | — | — | 1.59 | 77/77 (100%) | 0.860 | 0.000 | | turboquant_4bit_nc | — | — | 1.53 | 77/77 (100%) | 0.840 | 0.000 | | turboquant_k3v4_nc | — | — | 1.52 | 77/77 (100%) | 0.780 | 0.000 | | turboquant_3bit_nc | — | — | 1.53 | 77/77 (100%) | 0.720 | 0.000 | ### Throughput (output tok/s) | Scenario | baseline | k8v4 | % base | t4nc | % base | k3v4nc | % base | t3nc | % base | |----------|---------|------|--------|------|--------|--------|--------|------|--------| | short-decode (128→512) | 8977 | **7113** | **79%** | 6397 | 71% | 6206 | 69% | 6114 | 68% | | long-prefill (4096→128) | 850 | **811** | **95%** | 766 | 90% | 745 | 88% | 730 | 86% | | mixed (512→512) | 6618 | **5279** | **80%** | 4829 | 73% | 4584 | 69% | 4491 | 68% | | high-load (512→128, n=500) | 5633 | **4751** | **84%** | 4456 | 79% | 4337 | 77% | 4240 | 75% | | very-long-prefill (8192→64) | 233 | **234** | **100%** | 224 | 96% | 220 | 94% | 216 | 93% | | decode-heavy (64→1024) | 8304 | **6521** | **79%** | 5887 | 71% | 5650 | 68% | 5430 | 65% | ### TPOT avg (ms) — lower is better | Scenario | baseline | k8v4 | t4nc | k3v4nc | t3nc | |----------|---------|------|------|--------|------| | short-decode | 11.9 | 15.0 | 16.6 | 17.2 | 17.5 | | long-prefill | 138.1 | 135.2 | 142.4 | 146.6 | 149.3 | | mixed | 19.3 | 23.1 | 25.3 | 26.6 | 27.2 | | high-load | 60.9 | 65.9 | 71.1 | 72.1 | 73.7 | | very-long-prefill | 241.9 | 235.2 | 244.4 | 250.1 | 254.5 | | decode-heavy | 12.8 | 16.4 | 18.0 | 18.7 | 19.5 | ### TTFT avg (ms) — lower is better | Scenario | baseline | k8v4 | t4nc | k3v4nc | t3nc | |----------|---------|------|------|--------|------| | short-decode | 305 | 389 | 430 | 461 | 407 | | long-prefill | 6095 | 6530 | 6690 | 6822 | 6753 | | mixed | 825 | 1014 | 1034 | 1077 | 1054 | | high-load | 1872 | 2124 | 2141 | 2188 | 2198 | | very-long-prefill | 13372 | 13633 | 14159 | 14399 | 14539 | | decode-heavy | 224 | 342 | 293 | 292 | 377 | ### Improvement vs pre-optimization baseline (0408 run) Decode overhead dropped from **~3x baseline to ~1.3-1.5x** — a **~55% reduction** in the TQ-vs-baseline gap: | Metric | Before (0408) | After (0410) | Improvement | |--------|---------------|--------------|-------------| | k8v4 short-decode tok/s | 40% of baseline | 79% of baseline | **+39pp** | | t4nc short-decode tok/s | 40% of baseline | 71% of baseline | **+31pp** | | k8v4 TPOT overhead | 2.96x baseline | 1.26x baseline | **-57%** | | k8v4 long-prefill tok/s | 72% of baseline | 95% of baseline | **+23pp** | | k8v4 very-long-prefill tok/s | 86% of baseline | 100% of baseline | **+14pp** | ### Key takeaways - **k8v4** (FP8 keys + 4-bit values, ~2x compression): **79-100%** of baseline throughput - **t4nc** (4-bit MSE + 4-bit values, ~3.8x compression): **71-96%** of baseline - **k3v4nc** (3-bit MSE + 4-bit values, ~3.5x compression): **68-94%** of baseline - **k8v4 long-prefill TPOT is faster than baseline** (135.2ms vs 138.1ms) — compressed cache reduces memory bandwidth - **WHT rotation**: No regression vs QR; consistent +0.5-2.5% improvement from structured Hadamard cache patterns ## Test plan - [x] Full perf ben

vllm2026-04-10 17:09:10

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

vllm-project/vllm#39516•Fetched 2026-04-11 06:13:06

View on GitHub

Comments

Participants

Timeline

Reactions

Author

Alberto-Codes

Participants

Alberto-Codes

Root Cause

The assertion's intent (gather sliding window configs from FA layers) is correct; the implementation is wrong because it iterates the global layer set instead of restricting to layers in this builder's KV cache group.

Fix Action

Fix / Workaround

The same bug surfaces during the GDN/Mamba code path on hybrid models like Qwen3.5-35B-A3B. cc @vibhavagarwal5 — your perf branch vibhavagarwal5/vllm#7 inherits the unmodified flash_attn.py from main, so anyone testing it will hit this without a local patch.

Defensive (1 line): change the assert to a soft-skip:

for layer in layers.values():
    if not isinstance(layer.impl, FlashAttentionImpl):
        continue
    sliding_window_configs.add(layer.impl.sliding_window)

Strictly more permissive than current behavior, no risk of regressing FA-only models. This is what I patched locally to run my benchmarks (H100 throughput data here).

Local workaround until fixed

PR fix notes

PR #7: perf(tq): fuse MSE store ops + inline decode Q cast + cleanup

Repository: vibhavagarwal5/vllm
Author: vibhavagarwal5
State: closed | merged: True
Link: https://github.com/vibhavagarwal5/vllm/pull/7

Description (problem / solution / changelog)

Summary

WHT rotation: Replace QR-decomposed random orthogonal matrices with Walsh-Hadamard Transform + random sign flips for key/query rotation. Drop-in replacement (same D×D matmul), orthonormal + self-inverse, enables future in-kernel butterfly fusion
Fused MSE store: Bucketize/centroid-gather/residual-norm fused into single Triton kernel (_tq_fused_store_mse), eliminating 4 PyTorch kernel launches per layer
In-kernel FP8 cast: FP8 key cast moved from host-side torch.float8_e4m3fn to in-kernel tl.float8e4nv/tl.float8e4b15, removing a separate kernel launch
Value quant dedup: Extracted shared _store_quantized_value Triton JIT helper, deduplicating ~60 lines between FP8 and MSE store kernels
Prefill .tolist() optimization: Single CPU-GPU sync instead of per-request .item() calls in prefill loop
CUDAGraph memory fix: Static NUM_KV_SPLITS=32 reduced estimated memory from 33 GiB → 8.7 GiB
Dead code cleanup: Removed unused loggers, kernel constexprs (NUM_Q_HEADS, PADDED_SLOT, MAX_NUM_BLOCKS, N_CENTROIDS), value_packed_size params, stale QR matrix buffer

Benchmark results (Qwen3-4B, 4× RTX PRO 6000 Blackwell, cudagraphs+compile)

Quality remains constant

Config	K cos	V cos	PPL	NIAH	GSM8K
baseline	—	—	1.54	77/77 (100%)	0.900
turboquant_k8v4	—	—	1.59	77/77 (100%)	0.860
turboquant_4bit_nc	—	—	1.53	77/77 (100%)	0.840
turboquant_k3v4_nc	—	—	1.52	77/77 (100%)	0.780
turboquant_3bit_nc	—	—	1.53	77/77 (100%)	0.720

Throughput (output tok/s)

Scenario	baseline	k8v4	% base	t4nc	% base	k3v4nc	% base	t3nc	% base
short-decode (128→512)	8977	7113	79%	6397	71%	6206	69%	6114	68%
long-prefill (4096→128)	850	811	95%	766	90%	745	88%	730	86%
mixed (512→512)	6618	5279	80%	4829	73%	4584	69%	4491	68%
high-load (512→128, n=500)	5633	4751	84%	4456	79%	4337	77%	4240	75%
very-long-prefill (8192→64)	233	234	100%	224	96%	220	94%	216	93%
decode-heavy (64→1024)	8304	6521	79%	5887	71%	5650	68%	5430	65%

TPOT avg (ms) — lower is better

Scenario	baseline	k8v4	t4nc	k3v4nc	t3nc
short-decode	11.9	15.0	16.6	17.2	17.5
long-prefill	138.1	135.2	142.4	146.6	149.3
mixed	19.3	23.1	25.3	26.6	27.2
high-load	60.9	65.9	71.1	72.1	73.7
very-long-prefill	241.9	235.2	244.4	250.1	254.5
decode-heavy	12.8	16.4	18.0	18.7	19.5

TTFT avg (ms) — lower is better

Scenario	baseline	k8v4	t4nc	k3v4nc	t3nc
short-decode	305	389	430	461	407
long-prefill	6095	6530	6690	6822	6753
mixed	825	1014	1034	1077	1054
high-load	1872	2124	2141	2188	2198
very-long-prefill	13372	13633	14159	14399	14539
decode-heavy	224	342	293	292	377

Improvement vs pre-optimization baseline (0408 run)

Decode overhead dropped from ~3x baseline to ~1.3-1.5x — a ~55% reduction in the TQ-vs-baseline gap:

Metric	Before (0408)	After (0410)	Improvement
k8v4 short-decode tok/s	40% of baseline	79% of baseline	+39pp
t4nc short-decode tok/s	40% of baseline	71% of baseline	+31pp
k8v4 TPOT overhead	2.96x baseline	1.26x baseline	-57%
k8v4 long-prefill tok/s	72% of baseline	95% of baseline	+23pp
k8v4 very-long-prefill tok/s	86% of baseline	100% of baseline	+14pp

Key takeaways

k8v4 (FP8 keys + 4-bit values, ~2x compression): 79-100% of baseline throughput
t4nc (4-bit MSE + 4-bit values, ~3.8x compression): 71-96% of baseline
k3v4nc (3-bit MSE + 4-bit values, ~3.5x compression): 68-94% of baseline
k8v4 long-prefill TPOT is faster than baseline (135.2ms vs 138.1ms) — compressed cache reduces memory bandwidth
WHT rotation: No regression vs QR; consistent +0.5-2.5% improvement from structured Hadamard cache patterns

Test plan

Full perf benchmark (6 scenarios × 5 configs) — no regressions on baseline
All TQ configs produce correct output (k8v4, t4nc, k3v4nc, t3nc)
CUDAGraph capture verified (51 FULL + 51 PIECEWISE graphs)
WHT smoke test: coherent generation across all MSE configs
Quality benchmark (PPL/GSM8K) sanity check

🤖 Generated with Claude Code

Changed files

vllm/config/attention.py (modified, +5/-0)
vllm/model_executor/layers/attention/attention.py (modified, +3/-3)
vllm/model_executor/layers/quantization/turboquant/quantizer.py (modified, +18/-3)
vllm/v1/attention/backends/turboquant_attn.py (modified, +65/-19)
vllm/v1/attention/ops/triton_turboquant_decode.py (modified, +38/-62)
vllm/v1/attention/ops/triton_turboquant_store.py (modified, +239/-154)

PR #38479: [Attention Backend] TurboQuant: 2-bit KV cache compression with 4x capacity

Repository: vllm-project/vllm
Author: vibhavagarwal5
State: open | merged: False
Link: https://github.com/vllm-project/vllm/pull/38479

Description (problem / solution / changelog)

Summary

TurboQuant adds online KV cache compression to vLLM's v1 attention backend using PolarQuant (WHT rotation + Lloyd-Max scalar quantization) for keys and uniform quantization for values. All quantization happens at store time via fused Triton kernels — no offline calibration, model changes, or weight modifications required. Just set --kv-cache-dtype turboquant_k8v4.

Compression Presets (Qwen3-4B, head_dim=128)

Preset	Key	Value	Slot (bytes)	Compression	GSM8K	NIAH
`turboquant_k8v4`	FP8 (E4M3)	4-bit uniform	196	2.6x	0.860	100%
`turboquant_4bit_nc`	4-bit MSE + NC	4-bit uniform + NC	136	3.8x	0.840	100%
`turboquant_k3v4_nc`	3-bit MSE + NC	4-bit uniform + NC	120	4.3x	0.780	100%
`turboquant_3bit_nc`	3-bit MSE + NC	3-bit uniform + NC	104	4.9x	0.720	100%

Baseline: GSM8K 0.900, NIAH 100%. Measured on Qwen/Qwen3-4B with 5-shot GSM8K (200q) and NIAH (512-32K, 77 probes).

Performance (Qwen3-4B, 4x RTX PRO 6000 Blackwell, cudagraphs+compile)

Throughput (output tok/s)

Scenario	Baseline	k8v4	% base	t4nc	% base	k3v4nc	% base	t3nc	% base
short-decode (128→512)	8977	7113	79%	6397	71%	6206	69%	6114	68%
long-prefill (4096→128)	850	811	95%	766	90%	745	88%	730	86%
mixed (512→512)	6618	5279	80%	4829	73%	4584	69%	4491	68%
high-load (512→128, n=500)	5633	4751	84%	4456	79%	4337	77%	4240	75%
very-long-prefill (8192→64)	233	234	100%	224	96%	220	94%	216	93%
decode-heavy (64→1024)	8304	6521	79%	5887	71%	5650	68%	5430	65%

TPOT (ms) — lower is better

Scenario	baseline	k8v4	t4nc	k3v4nc	t3nc
short-decode	11.9	15.0	16.6	17.2	17.5
long-prefill	138.1	135.2	142.4	146.6	149.3
mixed	19.3	23.1	25.3	26.6	27.2
very-long-prefill	241.9	235.2	244.4	250.1	254.5
decode-heavy	12.8	16.4	18.0	18.7	19.5

TTFT (ms) — lower is better

Scenario	baseline	k8v4	t4nc	k3v4nc	t3nc
short-decode	305	389	430	461	407
long-prefill	6095	6530	6690	6822	6753
mixed	825	1014	1034	1077	1054
decode-heavy	224	342	293	292	377

Key Takeaways

k8v4 (FP8 keys + 4-bit values, ~2.6x compression): 79-100% of baseline throughput across all scenarios
t4nc (4-bit MSE + NC, ~3.8x compression): 71-96% of baseline
k8v4 TPOT is faster than baseline on long sequences (135.2ms vs 138.1ms) — compressed cache reduces memory bandwidth pressure
Very-long-prefill at parity — 8K→64 shows 100% of baseline tok/s for k8v4

Technical Innovations

Walsh-Hadamard Transform (WHT) rotation — Replaced QR-decomposed random orthogonal matrices with WHT + random sign flips. Orthonormal, self-inverse (H = H^T = H^{-1}), enabling future in-kernel butterfly fusion. Same D×D matmul API, zero quality regression, consistent +0.5-2.5% improvement from structured Hadamard cache patterns. Continuation-prefill inversion is trivially H @ x (no transpose needed).

Fused MSE store kernel — Bucketize, centroid gather, residual norm, index packing, and value quantization fused into a single Triton kernel (_tq_fused_store_mse), eliminating 4 separate PyTorch kernel launches per layer. Result: +18-21% decode throughput, -10-12% prefill TTFT.

In-kernel FP8 cast — FP8 key cast moved from host-side torch.float8_e4m3fn to in-kernel tl.float8e4nv/tl.float8e4b15, removing a separate kernel launch. Auto-detects SM capability for Ampere vs Hopper FP8 formats.

Compact slot sizes — Slots are rounded to next even number instead of power-of-2, eliminating up to 47% padding waste (t4nc: 136B vs 256B). TQFullAttentionSpec properly overrides real_page_size_bytes with compact TQ slot bytes.

Shared value quant JIT helper — Extracted _store_quantized_value Triton JIT function, deduplicating ~60 lines between FP8 and MSE store kernels for both 3-bit and 4-bit value paths.

Prefill .tolist() optimization — Single CPU-GPU sync via .tolist() instead of per-request .item() calls in the prefill loop.

CUDAGraph memory fix — Static NUM_KV_SPLITS grid dimension (configurable, default 32) enables CUDAGraph capture. Estimated GPU memory reduced from 33 GiB → 8.7 GiB.

Stream overlap — KV store runs on a secondary CUDA stream so it can overlap with the next layer's forward pass (disabled during CUDAGraph capture).

Architecture

┌──────────────────────────────────────────────────────────────────┐
│  Store path (Triton)                                            │
│  K → WHT rotation → Lloyd-Max quantize → bit-pack ──┐          │
│  V → uniform quantize → bit-pack ────────────────────┤→ cache   │
│                                                      │          │
│  Decode path (Triton, split-KV)                      │          │
│  cache → unpack K → dequant → Q·K scores ──┐         │          │
│  cache → unpack V → dequant ──→ score·V ───┤→ output │          │
│                                            │         │          │
│  Prefill path (flash_attn_varlen_func)     │         │          │
│  Raw Q, K, V → flash attention → output    │         │          │
│  (continuation decode via TQ decode kernel)│         │          │
└──────────────────────────────────────────────────────────────────┘

Design Decisions

Compact even-aligned slots — slots rounded to next even number (not pow2), eliminating up to 47% memory waste. Hybrid mamba+attention models are out of scope for this PR.
Boundary layer protection — first/last N layers keep FP16 KV cache via kv_cache_dtype_skip_layers to protect embedding-adjacent representations. Also supports skipping "sliding_window" layers and arbitrary layer indices.
TQFullAttentionSpec — proper spec subclass that overrides real_page_size_bytes with TQ slot bytes, with correct merge semantics for uniform-spec models. Passes UniformTypeKVCacheSpecs.is_uniform_type() check as a FullAttentionSpec subclass.
No QJL — intentionally omitted per community consensus (5+ independent groups found it hurts attention quality by amplifying variance through softmax).
Norm correction (NC) — re-normalizes centroid vectors to unit norm before inverse rotation during dequant, fixing quantization-induced norm distortion (~0.8% PPL improvement at 4-bit).
Flash-attention prefill — uses flash_attn_varlen_func for memory-efficient O(N) prefill, with a continuation-decode threshold (128 tokens) routing small chunks directly through the TQ decode kernel.

Usage

# FP8 keys + 4-bit values (best quality/throughput trade-off)
vllm serve Qwen/Qwen3-4B --kv-cache-dtype turboquant_k8v4

# 4-bit MSE keys + 4-bit values + norm correction (3.8x compression)
vllm serve Qwen/Qwen3-4B --kv-cache-dtype turboquant_4bit_nc

# Maximum compression (4.9x)
vllm serve Qwen/Qwen3-4B --kv-cache-dtype turboquant_3bit_nc

# Skip specific layers (boundary protection)
vllm serve Qwen/Qwen3-4B --kv-cache-dtype turboquant_k8v4 \
  --kv-cache-dtype-skip-layers 0,1,34,35

Scope

Supports full-attention and uniform sliding-window transformer models. Hybrid architectures (mamba+attention, interleaved SWA) are planned for a follow-up PR.

Test Plan

Full perf benchmark (6 scenarios × 5 configs) — no regressions on baseline
All TQ configs produce correct output (k8v4, t4nc, k3v4nc, t3nc)
CUDAGraph capture verified (51 FULL + 51 PIECEWISE graphs)
WHT rotation: coherent generation across all MSE configs
Quality benchmark: GSM8K + NIAH across all presets
Mixed batch (decode+prefill) correct routing
LM Eval harness integration test

Changed files

.buildkite/test_areas/lm_eval.yaml (modified, +10/-0)
tests/evals/gsm8k/configs/Qwen3-4B-TQ-k3v4nc.yaml (added, +5/-0)
tests/evals/gsm8k/configs/Qwen3-4B-TQ-k8v4.yaml (added, +5/-0)
tests/evals/gsm8k/configs/Qwen3-4B-TQ-t3nc.yaml (added, +5/-0)
tests/evals/gsm8k/configs/Qwen3-4B-TQ-t4nc.yaml (added, +5/-0)
tests/evals/gsm8k/configs/models-turboquant.txt (added, +4/-0)
tests/quantization/test_turboquant.py (added, +328/-0)
vllm/config/attention.py (modified, +5/-0)
vllm/config/cache.py (modified, +4/-0)
vllm/engine/arg_utils.py (modified, +18/-0)
vllm/model_executor/layers/attention/attention.py (modified, +48/-0)
vllm/model_executor/layers/quantization/turboquant/__init__.py (added, +13/-0)
vllm/model_executor/layers/quantization/turboquant/centroids.py (added, +89/-0)
vllm/model_executor/layers/quantization/turboquant/config.py (added, +185/-0)
vllm/model_executor/layers/quantization/turboquant/quantizer.py (added, +39/-0)
vllm/platforms/cuda.py (modified, +5/-0)
vllm/utils/torch_utils.py (modified, +4/-0)
vllm/v1/attention/backends/registry.py (modified, +4/-0)
vllm/v1/attention/backends/turboquant_attn.py (added, +775/-0)
vllm/v1/attention/ops/triton_turboquant_decode.py (added, +546/-0)
vllm/v1/attention/ops/triton_turboquant_store.py (added, +381/-0)
vllm/v1/core/single_type_kv_cache_manager.py (modified, +4/-2)
vllm/v1/kv_cache_interface.py (modified, +27/-0)

Code Example

def _get_sliding_window_configs(
    vllm_config: VllmConfig,
) -> set[tuple[int, int] | None]:
    """Get the set of all sliding window configs used in the model."""
    sliding_window_configs: set[tuple[int, int] | None] = set()
    layers = get_layers_from_vllm_config(vllm_config, Attention)
    for layer in layers.values():
        assert isinstance(layer.impl, FlashAttentionImpl)  # ← fires on TQ/mamba/GDN/lightning
        sliding_window_configs.add(layer.impl.sliding_window)
    return sliding_window_configs

---

File "/usr/local/lib/python3.12/dist-packages/vllm/v1/attention/backends/flash_attn.py", line 255, in _get_sliding_window_configs
    assert isinstance(layer.impl, FlashAttentionImpl)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AssertionError

---

for layer in layers.values():
    if not isinstance(layer.impl, FlashAttentionImpl):
        continue
    sliding_window_configs.add(layer.impl.sliding_window)

---

sed -i 's|assert isinstance(layer.impl, FlashAttentionImpl)|if not isinstance(layer.impl, FlashAttentionImpl): continue|' \
  $(python -c "import vllm,os; print(os.path.dirname(vllm.__file__))")/v1/attention/backends/flash_attn.py

RAW_BUFFERClick to expand / collapse

Your current environment

vLLM nightly 0.19.1rc1.dev188+g8d0f908b9
PyTorch 2.11.0+cu130
H100 80GB, single GPU, TP=1
Qwen3-4B served with --kv-cache-dtype turboquant_3bit_nc (any non-FA backend reproduces — TQ, mamba, GDN, lightning attention)

🐛 Describe the bug

vllm/v1/attention/backends/flash_attn.py:_get_sliding_window_configs iterates all Attention layers in the vllm config and asserts that every layer's impl is a FlashAttentionImpl:

def _get_sliding_window_configs(
    vllm_config: VllmConfig,
) -> set[tuple[int, int] | None]:
    """Get the set of all sliding window configs used in the model."""
    sliding_window_configs: set[tuple[int, int] | None] = set()
    layers = get_layers_from_vllm_config(vllm_config, Attention)
    for layer in layers.values():
        assert isinstance(layer.impl, FlashAttentionImpl)  # ← fires on TQ/mamba/GDN/lightning
        sliding_window_configs.add(layer.impl.sliding_window)
    return sliding_window_configs

This function is called from FlashAttentionMetadataBuilder.build() (flash_attn.py:405). When the model has at least one FlashAttention layer and at least one non-FA layer, the assertion fires and engine init crashes with:

File "/usr/local/lib/python3.12/dist-packages/vllm/v1/attention/backends/flash_attn.py", line 255, in _get_sliding_window_configs
    assert isinstance(layer.impl, FlashAttentionImpl)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AssertionError

Reproduction

Easiest repro is via the in-flight TurboQuant attention backend (#38479) on its current head, but any non-FA backend that coexists with FA layers triggers it. Steps that reproduced for me:

pip install --pre vllm --extra-index-url https://wheels.vllm.ai/nightly
Install or overlay any non-FA backend that registers an Attention.impl of a non-FlashAttentionImpl type
vllm serve Qwen/Qwen3-4B --kv-cache-dtype turboquant_3bit_nc --port 8000 --tensor-parallel-size 1 --max-model-len 8192 --gpu-memory-utilization 0.85
Engine init crashes during the first metadata builder construction

Suggested fix

Two options:

Defensive (1 line): change the assert to a soft-skip:

for layer in layers.values():
    if not isinstance(layer.impl, FlashAttentionImpl):
        continue
    sliding_window_configs.add(layer.impl.sliding_window)

Strictly more permissive than current behavior, no risk of regressing FA-only models. This is what I patched locally to run my benchmarks (H100 throughput data here).

Scoped (correct): restrict the layer iteration to layers in this builder's KV cache group, not the global set. Requires tracing the call site (FlashAttentionMetadataBuilder.build) to figure out which group the builder is responsible for. Cleaner but more invasive — would touch the call site too.

Happy to send the defensive PR if a maintainer wants the fast fix; the scoped fix probably needs a design call from someone who touched the per-backend layer-set conventions in #35431 and knows whether this function is meant to be global by design or was just never updated for the multi-backend hybrid manager. cc @LucasWilkinson @MatthewBonanni as the #35431 co-authors.

Local workaround until fixed

sed -i 's|assert isinstance(layer.impl, FlashAttentionImpl)|if not isinstance(layer.impl, FlashAttentionImpl): continue|' \
  $(python -c "import vllm,os; print(os.path.dirname(vllm.__file__))")/v1/attention/backends/flash_attn.py

Before submitting

Searched existing issues — didn't find a match. Closest is the broader hybrid-backend issue family, but no specific report on this assertion.

extent analysis

TL;DR

The most likely fix is to change the assert statement in _get_sliding_window_configs to a soft-skip, allowing the function to continue iterating over layers without crashing when encountering non-FlashAttentionImpl layers.

Guidance

Identify the layers that are causing the assertion to fail by checking the impl type of each layer in the layers dictionary.
Consider implementing a defensive fix by changing the assert statement to a conditional statement that skips layers with non-FlashAttentionImpl implementations.
Alternatively, investigate the possibility of restricting the layer iteration to layers in the builder's KV cache group, which may require tracing the call site and understanding the per-backend layer-set conventions.
Verify the fix by running the vllm serve command with the modified code and checking that the engine initialization completes successfully.

Example

for layer in layers.values():
    if not isinstance(layer.impl, FlashAttentionImpl):
        continue
    sliding_window_configs.add(layer.impl.sliding_window)

Notes

The defensive fix is a more straightforward solution, but it may not address the underlying issue of why the function is iterating over all layers instead of just the ones in the builder's KV cache group. The scoped fix requires a deeper understanding of the codebase and the intentions of the original authors.

Recommendation

Apply the defensive workaround by changing the assert statement to a soft-skip, as it is a more permissive and less invasive solution that can be implemented quickly. This fix can be used as a temporary solution until a more comprehensive fix can be developed and tested.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#training loop #device allocation #model download #tokenizer error #prompt formatting

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

vllm - ✅(Solved) Fix [Bug] flash_attn _get_sliding_window_configs asserts FlashAttentionImpl over all attention layers, breaks any non-FA backend [2 pull requests, 1 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Fix Action

Fix / Workaround

Local workaround until fixed

PR fix notes

PR #7: perf(tq): fuse MSE store ops + inline decode Q cast + cleanup

Description (problem / solution / changelog)

Summary

Benchmark results (Qwen3-4B, 4× RTX PRO 6000 Blackwell, cudagraphs+compile)

Quality remains constant

Throughput (output tok/s)

TPOT avg (ms) — lower is better

TTFT avg (ms) — lower is better

Improvement vs pre-optimization baseline (0408 run)

Key takeaways

Test plan

Changed files

PR #38479: [Attention Backend] TurboQuant: 2-bit KV cache compression with 4x capacity

Description (problem / solution / changelog)

Summary

Compression Presets (Qwen3-4B, head_dim=128)

Performance (Qwen3-4B, 4x RTX PRO 6000 Blackwell, cudagraphs+compile)

Throughput (output tok/s)

TPOT (ms) — lower is better

TTFT (ms) — lower is better

Key Takeaways

Technical Innovations

Architecture

Design Decisions

Usage

Scope

Test Plan

Changed files

Code Example

Your current environment

🐛 Describe the bug

Reproduction

Suggested fix

Local workaround until fixed

Before submitting

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING