vllm - ✅(Solved) Fix Automatically select highest priority batch-invariant attention backend [1 pull requests, 4 comments, 4 participants]

vllm2026-04-17 19:35:21

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

vllm-project/vllm#40173•Fetched 2026-04-18 05:52:11

View on GitHub

Comments

Participants

Timeline

Reactions

Author

Participants

Assignees

Timeline (top)

mentioned ×5subscribed ×5commented ×4labeled ×2

Error Message

RuntimeError: VLLM batch_invariant mode requires an attention backend in ['FLASH_ATTN', 'TRITON_ATTN', 'FLASH_ATTN_MLA', 'TRITON_MLA'], but got 'None'. Please use --attention-backend or

Fix Action

Fixed

Fixed by PR: [Bugfix] Fix auto-selection for batch-invariant attention backends (https://github.com/vllm-project/vllm/pull/40184)

PR fix notes

PR #40184: [Bugfix] Fix auto-selection for batch-invariant attention backends

Repository: vllm-project/vllm
Author: bedeks
State: open | merged: False
Link: https://github.com/vllm-project/vllm/pull/40184

Description (problem / solution / changelog)

Purpose

Fixes #40173.

This change makes attention backend selection batch-invariance-aware instead of requiring an explicit override. Following changes were made:

adds supports_batch_invariance() to AttentionBackend, defaulting to False
adds is_batch_invariant to AttentionSelectorConfig
updates validate_configuration() to reject backends that do not support batch invariance when VLLM_BATCH_INVARIANT=1
marks the relevant backends (FLASH_ATTN, TRITON_ATTN, FLASH_ATTN_MLA, TRITON_MLA) as batch-invariant-capable
allows early batch-invariant init to defer validation when attention_config.backend is unset, so the selector can auto-pick the highest-priority compatible backend

Test Plan

CUDA_VISIBLE_DEVICES=0 .venv/bin/python -m pytest tests/v1/attention/test_batch_invariant_backend_validation.py -v

VLLM_BATCH_INVARIANT=1 CUDA_VISIBLE_DEVICES=0 .venv/bin/python - <<'PY'
from vllm import LLM, SamplingParams

llm = LLM(
    model="Qwen/Qwen3-1.7B",
    max_num_seqs=8,
    max_model_len=2048,
    gpu_memory_utilization=0.5,
)

out = llm.generate(
    ["Hello"],
    SamplingParams(max_tokens=8, temperature=0.0, seed=42),
)
print(out[0].outputs[0].text)
PY

Hardware used for the runtime validation:

NVIDIA H100 80GB HBM3
torch.cuda.get_device_capability(0) == (9, 0)

Test Result

Before fix (H100):

running the no-backend smoke repro above failed during engine init with:

RuntimeError: VLLM batch_invariant mode requires an attention backend in ['FLASH_ATTN', 'TRITON_ATTN', 'FLASH_ATTN_MLA', 'TRITON_MLA'], but got 'None'.

After fix (same H100, same command):

engine initialized successfully with no explicit --attention-backend
selector auto-picked a compatible backend:

Using FLASH_ATTN attention backend out of potential backends: ['FLASH_ATTN', 'TRITON_ATTN'].

generation completed successfully

Unit test coverage:

tests/v1/attention/test_batch_invariant_backend_validation.py::test_validate_configuration_rejects_batch_invariant_unsupported_backend PASSED
tests/v1/attention/test_batch_invariant_backend_validation.py::test_validate_configuration_accepts_batch_invariant_supported_backend PASSED
tests/v1/attention/test_batch_invariant_backend_validation.py::test_get_attn_backend_threads_batch_invariance PASSED
tests/v1/attention/test_batch_invariant_backend_validation.py::test_override_envs_for_invariance_allows_auto_selected_backend PASSED
tests/v1/attention/test_batch_invariant_backend_validation.py::test_override_envs_for_invariance_rejects_unsupported_backend PASSED

<details> <summary> Essential Elements of an Effective PR Description Checklist </summary>

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.

</details>

Changed files

tests/v1/attention/test_batch_invariant_backend_validation.py (added, +196/-0)
vllm/model_executor/layers/batch_invariant.py (modified, +26/-22)
vllm/v1/attention/backend.py (modified, +7/-0)
vllm/v1/attention/backends/flash_attn.py (modified, +4/-0)
vllm/v1/attention/backends/mla/flashattn_mla.py (modified, +4/-0)
vllm/v1/attention/backends/mla/triton_mla.py (modified, +4/-0)
vllm/v1/attention/backends/triton_attn.py (modified, +4/-0)
vllm/v1/attention/selector.py (modified, +5/-1)

Code Example

RuntimeError: VLLM batch_invariant mode requires an attention backend in ['FLASH_ATTN', 'TRITON_ATTN', 'FLASH_ATTN_MLA', 'TRITON_MLA'], but got 'None'. Please use --attention-backend or

RAW_BUFFERClick to expand / collapse

By default, attention backends are selected according to the attention backend priority list. This can be overriden using --attention-backend (and equivalent CLI/Python flags). However, if batch invariance is enabled (VLLM_BATCH_INVARIANT=1), the batch invariant init requires an explicit selection. If nothing is specified, it produces the following error:

RuntimeError: VLLM batch_invariant mode requires an attention backend in ['FLASH_ATTN', 'TRITON_ATTN', 'FLASH_ATTN_MLA', 'TRITON_MLA'], but got 'None'. Please use --attention-backend or

Instead, the attention backend selector should be aware of batch-invariance, and select the highest priority backend that supports batch invariance. This will likely require adding a supports_batch_invariance to the AttentionBackend class so the selector can query it.

Reviewers: @yewentao256 @MatthewBonanni

extent analysis

TL;DR

Modify the attention backend selector to choose a backend that supports batch invariance when VLLM_BATCH_INVARIANT=1.

Guidance

Identify the AttentionBackend class and add a supports_batch_invariance attribute to it, allowing the selector to query this property.
Update the attention backend selector to prioritize backends that support batch invariance when VLLM_BATCH_INVARIANT=1.
Verify that the updated selector correctly chooses a compatible backend by testing with VLLM_BATCH_INVARIANT=1 and checking the selected backend.
Consider adding a fallback or default backend that supports batch invariance to handle cases where no priority backend is compatible.

Example

class AttentionBackend:
    def __init__(self, name, supports_batch_invariance=False):
        self.name = name
        self.supports_batch_invariance = supports_batch_invariance

# Example backends
backends = [
    AttentionBackend('FLASH_ATTN', supports_batch_invariance=True),
    AttentionBackend('TRITON_ATTN', supports_batch_invariance=False),
]

# Selector example
def select_backend(backends, batch_invariant):
    if batch_invariant:
        compatible_backends = [b for b in backends if b.supports_batch_invariance]
        return max(compatible_backends, key=lambda b: b.priority)
    #... existing selector logic

Notes

This solution assumes that the AttentionBackend class can be modified and that the selector can be updated to query the supports_batch_invariance attribute.

Recommendation

Apply workaround: Modify the attention backend selector to choose a backend that supports batch invariance when VLLM_BATCH_INVARIANT=1, as this directly addresses the error and provides a clear solution.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#batch processing #GPU compatibility #latency issue #model loading #dependency error

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

vllm - ✅(Solved) Fix Automatically select highest priority batch-invariant attention backend [1 pull requests, 4 comments, 4 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Fix Action

Fixed

PR fix notes

PR #40184: [Bugfix] Fix auto-selection for batch-invariant attention backends

Description (problem / solution / changelog)

Purpose

Test Plan

Test Result

Changed files

Code Example

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

TRENDING

vllm - ✅(Solved) Fix Automatically select highest priority batch-invariant attention backend [1 pull requests, 4 comments, 4 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Fix Action

Fixed

PR fix notes

PR #40184: [Bugfix] Fix auto-selection for batch-invariant attention backends

Description (problem / solution / changelog)

Purpose

Test Plan

Test Result

Changed files

Code Example

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING