vllm - ✅(Solved) Fix Automatically select highest priority batch-invariant attention backend [1 pull requests, 4 comments, 4 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#40173Fetched 2026-04-18 05:52:11
View on GitHub
Comments
4
Participants
4
Timeline
19
Reactions
0
Assignees
Timeline (top)
mentioned ×5subscribed ×5commented ×4labeled ×2

Error Message

RuntimeError: VLLM batch_invariant mode requires an attention backend in ['FLASH_ATTN', 'TRITON_ATTN', 'FLASH_ATTN_MLA', 'TRITON_MLA'], but got 'None'. Please use --attention-backend or

Fix Action

Fixed

PR fix notes

PR #40184: [Bugfix] Fix auto-selection for batch-invariant attention backends

Description (problem / solution / changelog)

Purpose

Fixes #40173.

This change makes attention backend selection batch-invariance-aware instead of requiring an explicit override. Following changes were made:

  • adds supports_batch_invariance() to AttentionBackend, defaulting to False
  • adds is_batch_invariant to AttentionSelectorConfig
  • updates validate_configuration() to reject backends that do not support batch invariance when VLLM_BATCH_INVARIANT=1
  • marks the relevant backends (FLASH_ATTN, TRITON_ATTN, FLASH_ATTN_MLA, TRITON_MLA) as batch-invariant-capable
  • allows early batch-invariant init to defer validation when attention_config.backend is unset, so the selector can auto-pick the highest-priority compatible backend

Test Plan

CUDA_VISIBLE_DEVICES=0 .venv/bin/python -m pytest tests/v1/attention/test_batch_invariant_backend_validation.py -v
VLLM_BATCH_INVARIANT=1 CUDA_VISIBLE_DEVICES=0 .venv/bin/python - <<'PY'
from vllm import LLM, SamplingParams

llm = LLM(
    model="Qwen/Qwen3-1.7B",
    max_num_seqs=8,
    max_model_len=2048,
    gpu_memory_utilization=0.5,
)

out = llm.generate(
    ["Hello"],
    SamplingParams(max_tokens=8, temperature=0.0, seed=42),
)
print(out[0].outputs[0].text)
PY

Hardware used for the runtime validation:

  • NVIDIA H100 80GB HBM3
  • torch.cuda.get_device_capability(0) == (9, 0)

Test Result

Before fix (H100):

  • running the no-backend smoke repro above failed during engine init with:

RuntimeError: VLLM batch_invariant mode requires an attention backend in ['FLASH_ATTN', 'TRITON_ATTN', 'FLASH_ATTN_MLA', 'TRITON_MLA'], but got 'None'.

After fix (same H100, same command):

  • engine initialized successfully with no explicit --attention-backend
  • selector auto-picked a compatible backend:
Using FLASH_ATTN attention backend out of potential backends: ['FLASH_ATTN', 'TRITON_ATTN'].
  • generation completed successfully

Unit test coverage:

tests/v1/attention/test_batch_invariant_backend_validation.py::test_validate_configuration_rejects_batch_invariant_unsupported_backend PASSED
tests/v1/attention/test_batch_invariant_backend_validation.py::test_validate_configuration_accepts_batch_invariant_supported_backend PASSED
tests/v1/attention/test_batch_invariant_backend_validation.py::test_get_attn_backend_threads_batch_invariance PASSED
tests/v1/attention/test_batch_invariant_backend_validation.py::test_override_envs_for_invariance_allows_auto_selected_backend PASSED
tests/v1/attention/test_batch_invariant_backend_validation.py::test_override_envs_for_invariance_rejects_unsupported_backend PASSED

<details> <summary> Essential Elements of an Effective PR Description Checklist </summary>
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
</details>

Changed files

  • tests/v1/attention/test_batch_invariant_backend_validation.py (added, +196/-0)
  • vllm/model_executor/layers/batch_invariant.py (modified, +26/-22)
  • vllm/v1/attention/backend.py (modified, +7/-0)
  • vllm/v1/attention/backends/flash_attn.py (modified, +4/-0)
  • vllm/v1/attention/backends/mla/flashattn_mla.py (modified, +4/-0)
  • vllm/v1/attention/backends/mla/triton_mla.py (modified, +4/-0)
  • vllm/v1/attention/backends/triton_attn.py (modified, +4/-0)
  • vllm/v1/attention/selector.py (modified, +5/-1)

Code Example

RuntimeError: VLLM batch_invariant mode requires an attention backend in ['FLASH_ATTN', 'TRITON_ATTN', 'FLASH_ATTN_MLA', 'TRITON_MLA'], but got 'None'. Please use --attention-backend or
RAW_BUFFERClick to expand / collapse

By default, attention backends are selected according to the attention backend priority list. This can be overriden using --attention-backend (and equivalent CLI/Python flags). However, if batch invariance is enabled (VLLM_BATCH_INVARIANT=1), the batch invariant init requires an explicit selection. If nothing is specified, it produces the following error:

RuntimeError: VLLM batch_invariant mode requires an attention backend in ['FLASH_ATTN', 'TRITON_ATTN', 'FLASH_ATTN_MLA', 'TRITON_MLA'], but got 'None'. Please use --attention-backend or

Instead, the attention backend selector should be aware of batch-invariance, and select the highest priority backend that supports batch invariance. This will likely require adding a supports_batch_invariance to the AttentionBackend class so the selector can query it.

Reviewers: @yewentao256 @MatthewBonanni

extent analysis

TL;DR

Modify the attention backend selector to choose a backend that supports batch invariance when VLLM_BATCH_INVARIANT=1.

Guidance

  • Identify the AttentionBackend class and add a supports_batch_invariance attribute to it, allowing the selector to query this property.
  • Update the attention backend selector to prioritize backends that support batch invariance when VLLM_BATCH_INVARIANT=1.
  • Verify that the updated selector correctly chooses a compatible backend by testing with VLLM_BATCH_INVARIANT=1 and checking the selected backend.
  • Consider adding a fallback or default backend that supports batch invariance to handle cases where no priority backend is compatible.

Example

class AttentionBackend:
    def __init__(self, name, supports_batch_invariance=False):
        self.name = name
        self.supports_batch_invariance = supports_batch_invariance

# Example backends
backends = [
    AttentionBackend('FLASH_ATTN', supports_batch_invariance=True),
    AttentionBackend('TRITON_ATTN', supports_batch_invariance=False),
]

# Selector example
def select_backend(backends, batch_invariant):
    if batch_invariant:
        compatible_backends = [b for b in backends if b.supports_batch_invariance]
        return max(compatible_backends, key=lambda b: b.priority)
    #... existing selector logic

Notes

This solution assumes that the AttentionBackend class can be modified and that the selector can be updated to query the supports_batch_invariance attribute.

Recommendation

Apply workaround: Modify the attention backend selector to choose a backend that supports batch invariance when VLLM_BATCH_INVARIANT=1, as this directly addresses the error and provides a clear solution.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING