vllm - ✅(Solved) Fix [Bug]: Per-attention-head quantization is currently available only with the Flash Attention backend and requires the calibration pathway provided by llm-compressor. [1 pull requests, 1 participants]

vllm2026-04-21 07:47:21

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

vllm-project/vllm#40444•Fetched 2026-04-22 07:45:34

View on GitHub

Comments

Participants

Timeline

Reactions

Author

cuihangbin

Participants

cuihangbin

Timeline (top)

cross-referenced ×1labeled ×1

Fix Action

Fixed

Fixed by PR: [Bugfix] Clarify FLASHINFER limitation for per-attention-head KV quantization (https://github.com/vllm-project/vllm/pull/40448)

PR fix notes

PR #40448: [Bugfix] Clarify FLASHINFER limitation for per-attention-head KV quantization

Repository: vllm-project/vllm
Author: MerIinnn
State: open | merged: False
Link: https://github.com/vllm-project/vllm/pull/40448

Description (problem / solution / changelog)

Fixes #40444

Summary

raise a clear validation error when per-attention-head KV-cache quantization is requested with the FLASHINFER backend
add regression coverage for the FLASHINFER failure path while keeping the existing FLASH_ATTN success path
document that per-attention-head KV-cache quantization currently requires FLASH_ATTN

Why this is not duplicate work

I checked issue/PR state for #40444 and did not find an open PR that already addresses this fix
this PR does not add FlashInfer kernel support; it makes the current backend limitation explicit and fail-fast

Test plan

uv run --no-project python -m py_compile vllm/v1/attention/backend.py tests/test_attention_backend_registry.py tests/quantization/test_compressed_tensors.py
./.venv/bin/python -m pytest tests/test_attention_backend_registry.py -k per_head_quant_scale_support -v
./.venv/bin/python -m pytest tests/quantization/test_compressed_tensors.py -k per_attn_head -v

AI assistance

This PR was prepared with AI assistance, and the final changes were reviewed by a human submitter.

Changed files

docs/features/quantization/quantized_kvcache.md (modified, +1/-1)
tests/quantization/test_compressed_tensors.py (modified, +16/-0)
tests/test_attention_backend_registry.py (modified, +50/-0)
vllm/v1/attention/backends/flashinfer.py (modified, +42/-0)

RAW_BUFFERClick to expand / collapse

Your current environment

Per-attention-head quantization is currently available only with the Flash Attention backend and requires the calibration pathway provided by llm-compressor. 在flashinfer后端的话，支持per attention head量化的算子吗？有对应的算子实现吗？如果量化模型是per attention head FP8+ flasinfer后端，那attention计算是有对应的算子实现还是回推到flash attention3的fp8计算

🐛 Describe the bug

"quantization_config": { "config_groups": { "group_0": { "format": "dense", "input_activations": { "actorder": null, "block_structure": null, "dynamic": false, "group_size": null, "num_bits": 8, "observer": "memoryless_minmax", "observer_kwargs": {}, "scale_dtype": null, "strategy": "attn_head", "symmetric": true, "type": "float", "zp_dtype": null }, "output_activations": null, "targets": [ "Qwen3VLTextAttention" ], "weights": null }, "group_1": { "format": "float-quantized", "input_activations": { "actorder": null, "block_structure": null, "dynamic": true, "group_size": 128, "num_bits": 8, "observer": null, "observer_kwargs": {}, "scale_dtype": null, "strategy": "group", "symmetric": true, "type": "float", "zp_dtype": null }, "output_activations": null, "targets": [ "Linear" ], "weights": { "actorder": null, "block_structure": [ 128, 128 ], "dynamic": false, "group_size": null, "num_bits": 8, "observer": "memoryless_minmax", "observer_kwargs": {}, "scale_dtype": null, "strategy": "block", "symmetric": true, "type": "float", "zp_dtype": null } } }, "format": "float-quantized", "global_compression_ratio": null, "ignore": [ "model.visual.blocks.0.attn.qkv", "model.visual.blocks.0.attn.proj", "model.visual.blocks.0.mlp.linear_fc1", "model.visual.blocks.0.mlp.linear_fc2", "model.visual.blocks.1.attn.qkv", "model.visual.blocks.1.attn.proj", "model.visual.blocks.1.mlp.linear_fc1", "model.visual.blocks.1.mlp.linear_fc2", "model.visual.blocks.2.attn.qkv", "model.visual.blocks.2.attn.proj", "model.visual.blocks.2.mlp.linear_fc1", "model.visual.blocks.2.mlp.linear_fc2", "model.visual.blocks.3.attn.qkv", "model.visual.blocks.3.attn.proj", "model.visual.blocks.3.mlp.linear_fc1", "model.visual.blocks.3.mlp.linear_fc2", "model.visual.blocks.4.attn.qkv", "model.visual.blocks.4.attn.proj", "model.visual.blocks.4.mlp.linear_fc1", "model.visual.blocks.4.mlp.linear_fc2", "model.visual.blocks.5.attn.qkv", "model.visual.blocks.5.attn.proj", "model.visual.blocks.5.mlp.linear_fc1", "model.visual.blocks.5.mlp.linear_fc2", "model.visual.blocks.6.attn.qkv", "model.visual.blocks.6.attn.proj", "model.visual.blocks.6.mlp.linear_fc1", "model.visual.blocks.6.mlp.linear_fc2", "model.visual.blocks.7.attn.qkv", "model.visual.blocks.7.attn.proj", "model.visual.blocks.7.mlp.linear_fc1", "model.visual.blocks.7.mlp.linear_fc2", "model.visual.blocks.8.attn.qkv", "model.visual.blocks.8.attn.proj", "model.visual.blocks.8.mlp.linear_fc1", "model.visual.blocks.8.mlp.linear_fc2", "model.visual.blocks.9.attn.qkv", "model.visual.blocks.9.attn.proj", "model.visual.blocks.9.mlp.linear_fc1", "model.visual.blocks.9.mlp.linear_fc2", "model.visual.blocks.10.attn.qkv", "model.visual.blocks.10.attn.proj", "model.visual.blocks.10.mlp.linear_fc1", "model.visual.blocks.10.mlp.linear_fc2", "model.visual.blocks.11.attn.qkv", "model.visual.blocks.11.attn.proj", "model.visual.blocks.11.mlp.linear_fc1", "model.visual.blocks.11.mlp.linear_fc2", "model.visual.blocks.12.attn.qkv", "model.visual.blocks.12.attn.proj", "model.visual.blocks.12.mlp.linear_fc1", "model.visual.blocks.12.mlp.linear_fc2", "model.visual.blocks.13.attn.qkv", "model.visual.blocks.13.attn.proj", "model.visual.blocks.13.mlp.linear_fc1", "model.visual.blocks.13.mlp.linear_fc2", "model.visual.blocks.14.attn.qkv", "model.visual.blocks.14.attn.proj", "model.visual.blocks.14.mlp.linear_fc1", "model.visual.blocks.14.mlp.linear_fc2", "model.visual.blocks.15.attn.qkv", "model.visual.blocks.15.attn.proj", "model.visual.blocks.15.mlp.linear_fc1", "model.visual.blocks.15.mlp.linear_fc2", "model.visual.blocks.16.attn.qkv", "model.visual.blocks.16.attn.proj", "model.visual.blocks.16.mlp.linear_fc1", "model.visual.blocks.16.mlp.linear_fc2", "model.visual.blocks.17.attn.qkv", "model.visual.blocks.17.attn.proj", "model.visual.blocks.17.mlp.linear_fc1", "model.visual.blocks.17.mlp.linear_fc2", "model.visual.blocks.18.attn.qkv", "model.visual.blocks.18.attn.proj", "model.visual.blocks.18.mlp.linear_fc1", "model.visual.blocks.18.mlp.linear_fc2", "model.visual.blocks.19.attn.qkv", "model.visual.blocks.19.attn.proj", "model.visual.blocks.19.mlp.linear_fc1", "model.visual.blocks.19.mlp.linear_fc2", "model.visual.blocks.20.attn.qkv", "model.visual.blocks.20.attn.proj", "model.visual.blocks.20.mlp.linear_fc1", "model.visual.blocks.20.mlp.linear_fc2", "model.visual.blocks.21.attn.qkv", "model.visual.blocks.21.attn.proj", "model.visual.blocks.21.mlp.linear_fc1", "model.visual.blocks.21.mlp.linear_fc2", "model.visual.blocks.22.attn.qkv", "model.visual.blocks.22.attn.proj", "model.visual.blocks.22.mlp.linear_fc1", "model.visual.blocks.22.mlp.linear_fc2", "model.visual.blocks.23.attn.qkv", "model.visual.blocks.23.attn.proj", "model.visual.blocks.23.mlp.linear_fc1", "model.visual.blocks.23.mlp.linear_fc2", "model.visual.merger.linear_fc1", "model.visual.merger.linear_fc2", "model.visual.deepstack_merger_list.0.linear_fc1", "model.visual.deepstack_merger_list.0.linear_fc2", "model.visual.deepstack_merger_list.1.linear_fc1", "model.visual.deepstack_merger_list.1.linear_fc2", "model.visual.deepstack_merger_list.2.linear_fc1", "model.visual.deepstack_merger_list.2.linear_fc2", "lm_head" ], "kv_cache_scheme": { "actorder": null, "block_structure": null, "dynamic": false, "group_size": null, "num_bits": 8, "observer": "memoryless_minmax", "observer_kwargs": {}, "scale_dtype": null, "strategy": "attn_head", "symmetric": true, "type": "float", "zp_dtype": null }, "quant_method": "compressed-tensors", "quantization_status": "compressed", "sparsity_config": {}, "transform_config": {}, "version": "0.15.0.1" },

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

TL;DR

The issue seems to be related to quantization configuration for a model, and a potential fix could involve adjusting the quantization settings for the Flash Attention backend.

Guidance

Review the quantization_config section to ensure that the settings are correctly configured for the Flash Attention backend, particularly the format, num_bits, and strategy fields.
Verify that the attn_head strategy is correctly applied to the Qwen3VLTextAttention target, as specified in the group_0 configuration.
Check the ignore list to ensure that it does not inadvertently exclude necessary components from quantization.
Consider testing the model with a simplified quantization configuration to isolate the issue.

Example

No specific code example can be provided without further context, but reviewing the quantization_config section and adjusting the settings as needed may help resolve the issue.

Notes

The provided information does not include specific error messages or symptoms, making it challenging to provide a more targeted solution. Further investigation and testing may be necessary to fully resolve the issue.

Recommendation

Apply workaround: Adjust the quantization settings for the Flash Attention backend, as the current configuration may not be compatible with the required quantization method.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#tool integration #LLM response #prompt template #agent execution #callback error

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

vllm - ✅(Solved) Fix [Bug]: Per-attention-head quantization is currently available only with the Flash Attention backend and requires the calibration pathway provided by llm-compressor. [1 pull requests, 1 participants]

Recommended Tools

GitHub issue graph ai analysis

Fix Action

Fixed

PR fix notes

PR #40448: [Bugfix] Clarify FLASHINFER limitation for per-attention-head KV quantization

Description (problem / solution / changelog)

Summary

Why this is not duplicate work

Test plan

AI assistance

Changed files

Your current environment

🐛 Describe the bug

Before submitting a new issue...

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

TRENDING

vllm - ✅(Solved) Fix [Bug]: Per-attention-head quantization is currently available only with the Flash Attention backend and requires the calibration pathway provided by llm-compressor. [1 pull requests, 1 participants]

Recommended Tools

GitHub issue graph ai analysis

Fix Action

Fixed

PR fix notes

PR #40448: [Bugfix] Clarify FLASHINFER limitation for per-attention-head KV quantization

Description (problem / solution / changelog)

Summary

Why this is not duplicate work

Test plan

AI assistance

Changed files

Your current environment

🐛 Describe the bug

Before submitting a new issue...

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING