vllm - ✅(Solved) Fix [Bug]: Per-attention-head quantization is currently available only with the Flash Attention backend and requires the calibration pathway provided by llm-compressor. [1 pull requests, 1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#40444Fetched 2026-04-22 07:45:34
View on GitHub
Comments
0
Participants
1
Timeline
2
Reactions
0
Participants
Timeline (top)
cross-referenced ×1labeled ×1

Fix Action

Fixed

PR fix notes

PR #40448: [Bugfix] Clarify FLASHINFER limitation for per-attention-head KV quantization

Description (problem / solution / changelog)

Fixes #40444

Summary

  • raise a clear validation error when per-attention-head KV-cache quantization is requested with the FLASHINFER backend
  • add regression coverage for the FLASHINFER failure path while keeping the existing FLASH_ATTN success path
  • document that per-attention-head KV-cache quantization currently requires FLASH_ATTN

Why this is not duplicate work

  • I checked issue/PR state for #40444 and did not find an open PR that already addresses this fix
  • this PR does not add FlashInfer kernel support; it makes the current backend limitation explicit and fail-fast

Test plan

  • uv run --no-project python -m py_compile vllm/v1/attention/backend.py tests/test_attention_backend_registry.py tests/quantization/test_compressed_tensors.py
  • ./.venv/bin/python -m pytest tests/test_attention_backend_registry.py -k per_head_quant_scale_support -v
  • ./.venv/bin/python -m pytest tests/quantization/test_compressed_tensors.py -k per_attn_head -v

AI assistance

  • This PR was prepared with AI assistance, and the final changes were reviewed by a human submitter.

Changed files

  • docs/features/quantization/quantized_kvcache.md (modified, +1/-1)
  • tests/quantization/test_compressed_tensors.py (modified, +16/-0)
  • tests/test_attention_backend_registry.py (modified, +50/-0)
  • vllm/v1/attention/backends/flashinfer.py (modified, +42/-0)
RAW_BUFFERClick to expand / collapse

Your current environment

Per-attention-head quantization is currently available only with the Flash Attention backend and requires the calibration pathway provided by llm-compressor. 在flashinfer后端的话,支持per attention head量化的算子吗? 有对应的算子实现吗? 如果量化模型是per attention head FP8+ flasinfer后端,那attention计算是有对应的算子实现还是 回推到flash attention3的fp8计算

🐛 Describe the bug

"quantization_config": { "config_groups": { "group_0": { "format": "dense", "input_activations": { "actorder": null, "block_structure": null, "dynamic": false, "group_size": null, "num_bits": 8, "observer": "memoryless_minmax", "observer_kwargs": {}, "scale_dtype": null, "strategy": "attn_head", "symmetric": true, "type": "float", "zp_dtype": null }, "output_activations": null, "targets": [ "Qwen3VLTextAttention" ], "weights": null }, "group_1": { "format": "float-quantized", "input_activations": { "actorder": null, "block_structure": null, "dynamic": true, "group_size": 128, "num_bits": 8, "observer": null, "observer_kwargs": {}, "scale_dtype": null, "strategy": "group", "symmetric": true, "type": "float", "zp_dtype": null }, "output_activations": null, "targets": [ "Linear" ], "weights": { "actorder": null, "block_structure": [ 128, 128 ], "dynamic": false, "group_size": null, "num_bits": 8, "observer": "memoryless_minmax", "observer_kwargs": {}, "scale_dtype": null, "strategy": "block", "symmetric": true, "type": "float", "zp_dtype": null } } }, "format": "float-quantized", "global_compression_ratio": null, "ignore": [ "model.visual.blocks.0.attn.qkv", "model.visual.blocks.0.attn.proj", "model.visual.blocks.0.mlp.linear_fc1", "model.visual.blocks.0.mlp.linear_fc2", "model.visual.blocks.1.attn.qkv", "model.visual.blocks.1.attn.proj", "model.visual.blocks.1.mlp.linear_fc1", "model.visual.blocks.1.mlp.linear_fc2", "model.visual.blocks.2.attn.qkv", "model.visual.blocks.2.attn.proj", "model.visual.blocks.2.mlp.linear_fc1", "model.visual.blocks.2.mlp.linear_fc2", "model.visual.blocks.3.attn.qkv", "model.visual.blocks.3.attn.proj", "model.visual.blocks.3.mlp.linear_fc1", "model.visual.blocks.3.mlp.linear_fc2", "model.visual.blocks.4.attn.qkv", "model.visual.blocks.4.attn.proj", "model.visual.blocks.4.mlp.linear_fc1", "model.visual.blocks.4.mlp.linear_fc2", "model.visual.blocks.5.attn.qkv", "model.visual.blocks.5.attn.proj", "model.visual.blocks.5.mlp.linear_fc1", "model.visual.blocks.5.mlp.linear_fc2", "model.visual.blocks.6.attn.qkv", "model.visual.blocks.6.attn.proj", "model.visual.blocks.6.mlp.linear_fc1", "model.visual.blocks.6.mlp.linear_fc2", "model.visual.blocks.7.attn.qkv", "model.visual.blocks.7.attn.proj", "model.visual.blocks.7.mlp.linear_fc1", "model.visual.blocks.7.mlp.linear_fc2", "model.visual.blocks.8.attn.qkv", "model.visual.blocks.8.attn.proj", "model.visual.blocks.8.mlp.linear_fc1", "model.visual.blocks.8.mlp.linear_fc2", "model.visual.blocks.9.attn.qkv", "model.visual.blocks.9.attn.proj", "model.visual.blocks.9.mlp.linear_fc1", "model.visual.blocks.9.mlp.linear_fc2", "model.visual.blocks.10.attn.qkv", "model.visual.blocks.10.attn.proj", "model.visual.blocks.10.mlp.linear_fc1", "model.visual.blocks.10.mlp.linear_fc2", "model.visual.blocks.11.attn.qkv", "model.visual.blocks.11.attn.proj", "model.visual.blocks.11.mlp.linear_fc1", "model.visual.blocks.11.mlp.linear_fc2", "model.visual.blocks.12.attn.qkv", "model.visual.blocks.12.attn.proj", "model.visual.blocks.12.mlp.linear_fc1", "model.visual.blocks.12.mlp.linear_fc2", "model.visual.blocks.13.attn.qkv", "model.visual.blocks.13.attn.proj", "model.visual.blocks.13.mlp.linear_fc1", "model.visual.blocks.13.mlp.linear_fc2", "model.visual.blocks.14.attn.qkv", "model.visual.blocks.14.attn.proj", "model.visual.blocks.14.mlp.linear_fc1", "model.visual.blocks.14.mlp.linear_fc2", "model.visual.blocks.15.attn.qkv", "model.visual.blocks.15.attn.proj", "model.visual.blocks.15.mlp.linear_fc1", "model.visual.blocks.15.mlp.linear_fc2", "model.visual.blocks.16.attn.qkv", "model.visual.blocks.16.attn.proj", "model.visual.blocks.16.mlp.linear_fc1", "model.visual.blocks.16.mlp.linear_fc2", "model.visual.blocks.17.attn.qkv", "model.visual.blocks.17.attn.proj", "model.visual.blocks.17.mlp.linear_fc1", "model.visual.blocks.17.mlp.linear_fc2", "model.visual.blocks.18.attn.qkv", "model.visual.blocks.18.attn.proj", "model.visual.blocks.18.mlp.linear_fc1", "model.visual.blocks.18.mlp.linear_fc2", "model.visual.blocks.19.attn.qkv", "model.visual.blocks.19.attn.proj", "model.visual.blocks.19.mlp.linear_fc1", "model.visual.blocks.19.mlp.linear_fc2", "model.visual.blocks.20.attn.qkv", "model.visual.blocks.20.attn.proj", "model.visual.blocks.20.mlp.linear_fc1", "model.visual.blocks.20.mlp.linear_fc2", "model.visual.blocks.21.attn.qkv", "model.visual.blocks.21.attn.proj", "model.visual.blocks.21.mlp.linear_fc1", "model.visual.blocks.21.mlp.linear_fc2", "model.visual.blocks.22.attn.qkv", "model.visual.blocks.22.attn.proj", "model.visual.blocks.22.mlp.linear_fc1", "model.visual.blocks.22.mlp.linear_fc2", "model.visual.blocks.23.attn.qkv", "model.visual.blocks.23.attn.proj", "model.visual.blocks.23.mlp.linear_fc1", "model.visual.blocks.23.mlp.linear_fc2", "model.visual.merger.linear_fc1", "model.visual.merger.linear_fc2", "model.visual.deepstack_merger_list.0.linear_fc1", "model.visual.deepstack_merger_list.0.linear_fc2", "model.visual.deepstack_merger_list.1.linear_fc1", "model.visual.deepstack_merger_list.1.linear_fc2", "model.visual.deepstack_merger_list.2.linear_fc1", "model.visual.deepstack_merger_list.2.linear_fc2", "lm_head" ], "kv_cache_scheme": { "actorder": null, "block_structure": null, "dynamic": false, "group_size": null, "num_bits": 8, "observer": "memoryless_minmax", "observer_kwargs": {}, "scale_dtype": null, "strategy": "attn_head", "symmetric": true, "type": "float", "zp_dtype": null }, "quant_method": "compressed-tensors", "quantization_status": "compressed", "sparsity_config": {}, "transform_config": {}, "version": "0.15.0.1" },

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

TL;DR

The issue seems to be related to quantization configuration for a model, and a potential fix could involve adjusting the quantization settings for the Flash Attention backend.

Guidance

  • Review the quantization_config section to ensure that the settings are correctly configured for the Flash Attention backend, particularly the format, num_bits, and strategy fields.
  • Verify that the attn_head strategy is correctly applied to the Qwen3VLTextAttention target, as specified in the group_0 configuration.
  • Check the ignore list to ensure that it does not inadvertently exclude necessary components from quantization.
  • Consider testing the model with a simplified quantization configuration to isolate the issue.

Example

No specific code example can be provided without further context, but reviewing the quantization_config section and adjusting the settings as needed may help resolve the issue.

Notes

The provided information does not include specific error messages or symptoms, making it challenging to provide a more targeted solution. Further investigation and testing may be necessary to fully resolve the issue.

Recommendation

Apply workaround: Adjust the quantization settings for the Flash Attention backend, as the current configuration may not be compatible with the required quantization method.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING