vllm - ✅(Solved) Fix [Tracking] TurboQuant + Gemma 4 multimodal: 5-gate blocker stack [1 pull requests, 1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#41403Fetched 2026-05-01 05:33:48
View on GitHub
Comments
0
Participants
1
Timeline
0
Reactions
1
Author
Participants

Trying to enable TurboQuant KV cache compression (turboquant_4bit_nc) on Gemma 4 31B FP8 (the multimodal Gemma4ForConditionalGeneration variant) on consumer Blackwell hardware (4× RTX 5060 Ti, sm_120, TP=4). Walking through the failure modes systematically, there are 5 distinct gates that block this configuration today on main. Filing this as a tracking/discussion issue because the gates compose: fixing #1 reveals #2, fixing #2 reveals #3, etc. Hopefully useful as a roadmap item for anyone working on TurboQuant + multimodal hybrid models.

TL;DR motivation: on 4× 16 GB consumer Blackwell at TP=4, native 262,144 context with fp8_per_token_head doesn't fit (~3.04 GiB needed per GPU vs ~2.15 GiB available at gpu-util 0.97; estimated max 170,560). TurboQuant 4-bit KV would give ~4× compression and unlock native context. So the value proposition is real, not hypothetical.

Error Message

ValueError: Selected backend AttentionBackendEnum.TURBOQUANT is not valid for this configuration. Reason: ['kv_cache_dtype not supported', 'partial multimodal token full attention not supported']

Root Cause

Trying to enable TurboQuant KV cache compression (turboquant_4bit_nc) on Gemma 4 31B FP8 (the multimodal Gemma4ForConditionalGeneration variant) on consumer Blackwell hardware (4× RTX 5060 Ti, sm_120, TP=4). Walking through the failure modes systematically, there are 5 distinct gates that block this configuration today on main. Filing this as a tracking/discussion issue because the gates compose: fixing #1 reveals #2, fixing #2 reveals #3, etc. Hopefully useful as a roadmap item for anyone working on TurboQuant + multimodal hybrid models.

Fix Action

Fix / Workaround

Workaround tested: --hf-overrides '{"text_config":{"use_bidirectional_attention":null}}' removes the flag and the rejection. Sacrifices vision quality (causal vs bidi attention on image tokens) — undesirable for a vision model.

Workaround tested: monkey-patch get_boundary_skip_layers to return [] (no skipping). All 60 layers get TurboQuant. Sacrifices boundary protection — first/last layers tend to be the most quantization-sensitive, so this is a quality cost.

Gate 3 — Alberto-Codes/turboquant-vllm plugin conflict

After clearing Gates 1+2 with the workarounds:

RuntimeError: Worker failed with error 'TQ4FullAttentionSpec.__init__() got an unexpected keyword argument 'tq_slot_size''

PR fix notes

PR #39931: [Feature] TurboQuant: support hybrid models and uniform quantization

Description (problem / solution / changelog)

TurboQuant support for hybrid models

This PR fixes TurboQuant startup for hybrid models such as Qwen3.5, Qwen3-Next, and similar architectures.

Previously, TurboQuant would fail with a NotImplementedError as soon as it encountered Mamba layers. With this change, hybrid models are now handled correctly: TurboQuant is applied only to full_attention layers.

Additional fixes

While enabling proper hybrid support, this PR also fixes three additional issues:

  • Page-size planner mismatch: The hybrid page-size planner was sizing attention pages using the standard formula, which does not match TurboQuant's packed K|V layout. As a result, every TurboQuant attention layer could trigger an assertion in the page merger. The planner now uses the TurboQuant-specific layout.

  • Incorrect backend selection for excluded layers: If a layer was excluded from TurboQuant — for example because it was a skipped layer, a sliding-window layer, or a Mamba layer — the ROCm/CUDA backend selector could still incorrectly force the TURBOQUANT backend. These layers now correctly fall back to the default backend.

  • ROCm flash_attn_varlen_func incompatibility: On ROCm, upstream flash_attn_varlen_func does not accept out=. A lightweight wrapper now detects that case and copies the result only when needed.

Summary

Overall, this makes TurboQuant work reliably on hybrid architectures while preserving the current behavior and baselines for dense models.

Changed files

  • tests/quantization/test_turboquant.py (modified, +82/-4)
  • vllm/engine/arg_utils.py (modified, +3/-17)
  • vllm/model_executor/layers/quantization/turboquant/config.py (modified, +65/-6)
  • vllm/platforms/interface.py (modified, +36/-0)

Code Example

ValueError: Selected backend AttentionBackendEnum.TURBOQUANT is not valid for this configuration.
Reason: ['kv_cache_dtype not supported', 'partial multimodal token full attention not supported']

---

ValueError: Selected backend AttentionBackendEnum.TURBOQUANT is not valid for this configuration.
Reason: ['kv_cache_dtype not supported']

---

RuntimeError: Worker failed with error 'TQ4FullAttentionSpec.__init__() got an unexpected keyword argument 'tq_slot_size''

---

RuntimeError: Worker failed with error '[Errno 2] No such file or directory: 'ninja'

---

File "vllm/v1/core/kv_cache_utils.py", line 1035, in unify_kv_cache_spec_page_size
NotImplementedError: The page size of the layer is not divisible by the maximum page size.
Cannot unify by adjusting block_size.

---

# Clone + checkout PR head + uninstall plugin if present
git clone https://github.com/vllm-project/vllm.git
cd vllm
git fetch origin pull/40391/head:pr-40391
git checkout pr-40391
pip install -e . --no-build-isolation
pip uninstall -y turboquant-vllm  # if installed
pip install ninja                  # gate 4 workaround

# Reproduces Gates 125 in order:
vllm serve <gemma4-31b-fp8-multimodal> \
  --tensor-parallel-size 4 \
  --max-model-len 32768 \
  --gpu-memory-utilization 0.92 \
  --enforce-eager \
  --attention-backend TURBOQUANT \
  --kv-cache-dtype turboquant_4bit_nc

# To skip Gate 1: --hf-overrides '{"text_config":{"use_bidirectional_attention":null}}'
# To skip Gate 2: monkey-patch TurboQuantConfig.get_boundary_skip_layers to return []
# (no workaround for Gate 5 without #39931)
RAW_BUFFERClick to expand / collapse

[Tracking] TurboQuant KV cache + Gemma 4 31B (multimodal) — full blocker stack

Summary

Trying to enable TurboQuant KV cache compression (turboquant_4bit_nc) on Gemma 4 31B FP8 (the multimodal Gemma4ForConditionalGeneration variant) on consumer Blackwell hardware (4× RTX 5060 Ti, sm_120, TP=4). Walking through the failure modes systematically, there are 5 distinct gates that block this configuration today on main. Filing this as a tracking/discussion issue because the gates compose: fixing #1 reveals #2, fixing #2 reveals #3, etc. Hopefully useful as a roadmap item for anyone working on TurboQuant + multimodal hybrid models.

TL;DR motivation: on 4× 16 GB consumer Blackwell at TP=4, native 262,144 context with fp8_per_token_head doesn't fit (~3.04 GiB needed per GPU vs ~2.15 GiB available at gpu-util 0.97; estimated max 170,560). TurboQuant 4-bit KV would give ~4× compression and unlock native context. So the value proposition is real, not hypothetical.

Environment

  • vLLM 0.20.1rc1.dev119+gc74e90b9e.cu132 = origin/main @ 39a7f4f (2026-04-29) + #40391 (lisp19's rework HEAD c74e90b9)
  • 4× RTX 5060 Ti (Blackwell consumer, sm_120), CUDA 13.2, Ubuntu 25.10
  • Gemma 4 31B FP8 (compressed-tensors block FP8 weights)
  • Launch base flags: --tensor-parallel-size 4 --max-model-len 32768 --max-num-batched-tokens 4096 --gpu-memory-utilization 0.92 --enforce-eager
  • Target KV dtype: --kv-cache-dtype turboquant_4bit_nc (chose _4bit_nc over _k8v4 because sm_86 lacks FP8 hardware in Triton — same caveat applies as a guard for users on Ampere)

The 5 gates (encountered in order)

Gate 1 — partial multimodal token full attention not supported

ValueError: Selected backend AttentionBackendEnum.TURBOQUANT is not valid for this configuration.
Reason: ['kv_cache_dtype not supported', 'partial multimodal token full attention not supported']

Root cause: Gemma4ForConditionalGeneration sets use_bidirectional_attention='vision' (added in #40534), which propagates use_mm_prefix=True on full-attention layers. TurboQuantAttentionBackend.supports_mm_prefix() returns False.

Workaround tested: --hf-overrides '{"text_config":{"use_bidirectional_attention":null}}' removes the flag and the rejection. Sacrifices vision quality (causal vs bidi attention on image tokens) — undesirable for a vision model.

Real fix: implement mm_prefix handling in TurboQuantAttentionBackend (kernel + metadata changes for partial multimodal token attention).


Gate 2 — boundary-skipped layers fall back to auto dtype

After clearing Gate 1:

ValueError: Selected backend AttentionBackendEnum.TURBOQUANT is not valid for this configuration.
Reason: ['kv_cache_dtype not supported']

Root cause: TurboQuantConfig.get_boundary_skip_layers(num_layers, n=2) auto-skips first 2 + last 2 layers from TurboQuant compression. For those layers, vllm/model_executor/layers/attention/attention.py:248-262 sets kv_cache_dtype="auto". Then get_attn_backend(...) is called for each layer with the user-selected backend (TURBOQUANT) + per-layer dtype. TurboQuantAttentionBackend.supports_kv_cache_dtype("auto")False → reject.

So the model is globally configured for one backend, but layers 0/1/N-2/N-1 want a different dtype that backend can't serve. Currently no per-layer routing.

Workaround tested: monkey-patch get_boundary_skip_layers to return [] (no skipping). All 60 layers get TurboQuant. Sacrifices boundary protection — first/last layers tend to be the most quantization-sensitive, so this is a quality cost.

Real fix: per-layer attention backend routing. When a layer is skipped from TurboQuant, that specific layer should be wired to a backend that can handle its actual dtype (e.g., TRITON_ATTN), not have the global backend reject it.


Gate 3 — Alberto-Codes/turboquant-vllm plugin conflict

After clearing Gates 1+2 with the workarounds:

RuntimeError: Worker failed with error 'TQ4FullAttentionSpec.__init__() got an unexpected keyword argument 'tq_slot_size''

Root cause: the turboquant-vllm plugin (1.5.0) registers a backend via vllm.general_plugins entry point. Its TQ4FullAttentionSpec class has a constructor signature that doesn't match the current in-tree TurboQuant API (signature drifted post-merge of #38479).

Workaround: pip uninstall -y turboquant-vllm. Easy if you know about it; surprising if you don't.

Real fix: the plugin upstream needs to track in-tree changes, OR vLLM's plugin loading should skip plugins that fail compatibility checks gracefully, OR the in-tree TQ4FullAttentionSpec should accept legacy kwargs with a deprecation path.


Gate 4 — ninja missing for Triton runtime JIT

After clearing Gates 1+2+3:

RuntimeError: Worker failed with error '[Errno 2] No such file or directory: 'ninja'

Root cause: TurboQuant's Triton kernels need ninja for runtime JIT compilation. Not auto-installed via vllm[turboquant] extras, not flagged in setup.py, not documented in docs.

Workaround: pip install ninja into the venv.

Real fix: add ninja to TurboQuant's runtime requirements (or a [turboquant] extras_require), or document it as a hard dep in the TurboQuant docs page.


Gate 5 — hybrid head_dim page-size unification fails (THE STOPPING POINT)

After clearing Gates 1+2+3+4:

File "vllm/v1/core/kv_cache_utils.py", line 1035, in unify_kv_cache_spec_page_size
NotImplementedError: The page size of the layer is not divisible by the maximum page size.
Cannot unify by adjusting block_size.

Root cause: Gemma 4's heterogeneous head_dim (sliding=256, full=512) produces different per-layer page sizes. With turboquant_4bit_nc, the math comes out non-divisible, so vLLM's KV unification raises rather than padding.

Workaround: none — this needs upstream code (LCM padding) to handle. JartX's #39931 implements exactly this (_align_hybrid_block_size + LCM-based padding in unify_kv_cache_spec_page_size). Still open as of 2026-04-30.

Real fix: merge #39931, OR include its core LCM logic in a different PR that doesn't bundle the hybrid-guard removal.


Recommended ordering for closing the gaps

These are roughly in order of how often they'll bite users + how much new code each requires:

  1. Gate 4 (ninja dep) — trivial, document or add to extras
  2. Gate 5 (#39931) — already has a PR, needs review
  3. Gate 3 (plugin compat) — coordinate with Alberto-Codes/turboquant-vllm maintainer
  4. Gate 2 (per-layer backend routing) — significant architectural change, but useful beyond TurboQuant (any model with per-layer dtype variance)
  5. Gate 1 (mm_prefix on TurboQuant backend) — most engineering work, but highest unlock value (multimodal models are increasingly the default)

Closing 1+2+5 alone unblocks Gemma 4 text-only TurboQuant on hybrid hardware. Closing all five unlocks the multimodal serving config most production users actually want.

Hardware/perf context for prioritization

We're running Gemma 4 31B FP8 TP=4 on 4× 16 GB consumer Blackwell. With fp8_per_token_head (lisp19's #40391, validated) we top out at ~170k context — short of the model's native 262k. TurboQuant 4-bit KV (~4× compression vs fp8) would push KV memory budget from 2.15 GiB → ~8.6 GiB per GPU, easily fitting native 262k.

Same dynamic on 2× RTX 3090 (sm_86, 48 GB total) for non-multimodal variants — TurboQuant works on Qwen 3.6 27B AWQ-INT4 there with turboquant_4bit_nc (verified, ~22× concurrency at 32k context). The multimodal Gemma 4 case is what the 5 gates above are really about.

Reproduction recipe

# Clone + checkout PR head + uninstall plugin if present
git clone https://github.com/vllm-project/vllm.git
cd vllm
git fetch origin pull/40391/head:pr-40391
git checkout pr-40391
pip install -e . --no-build-isolation
pip uninstall -y turboquant-vllm  # if installed
pip install ninja                  # gate 4 workaround

# Reproduces Gates 1 → 2 → 5 in order:
vllm serve <gemma4-31b-fp8-multimodal> \
  --tensor-parallel-size 4 \
  --max-model-len 32768 \
  --gpu-memory-utilization 0.92 \
  --enforce-eager \
  --attention-backend TURBOQUANT \
  --kv-cache-dtype turboquant_4bit_nc

# To skip Gate 1: --hf-overrides '{"text_config":{"use_bidirectional_attention":null}}'
# To skip Gate 2: monkey-patch TurboQuantConfig.get_boundary_skip_layers to return []
# (no workaround for Gate 5 without #39931)

Tagging

cc @lisp19 (#40391 author), @JartX (#39931 author), @lucianommartins (#40534/Gemma 4 vision-bidi), @mgoin, @WoosukKwon — happy to provide any additional diagnostic data.

extent analysis

TL;DR

To resolve the issues with enabling TurboQuant KV cache compression on Gemma 4 31B FP8, address the five gates in the recommended order, starting with the trivial fixes like documenting or adding the ninja dependency.

Guidance

  1. Address Gate 4: Document or add ninja to TurboQuant's runtime requirements to prevent the ninja missing error.
  2. Review and merge #39931: This pull request addresses Gate 5 by implementing LCM padding for hybrid head_dim page-size unification.
  3. Coordinate with the turboquant-vllm plugin maintainer: Resolve the plugin compatibility issue (Gate 3) by updating the plugin to match the current in-tree TurboQuant API.
  4. Implement per-layer backend routing: Significantly, modify the architecture to allow different backends for different layers, addressing Gate 2.
  5. Implement mm_prefix handling in TurboQuantAttentionBackend: Finally, tackle Gate 1 by enabling support for partial multimodal token attention in the TurboQuant attention backend.

Example

No specific code snippet is provided as the fixes involve a range of changes from documentation updates to significant architectural modifications.

Notes

The order of addressing these gates is crucial, as fixing one gate may reveal the next. The recommended ordering prioritizes the simplest fixes first, which can unblock certain configurations, and then moves on to more complex changes.

Recommendation

Apply the workarounds and fixes in the recommended order, starting with the simplest ones like addressing the ninja dependency issue and reviewing the merge of #39931, as these can quickly unblock certain configurations and provide a foundation for the more complex fixes.

Extra Tips

  • Ensure thorough testing after each fix to verify that the changes do not introduce new issues.
  • Consider the performance and hardware implications of each fix, especially when dealing with memory-intensive models like Gemma

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING