vllm - 💡(How to fix) Fix Gemma 4 + DFlash incompatible: MTP-specific backend propagation forces TRITON_ATTN on independent (DFlash) drafters

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

For Gemma 4, the spec-decode draft-config builder unconditionally propagates the target's force-locked TRITON_ATTN backend to the drafter. This is correct for MTP drafters (KV-shared with target — they need the same backend) but breaks DFlash drafters (independent KV, non-causal attention) which require FLASHINFER or FLASH_ATTN.

The result: Gemma 4 + DFlash speculative decoding is structurally impossible today, even though SpeculativeConfig.attention_backend exists in the documented API.

Error Message

INFO config.py Gemma4 model has heterogeneous head dimensions (head_dim=256, global_head_dim=512). Forcing TRITON_ATTN backend to prevent mixed-backend numerical divergence. … INFO model.py Resolved architecture: DFlashDraftModel … INFO cuda.py Using AttentionBackendEnum.TRITON_ATTN backend. INFO cuda.py Using AttentionBackendEnum.TRITON_ATTN backend. ← drafter inherits the lock … ValueError: Selected backend AttentionBackendEnum.TRITON_ATTN is not valid for this configuration. Reason: ['non-causal attention not supported']

Root Cause

DFlash is a major inference-cost win for any organization fine-tuning Gemma 4 (Spheron benchmarks $0.06/M-tokens with DFlash on H100 vs $0.47 standard, ~7-8× cost reduction). With the MTP-specific propagation in place, every Gemma 4 fine-tune in the ecosystem is blocked from this win for non-MTP drafters.

Fix Action

Fix / Workaround

Phase 2's QLoRA-fine-tuned target (~62 GB merged bf16) holds 92% of the stock-target speedup despite the drafter being conditioned on stock Gemma 4 hidden states — confirming the patch is broadly useful for the fine-tune ecosystem, not just stock targets. Output text was bit-identical between with-DFlash and without-DFlash runs, as expected from the verifier's lossless guarantee.

Code Example

gh issue create --repo vllm-project/vllm \
  --title "Gemma 4 + DFlash incompatible: MTP-specific backend propagation forces TRITON_ATTN on independent (DFlash) drafters" \
  --body-file notebooks/DFLASH_VLLM_ISSUE_DRAFT.md

---

from vllm import LLM

llm = LLM(
    model="google/gemma-4-31B-it",
    max_model_len=4096,
    dtype="bfloat16",
    speculative_config={
        "method": "dflash",
        "model": "z-lab/gemma-4-31B-it-DFlash",
        "num_speculative_tokens": 16,
    },
)

---

INFO config.py Gemma4 model has heterogeneous head dimensions
(head_dim=256, global_head_dim=512). Forcing TRITON_ATTN backend to
prevent mixed-backend numerical divergence.

INFO model.py Resolved architecture: DFlashDraftModel
INFO cuda.py Using AttentionBackendEnum.TRITON_ATTN backend.
INFO cuda.py Using AttentionBackendEnum.TRITON_ATTN backend.    drafter inherits the lock
ValueError: Selected backend AttentionBackendEnum.TRITON_ATTN is not valid
for this configuration. Reason: ['non-causal attention not supported']

---

def _create_draft_vllm_config(self) -> VllmConfig:
    """Preserve the target's forced TRITON_ATTN backend for draft layers.

    Gemma4 forces TRITON_ATTN due to heterogeneous head dimensions
    (head_dim=256 sliding, global_head_dim=512 full). The base class
    resets attention_config.backend to None for draft models, causing
    sliding layers to fall back to FLASH_ATTN which cannot handle
    KV-shared cache. Override to carry the target's backend through.
    """
    base = super()._create_draft_vllm_config()
    target_backend = self.vllm_config.attention_config.backend
    if target_backend is not None:
        base = replace(base, attention_config=replace(
            base.attention_config, backend=target_backend,
        ))
    return base
RAW_BUFFERClick to expand / collapse

DRAFT — vLLM GitHub Issue (source-verified)

Status: ready for review. Once approved, post via:

gh issue create --repo vllm-project/vllm \
  --title "Gemma 4 + DFlash incompatible: MTP-specific backend propagation forces TRITON_ATTN on independent (DFlash) drafters" \
  --body-file notebooks/DFLASH_VLLM_ISSUE_DRAFT.md

Title

Gemma 4 + DFlash incompatible: MTP-specific backend propagation forces TRITON_ATTN on independent (DFlash) drafters

Body

Summary

For Gemma 4, the spec-decode draft-config builder unconditionally propagates the target's force-locked TRITON_ATTN backend to the drafter. This is correct for MTP drafters (KV-shared with target — they need the same backend) but breaks DFlash drafters (independent KV, non-causal attention) which require FLASHINFER or FLASH_ATTN.

The result: Gemma 4 + DFlash speculative decoding is structurally impossible today, even though SpeculativeConfig.attention_backend exists in the documented API.

Reproducer

vLLM nightly 0.20.2rc1.dev95+g8a4888be2, Modal H100, fresh image with flashinfer-python>=0.2.0 installed.

from vllm import LLM

llm = LLM(
    model="google/gemma-4-31B-it",
    max_model_len=4096,
    dtype="bfloat16",
    speculative_config={
        "method": "dflash",
        "model": "z-lab/gemma-4-31B-it-DFlash",
        "num_speculative_tokens": 16,
    },
)

Failing log:

INFO config.py Gemma4 model has heterogeneous head dimensions
(head_dim=256, global_head_dim=512). Forcing TRITON_ATTN backend to
prevent mixed-backend numerical divergence.
INFO model.py Resolved architecture: DFlashDraftModel
INFO cuda.py Using AttentionBackendEnum.TRITON_ATTN backend.
INFO cuda.py Using AttentionBackendEnum.TRITON_ATTN backend.   ← drafter inherits the lock
ValueError: Selected backend AttentionBackendEnum.TRITON_ATTN is not valid
for this configuration. Reason: ['non-causal attention not supported']

Root cause (source-verified against main)

Force-lock site: vllm/model_executor/models/config.py:58-107Gemma4Config.verify_and_update_config(). The lock fires when:

  • head_dim != global_head_dim, AND
  • max(head_dim, global_head_dim) > 256, AND
  • vllm_config.attention_config.backend is Nonedeliberate escape: an explicit user backend bypasses the lock

This part is well-designed. The user CAN bypass for the target.

The actual blocker: vllm/v1/spec_decode/gemma4.py:140-160_create_draft_vllm_config() actively propagates the target's locked backend to the drafter, negating the escape clause:

def _create_draft_vllm_config(self) -> VllmConfig:
    """Preserve the target's forced TRITON_ATTN backend for draft layers.

    Gemma4 forces TRITON_ATTN due to heterogeneous head dimensions
    (head_dim=256 sliding, global_head_dim=512 full). The base class
    resets attention_config.backend to None for draft models, causing
    sliding layers to fall back to FLASH_ATTN which cannot handle
    KV-shared cache. Override to carry the target's backend through.
    """
    base = super()._create_draft_vllm_config()
    target_backend = self.vllm_config.attention_config.backend
    if target_backend is not None:
        base = replace(base, attention_config=replace(
            base.attention_config, backend=target_backend,
        ))
    return base

The comment is explicit: this propagation is needed because Gemma4 MTP shares its KV cache with the target. The base-class behavior (reset drafter backend to None → drafter picks its own) is correct for independent drafters like DFlash; the override is correct for MTP only.

Why the concern doesn't apply to DFlash

The "mixed-backend numerical divergence" risk is legitimate for INTRA-FORWARD layer mixing (Gemma 4's own sliding-vs-global attention layers within one forward pass).

It is not legitimate for spec-decode where target and drafter are separate nn.Modules with separate KV caches and separate forwards. They're algorithmically independent and rejection sampling tolerates numerical drift by design — that's the entire point of verifier-based speculative decoding.

The MTP case (#41745) is the exception (KV-shared with target — must inherit backend). DFlash and any other independent-KV drafter is the general case where the drafter should be free to pick its own backend.

Precedent in sibling project

vllm-ascend (the sibling project for Ascend NPUs) already merged the equivalent decoupling:

Porting the design intent to vllm-project/vllm CUDA path is the obvious next step.

Related

  • #38887 — Gemma 4 E4B slow on TRITON_ATTN (open since v0.19.0, no maintainer triage)
  • #41789 — Gemma 4 31B MTP draft acceptance 0.2% (different bug, same area of code)
  • #41745 — Gemma4 MTP speculative decoding support (cc @lucianommartins — natural reviewer; thanks for #41745, this issue is the inverse case for independent drafters)
  • vLLM stable docs already document SpeculativeConfig.attention_backend: AttentionBackendEnum | None = None ("Attention backend to use for the draft model. When None, the backend is automatically selected.") — the field exists, but the Gemma4 MTP-specific override at vllm/v1/spec_decode/gemma4.py:140 overrides it.

Proposed fix path

Two candidates, escalating:

(b) Method-aware propagation (smallest, lowest reviewer friction)

  • Make the propagation in vllm/v1/spec_decode/gemma4.py:_create_draft_vllm_config conditional on the spec-decode method requiring KV-sharing (i.e. MTP), not unconditional for "any Gemma 4 spec decode."
  • For non-KV-shared methods (DFlash and any future independent drafter), let the base class's reset to backend=None stand, so the drafter goes through Gemma4Config.verify_and_update_config's existing escape clause and picks FLASHINFER/FLASH_ATTN.
  • ~30-50 LOC + one regression test.

(a) Honor SpeculativeConfig.attention_backend everywhere (proper fix)

  • When the user explicitly sets SpeculativeConfig.attention_backend, that wins over both the Gemma4 force-lock AND the MTP propagation. Default behavior unchanged for users who don't set it.
  • Mirrors vllm-ascend #7342's design.
  • ~150-250 LOC + tests.

I'm happy to ship (b) as a PR over the next 2-3 days; if reviewers want stronger automatic behavior, escalate to (a) in the same PR or a follow-up.

Environment

  • vLLM: 0.20.2rc1.dev95+g8a4888be2 (nightly wheel from https://wheels.vllm.ai/nightly)
  • GPU: NVIDIA H100 (Modal Cloud)
  • Python: 3.11
  • PyTorch: 2.5.1 / CUDA 12.4
  • flashinfer-python: 0.6.10
  • Target: google/gemma-4-31B-it
  • Drafter: z-lab/gemma-4-31B-it-DFlash

Why this matters

DFlash is a major inference-cost win for any organization fine-tuning Gemma 4 (Spheron benchmarks $0.06/M-tokens with DFlash on H100 vs $0.47 standard, ~7-8× cost reduction). With the MTP-specific propagation in place, every Gemma 4 fine-tune in the ecosystem is blocked from this win for non-MTP drafters.

Measured impact of the proposed fix

We applied the (b) method-aware decoupling on Divinci-AI/vllm@gemma4-dflash-decouple and ran A/B (with-DFlash vs without-DFlash) on Modal H100, 10 mixed prompts (5 math + 5 conversational), temperature=0.0, max_new_tokens=256:

SetupTargetAvg speedupMath-reasoning peak
Phase 1google/gemma-4-31B-it (stock)1.28×4.4×
Phase 2google/gemma-4-31B-it + Divinci QLoRA-DFO (merged bf16)1.18×4.0×

Phase 2's QLoRA-fine-tuned target (~62 GB merged bf16) holds 92% of the stock-target speedup despite the drafter being conditioned on stock Gemma 4 hidden states — confirming the patch is broadly useful for the fine-tune ecosystem, not just stock targets. Output text was bit-identical between with-DFlash and without-DFlash runs, as expected from the verifier's lossless guarantee.

— @mikeumus / divinci.ai

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

vllm - 💡(How to fix) Fix Gemma 4 + DFlash incompatible: MTP-specific backend propagation forces TRITON_ATTN on independent (DFlash) drafters