vllm - 💡(How to fix) Fix Gemma 4 + DFlash incompatible: MTP-specific backend propagation forces TRITON_ATTN on independent (DFlash) drafters

vllm2026-05-08 13:33:28

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

For Gemma 4, the spec-decode draft-config builder unconditionally propagates the target's force-locked TRITON_ATTN backend to the drafter. This is correct for MTP drafters (KV-shared with target — they need the same backend) but breaks DFlash drafters (independent KV, non-causal attention) which require FLASHINFER or FLASH_ATTN.

The result: Gemma 4 + DFlash speculative decoding is structurally impossible today, even though SpeculativeConfig.attention_backend exists in the documented API.

Error Message

INFO config.py Gemma4 model has heterogeneous head dimensions (head_dim=256, global_head_dim=512). Forcing TRITON_ATTN backend to prevent mixed-backend numerical divergence. … INFO model.py Resolved architecture: DFlashDraftModel … INFO cuda.py Using AttentionBackendEnum.TRITON_ATTN backend. INFO cuda.py Using AttentionBackendEnum.TRITON_ATTN backend. ← drafter inherits the lock … ValueError: Selected backend AttentionBackendEnum.TRITON_ATTN is not valid for this configuration. Reason: ['non-causal attention not supported']

Root Cause

DFlash is a major inference-cost win for any organization fine-tuning Gemma 4 (Spheron benchmarks $0.06/M-tokens with DFlash on H100 vs $0.47 standard, ~7-8× cost reduction). With the MTP-specific propagation in place, every Gemma 4 fine-tune in the ecosystem is blocked from this win for non-MTP drafters.

Fix Action

Fix / Workaround

Phase 2's QLoRA-fine-tuned target (~62 GB merged bf16) holds 92% of the stock-target speedup despite the drafter being conditioned on stock Gemma 4 hidden states — confirming the patch is broadly useful for the fine-tune ecosystem, not just stock targets. Output text was bit-identical between with-DFlash and without-DFlash runs, as expected from the verifier's lossless guarantee.

Code Example

gh issue create --repo vllm-project/vllm \
  --title "Gemma 4 + DFlash incompatible: MTP-specific backend propagation forces TRITON_ATTN on independent (DFlash) drafters" \
  --body-file notebooks/DFLASH_VLLM_ISSUE_DRAFT.md

---

from vllm import LLM

llm = LLM(
    model="google/gemma-4-31B-it",
    max_model_len=4096,
    dtype="bfloat16",
    speculative_config={
        "method": "dflash",
        "model": "z-lab/gemma-4-31B-it-DFlash",
        "num_speculative_tokens": 16,
    },
)

---

INFO config.py Gemma4 model has heterogeneous head dimensions
(head_dim=256, global_head_dim=512). Forcing TRITON_ATTN backend to
prevent mixed-backend numerical divergence.
…
INFO model.py Resolved architecture: DFlashDraftModel
…
INFO cuda.py Using AttentionBackendEnum.TRITON_ATTN backend.
INFO cuda.py Using AttentionBackendEnum.TRITON_ATTN backend.   ← drafter inherits the lock
…
ValueError: Selected backend AttentionBackendEnum.TRITON_ATTN is not valid
for this configuration. Reason: ['non-causal attention not supported']

---

def _create_draft_vllm_config(self) -> VllmConfig:
    """Preserve the target's forced TRITON_ATTN backend for draft layers.

    Gemma4 forces TRITON_ATTN due to heterogeneous head dimensions
    (head_dim=256 sliding, global_head_dim=512 full). The base class
    resets attention_config.backend to None for draft models, causing
    sliding layers to fall back to FLASH_ATTN which cannot handle
    KV-shared cache. Override to carry the target's backend through.
    """
    base = super()._create_draft_vllm_config()
    target_backend = self.vllm_config.attention_config.backend
    if target_backend is not None:
        base = replace(base, attention_config=replace(
            base.attention_config, backend=target_backend,
        ))
    return base

RAW_BUFFERClick to expand / collapse

DRAFT — vLLM GitHub Issue (source-verified)

Status: ready for review. Once approved, post via:

gh issue create --repo vllm-project/vllm \
  --title "Gemma 4 + DFlash incompatible: MTP-specific backend propagation forces TRITON_ATTN on independent (DFlash) drafters" \
  --body-file notebooks/DFLASH_VLLM_ISSUE_DRAFT.md

Title

Gemma 4 + DFlash incompatible: MTP-specific backend propagation forces TRITON_ATTN on independent (DFlash) drafters

Body

Summary

The result: Gemma 4 + DFlash speculative decoding is structurally impossible today, even though SpeculativeConfig.attention_backend exists in the documented API.

Reproducer

vLLM nightly 0.20.2rc1.dev95+g8a4888be2, Modal H100, fresh image with flashinfer-python>=0.2.0 installed.

from vllm import LLM

llm = LLM(
    model="google/gemma-4-31B-it",
    max_model_len=4096,
    dtype="bfloat16",
    speculative_config={
        "method": "dflash",
        "model": "z-lab/gemma-4-31B-it-DFlash",
        "num_speculative_tokens": 16,
    },
)

Failing log:

INFO config.py Gemma4 model has heterogeneous head dimensions
(head_dim=256, global_head_dim=512). Forcing TRITON_ATTN backend to
prevent mixed-backend numerical divergence.
…
INFO model.py Resolved architecture: DFlashDraftModel
…
INFO cuda.py Using AttentionBackendEnum.TRITON_ATTN backend.
INFO cuda.py Using AttentionBackendEnum.TRITON_ATTN backend.   ← drafter inherits the lock
…
ValueError: Selected backend AttentionBackendEnum.TRITON_ATTN is not valid
for this configuration. Reason: ['non-causal attention not supported']

Root cause (source-verified against `main`)

Force-lock site: vllm/model_executor/models/config.py:58-107 — Gemma4Config.verify_and_update_config(). The lock fires when:

head_dim != global_head_dim, AND
max(head_dim, global_head_dim) > 256, AND
vllm_config.attention_config.backend is None ← deliberate escape: an explicit user backend bypasses the lock

This part is well-designed. The user CAN bypass for the target.

The actual blocker: vllm/v1/spec_decode/gemma4.py:140-160 — _create_draft_vllm_config() actively propagates the target's locked backend to the drafter, negating the escape clause:

def _create_draft_vllm_config(self) -> VllmConfig:
    """Preserve the target's forced TRITON_ATTN backend for draft layers.

    Gemma4 forces TRITON_ATTN due to heterogeneous head dimensions
    (head_dim=256 sliding, global_head_dim=512 full). The base class
    resets attention_config.backend to None for draft models, causing
    sliding layers to fall back to FLASH_ATTN which cannot handle
    KV-shared cache. Override to carry the target's backend through.
    """
    base = super()._create_draft_vllm_config()
    target_backend = self.vllm_config.attention_config.backend
    if target_backend is not None:
        base = replace(base, attention_config=replace(
            base.attention_config, backend=target_backend,
        ))
    return base

The comment is explicit: this propagation is needed because Gemma4 MTP shares its KV cache with the target. The base-class behavior (reset drafter backend to None → drafter picks its own) is correct for independent drafters like DFlash; the override is correct for MTP only.

Why the concern doesn't apply to DFlash

The "mixed-backend numerical divergence" risk is legitimate for INTRA-FORWARD layer mixing (Gemma 4's own sliding-vs-global attention layers within one forward pass).

It is not legitimate for spec-decode where target and drafter are separate nn.Modules with separate KV caches and separate forwards. They're algorithmically independent and rejection sampling tolerates numerical drift by design — that's the entire point of verifier-based speculative decoding.

The MTP case (#41745) is the exception (KV-shared with target — must inherit backend). DFlash and any other independent-KV drafter is the general case where the drafter should be free to pick its own backend.

Precedent in sibling project

vllm-ascend (the sibling project for Ascend NPUs) already merged the equivalent decoupling:

https://github.com/vllm-project/vllm-ascend/pull/7342 by @SidaoY — "Separate attention backend for target and draft model"

Porting the design intent to vllm-project/vllm CUDA path is the obvious next step.

#38887 — Gemma 4 E4B slow on TRITON_ATTN (open since v0.19.0, no maintainer triage)
#41789 — Gemma 4 31B MTP draft acceptance 0.2% (different bug, same area of code)
#41745 — Gemma4 MTP speculative decoding support (cc @lucianommartins — natural reviewer; thanks for #41745, this issue is the inverse case for independent drafters)
vLLM stable docs already document SpeculativeConfig.attention_backend: AttentionBackendEnum | None = None ("Attention backend to use for the draft model. When None, the backend is automatically selected.") — the field exists, but the Gemma4 MTP-specific override at vllm/v1/spec_decode/gemma4.py:140 overrides it.

Proposed fix path

Two candidates, escalating:

(b) Method-aware propagation (smallest, lowest reviewer friction)

Make the propagation in vllm/v1/spec_decode/gemma4.py:_create_draft_vllm_config conditional on the spec-decode method requiring KV-sharing (i.e. MTP), not unconditional for "any Gemma 4 spec decode."
For non-KV-shared methods (DFlash and any future independent drafter), let the base class's reset to backend=None stand, so the drafter goes through Gemma4Config.verify_and_update_config's existing escape clause and picks FLASHINFER/FLASH_ATTN.
~30-50 LOC + one regression test.

(a) Honor SpeculativeConfig.attention_backend everywhere (proper fix)

When the user explicitly sets SpeculativeConfig.attention_backend, that wins over both the Gemma4 force-lock AND the MTP propagation. Default behavior unchanged for users who don't set it.
Mirrors vllm-ascend #7342's design.
~150-250 LOC + tests.

I'm happy to ship (b) as a PR over the next 2-3 days; if reviewers want stronger automatic behavior, escalate to (a) in the same PR or a follow-up.

Environment

vLLM: 0.20.2rc1.dev95+g8a4888be2 (nightly wheel from https://wheels.vllm.ai/nightly)
GPU: NVIDIA H100 (Modal Cloud)
Python: 3.11
PyTorch: 2.5.1 / CUDA 12.4
flashinfer-python: 0.6.10
Target: google/gemma-4-31B-it
Drafter: z-lab/gemma-4-31B-it-DFlash

Why this matters

Measured impact of the proposed fix

We applied the (b) method-aware decoupling on Divinci-AI/vllm@gemma4-dflash-decouple and ran A/B (with-DFlash vs without-DFlash) on Modal H100, 10 mixed prompts (5 math + 5 conversational), temperature=0.0, max_new_tokens=256:

Setup	Target	Avg speedup	Math-reasoning peak
Phase 1	`google/gemma-4-31B-it` (stock)	1.28×	4.4×
Phase 2	`google/gemma-4-31B-it` + Divinci QLoRA-DFO (merged bf16)	1.18×	4.0×

— @mikeumus / divinci.ai

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #ssr #GPU setup #container setup #orchestration issue

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

vllm - 💡(How to fix) Fix Gemma 4 + DFlash incompatible: MTP-specific backend propagation forces TRITON_ATTN on independent (DFlash) drafters

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Fix / Workaround

Code Example

DRAFT — vLLM GitHub Issue (source-verified)

Title

Body

Summary

Reproducer

Root cause (source-verified against `main`)

Why the concern doesn't apply to DFlash

Precedent in sibling project

Related

Proposed fix path

Environment

Why this matters

Measured impact of the proposed fix

Still need to ship something?

TRENDING

vllm - 💡(How to fix) Fix Gemma 4 + DFlash incompatible: MTP-specific backend propagation forces TRITON_ATTN on independent (DFlash) drafters

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Fix / Workaround

Code Example

DRAFT — vLLM GitHub Issue (source-verified)

Title

Body

Summary

Reproducer

Root cause (source-verified against main)

Why the concern doesn't apply to DFlash

Precedent in sibling project

Related

Proposed fix path

Environment

Why this matters

Measured impact of the proposed fix

Still need to ship something?

RELATED_DISCOVERY

TRENDING

Root cause (source-verified against `main`)