vllm - 💡(How to fix) Fix [Feature]: DFlash Partial Multimodal Token Full Attention with Gemma MoE + Drafter

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

Fix Action

Fix / Workaround

Gemma4's MoE architecture was patched as of the latest updates to vLLM, including heavily optimized forks that minimally adjust architecture like SystemPanic's (which runs on the latest vLLM version), meaning, any user attempting to run the 26B MoE, with Nvfp4 (modelopt as opposed to compressed tensors) will have a very easy experience getting their model up and running with the known setup recipes for the heterogenous head-dimensions. That is (tl;dr) $env:VLLM_ALLOW_LONG_MAX_MODEL_LEN = "1" $env:TORCH_MATMUL_PRECISION = "high" $env:PYTORCH_CUDA_ALLOC_CONF = "expandable_segments:True" $env:VLLM_USE_FLASHINFER_MOE_FP4 = "0" $env:VLLM_TEST_FORCE_FP8_MARLIN = "0" $env:VLLM_NVFP4_GEMM_BACKEND = "flashinfer-cutlass" $env:VLLM_USE_FLASHINFER_SAMPLER = "1" for Gemma4 26B MoE NVFP4 usage (that aren't in compressed tensors) i.e. modelopt (Nvidia & Google preferred)

where the final line --speculative-config '{"method":"dflash","model":"models/gemma4-dflash-0.2b","num_speculative_tokens":15,"attention_backend":"flash_attn"}' requires a build-specific patch

Patches are already performed by docker images that have drafter + Gemma modelopt MoE support baked in, one can perform the same. However, the path to implementation is not documented. A note to Tracking PR #42175 (relates to Gemma‑4 speculative decoding).

Code Example

# assumed nvidia-smi working, cu130+
winget install --source winget --id Python.Python.3.12
curl -LsSf https://astral.sh/uv/install.sh | sh
# (ADD TO PATH VIA sysdm.cpl)
uv venv --python 3.12.10 --seed --managed-python
.venv\Scripts\activate
uv pip install -U vllm-0.21.0+cu132-cp312-cp312-win_amd64.whl --extra-index-url https://download.pytorch.org/whl/cu130 --index-strategy unsafe-best-match

$env:VLLM_ALLOW_LONG_MAX_MODEL_LEN = "1"
$env:TORCH_MATMUL_PRECISION = "high"
$env:PYTORCH_CUDA_ALLOC_CONF = "expandable_segments:True"
$env:VLLM_USE_FLASHINFER_MOE_FP4 = "0"
$env:VLLM_TEST_FORCE_FP8_MARLIN = "0"
$env:VLLM_NVFP4_GEMM_BACKEND = "flashinfer-cutlass"
$env:VLLM_USE_FLASHINFER_SAMPLER = "1"
vllm serve models/gemma4-modelopt-moe --served-model-name gemma4 gemma4-fast gemma4-deep --host 0.0.0.0 --port 8000 --tensor-parallel-size 1 --dtype auto --max-model-len 131072 --max-num-seqs 22 --max-num-batched-tokens 32768 --gpu-memory-utilization 0.80 --enable-chunked-prefill --enable-prefix-caching --trust-remote-code --enable-auto-tool-choice --tool-call-parser gemma4 --speculative-config '{"method":"dflash","model":"models/gemma4-dflash-0.2b","num_speculative_tokens":15,"attention_backend":"flash_attn"}'

---

export VLLM_ALLOW_LONG_MAX_MODEL_LEN=1
export TORCH_MATMUL_PRECISION="high"
export PYTORCH_CUDA_ALLOC_CONF="expandable_segments:True"
export VLLM_USE_FLASHINFER_MOE_FP4=0
export VLLM_TEST_FORCE_FP8_MARLIN=0
export VLLM_NVFP4_GEMM_BACKEND="flashinfer-cutlass"
export VLLM_USE_FLASHINFER_SAMPLER=1
vllm serve models/gemma4-modelopt-moe \
--served-model-name gemma4 gemma4-fast gemma4-deep \
--host 0.0.0.0 \
--port 8000 \
--tensor-parallel-size 1 \
--dtype auto \
--max-model-len 131072 \
--max-num-seqs 22 \
--max-num-batched-tokens 32768 \
--gpu-memory-utilization 0.80 \
--enable-chunked-prefill \
--enable-prefix-caching \
--trust-remote-code \
--enable-auto-tool-choice \
--tool-call-parser gemma4 \
--speculative-config '{"method":"dflash","model":"models/gemma4-dflash-0.2b","num_speculative_tokens":15,"attention_backend":"flash_attn"}'

---

\.venv\Lib\site-packages\vllm\model_executor\models\gemma4.py
\.venv\Lib\site-packages\vllm\v1\spec_decode\gemma4.py
\.venv\Lib\site-packages\transformers\models\gemma4\processing_gemma4.py
\.venv\Lib\site-packages\transformers\models\gemma4\video_processing_gemma4.py
\.venv\Lib\site-packages\transformers\models\gemma4\modular_gemma4.py
\.venv\Lib\site-packages\transformers\models\gemma4\modeling_gemma4.py
\.venv\Lib\site-packages\transformers\models\gemma4\image_processing_pil_gemma4.py
\.venv\Lib\site-packages\transformers\models\gemma4\image_processing_gemma4.py
\.venv\Lib\site-packages\transformers\models\gemma4\feature_extraction_gemma4.py
\.venv\Lib\site-packages\transformers\models\gemma4\configuration_gemma4.py
RAW_BUFFERClick to expand / collapse
  • The issue:

Gemma4 MoE (26B) is an incredible model both in regards to light-weightedness as well as it's high thinking & reasoning capabilities. For end users on top-line consumer cards (5090, (maybe 5080) or above only), Blackwell, this model remains the top-of-the-line choice for users running non-GGUF variations of local LLMs (using vLLM).

With Windows, users may gravitate to either WSL or, (from the officially recognized repo on vLLM docs) SystemPanic vLLM fork (SystemPanic/vllm-windows) for getting their first inference with CUDA on Windows, which maintains a heavily up-to-date architecture for builds, as well as prebuilt's for Blackwell support with Windows.

Gemma4's MoE architecture was patched as of the latest updates to vLLM, including heavily optimized forks that minimally adjust architecture like SystemPanic's (which runs on the latest vLLM version), meaning, any user attempting to run the 26B MoE, with Nvfp4 (modelopt as opposed to compressed tensors) will have a very easy experience getting their model up and running with the known setup recipes for the heterogenous head-dimensions. That is (tl;dr) $env:VLLM_ALLOW_LONG_MAX_MODEL_LEN = "1" $env:TORCH_MATMUL_PRECISION = "high" $env:PYTORCH_CUDA_ALLOC_CONF = "expandable_segments:True" $env:VLLM_USE_FLASHINFER_MOE_FP4 = "0" $env:VLLM_TEST_FORCE_FP8_MARLIN = "0" $env:VLLM_NVFP4_GEMM_BACKEND = "flashinfer-cutlass" $env:VLLM_USE_FLASHINFER_SAMPLER = "1" for Gemma4 26B MoE NVFP4 usage (that aren't in compressed tensors) i.e. modelopt (Nvidia & Google preferred)

For sm120+sm121 (essentially, Blackwell), running MoE in the preferred modelopt nvfp4 variation without directly passing --quantization modelopt (don't do this), when running non-compressed-tensors nvfp4 variants (the good ones) of Gemma4 MoE means one cannot use or decide to use DFlash "drafter" models with speculative decode.

So, with a working modelopt nvfp4 Gemma4 MoE, try to append speculative drafter model: --speculative-config '{"method":"dflash","model":"models5/gemma4-dflash-0.2b","num_speculative_tokens":15,"attention_backend":"flash_attn"}'

This will throw: "partial multimodal token full attention not supported"

  • Steps to reproduce (ignore first lines, easy setup) for those not debugging on multiplatform:
# assumed nvidia-smi working, cu130+
winget install --source winget --id Python.Python.3.12
curl -LsSf https://astral.sh/uv/install.sh | sh
# (ADD TO PATH VIA sysdm.cpl)
uv venv --python 3.12.10 --seed --managed-python
.venv\Scripts\activate
uv pip install -U vllm-0.21.0+cu132-cp312-cp312-win_amd64.whl --extra-index-url https://download.pytorch.org/whl/cu130 --index-strategy unsafe-best-match

$env:VLLM_ALLOW_LONG_MAX_MODEL_LEN = "1"
$env:TORCH_MATMUL_PRECISION = "high"
$env:PYTORCH_CUDA_ALLOC_CONF = "expandable_segments:True"
$env:VLLM_USE_FLASHINFER_MOE_FP4 = "0"
$env:VLLM_TEST_FORCE_FP8_MARLIN = "0"
$env:VLLM_NVFP4_GEMM_BACKEND = "flashinfer-cutlass"
$env:VLLM_USE_FLASHINFER_SAMPLER = "1"
vllm serve models/gemma4-modelopt-moe --served-model-name gemma4 gemma4-fast gemma4-deep --host 0.0.0.0 --port 8000 --tensor-parallel-size 1 --dtype auto --max-model-len 131072 --max-num-seqs 22 --max-num-batched-tokens 32768 --gpu-memory-utilization 0.80 --enable-chunked-prefill --enable-prefix-caching --trust-remote-code --enable-auto-tool-choice --tool-call-parser gemma4 --speculative-config '{"method":"dflash","model":"models/gemma4-dflash-0.2b","num_speculative_tokens":15,"attention_backend":"flash_attn"}'

Note, this issue is for all versions of vllm with Gemma4 MoE+Drafter, can be reproduced on linux with:

export VLLM_ALLOW_LONG_MAX_MODEL_LEN=1
export TORCH_MATMUL_PRECISION="high"
export PYTORCH_CUDA_ALLOC_CONF="expandable_segments:True"
export VLLM_USE_FLASHINFER_MOE_FP4=0
export VLLM_TEST_FORCE_FP8_MARLIN=0
export VLLM_NVFP4_GEMM_BACKEND="flashinfer-cutlass"
export VLLM_USE_FLASHINFER_SAMPLER=1
vllm serve models/gemma4-modelopt-moe \
--served-model-name gemma4 gemma4-fast gemma4-deep \
--host 0.0.0.0 \
--port 8000 \
--tensor-parallel-size 1 \
--dtype auto \
--max-model-len 131072 \
--max-num-seqs 22 \
--max-num-batched-tokens 32768 \
--gpu-memory-utilization 0.80 \
--enable-chunked-prefill \
--enable-prefix-caching \
--trust-remote-code \
--enable-auto-tool-choice \
--tool-call-parser gemma4 \
--speculative-config '{"method":"dflash","model":"models/gemma4-dflash-0.2b","num_speculative_tokens":15,"attention_backend":"flash_attn"}'

where the final line --speculative-config '{"method":"dflash","model":"models/gemma4-dflash-0.2b","num_speculative_tokens":15,"attention_backend":"flash_attn"}' requires a build-specific patch

  • Proposed Solution:

Patches are already performed by docker images that have drafter + Gemma modelopt MoE support baked in, one can perform the same. However, the path to implementation is not documented. A note to Tracking PR #42175 (relates to Gemma‑4 speculative decoding).

There are two routes. Using prepatched docker images, one could take note that: All one needs to do is extract an ARM64 Docker image → copy out the patched Python files → rebuild the CUDA kernels for SM120 → rebuild FlashInfer → rebuild PyTorch → run natively. and essentially run the same patchset as which worked with the old gemma4_patched.py approach when vLLM baked-in MoE support.

Outside of gemma4 being patched by default with MoE now, on windows I noticed likely the area of interest is still:

\.venv\Lib\site-packages\vllm\model_executor\models\gemma4.py
\.venv\Lib\site-packages\vllm\v1\spec_decode\gemma4.py
\.venv\Lib\site-packages\transformers\models\gemma4\processing_gemma4.py
\.venv\Lib\site-packages\transformers\models\gemma4\video_processing_gemma4.py
\.venv\Lib\site-packages\transformers\models\gemma4\modular_gemma4.py
\.venv\Lib\site-packages\transformers\models\gemma4\modeling_gemma4.py
\.venv\Lib\site-packages\transformers\models\gemma4\image_processing_pil_gemma4.py
\.venv\Lib\site-packages\transformers\models\gemma4\image_processing_gemma4.py
\.venv\Lib\site-packages\transformers\models\gemma4\feature_extraction_gemma4.py
\.venv\Lib\site-packages\transformers\models\gemma4\configuration_gemma4.py

Needing to look, likely, at: spec_decode/gemma4.py model_executor/models/gemma4.py sampling/speculative.py model_executor/models/gemma4.py or: attention/backends/dflash.py sampling/speculative.py executor/worker.py

Similarly to how ARM docker references perform this feature, usually built for more accessable hardware like the nvidia Spark, it would be great to be able to natively get this working with amd64+Blackwell.

Any approach or ideas welcome, as MoE modelopt gemma with a spec-decode drafter model is likely the best of the best setup for 5090 card owners that decide upon vLLM - not only having 32gb to work with, but also blackwell architecture. I, as well as many others, have a deep gratitude to Google's model release. Plus, Gemma MoE in nvfp4 has much better throughput+reasoning than almost all references that take advantage of local llms despite consumer hardware limitations. Speculative drafting model would only be a slight upgrade in throughput in higher context windows, but it would still be nice to have.


Further reading and note: Running within WSL and following workarounds for MoE and 'detected CUDA architecture' is fixed following Medium guides like Allen Kuo's Guide <-- this alternative to just using SystemPanic or some vLLM nightly (in wsl) that you may have luck with. SystemPanic's is quite good, but a further reminder that modelopt implementation for the MoE Gemma AND the drafter would require patching partial multimodal token full attention.

Alternatives

No response

Additional context

No response

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING