vllm - 💡(How to fix) Fix [RFC]: Add Gumiho speculative decoding to vLLM [1 pull requests]

StepCodex · 2026-05-24T23:46:52Z

[vllm] Motivation. <!-- Open at: https://github.com/vllm-project/vllm/issues/new?template=750-RFC.yml Title: RFC : Add Gumiho speculative decoding to vLLM Labe… ## Fixed - Fixed by PR: [Feature] Add Gumiho speculative decoding method (https://github.com/vllm-project/vllm/pull/43544) ### Motivation. ## Motivation. [Gumiho](https://arxiv.org/abs/2503.10135) (ICML 2025) is a hybrid speculative decoding drafter: 1. The first two speculative tokens are produced autoregressively by an EAGLE-style transformer head. 2. Every additional speculative token is produced **in parallel** by per-step MLP heads conditioned on the embeddings and hidden states of the first two draft tokens. The motivation is that **early speculative tokens matter more than late ones**. Each accepted prefix token unlocks all later tokens in the same draft, so allocating more compute to the first two positions (a small but high-quality transformer head) and amortising the rest with cheap parallel MLP heads yields a better acceptance-rate / drafter-latency trade-off than either a pure EAGLE drafter (all transformer, sequential) or a pure MLPSpeculator drafter (all MLP, parallel but lower quality). Concretely, this means a lower per-step drafter latency than EAGLE/EAGLE3 when `num_speculative_tokens > 2`, with similar end-to-end accepted-token gains. We have a working ROCm prototype on top of vLLM v0.16.0 and have already ported it to current `main` with all CI hygiene applied; the changes are small and contained. ## Proposed Change. Add `"method": "gumiho"` as a new V1 speculative decoding method. https://github.com/vllm-project/vllm/pull/43544 ### Surface ```python LLM( model="meta-llama/Meta-Llama-3-8B-Instruct", speculative_config={ "method": "gumiho", "model": "amd/Gumiho-llama3-8b", "num_speculative_tokens": 3, "draft_tensor_parallel_size": 1, }, ) ``` ### Implementation * **New drafter model** `vllm/model_executor/models/gumiho.py` (`GumihoLlamaForCausalLM` + inner `GumihoLlamaModel` + `GumihoResBlock`, `GumihoNoResBlock`, weight loader). The transformer head reuses `LlamaDecoderLayer` from `llama_eagle.py` exactly like EAGLE does. * **New proposer** `vllm/v1/spec_decode/gumiho.py` (`GumihoProposer(EagleProposer)`). Only overrides two new hooks defined on `SpecDecodeBaseProposer`: | Hook | Default | Gumiho | | --- | --- | --- | | `_init_draft_hidden_states_list` | returns `None` | returns `[sample_hidden_states]` so the sequential loop also collects hidden states | | `_maybe_get_mlp_draft_token_ids` | returns `None` | after the 2nd draft step, calls `GumihoLlamaForCausalLM.generate_mlp_draft_token_ids(...)` and short-circuits the rest of the loop | Both hooks are no-ops in the base class, so existing proposers (EAGLE / EAGLE3 / MTP / DFlash / draft_model / ...) are not affected. * **New HF config wrapper** `vllm/transformers_utils/configs/gumiho.py` (`GumihoConfig`), mirroring how `EAGLEConfig` wraps the underlying Llama backbone. * **Wiring** * `SpeculativeConfig`: add `gumiho` to `SpeculativeMethod`, `use_eagle()`, the draft-TP=1 forced set, the model-type auto-detect path, and add a config-wrap branch parallel to the EAGLE one (reusing the existing `update_arch_()` helper). * `gpu_model_runner.GPUModelRunner.__init__`: add a `method == "gumiho"` branch matched **before** `use_eagle()` so we don't fall back to a plain `EagleProposer`. * `v1/worker/gpu/spec_decode/__init__.py`: raise an explicit `NotImplementedError` on the V2 GPU runner. * Model + HF-config registry entries. * **Docs + example** under `docs/features/speculative_decoding/gumiho.md` and a new `--method gumiho` branch in `examples/features/speculative_decoding/spec_decode_offline.py`. * **Unit tests** under `tests/v1/spec_decode/test_gumiho.py` (CPU-only, exercises `GumihoConfig`, MLP-head shape contract, OOB-head filtering, and the two hooks). ### Out of scope for the first PR These can come as follow-ups once the basic path is merged and reviewed: * **Tree attention / FTA**: the reference Gumiho training code uses an FTA tree for verification. The V1 verifier currently does not support tree drafts, so this PR uses the standard linear-chain V1 verifier. * **Probabilistic draft sampling**: the MLP heads currently produce argmax ids only; with `draft_sample_method=probabilistic` we fall back to the sequential transformer path. Adding logits output to the MLP heads is a small, follow-up change. * **V2 GPU model runner** path. * **Multimodal drafters**. ### Diff size * 5 new files: `vllm/v1/spec_decode/gumiho.py` (74 LoC), `vllm/model_executor/models/gumiho.py` (437 LoC), `vllm/transformers_utils/configs/gumiho.py` (80 LoC), `tests/v1/spec_decode/test_gumiho.py` (350 LoC), `docs/features/speculative_decoding/gumiho.md` (

LLM( model="meta-llama/Meta-Llama-3-8B-Instruct", speculative_config={ "method": "gumiho", "model": "amd/Gumiho-llama3-8b", "num_speculative_tokens": 3, "draft_tensor_parallel_size": 1, }, )

Motivation.

Motivation.

Gumiho (ICML 2025) is a hybrid speculative decoding drafter:

The first two speculative tokens are produced autoregressively by an EAGLE-style transformer head.
Every additional speculative token is produced in parallel by per-step MLP heads conditioned on the embeddings and hidden states of the first two draft tokens.

The motivation is that early speculative tokens matter more than late ones. Each accepted prefix token unlocks all later tokens in the same draft, so allocating more compute to the first two positions (a small but high-quality transformer head) and amortising the rest with cheap parallel MLP heads yields a better acceptance-rate / drafter-latency trade-off than either a pure EAGLE drafter (all transformer, sequential) or a pure MLPSpeculator drafter (all MLP, parallel but lower quality).

Concretely, this means a lower per-step drafter latency than EAGLE/EAGLE3 when num_speculative_tokens > 2, with similar end-to-end accepted-token gains.

We have a working ROCm prototype on top of vLLM v0.16.0 and have already ported it to current main with all CI hygiene applied; the changes are small and contained.

Proposed Change.

Add "method": "gumiho" as a new V1 speculative decoding method.

https://github.com/vllm-project/vllm/pull/43544

Surface

LLM(
    model="meta-llama/Meta-Llama-3-8B-Instruct",
    speculative_config={
        "method": "gumiho",
        "model": "amd/Gumiho-llama3-8b",
        "num_speculative_tokens": 3,
        "draft_tensor_parallel_size": 1,
    },
)

Implementation

New drafter model vllm/model_executor/models/gumiho.py (GumihoLlamaForCausalLM + inner GumihoLlamaModel + GumihoResBlock, GumihoNoResBlock, weight loader). The transformer head reuses LlamaDecoderLayer from llama_eagle.py exactly like EAGLE does.

New proposer vllm/v1/spec_decode/gumiho.py (GumihoProposer(EagleProposer)). Only overrides two new hooks defined on SpecDecodeBaseProposer:

Hook	Default	Gumiho
`_init_draft_hidden_states_list`	returns `None`	returns `[sample_hidden_states]` so the sequential loop also collects hidden states
`_maybe_get_mlp_draft_token_ids`	returns `None`	after the 2nd draft step, calls `GumihoLlamaForCausalLM.generate_mlp_draft_token_ids(...)` and short-circuits the rest of the loop

Both hooks are no-ops in the base class, so existing proposers (EAGLE / EAGLE3 / MTP / DFlash / draft_model / ...) are not affected.

New HF config wrapper vllm/transformers_utils/configs/gumiho.py (GumihoConfig), mirroring how EAGLEConfig wraps the underlying Llama backbone.
Wiring
- SpeculativeConfig: add gumiho to SpeculativeMethod, use_eagle(), the draft-TP=1 forced set, the model-type auto-detect path, and add a config-wrap branch parallel to the EAGLE one (reusing the existing update_arch_() helper).
- gpu_model_runner.GPUModelRunner.__init__: add a method == "gumiho" branch matched before use_eagle() so we don't fall back to a plain EagleProposer.
- v1/worker/gpu/spec_decode/__init__.py: raise an explicit NotImplementedError on the V2 GPU runner.
- Model + HF-config registry entries.
Docs + example under docs/features/speculative_decoding/gumiho.md and a new --method gumiho branch in examples/features/speculative_decoding/spec_decode_offline.py.
Unit tests under tests/v1/spec_decode/test_gumiho.py (CPU-only, exercises GumihoConfig, MLP-head shape contract, OOB-head filtering, and the two hooks).

Out of scope for the first PR

These can come as follow-ups once the basic path is merged and reviewed:

Tree attention / FTA: the reference Gumiho training code uses an FTA tree for verification. The V1 verifier currently does not support tree drafts, so this PR uses the standard linear-chain V1 verifier.
Probabilistic draft sampling: the MLP heads currently produce argmax ids only; with draft_sample_method=probabilistic we fall back to the sequential transformer path. Adding logits output to the MLP heads is a small, follow-up change.
V2 GPU model runner path.
Multimodal drafters.

Diff size

5 new files: vllm/v1/spec_decode/gumiho.py (74 LoC), vllm/model_executor/models/gumiho.py (437 LoC), vllm/transformers_utils/configs/gumiho.py (80 LoC), tests/v1/spec_decode/test_gumiho.py (350 LoC), docs/features/speculative_decoding/gumiho.md (72 LoC).
9 modified files: net +110 / -4 LoC.

The PR with the full diff is at .

Feedback Period.

Two weeks (we'd like to keep the review loop short since the diff is contained and we have a working prototype, but happy to extend if a maintainer raises larger design questions).

CC List.

@WoosukKwon @LiuXiaoxuanPKU @ekagra-ranjan

Any Other Things.

Paper: Li et al., Gumiho: A Hybrid Architecture to Prioritize Early Tokens in Speculative Decoding, ICML 2025.
Reference code: https://github.com/AMD-AIG-AIMA/Gumiho.
Pre-trained drafter: https://huggingface.co/amd/Gumiho-llama3-8b.
This RFC is submitted by the original authors of the paper.
Smoke-test on ROCm shows the V1 path is producing accepted tokens end-to-end (mean acceptance length ≈ 1.67 on a small 16-prompt eval with num_speculative_tokens=3); we'll post a full benchmark vs. EAGLE3 along with the PR.

Proposed Change.

https://github.com/vllm-project/vllm/pull/43544

Feedback Period.

No response

CC List.

No response

Any Other Things.

No response

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

vllm - 💡(How to fix) Fix [RFC]: Add Gumiho speculative decoding to vLLM [1 pull requests]

Recommended Tools

GitHub issue graph ai analysis

Fix Action

Fixed

Code Example

Motivation.

Motivation.

Proposed Change.

Surface

Implementation

Out of scope for the first PR

Diff size

Feedback Period.

CC List.

Any Other Things.

Proposed Change.

Feedback Period.

CC List.

Any Other Things.

Before submitting a new issue...

Still need to ship something?

TRENDING