vllm - 💡(How to fix) Fix [RFC]: Add Gumiho speculative decoding to vLLM [1 pull requests]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

Fix Action

Fixed

Code Example

LLM(
    model="meta-llama/Meta-Llama-3-8B-Instruct",
    speculative_config={
        "method": "gumiho",
        "model": "amd/Gumiho-llama3-8b",
        "num_speculative_tokens": 3,
        "draft_tensor_parallel_size": 1,
    },
)
RAW_BUFFERClick to expand / collapse

Motivation.

<!-- Open at: https://github.com/vllm-project/vllm/issues/new?template=750-RFC.yml Title: [RFC]: Add Gumiho speculative decoding to vLLM Labels: RFC (added automatically by the template) Paste each section below into the matching field of the RFC template. -->

Motivation.

Gumiho (ICML 2025) is a hybrid speculative decoding drafter:

  1. The first two speculative tokens are produced autoregressively by an EAGLE-style transformer head.
  2. Every additional speculative token is produced in parallel by per-step MLP heads conditioned on the embeddings and hidden states of the first two draft tokens.

The motivation is that early speculative tokens matter more than late ones. Each accepted prefix token unlocks all later tokens in the same draft, so allocating more compute to the first two positions (a small but high-quality transformer head) and amortising the rest with cheap parallel MLP heads yields a better acceptance-rate / drafter-latency trade-off than either a pure EAGLE drafter (all transformer, sequential) or a pure MLPSpeculator drafter (all MLP, parallel but lower quality).

Concretely, this means a lower per-step drafter latency than EAGLE/EAGLE3 when num_speculative_tokens > 2, with similar end-to-end accepted-token gains.

We have a working ROCm prototype on top of vLLM v0.16.0 and have already ported it to current main with all CI hygiene applied; the changes are small and contained.

Proposed Change.

Add "method": "gumiho" as a new V1 speculative decoding method.

https://github.com/vllm-project/vllm/pull/43544

Surface

LLM(
    model="meta-llama/Meta-Llama-3-8B-Instruct",
    speculative_config={
        "method": "gumiho",
        "model": "amd/Gumiho-llama3-8b",
        "num_speculative_tokens": 3,
        "draft_tensor_parallel_size": 1,
    },
)

Implementation

  • New drafter model vllm/model_executor/models/gumiho.py (GumihoLlamaForCausalLM + inner GumihoLlamaModel + GumihoResBlock, GumihoNoResBlock, weight loader). The transformer head reuses LlamaDecoderLayer from llama_eagle.py exactly like EAGLE does.

  • New proposer vllm/v1/spec_decode/gumiho.py (GumihoProposer(EagleProposer)). Only overrides two new hooks defined on SpecDecodeBaseProposer:

    HookDefaultGumiho
    _init_draft_hidden_states_listreturns Nonereturns [sample_hidden_states] so the sequential loop also collects hidden states
    _maybe_get_mlp_draft_token_idsreturns Noneafter the 2nd draft step, calls GumihoLlamaForCausalLM.generate_mlp_draft_token_ids(...) and short-circuits the rest of the loop

    Both hooks are no-ops in the base class, so existing proposers (EAGLE / EAGLE3 / MTP / DFlash / draft_model / ...) are not affected.

  • New HF config wrapper vllm/transformers_utils/configs/gumiho.py (GumihoConfig), mirroring how EAGLEConfig wraps the underlying Llama backbone.

  • Wiring

    • SpeculativeConfig: add gumiho to SpeculativeMethod, use_eagle(), the draft-TP=1 forced set, the model-type auto-detect path, and add a config-wrap branch parallel to the EAGLE one (reusing the existing update_arch_() helper).
    • gpu_model_runner.GPUModelRunner.__init__: add a method == "gumiho" branch matched before use_eagle() so we don't fall back to a plain EagleProposer.
    • v1/worker/gpu/spec_decode/__init__.py: raise an explicit NotImplementedError on the V2 GPU runner.
    • Model + HF-config registry entries.
  • Docs + example under docs/features/speculative_decoding/gumiho.md and a new --method gumiho branch in examples/features/speculative_decoding/spec_decode_offline.py.

  • Unit tests under tests/v1/spec_decode/test_gumiho.py (CPU-only, exercises GumihoConfig, MLP-head shape contract, OOB-head filtering, and the two hooks).

Out of scope for the first PR

These can come as follow-ups once the basic path is merged and reviewed:

  • Tree attention / FTA: the reference Gumiho training code uses an FTA tree for verification. The V1 verifier currently does not support tree drafts, so this PR uses the standard linear-chain V1 verifier.
  • Probabilistic draft sampling: the MLP heads currently produce argmax ids only; with draft_sample_method=probabilistic we fall back to the sequential transformer path. Adding logits output to the MLP heads is a small, follow-up change.
  • V2 GPU model runner path.
  • Multimodal drafters.

Diff size

  • 5 new files: vllm/v1/spec_decode/gumiho.py (74 LoC), vllm/model_executor/models/gumiho.py (437 LoC), vllm/transformers_utils/configs/gumiho.py (80 LoC), tests/v1/spec_decode/test_gumiho.py (350 LoC), docs/features/speculative_decoding/gumiho.md (72 LoC).
  • 9 modified files: net +110 / -4 LoC.

The PR with the full diff is at <!-- TODO: link your PR URL once opened -->.

Feedback Period.

Two weeks (we'd like to keep the review loop short since the diff is contained and we have a working prototype, but happy to extend if a maintainer raises larger design questions).

CC List.

@WoosukKwon @LiuXiaoxuanPKU @ekagra-ranjan

<!-- TODO before submitting: * Look at recent EAGLE / MTP / DFlash PRs and CC their reviewers as well, e.g. `git log --oneline -- vllm/v1/spec_decode/` on main to find them. -->

Any Other Things.

Proposed Change.

https://github.com/vllm-project/vllm/pull/43544

Feedback Period.

No response

CC List.

No response

Any Other Things.

No response

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

vllm - 💡(How to fix) Fix [RFC]: Add Gumiho speculative decoding to vLLM [1 pull requests]