transformers - 💡(How to fix) Fix [Feature Request] Add lossy speculative decoding via static ensemble verification [1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
huggingface/transformers#45865Fetched 2026-05-11 03:13:03
View on GitHub
Comments
0
Participants
1
Timeline
1
Reactions
0
Author
Participants
Timeline (top)
cross-referenced ×1

Root Cause

Standard speculative decoding (assisted generation) in Transformers is lossless — it guarantees the output distribution exactly matches the target model. While this is a strong guarantee, it comes at a cost: many plausible draft tokens are rejected because p(x)/q(x) < 1, even when those tokens would lead to correct outputs. This limits the practical speedup achievable with speculative decoding.

Code Example

v(x) = w * p_target(x) + (1 - w) * q_draft(x)

---

outputs = model.generate(
    **inputs,
    assistant_model=assistant_model,
    assistant_ensemble_weight=0.7,  # float in (0, 1], default=1.0 (lossless)
)
RAW_BUFFERClick to expand / collapse

Feature request

Is your feature request related to a problem? Please describe.

Standard speculative decoding (assisted generation) in Transformers is lossless — it guarantees the output distribution exactly matches the target model. While this is a strong guarantee, it comes at a cost: many plausible draft tokens are rejected because p(x)/q(x) < 1, even when those tokens would lead to correct outputs. This limits the practical speedup achievable with speculative decoding.

In our experiments across multiple model pairs (Llama, Qwen, Gemma families), we observe that wall-clock time decreases monotonically as acceptance rate increases. The rigid verification step is the primary bottleneck.

Describe the solution you would like

Add an optional assistant_ensemble_weight parameter to GenerationConfig that enables static ensemble verification — a training-free, single-parameter extension that trades a controllable amount of distributional bias for higher acceptance rates.

The verification distribution becomes a weighted mixture:

v(x) = w * p_target(x) + (1 - w) * q_draft(x)

A draft token is accepted with probability min(1, v(x) / q(x)), and on rejection, we resample from the corresponding fallback distribution.

Proposed API:

outputs = model.generate(
    **inputs,
    assistant_model=assistant_model,
    assistant_ensemble_weight=0.7,  # float in (0, 1], default=1.0 (lossless)
)

Key properties:

  • w=1.0 recovers standard lossless speculative decoding (backward compatible)
  • w<1.0 increases acceptance probability from 1 - TV(q,p) to 1 - w*TV(q,p) (Lemma 1 in our paper)
  • The method is Pareto-optimal: it achieves the best possible tradeoff between acceptance rate and distributional bias (Proposition 1)
  • No training required, no extra model weights, just one scalar parameter

Describe alternatives you have considered

  1. Training a draft model to better match the target (e.g., online distillation) — requires expensive training
  2. Dynamic ensemble with learned context-dependent weights (our full DIVERSED method) — requires training an ensemble head, too invasive for a first contribution
  3. Simply increasing num_assistant_tokens — does not address the fundamental acceptance rate limitation

Empirical results (from our paper):

On CNN/DailyMail with Llama-3.1-8B-Instruct (target) + Llama-3.2-1B-Instruct (draft), temperature=0:

  • Standard SD (w=1.0): ~65% acceptance rate
  • Static ensemble (w=0.7): ~78% acceptance rate, with ROUGE-L within 0.5 points of the target model

Similar improvements observed across GSM8K, XSum, WMT, HumanEval, and MBPP benchmarks.

Implementation scope:

The change is minimal (~50-100 lines of logic):

  1. Add assistant_ensemble_weight to GenerationConfig in configuration_utils.py
  2. Modify the verification step in utils.py to blend distributions when w < 1.0
  3. Tests and documentation

References:

I am happy to implement this and submit a PR if there is interest from the maintainers. I am one of the paper authors.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING