transformers - 💡(How to fix) Fix [Feature Request] Add lossy speculative decoding via static ensemble verification [1 participants]

transformers2026-05-10 06:09:08

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

huggingface/transformers#45865•Fetched 2026-05-11 03:13:03

View on GitHub

Comments

Participants

Timeline

Reactions

Author

kasakh

Participants

kasakh

Timeline (top)

cross-referenced ×1

Root Cause

Standard speculative decoding (assisted generation) in Transformers is lossless — it guarantees the output distribution exactly matches the target model. While this is a strong guarantee, it comes at a cost: many plausible draft tokens are rejected because p(x)/q(x) < 1, even when those tokens would lead to correct outputs. This limits the practical speedup achievable with speculative decoding.

Code Example

v(x) = w * p_target(x) + (1 - w) * q_draft(x)

---

outputs = model.generate(
    **inputs,
    assistant_model=assistant_model,
    assistant_ensemble_weight=0.7,  # float in (0, 1], default=1.0 (lossless)
)

RAW_BUFFERClick to expand / collapse

Feature request

Is your feature request related to a problem? Please describe.

In our experiments across multiple model pairs (Llama, Qwen, Gemma families), we observe that wall-clock time decreases monotonically as acceptance rate increases. The rigid verification step is the primary bottleneck.

Describe the solution you would like

Add an optional assistant_ensemble_weight parameter to GenerationConfig that enables static ensemble verification — a training-free, single-parameter extension that trades a controllable amount of distributional bias for higher acceptance rates.

The verification distribution becomes a weighted mixture:

v(x) = w * p_target(x) + (1 - w) * q_draft(x)

A draft token is accepted with probability min(1, v(x) / q(x)), and on rejection, we resample from the corresponding fallback distribution.

Proposed API:

outputs = model.generate(
    **inputs,
    assistant_model=assistant_model,
    assistant_ensemble_weight=0.7,  # float in (0, 1], default=1.0 (lossless)
)

Key properties:

w=1.0 recovers standard lossless speculative decoding (backward compatible)
w<1.0 increases acceptance probability from 1 - TV(q,p) to 1 - w*TV(q,p) (Lemma 1 in our paper)
The method is Pareto-optimal: it achieves the best possible tradeoff between acceptance rate and distributional bias (Proposition 1)
No training required, no extra model weights, just one scalar parameter

Describe alternatives you have considered

Training a draft model to better match the target (e.g., online distillation) — requires expensive training
Dynamic ensemble with learned context-dependent weights (our full DIVERSED method) — requires training an ensemble head, too invasive for a first contribution
Simply increasing num_assistant_tokens — does not address the fundamental acceptance rate limitation

Empirical results (from our paper):

On CNN/DailyMail with Llama-3.1-8B-Instruct (target) + Llama-3.2-1B-Instruct (draft), temperature=0:

Standard SD (w=1.0): ~65% acceptance rate
Static ensemble (w=0.7): ~78% acceptance rate, with ROUGE-L within 0.5 points of the target model

Similar improvements observed across GSM8K, XSum, WMT, HumanEval, and MBPP benchmarks.

Implementation scope:

The change is minimal (~50-100 lines of logic):

Add assistant_ensemble_weight to GenerationConfig in configuration_utils.py
Modify the verification step in utils.py to blend distributions when w < 1.0
Tests and documentation

References:

Paper: DIVERSED: Relaxed Speculative Decoding via Dynamic Ensemble Verification (AISTATS 2026)
Code: https://github.com/comeusr/diversed
The static ensemble method is described in Section 3.1 of the paper

I am happy to implement this and submit a PR if there is interest from the maintainers. I am one of the paper authors.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #ssr #installation #API versioning #request timeout

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

transformers - 💡(How to fix) Fix [Feature Request] Add lossy speculative decoding via static ensemble verification [1 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Code Example

Feature request

Still need to ship something?

TRENDING

transformers - 💡(How to fix) Fix [Feature Request] Add lossy speculative decoding via static ensemble verification [1 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Code Example

Feature request

Still need to ship something?

RELATED_DISCOVERY

TRENDING