vllm - ✅(Solved) Fix [Feature]: Universal Speculative Decoding for Heterogeneous Vocabularies (TLI / Token-Level Intersection) [2 pull requests, 1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#38173Fetched 2026-04-08 01:31:53
View on GitHub
Comments
0
Participants
1
Timeline
8
Reactions
2
Participants
Timeline (top)
mentioned ×3subscribed ×3cross-referenced ×1labeled ×1

Fix Action

Fixed

PR fix notes

PR #38174: [Feature] Universal speculative decoding for heterogeneous vocabularies (TLI)

Description (problem / solution / changelog)

Summary

Implements Token-Level Intersection (TLI) speculative decoding, allowing target and draft models to have different (but overlapping) vocabularies.

Closes #38173

Algorithm

Based on the ICML 2025 oral paper:

Accelerating LLM Inference with Lossless Speculative Decoding Algorithms for Heterogeneous Vocabularies
Timor et al., https://arxiv.org/abs/2502.05202

How it works:

  1. At startup, build a normalized token intersection between target and draft vocabularies
  2. Draft model generates tokens constrained to the intersection (logits of non-intersection tokens → -inf)
  3. Intersection tokens are mapped to target token IDs before rejection sampling
  4. Rejection sampling runs unchanged — the algorithm is provably lossless

Changes

FileChange
vllm/v1/spec_decode/universal_draft_model.pyNew:
UniversalDraftModelProposer
vllm/v1/spec_decode/vocab_mapping.pyNew: VocabMapping (intersection + ID mapping)
vllm/config/speculative.pyRegister "universal_draft", skip same-vocab check
vllm/v1/worker/gpu_model_runner.pyInstantiate proposer foruniversal_draft method

Testing

Functional test (Qwen2.5-1.5B + Qwen2.5-0.5B, A800 80GB):

  • Vocab intersection: 151665 / 151936 = 99.8%
  • Mean acceptance length: 2.83 / 3
  • Per-position acceptance rate: 77.5%, 60.6%, 45.1%
  • Avg draft acceptance rate: 61%

Regression (existing methods unaffected):

  • ✅ No speculative decoding (baseline)
  • ngram
  • draft_model

Attribution

This implementation is based on the TLI algorithm by Timor et al. The original authors have a reference implementation in HuggingFace Transformers (PR #35029).

Changed files

  • vllm/config/speculative.py (modified, +10/-3)
  • vllm/v1/spec_decode/eagle.py (modified, +4/-0)
  • vllm/v1/spec_decode/universal_draft_model.py (added, +101/-0)
  • vllm/v1/spec_decode/vocab_mapping.py (added, +90/-0)
  • vllm/v1/worker/gpu_model_runner.py (modified, +18/-6)
RAW_BUFFERClick to expand / collapse

🚀 The feature, motivation and pitch

🚀 The feature, motivation and pitch

This feature adds support for speculative decoding with heterogeneous (mismatched) vocabularies, enabling any two models from different families — or even different tokenizer versions — to be paired as target + draft without requiring identical vocabularies.

The Problem Today

vLLM currently requires the draft model to share the exact same vocabulary as the target model. This severely limits drafter selection and often necessitates training a dedicated draft model from scratch.

Proposed Solution: Token-Level Intersection (TLI)

We implement the TLI (Token-Level Intersection) algorithm from the ICML
2025 oral paper:

Accelerating LLM Inference with Lossless Speculative Decoding Algorithms
for Heterogeneous Vocabularies
Nadav Timor, Jonathan Mamou, Daniel Korat, Moshe Berchansky, Gaurav Jain,
Oren Pereg, Moshe Wasserblat, David Harel
https://arxiv.org/abs/2502.05202

The key idea:

  1. Compute the token-level intersection between the target and draft vocabularies via text normalization
  2. Constrain the draft model to only propose tokens present in the intersection
  3. Map proposed tokens between vocabularies before verification

This is lossless — the target distribution is preserved exactly, with no approximation.

The original authors have an open-source implementation in HuggingFace
Transformers: https://github.com/huggingface/transformers/pull/35029
(cc: @keyboardAnt @jmamou @gauravjain14 — welcome your feedback on this vLLM
integration!)

Practical Impact

For models with highly overlapping vocabularies (e.g., Qwen2.5 family: 99.8% intersection):

  • Draft acceptance rate: ~60% (mean accepted tokens: 2.8 / 3 speculative
    tokens)
  • No model training required — any off-the-shelf smaller model can serve as drafter

This opens up a much larger pool of draft models: any model with a
sufficiently overlapping vocabulary can be used out of the box.

Proposed API

vllm serve Qwen/Qwen2.5-7B-Instruct \
  --speculative-config '{                                                     
    "model": "Qwen/Qwen2.5-0.5B-Instruct",
    "method": "universal_draft",                                              
    "num_speculative_tokens": 3
  }'                                                                          
                
Implementation Overview                                                       
                
Two new files, minimal changes to existing code:                              

New files:                                                                    
- vllm/v1/spec_decode/universal_draft_model.py — UniversalDraftModelProposer
(inherits SpecDecodeBaseProposer)                                             
- vllm/v1/spec_decode/vocab_mapping.py — VocabMapping (vocabulary intersection
 + token ID mapping)                                                          
                                                                              
Modified files:
- vllm/config/speculative.py — register "universal_draft" method, skip        
same-vocab check                                                              
- vllm/v1/worker/gpu_model_runner.py — instantiate UniversalDraftModelProposer
 for the new method                                                           
                                                                              
Alternatives Considered
                                                                              
- Retokenization at each step: high latency, not lossless                     
- Shared embedding training: requires fine-tuning, not plug-and-play
- Restricting to same-vocab models only (current behavior): limits drafter    
choices                                                                       

References                                                                    
                
- Paper (ICML 2025 oral): https://arxiv.org/abs/2502.05202
- HuggingFace Transformers implementation:
https://github.com/huggingface/transformers/pull/35029

### Alternatives

_No response_

### Additional context

_No response_

### Before submitting a new issue...

- [x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the [documentation page](https://docs.vllm.ai/en/latest/), which can answer lots of frequently asked questions.

extent analysis

Fix Plan

To implement the Token-Level Intersection (TLI) algorithm for speculative decoding with heterogeneous vocabularies, follow these steps:

  • Create a new file vllm/v1/spec_decode/universal_draft_model.py with the UniversalDraftModelProposer class, inheriting from SpecDecodeBaseProposer.
  • Create a new file vllm/v1/spec_decode/vocab_mapping.py with the VocabMapping class, handling vocabulary intersection and token ID mapping.
  • Modify vllm/config/speculative.py to register the "universal_draft" method and skip the same-vocab check.
  • Modify vllm/v1/worker/gpu_model_runner.py to instantiate UniversalDraftModelProposer for the new method.

Example code for vocab_mapping.py:

class VocabMapping:
    def __init__(self, target_vocab, draft_vocab):
        self.target_vocab = target_vocab
        self.draft_vocab = draft_vocab
        self.intersection = self.compute_intersection()

    def compute_intersection(self):
        # Compute token-level intersection between target and draft vocabularies
        intersection = set(self.target_vocab) & set(self.draft_vocab)
        return intersection

    def map_tokens(self, tokens):
        # Map proposed tokens between vocabularies
        mapped_tokens = [self.target_vocab[token] for token in tokens if token in self.intersection]
        return mapped_tokens

Example code for universal_draft_model.py:

class UniversalDraftModelProposer(SpecDecodeBaseProposer):
    def __init__(self, model, method, num_speculative_tokens):
        self.model = model
        self.method = method
        self.num_speculative_tokens = num_speculative_tokens
        self.vocab_mapping = VocabMapping(self.model.target_vocab, self.model.draft_vocab)

    def propose_tokens(self, input_ids):
        # Constrain draft model to propose tokens present in the intersection
        proposed_tokens = self.model.generate(input_ids, num_return_sequences=self.num_speculative_tokens)
        proposed_tokens = self.vocab_mapping.map_tokens(proposed_tokens)
        return proposed_tokens

Verification

To verify the implementation, test the UniversalDraftModelProposer with different models and vocabularies, checking that the proposed tokens are correctly mapped between vocabularies.

Extra Tips

  • Ensure that the VocabMapping class correctly computes the token-level intersection between the target and draft vocabularies.
  • Test the implementation with models having highly overlapping vocabularies (e.g., Qwen2.5 family) to verify the draft acceptance rate and mean accepted tokens.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

vllm - ✅(Solved) Fix [Feature]: Universal Speculative Decoding for Heterogeneous Vocabularies (TLI / Token-Level Intersection) [2 pull requests, 1 participants]