vllm - ✅(Solved) Fix [Feature]: Universal Speculative Decoding for Heterogeneous Vocabularies (TLI / Token-Level Intersection) [2 pull requests, 1 participants]

vllm2026-03-26 01:51:27

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

vllm-project/vllm#38173•Fetched 2026-04-08 01:31:53

View on GitHub

Comments

Participants

Timeline

Reactions

Author

wan-danfeng

Participants

wan-danfeng

Timeline (top)

mentioned ×3subscribed ×3cross-referenced ×1labeled ×1

Fix Action

Fixed

Fixed by PR: [Feature] Universal speculative decoding for heterogeneous vocabularies (TLI) (https://github.com/vllm-project/vllm/pull/38174)

PR fix notes

PR #38174: [Feature] Universal speculative decoding for heterogeneous vocabularies (TLI)

Repository: vllm-project/vllm
Author: wan-danfeng
State: open | merged: False
Link: https://github.com/vllm-project/vllm/pull/38174

Description (problem / solution / changelog)

Summary

Implements Token-Level Intersection (TLI) speculative decoding, allowing target and draft models to have different (but overlapping) vocabularies.

Closes #38173

Algorithm

Based on the ICML 2025 oral paper:

Accelerating LLM Inference with Lossless Speculative Decoding Algorithms for Heterogeneous Vocabularies
Timor et al., https://arxiv.org/abs/2502.05202

How it works:

At startup, build a normalized token intersection between target and draft vocabularies
Draft model generates tokens constrained to the intersection (logits of non-intersection tokens → -inf)
Intersection tokens are mapped to target token IDs before rejection sampling
Rejection sampling runs unchanged — the algorithm is provably lossless

Changes

File	Change
`vllm/v1/spec_decode/universal_draft_model.py`	New:
`UniversalDraftModelProposer`
`vllm/v1/spec_decode/vocab_mapping.py`	New: `VocabMapping` (intersection + ID mapping)
`vllm/config/speculative.py`	Register `"universal_draft"`, skip same-vocab check
`vllm/v1/worker/gpu_model_runner.py`	Instantiate proposer for`universal_draft` method

Testing

Functional test (Qwen2.5-1.5B + Qwen2.5-0.5B, A800 80GB):

Vocab intersection: 151665 / 151936 = 99.8%
Mean acceptance length: 2.83 / 3
Per-position acceptance rate: 77.5%, 60.6%, 45.1%
Avg draft acceptance rate: 61%

Regression (existing methods unaffected):

✅ No speculative decoding (baseline)
✅ ngram
✅ draft_model

Attribution

This implementation is based on the TLI algorithm by Timor et al. The original authors have a reference implementation in HuggingFace Transformers (PR #35029).

Changed files

vllm/config/speculative.py (modified, +10/-3)
vllm/v1/spec_decode/eagle.py (modified, +4/-0)
vllm/v1/spec_decode/universal_draft_model.py (added, +101/-0)
vllm/v1/spec_decode/vocab_mapping.py (added, +90/-0)
vllm/v1/worker/gpu_model_runner.py (modified, +18/-6)

RAW_BUFFERClick to expand / collapse

🚀 The feature, motivation and pitch

This feature adds support for speculative decoding with heterogeneous (mismatched) vocabularies, enabling any two models from different families — or even different tokenizer versions — to be paired as target + draft without requiring identical vocabularies.

The Problem Today

vLLM currently requires the draft model to share the exact same vocabulary as the target model. This severely limits drafter selection and often necessitates training a dedicated draft model from scratch.

Proposed Solution: Token-Level Intersection (TLI)

We implement the TLI (Token-Level Intersection) algorithm from the ICML
2025 oral paper:

Accelerating LLM Inference with Lossless Speculative Decoding Algorithms
for Heterogeneous Vocabularies Nadav Timor, Jonathan Mamou, Daniel Korat, Moshe Berchansky, Gaurav Jain,
Oren Pereg, Moshe Wasserblat, David Harel
https://arxiv.org/abs/2502.05202

The key idea:

Compute the token-level intersection between the target and draft vocabularies via text normalization
Constrain the draft model to only propose tokens present in the intersection
Map proposed tokens between vocabularies before verification

This is lossless — the target distribution is preserved exactly, with no approximation.

The original authors have an open-source implementation in HuggingFace
Transformers: https://github.com/huggingface/transformers/pull/35029
(cc: @keyboardAnt @jmamou @gauravjain14 — welcome your feedback on this vLLM
integration!)

Practical Impact

For models with highly overlapping vocabularies (e.g., Qwen2.5 family: 99.8% intersection):

Draft acceptance rate: ~60% (mean accepted tokens: 2.8 / 3 speculative
tokens)
No model training required — any off-the-shelf smaller model can serve as drafter

This opens up a much larger pool of draft models: any model with a
sufficiently overlapping vocabulary can be used out of the box.

Proposed API

vllm serve Qwen/Qwen2.5-7B-Instruct \
  --speculative-config '{                                                     
    "model": "Qwen/Qwen2.5-0.5B-Instruct",
    "method": "universal_draft",                                              
    "num_speculative_tokens": 3
  }'                                                                          
                
Implementation Overview                                                       
                
Two new files, minimal changes to existing code:                              

New files:                                                                    
- vllm/v1/spec_decode/universal_draft_model.py — UniversalDraftModelProposer
(inherits SpecDecodeBaseProposer)                                             
- vllm/v1/spec_decode/vocab_mapping.py — VocabMapping (vocabulary intersection
 + token ID mapping)                                                          
                                                                              
Modified files:
- vllm/config/speculative.py — register "universal_draft" method, skip        
same-vocab check                                                              
- vllm/v1/worker/gpu_model_runner.py — instantiate UniversalDraftModelProposer
 for the new method                                                           
                                                                              
Alternatives Considered
                                                                              
- Retokenization at each step: high latency, not lossless                     
- Shared embedding training: requires fine-tuning, not plug-and-play
- Restricting to same-vocab models only (current behavior): limits drafter    
choices                                                                       

References                                                                    
                
- Paper (ICML 2025 oral): https://arxiv.org/abs/2502.05202
- HuggingFace Transformers implementation:
https://github.com/huggingface/transformers/pull/35029

### Alternatives

_No response_

### Additional context

_No response_

### Before submitting a new issue...

- [x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the [documentation page](https://docs.vllm.ai/en/latest/), which can answer lots of frequently asked questions.

extent analysis

Fix Plan

To implement the Token-Level Intersection (TLI) algorithm for speculative decoding with heterogeneous vocabularies, follow these steps:

Create a new file vllm/v1/spec_decode/universal_draft_model.py with the UniversalDraftModelProposer class, inheriting from SpecDecodeBaseProposer.
Create a new file vllm/v1/spec_decode/vocab_mapping.py with the VocabMapping class, handling vocabulary intersection and token ID mapping.
Modify vllm/config/speculative.py to register the "universal_draft" method and skip the same-vocab check.
Modify vllm/v1/worker/gpu_model_runner.py to instantiate UniversalDraftModelProposer for the new method.

Example code for vocab_mapping.py:

class VocabMapping:
    def __init__(self, target_vocab, draft_vocab):
        self.target_vocab = target_vocab
        self.draft_vocab = draft_vocab
        self.intersection = self.compute_intersection()

    def compute_intersection(self):
        # Compute token-level intersection between target and draft vocabularies
        intersection = set(self.target_vocab) & set(self.draft_vocab)
        return intersection

    def map_tokens(self, tokens):
        # Map proposed tokens between vocabularies
        mapped_tokens = [self.target_vocab[token] for token in tokens if token in self.intersection]
        return mapped_tokens

Example code for universal_draft_model.py:

class UniversalDraftModelProposer(SpecDecodeBaseProposer):
    def __init__(self, model, method, num_speculative_tokens):
        self.model = model
        self.method = method
        self.num_speculative_tokens = num_speculative_tokens
        self.vocab_mapping = VocabMapping(self.model.target_vocab, self.model.draft_vocab)

    def propose_tokens(self, input_ids):
        # Constrain draft model to propose tokens present in the intersection
        proposed_tokens = self.model.generate(input_ids, num_return_sequences=self.num_speculative_tokens)
        proposed_tokens = self.vocab_mapping.map_tokens(proposed_tokens)
        return proposed_tokens

Verification

To verify the implementation, test the UniversalDraftModelProposer with different models and vocabularies, checking that the proposed tokens are correctly mapped between vocabularies.

Extra Tips

Ensure that the VocabMapping class correctly computes the token-level intersection between the target and draft vocabularies.
Test the implementation with models having highly overlapping vocabularies (e.g., Qwen2.5 family) to verify the draft acceptance rate and mean accepted tokens.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #output truncation #response parsing #generation error #database connection

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

vllm - ✅(Solved) Fix [Feature]: Universal Speculative Decoding for Heterogeneous Vocabularies (TLI / Token-Level Intersection) [2 pull requests, 1 participants]

Recommended Tools

GitHub issue graph ai analysis

Fix Action

Fixed

PR fix notes

PR #38174: [Feature] Universal speculative decoding for heterogeneous vocabularies (TLI)

Description (problem / solution / changelog)

Summary

Algorithm

Changes

Testing

Attribution

Changed files

🚀 The feature, motivation and pitch

🚀 The feature, motivation and pitch

The Problem Today

Proposed Solution: Token-Level Intersection (TLI)

Practical Impact

Proposed API

extent analysis

Fix Plan

Verification

Extra Tips

Still need to ship something?

TRENDING

vllm - ✅(Solved) Fix [Feature]: Universal Speculative Decoding for Heterogeneous Vocabularies (TLI / Token-Level Intersection) [2 pull requests, 1 participants]

Recommended Tools

GitHub issue graph ai analysis

Fix Action

Fixed

PR fix notes

PR #38174: [Feature] Universal speculative decoding for heterogeneous vocabularies (TLI)

Description (problem / solution / changelog)

Summary

Algorithm

Changes

Testing

Attribution

Changed files

🚀 The feature, motivation and pitch

🚀 The feature, motivation and pitch

The Problem Today

Proposed Solution: Token-Level Intersection (TLI)

Practical Impact

Proposed API

extent analysis

Fix Plan

Verification

Extra Tips

Still need to ship something?

RELATED_DISCOVERY

TRENDING