transformers - 💡(How to fix) Fix Bug, Generation [2 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
huggingface/transformers#45307Fetched 2026-04-09 07:50:51
View on GitHub
Comments
2
Participants
2
Timeline
4
Reactions
0
Timeline (top)
commented ×2closed ×1reopened ×1

When using assisted generation (model.generate(assistant_model=...)) with models that have different vocabulary sizes but share the same tokenizer family, the AssistantToTargetTranslator crashes because map_input_embeddings is never initialized.

This affects model pairs like Qwen2.5-7B (vocab=152,064) + Qwen2.5-0.5B (vocab=151,936), which share the same Qwen2.5 tokenizer but have different vocab padding.

Error Message

ValueError: The main and assistant models have different tokenizers.

Root Cause

When using assisted generation (model.generate(assistant_model=...)) with models that have different vocabulary sizes but share the same tokenizer family, the AssistantToTargetTranslator crashes because map_input_embeddings is never initialized.

Fix Action

Workaround

Catching the error and falling back:

try:
    output = target.generate(input_ids, assistant_model=draft, ...)
except (ValueError, AttributeError):
    # Fall back to standard generation
    output = target.generate(input_ids, max_new_tokens=32)

Code Example

from transformers import AutoModelForCausalLM, AutoTokenizer

target = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-7B", device_map="auto")
draft = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-0.5B", device_map="auto")
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-7B")

input_ids = tokenizer.encode("Hello world", return_tensors="pt").to("cuda")

# This crashes:
output = target.generate(
    input_ids,
    assistant_model=draft,
    tokenizer=tokenizer,
    assistant_tokenizer=tokenizer,
    max_new_tokens=32,
)

---

ValueError: The main and assistant models have different tokenizers.

---

AttributeError: 'AssistantToTargetTranslator' object has no attribute 'map_input_embeddings'

---

File "transformers/generation/utils.py", line 2521, in generate
    result = decoding_method(...)
File "transformers/generation/utils.py", line 3514, in _assisted_decoding
    candidate_input_ids, candidate_logits = candidate_generator.get_candidates(input_ids)
File "transformers/generation/candidate_generator.py", line 933, in get_candidates
    assistant_input_ids, num_added_tokens = self._prepare_assistant_input_ids(target_input_ids)
File "transformers/generation/candidate_generator.py", line 1009, in _prepare_assistant_input_ids
    self._atm_translator.unmap_input_ids()
File "transformers/generation/candidate_generator.py", line 754, in unmap_input_ids
    self.map_input_embeddings.map = False
AttributeError: 'AssistantToTargetTranslator' object has no attribute 'map_input_embeddings'

---

try:
    output = target.generate(input_ids, assistant_model=draft, ...)
except (ValueError, AttributeError):
    # Fall back to standard generation
    output = target.generate(input_ids, max_new_tokens=32)
RAW_BUFFERClick to expand / collapse

Title

AssistantToTargetTranslator crashes with AttributeError: 'map_input_embeddings' when using assisted generation with cross-vocab models

Description

When using assisted generation (model.generate(assistant_model=...)) with models that have different vocabulary sizes but share the same tokenizer family, the AssistantToTargetTranslator crashes because map_input_embeddings is never initialized.

This affects model pairs like Qwen2.5-7B (vocab=152,064) + Qwen2.5-0.5B (vocab=151,936), which share the same Qwen2.5 tokenizer but have different vocab padding.

Reproduction

from transformers import AutoModelForCausalLM, AutoTokenizer

target = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-7B", device_map="auto")
draft = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-0.5B", device_map="auto")
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-7B")

input_ids = tokenizer.encode("Hello world", return_tensors="pt").to("cuda")

# This crashes:
output = target.generate(
    input_ids,
    assistant_model=draft,
    tokenizer=tokenizer,
    assistant_tokenizer=tokenizer,
    max_new_tokens=32,
)

Without tokenizer and assistant_tokenizer, it raises:

ValueError: The main and assistant models have different tokenizers.

With tokenizer and assistant_tokenizer, it raises:

AttributeError: 'AssistantToTargetTranslator' object has no attribute 'map_input_embeddings'

Traceback

File "transformers/generation/utils.py", line 2521, in generate
    result = decoding_method(...)
File "transformers/generation/utils.py", line 3514, in _assisted_decoding
    candidate_input_ids, candidate_logits = candidate_generator.get_candidates(input_ids)
File "transformers/generation/candidate_generator.py", line 933, in get_candidates
    assistant_input_ids, num_added_tokens = self._prepare_assistant_input_ids(target_input_ids)
File "transformers/generation/candidate_generator.py", line 1009, in _prepare_assistant_input_ids
    self._atm_translator.unmap_input_ids()
File "transformers/generation/candidate_generator.py", line 754, in unmap_input_ids
    self.map_input_embeddings.map = False
AttributeError: 'AssistantToTargetTranslator' object has no attribute 'map_input_embeddings'

Expected Behavior

Assisted generation should work with models from the same family that have slightly different vocab sizes, either by:

  1. Properly initializing map_input_embeddings in AssistantToTargetTranslator.__init__
  2. Or handling the case where the tokenizer is the same but vocab sizes differ (padding tokens)

Environment

  • transformers version: 5.4.0
  • torch version: 2.11.0+cu128
  • Python: 3.13.5
  • GPU: NVIDIA H200
  • OS: Linux (RHEL 9, HPC cluster)

Context

Found while benchmarking a from-scratch speculative decoding implementation against HF's assisted generation across multiple model pairs. The bug only triggers with cross-vocab model pairs (e.g., Qwen2.5 family). Same-vocab pairs (e.g., Llama-3.1-8B + Llama-3.2-1B) work correctly.

Workaround

Catching the error and falling back:

try:
    output = target.generate(input_ids, assistant_model=draft, ...)
except (ValueError, AttributeError):
    # Fall back to standard generation
    output = target.generate(input_ids, max_new_tokens=32)

extent analysis

TL;DR

The most likely fix is to properly initialize map_input_embeddings in AssistantToTargetTranslator.__init__ to handle models with different vocabulary sizes but the same tokenizer family.

Guidance

  • Verify that the AssistantToTargetTranslator class is correctly handling the case where the tokenizer is the same but vocab sizes differ by checking the initialization of map_input_embeddings.
  • Consider adding a check in AssistantToTargetTranslator.__init__ to handle the case where the tokenizer is the same but vocab sizes differ.
  • If the above fix is not feasible, use the provided workaround of catching the error and falling back to standard generation.
  • Test the fix with different model pairs to ensure it works correctly for all cases.

Example

# Example of how map_input_embeddings could be initialized
class AssistantToTargetTranslator:
    def __init__(self, target_tokenizer, assistant_tokenizer):
        if target_tokenizer == assistant_tokenizer:
            # Handle the case where the tokenizer is the same but vocab sizes differ
            self.map_input_embeddings = ...
        else:
            # Handle the case where the tokenizers are different
            self.map_input_embeddings = ...

Notes

The provided workaround may not be ideal as it falls back to standard generation, which may not be the desired behavior. A proper fix would involve initializing map_input_embeddings correctly in AssistantToTargetTranslator.__init__.

Recommendation

Apply the workaround of catching the error and falling back to standard generation until a proper fix is implemented. This will allow for assisted generation to work with models from the same family that have slightly different vocab sizes, although it may not be the most efficient solution.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING