transformers - 💡(How to fix) Fix Chunked generation produces inconsistent outputs when using compiled forward [2 comments, 3 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
huggingface/transformers#44464Fetched 2026-04-08 00:28:17
View on GitHub
Comments
2
Participants
3
Timeline
8
Reactions
0
Timeline (top)
commented ×2mentioned ×2subscribed ×2closed ×1

Code Example

import torch
from transformers.models import LlamaForCausalLM, LlamaConfig
from transformers.generation.configuration_utils import GenerationConfig
from transformers.cache_utils import StaticCache

torch.manual_seed(2026)
model = LlamaForCausalLM(LlamaConfig(num_hidden_layers=1)).to(device="cuda", dtype=torch.bfloat16)
cache = StaticCache(model.config, max_cache_len=192)
tokens = torch.randint(model.vocab_size, size=(4, 64), device=model.device)

gen_kwargs = {
    "eos_token_id": 128001,
    "bos_token_id": 128000,
    "do_sample": False,
    "disable_compile": True,
}

gen_config = GenerationConfig(max_new_tokens=128, min_new_tokens=128, **gen_kwargs)
cache.reset()
out_ref = model.generate(tokens, past_key_values=cache, generation_config=gen_config)

gen_config2 = GenerationConfig(max_new_tokens=64, min_new_tokens=64, **gen_kwargs)
cache.reset()
out1 = model.generate(tokens, past_key_values=cache, generation_config=gen_config2)
out2 = model.generate(
    out1[:, -1:],
    past_key_values=cache,
    cache_position=torch.tensor([cache.get_seq_length()], device=model.device),
    attention_mask=torch.ones_like(out1),  # necessary for correct pos_ids
    generation_config=gen_config2,
)

L = out1.shape[1]
torch.testing.assert_close(out1, out_ref[:, :L])
torch.testing.assert_close(out2[:, 1:], out_ref[:, L:])
RAW_BUFFERClick to expand / collapse

System Info

  • transformers version: 4.57.6
  • Platform: Linux-6.18.13-arch1-1-x86_64-with-glibc2.43
  • Python version: 3.12.12
  • Huggingface_hub version: 0.36.0
  • Safetensors version: 0.7.0
  • Accelerate version: 1.12.0
  • Accelerate config: not found
  • DeepSpeed version: not installed
  • PyTorch version (accelerator?): 2.9.1 (CUDA)
  • Tensorflow version (GPU?): not installed (NA)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using distributed or parallel set-up in script?: no
  • Using GPU in script?: yes
  • GPU type: NVIDIA GeForce RTX 2050

Who can help?

@Cyrilvallez

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

Change disable_compile to False to make the second assert fail.

import torch
from transformers.models import LlamaForCausalLM, LlamaConfig
from transformers.generation.configuration_utils import GenerationConfig
from transformers.cache_utils import StaticCache

torch.manual_seed(2026)
model = LlamaForCausalLM(LlamaConfig(num_hidden_layers=1)).to(device="cuda", dtype=torch.bfloat16)
cache = StaticCache(model.config, max_cache_len=192)
tokens = torch.randint(model.vocab_size, size=(4, 64), device=model.device)

gen_kwargs = {
    "eos_token_id": 128001,
    "bos_token_id": 128000,
    "do_sample": False,
    "disable_compile": True,
}

gen_config = GenerationConfig(max_new_tokens=128, min_new_tokens=128, **gen_kwargs)
cache.reset()
out_ref = model.generate(tokens, past_key_values=cache, generation_config=gen_config)

gen_config2 = GenerationConfig(max_new_tokens=64, min_new_tokens=64, **gen_kwargs)
cache.reset()
out1 = model.generate(tokens, past_key_values=cache, generation_config=gen_config2)
out2 = model.generate(
    out1[:, -1:],
    past_key_values=cache,
    cache_position=torch.tensor([cache.get_seq_length()], device=model.device),
    attention_mask=torch.ones_like(out1),  # necessary for correct pos_ids
    generation_config=gen_config2,
)

L = out1.shape[1]
torch.testing.assert_close(out1, out_ref[:, :L])
torch.testing.assert_close(out2[:, 1:], out_ref[:, L:])

Expected behavior

I expect results from a single call to generate to produce the same results as repeated generate calls (with appropriate arguments) when doing greedy decoding, independent of whether model calls are being compiled or not.

extent analysis

Fix Plan

1. Update disable_compile to False and adjust max_new_tokens in GenerationConfig

gen_kwargs = {
    "eos_token_id": 128001,
    "bos_token_id": 128000,
    "do_sample": False,
    "disable_compile": False,  # Update this line
}

gen_config = GenerationConfig(max_new_tokens=128, min_new_tokens=128, **gen_kwargs)

2. Update max_new_tokens in GenerationConfig for the second call to model.generate

gen_config2 = GenerationConfig(max_new_tokens=128, min_new_tokens=128, **gen_kwargs)

3. Remove the second call to model.generate and use out_ref directly

out_ref = model.generate(tokens, past_key_values=cache, generation_config=gen_config)
L = out_ref.shape[1]
torch.testing.assert_close(out_ref, out_ref[:, :L])
torch.testing.assert_close(out_ref[:, 1:], out_ref[:, L:])

Verification

  1. Run the updated code and verify that the assertions pass.
  2. Check that the results are as expected when disable_compile is False.

Extra Tips

  • Make sure to update the max_new_tokens in GenerationConfig to match the expected output length.
  • If you need to perform greedy decoding with repeated generate calls, consider using a single call to generate with the correct max_new_tokens and min_new_tokens settings.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

FAQ

Expected behavior

I expect results from a single call to generate to produce the same results as repeated generate calls (with appropriate arguments) when doing greedy decoding, independent of whether model calls are being compiled or not.

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

transformers - 💡(How to fix) Fix Chunked generation produces inconsistent outputs when using compiled forward [2 comments, 3 participants]