transformers - 💡(How to fix) Fix Chunked generation produces inconsistent outputs when using compiled forward [2 comments, 3 participants]

Q: Expected behavior

I expect results from a single call to `generate` to produce the same results as repeated `generate` calls (with appropriate arguments) when doing greedy decoding, independent of whether model calls are being compiled or not.

transformers2026-03-05 12:50:02

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

huggingface/transformers#44464•Fetched 2026-04-08 00:28:17

View on GitHub

Comments

Participants

Timeline

Reactions

Author

Participants

Timeline (top)

commented ×2mentioned ×2subscribed ×2closed ×1

Code Example

import torch
from transformers.models import LlamaForCausalLM, LlamaConfig
from transformers.generation.configuration_utils import GenerationConfig
from transformers.cache_utils import StaticCache

torch.manual_seed(2026)
model = LlamaForCausalLM(LlamaConfig(num_hidden_layers=1)).to(device="cuda", dtype=torch.bfloat16)
cache = StaticCache(model.config, max_cache_len=192)
tokens = torch.randint(model.vocab_size, size=(4, 64), device=model.device)

gen_kwargs = {
    "eos_token_id": 128001,
    "bos_token_id": 128000,
    "do_sample": False,
    "disable_compile": True,
}

gen_config = GenerationConfig(max_new_tokens=128, min_new_tokens=128, **gen_kwargs)
cache.reset()
out_ref = model.generate(tokens, past_key_values=cache, generation_config=gen_config)

gen_config2 = GenerationConfig(max_new_tokens=64, min_new_tokens=64, **gen_kwargs)
cache.reset()
out1 = model.generate(tokens, past_key_values=cache, generation_config=gen_config2)
out2 = model.generate(
    out1[:, -1:],
    past_key_values=cache,
    cache_position=torch.tensor([cache.get_seq_length()], device=model.device),
    attention_mask=torch.ones_like(out1),  # necessary for correct pos_ids
    generation_config=gen_config2,
)

L = out1.shape[1]
torch.testing.assert_close(out1, out_ref[:, :L])
torch.testing.assert_close(out2[:, 1:], out_ref[:, L:])

RAW_BUFFERClick to expand / collapse

System Info

transformers version: 4.57.6
Platform: Linux-6.18.13-arch1-1-x86_64-with-glibc2.43
Python version: 3.12.12
Huggingface_hub version: 0.36.0
Safetensors version: 0.7.0
Accelerate version: 1.12.0
Accelerate config: not found
DeepSpeed version: not installed
PyTorch version (accelerator?): 2.9.1 (CUDA)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using distributed or parallel set-up in script?: no
Using GPU in script?: yes
GPU type: NVIDIA GeForce RTX 2050

Who can help?

@Cyrilvallez

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

Change disable_compile to False to make the second assert fail.

import torch
from transformers.models import LlamaForCausalLM, LlamaConfig
from transformers.generation.configuration_utils import GenerationConfig
from transformers.cache_utils import StaticCache

torch.manual_seed(2026)
model = LlamaForCausalLM(LlamaConfig(num_hidden_layers=1)).to(device="cuda", dtype=torch.bfloat16)
cache = StaticCache(model.config, max_cache_len=192)
tokens = torch.randint(model.vocab_size, size=(4, 64), device=model.device)

gen_kwargs = {
    "eos_token_id": 128001,
    "bos_token_id": 128000,
    "do_sample": False,
    "disable_compile": True,
}

gen_config = GenerationConfig(max_new_tokens=128, min_new_tokens=128, **gen_kwargs)
cache.reset()
out_ref = model.generate(tokens, past_key_values=cache, generation_config=gen_config)

gen_config2 = GenerationConfig(max_new_tokens=64, min_new_tokens=64, **gen_kwargs)
cache.reset()
out1 = model.generate(tokens, past_key_values=cache, generation_config=gen_config2)
out2 = model.generate(
    out1[:, -1:],
    past_key_values=cache,
    cache_position=torch.tensor([cache.get_seq_length()], device=model.device),
    attention_mask=torch.ones_like(out1),  # necessary for correct pos_ids
    generation_config=gen_config2,
)

L = out1.shape[1]
torch.testing.assert_close(out1, out_ref[:, :L])
torch.testing.assert_close(out2[:, 1:], out_ref[:, L:])

Expected behavior

I expect results from a single call to generate to produce the same results as repeated generate calls (with appropriate arguments) when doing greedy decoding, independent of whether model calls are being compiled or not.

extent analysis

Fix Plan

1. Update `disable_compile` to `False` and adjust `max_new_tokens` in `GenerationConfig`

gen_kwargs = {
    "eos_token_id": 128001,
    "bos_token_id": 128000,
    "do_sample": False,
    "disable_compile": False,  # Update this line
}

gen_config = GenerationConfig(max_new_tokens=128, min_new_tokens=128, **gen_kwargs)

2. Update `max_new_tokens` in `GenerationConfig` for the second call to `model.generate`

gen_config2 = GenerationConfig(max_new_tokens=128, min_new_tokens=128, **gen_kwargs)

3. Remove the second call to `model.generate` and use `out_ref` directly

out_ref = model.generate(tokens, past_key_values=cache, generation_config=gen_config)
L = out_ref.shape[1]
torch.testing.assert_close(out_ref, out_ref[:, :L])
torch.testing.assert_close(out_ref[:, 1:], out_ref[:, L:])

Verification

Run the updated code and verify that the assertions pass.
Check that the results are as expected when disable_compile is False.

Extra Tips

Make sure to update the max_new_tokens in GenerationConfig to match the expected output length.
If you need to perform greedy decoding with repeated generate calls, consider using a single call to generate with the correct max_new_tokens and min_new_tokens settings.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

FAQ

Expected behavior

#api #ssr #installation #tensor shape #autograd error #memory management #API rate limit #retriever error #indexing error #inference speed

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

transformers - 💡(How to fix) Fix Chunked generation produces inconsistent outputs when using compiled forward [2 comments, 3 participants]

Recommended Tools

GitHub issue graph ai analysis

Code Example

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

extent analysis

Fix Plan

1. Update `disable_compile` to `False` and adjust `max_new_tokens` in `GenerationConfig`

2. Update `max_new_tokens` in `GenerationConfig` for the second call to `model.generate`

3. Remove the second call to `model.generate` and use `out_ref` directly

Verification

Extra Tips

FAQ

Expected behavior

Still need to ship something?

TRENDING

transformers - 💡(How to fix) Fix Chunked generation produces inconsistent outputs when using compiled forward [2 comments, 3 participants]

Recommended Tools

GitHub issue graph ai analysis

Code Example

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

extent analysis

Fix Plan

1. Update disable_compile to False and adjust max_new_tokens in GenerationConfig

2. Update max_new_tokens in GenerationConfig for the second call to model.generate

3. Remove the second call to model.generate and use out_ref directly

Verification

Extra Tips

FAQ

Expected behavior

Still need to ship something?

RELATED_DISCOVERY

TRENDING

1. Update `disable_compile` to `False` and adjust `max_new_tokens` in `GenerationConfig`

2. Update `max_new_tokens` in `GenerationConfig` for the second call to `model.generate`

3. Remove the second call to `model.generate` and use `out_ref` directly