transformers - ✅(Solved) Fix Paged generate() emits a stale warning for num_return_sequences [1 pull requests, 1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
huggingface/transformers#45563Fetched 2026-04-23 07:23:01
View on GitHub
Comments
0
Participants
1
Timeline
7
Reactions
0
Participants
Timeline (top)
mentioned ×2subscribed ×2cross-referenced ×1labeled ×1

generate(..., cache_implementation="paged") still warns that num_return_sequences is unsupported for continuous batching.

That warning looks stale: generate_batch() already uses generation_config.num_return_sequences to expand the number of requests.

Root Cause

generate(..., cache_implementation="paged") still warns that num_return_sequences is unsupported for continuous batching.

That warning looks stale: generate_batch() already uses generation_config.num_return_sequences to expand the number of requests.

Fix Action

Fixed

PR fix notes

PR #45582: generate: drop stale num_return_sequences warning on continuous batching path

Description (problem / solution / changelog)

The continuous-batching branch in generate warned that num_return_sequences was unsupported alongside num_beams, but generate_batch() already honors generation_config.num_return_sequences when expanding requests. The warning fires for any run that explicitly sets num_return_sequences even though the feature works, cluttering logs and misleading users.

Drop the num_return_sequences half of the warning; keep the num_beams guard since beam search is still unsupported on the CB path.

Fixes #45563

Changed files

  • setup.py (modified, +1/-1)
  • src/transformers/generation/utils.py (modified, +2/-6)

Code Example

import torch
from transformers import GenerationConfig, PretrainedConfig
from transformers.generation.utils import GenerationMixin
from transformers.generation.continuous_batching.requests import GenerationOutput, RequestStatus

class DummyContinuousBatchingGenerateModel(GenerationMixin):
    def __init__(self):
        self.config = PretrainedConfig()
        self.generation_config = GenerationConfig()
        self.device = torch.device("cpu")

    def generate_batch(self, inputs, generation_config=None, **kwargs):
        num_return_sequences = generation_config.num_return_sequences or 1
        return {
            f"req_{i}": GenerationOutput(
                request_id=f"req_{i}",
                prompt_ids=inputs[0],
                generated_tokens=[10 + i],
                status=RequestStatus.FINISHED,
            )
            for i in range(num_return_sequences)
        }

model = DummyContinuousBatchingGenerateModel()
model.generate(
    inputs=torch.tensor([[1, 2, 3]]),
    cache_implementation="paged",
    do_sample=True,
    num_return_sequences=2,
)
RAW_BUFFERClick to expand / collapse

System Info

Transformers version: 5.6.0.dev0 Platform: macOS-26.2-arm64-arm-64bit-Mach-O Python version: 3.13.5 (v3.13.5:6cb20a219a8, Jun 11 2025, 12:23:45) [Clang 16.0.0 (clang-1600.0.26.6)] PyTorch version: 2.11.0 CUDA available: False MPS available: True

Who can help?

@cyrilvallez @remi-or

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

Summary

generate(..., cache_implementation="paged") still warns that num_return_sequences is unsupported for continuous batching.

That warning looks stale: generate_batch() already uses generation_config.num_return_sequences to expand the number of requests.

Minimal reproduction

import torch
from transformers import GenerationConfig, PretrainedConfig
from transformers.generation.utils import GenerationMixin
from transformers.generation.continuous_batching.requests import GenerationOutput, RequestStatus

class DummyContinuousBatchingGenerateModel(GenerationMixin):
    def __init__(self):
        self.config = PretrainedConfig()
        self.generation_config = GenerationConfig()
        self.device = torch.device("cpu")

    def generate_batch(self, inputs, generation_config=None, **kwargs):
        num_return_sequences = generation_config.num_return_sequences or 1
        return {
            f"req_{i}": GenerationOutput(
                request_id=f"req_{i}",
                prompt_ids=inputs[0],
                generated_tokens=[10 + i],
                status=RequestStatus.FINISHED,
            )
            for i in range(num_return_sequences)
        }

model = DummyContinuousBatchingGenerateModel()
model.generate(
    inputs=torch.tensor([[1, 2, 3]]),
    cache_implementation="paged",
    do_sample=True,
    num_return_sequences=2,
)

On current main, this still emits: num_return_sequences and num_beams are not supported for continuous batching yet.

I have a draft fix here: https://github.com/oleksii-tumanov/transformers/commit/f7a939d95239d26f94195cd6b820e4c720976507

Expected behavior

For valid generate(..., cache_implementation="paged") calls:

  • keep warning for num_beams > 1
  • stop warning for valid num_return_sequences cases

extent analysis

TL;DR

Apply the draft fix to update the warning logic for num_return_sequences in generate with cache_implementation="paged".

Guidance

  • Review the draft fix at https://github.com/oleksii-tumanov/transformers/commit/f7a939d95239d26f94195cd6b820e4c720976507 to understand the proposed changes.
  • Verify that the fix only removes the warning for num_return_sequences and still warns for num_beams > 1 when using cache_implementation="paged".
  • Test the updated generate method with different num_return_sequences values to ensure the warning is correctly suppressed.
  • Consider the implications of this change on the behavior of generate_batch and how it handles generation_config.num_return_sequences.

Example

# After applying the draft fix, this should no longer emit a warning for num_return_sequences
model.generate(
    inputs=torch.tensor([[1, 2, 3]]),
    cache_implementation="paged",
    do_sample=True,
    num_return_sequences=2,
)

Notes

The provided draft fix is specific to the transformers library and its handling of num_return_sequences with cache_implementation="paged". This solution may not apply to other libraries or versions.

Recommendation

Apply the workaround by implementing the draft fix, as it directly addresses the issue with the warning logic for num_return_sequences in generate with cache_implementation="paged".

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

FAQ

Expected behavior

For valid generate(..., cache_implementation="paged") calls:

  • keep warning for num_beams > 1
  • stop warning for valid num_return_sequences cases

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING