transformers - ✅(Solved) Fix [Bug] `transformers serve --continuous-batching` crashes with multimodal models (Qwen3.5) — AttributeError: 'str' object has no attribute 'to' [2 pull requests, 3 comments, 4 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
huggingface/transformers#44423Fetched 2026-04-08 00:28:32
View on GitHub
Comments
3
Participants
4
Timeline
18
Reactions
2
Author
Timeline (top)
mentioned ×6subscribed ×6commented ×3cross-referenced ×2

Error Message

File "transformers/cli/serve.py", line 831, in continuous_batching_chat_completion
    ).to(model.device)["input_ids"][0]
      ^^
AttributeError: 'str' object has no attribute 'to'

Root Cause

In serve.py line 829-831:

inputs = processor.apply_chat_template(
    req["messages"], return_tensors="pt", add_generation_prompt=True, return_dict=True
).to(model.device)["input_ids"][0]

For multimodal models like Qwen3.5, processor is a multimodal processor (not a plain tokenizer). Its apply_chat_template() returns a plain string instead of a BatchEncoding, so calling .to(model.device) raises AttributeError.

Note: Without --continuous-batching, the server works fine with Qwen3.5.

Fix Action

Fixed

PR fix notes

PR #44424: Fix transformers serve --continuous-batching for multimodal models

Description (problem / solution / changelog)

What does this PR do?

Fixes AttributeError: 'str' object has no attribute 'to' when using transformers serve --continuous-batching with multimodal models like Qwen3.5-9B.

processor.apply_chat_template() returns a plain string (not BatchEncoding) for some multimodal processors. The current code calls .to(model.device) directly on the return value, which fails.

Added a type check: if the output is a string, tokenize it first using tokenizer before moving to device.

Fixes #44423

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline, Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link to it if that's the case. (#44423)
  • Did you make sure to update the documentation with your changes?
  • Did you write any new necessary tests?

Who can review?

@remi-or @ArthurZucker @McPatate

Changed files

  • src/transformers/cli/serve.py (modified, +6/-2)

PR #44436: Fix continuous batching for multimodal models

Description (problem / solution / changelog)

Fixes #44423

continuous_batching_chat_completion was missing input preprocessing and tokenize=True in apply_chat_template, causing 'str' object has no attribute 'to' for multimodal models.

Added the same get_model_modality + get_processor_inputs_from_inbound_messages preprocessing already used in generate_chat_completion.

Changed files

  • src/transformers/cli/serve.py (modified, +8/-0)

Code Example

pip install "transformers[serving] @ git+https://github.com/huggingface/transformers.git@main"

---

transformers serve --force-model Qwen/Qwen3.5-9B --port 8000 --continuous-batching

---

curl -X POST http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "Qwen/Qwen3.5-9B", "messages": [{"role": "user", "content": "Hello"}]}'

---

File "transformers/cli/serve.py", line 831, in continuous_batching_chat_completion
    ).to(model.device)["input_ids"][0]
      ^^
AttributeError: 'str' object has no attribute 'to'

---

inputs = processor.apply_chat_template(
    req["messages"], return_tensors="pt", add_generation_prompt=True, return_dict=True
).to(model.device)["input_ids"][0]
RAW_BUFFERClick to expand / collapse

System Info

  • transformers main branch (5.3.0.dev0, commit 5c1c72be)
  • Python 3.11.14
  • PyTorch 2.5.1+cu121
  • OS: Ubuntu Linux

Who can help?

@Lysandre @ArthurZucker @joaogante

Reproduction

  1. Install latest transformers from main:
pip install "transformers[serving] @ git+https://github.com/huggingface/transformers.git@main"
  1. Launch the server with continuous batching:
transformers serve --force-model Qwen/Qwen3.5-9B --port 8000 --continuous-batching
  1. Send any chat completion request:
curl -X POST http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "Qwen/Qwen3.5-9B", "messages": [{"role": "user", "content": "Hello"}]}'

Error

File "transformers/cli/serve.py", line 831, in continuous_batching_chat_completion
    ).to(model.device)["input_ids"][0]
      ^^
AttributeError: 'str' object has no attribute 'to'

Root Cause

In serve.py line 829-831:

inputs = processor.apply_chat_template(
    req["messages"], return_tensors="pt", add_generation_prompt=True, return_dict=True
).to(model.device)["input_ids"][0]

For multimodal models like Qwen3.5, processor is a multimodal processor (not a plain tokenizer). Its apply_chat_template() returns a plain string instead of a BatchEncoding, so calling .to(model.device) raises AttributeError.

Note: Without --continuous-batching, the server works fine with Qwen3.5.

Expected behavior

transformers serve --continuous-batching should handle multimodal models whose processor returns a string from apply_chat_template.

extent analysis

Fix Plan

Fix Name

Multimodal Model Processor Fix

Fix Steps

1. Update serve.py to handle multimodal processors

Update the continuous_batching_chat_completion function in serve.py to check if the processor is a multimodal processor before calling .to(model.device).

# Before
inputs = processor.apply_chat_template(
    req["messages"], return_tensors="pt", add_generation_prompt=True, return_dict=True
).to(model.device)["input_ids"][0]

# After
if isinstance(processor, MultimodalProcessor):
    inputs = processor.apply_chat_template(
        req["messages"], return_tensors="pt", add_generation_prompt=True, return_dict=True
    )
else:
    inputs = processor.apply_chat_template(
        req["messages"], return_tensors="pt", add_generation_prompt=True, return_dict=True
    ).to(model.device)["input_ids"][0]

2. Update serve.py to handle string inputs

Update the continuous_batching_chat_completion function to handle string inputs from multimodal processors.

# Before
inputs = processor.apply_chat_template(
    req["messages"], return_tensors="pt", add_generation_prompt=True, return_dict=True
).to(model.device)["input_ids"][0]

# After
inputs = str(processor.apply_chat_template(
    req["messages"], return_tensors="pt", add_generation_prompt=True, return_dict=True
))

3. Update serve.py to handle multimodal models

Update the continuous_batching_chat_completion function to handle multimodal models.

# Before
inputs = processor.apply_chat_template(
    req["messages"], return_tensors="pt", add_generation_prompt=True, return_dict=True
).to(model.device)["input_ids"][0]

# After
if isinstance(processor, MultimodalProcessor):
    inputs = str(processor.apply_chat_template(

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

FAQ

Expected behavior

transformers serve --continuous-batching should handle multimodal models whose processor returns a string from apply_chat_template.

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING