transformers - ✅(Solved) Fix [Bug] `transformers serve --continuous-batching` crashes with multimodal models (Qwen3.5) — AttributeError: 'str' object has no attribute 'to' [2 pull requests, 3 comments, 4 participants]

Q: Expected behavior

`transformers serve --continuous-batching` should handle multimodal models whose processor returns a string from `apply_chat_template`.

transformers2026-03-04 00:51:26

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

huggingface/transformers#44423•Fetched 2026-04-08 00:28:32

View on GitHub

Comments

Participants

Timeline

Reactions

Author

Participants

Timeline (top)

mentioned ×6subscribed ×6commented ×3cross-referenced ×2

Error Message

File "transformers/cli/serve.py", line 831, in continuous_batching_chat_completion
    ).to(model.device)["input_ids"][0]
      ^^
AttributeError: 'str' object has no attribute 'to'

Root Cause

In serve.py line 829-831:

inputs = processor.apply_chat_template(
    req["messages"], return_tensors="pt", add_generation_prompt=True, return_dict=True
).to(model.device)["input_ids"][0]

For multimodal models like Qwen3.5, processor is a multimodal processor (not a plain tokenizer). Its apply_chat_template() returns a plain string instead of a BatchEncoding, so calling .to(model.device) raises AttributeError.

Note: Without --continuous-batching, the server works fine with Qwen3.5.

Fix Action

Fixed

Fixed by PR: Fix transformers serve --continuous-batching for multimodal models (https://github.com/huggingface/transformers/pull/44424)
Fixed by PR: Fix continuous batching for multimodal models (https://github.com/huggingface/transformers/pull/44436)

PR fix notes

PR #44424: Fix `transformers serve --continuous-batching` for multimodal models

Repository: huggingface/transformers
Author: jw9603
State: closed | merged: False
Link: https://github.com/huggingface/transformers/pull/44424

Description (problem / solution / changelog)

What does this PR do?

Fixes AttributeError: 'str' object has no attribute 'to' when using transformers serve --continuous-batching with multimodal models like Qwen3.5-9B.

processor.apply_chat_template() returns a plain string (not BatchEncoding) for some multimodal processors. The current code calls .to(model.device) directly on the return value, which fails.

Added a type check: if the output is a string, tokenize it first using tokenizer before moving to device.

Fixes #44423

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline, Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link to it if that's the case. (#44423)
Did you make sure to update the documentation with your changes?
Did you write any new necessary tests?

Who can review?

@remi-or @ArthurZucker @McPatate

Changed files

src/transformers/cli/serve.py (modified, +6/-2)

PR #44436: Fix continuous batching for multimodal models

Repository: huggingface/transformers
Author: jw9603
State: closed | merged: True
Link: https://github.com/huggingface/transformers/pull/44436

Description (problem / solution / changelog)

Fixes #44423

continuous_batching_chat_completion was missing input preprocessing and tokenize=True in apply_chat_template, causing 'str' object has no attribute 'to' for multimodal models.

Added the same get_model_modality + get_processor_inputs_from_inbound_messages preprocessing already used in generate_chat_completion.

Changed files

src/transformers/cli/serve.py (modified, +8/-0)

Code Example

pip install "transformers[serving] @ git+https://github.com/huggingface/transformers.git@main"

---

transformers serve --force-model Qwen/Qwen3.5-9B --port 8000 --continuous-batching

---

curl -X POST http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "Qwen/Qwen3.5-9B", "messages": [{"role": "user", "content": "Hello"}]}'

---

File "transformers/cli/serve.py", line 831, in continuous_batching_chat_completion
    ).to(model.device)["input_ids"][0]
      ^^
AttributeError: 'str' object has no attribute 'to'

---

inputs = processor.apply_chat_template(
    req["messages"], return_tensors="pt", add_generation_prompt=True, return_dict=True
).to(model.device)["input_ids"][0]

RAW_BUFFERClick to expand / collapse

System Info

transformers main branch (5.3.0.dev0, commit 5c1c72be)
Python 3.11.14
PyTorch 2.5.1+cu121
OS: Ubuntu Linux

Who can help?

@Lysandre @ArthurZucker @joaogante

Reproduction

Install latest transformers from main:

pip install "transformers[serving] @ git+https://github.com/huggingface/transformers.git@main"

Launch the server with continuous batching:

transformers serve --force-model Qwen/Qwen3.5-9B --port 8000 --continuous-batching

Send any chat completion request:

curl -X POST http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "Qwen/Qwen3.5-9B", "messages": [{"role": "user", "content": "Hello"}]}'

Error

File "transformers/cli/serve.py", line 831, in continuous_batching_chat_completion
    ).to(model.device)["input_ids"][0]
      ^^
AttributeError: 'str' object has no attribute 'to'

Root Cause

In serve.py line 829-831:

inputs = processor.apply_chat_template(
    req["messages"], return_tensors="pt", add_generation_prompt=True, return_dict=True
).to(model.device)["input_ids"][0]

Note: Without --continuous-batching, the server works fine with Qwen3.5.

Expected behavior

transformers serve --continuous-batching should handle multimodal models whose processor returns a string from apply_chat_template.

extent analysis

Fix Plan

Fix Name

Multimodal Model Processor Fix

Fix Steps

1. Update `serve.py` to handle multimodal processors

Update the continuous_batching_chat_completion function in serve.py to check if the processor is a multimodal processor before calling .to(model.device).

# Before
inputs = processor.apply_chat_template(
    req["messages"], return_tensors="pt", add_generation_prompt=True, return_dict=True
).to(model.device)["input_ids"][0]

# After
if isinstance(processor, MultimodalProcessor):
    inputs = processor.apply_chat_template(
        req["messages"], return_tensors="pt", add_generation_prompt=True, return_dict=True
    )
else:
    inputs = processor.apply_chat_template(
        req["messages"], return_tensors="pt", add_generation_prompt=True, return_dict=True
    ).to(model.device)["input_ids"][0]

2. Update `serve.py` to handle string inputs

Update the continuous_batching_chat_completion function to handle string inputs from multimodal processors.

# Before
inputs = processor.apply_chat_template(
    req["messages"], return_tensors="pt", add_generation_prompt=True, return_dict=True
).to(model.device)["input_ids"][0]

# After
inputs = str(processor.apply_chat_template(
    req["messages"], return_tensors="pt", add_generation_prompt=True, return_dict=True
))

3. Update `serve.py` to handle multimodal models

Update the continuous_batching_chat_completion function to handle multimodal models.

# Before
inputs = processor.apply_chat_template(
    req["messages"], return_tensors="pt", add_generation_prompt=True, return_dict=True
).to(model.device)["input_ids"][0]

# After
if isinstance(processor, MultimodalProcessor):
    inputs = str(processor.apply_chat_template(

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

FAQ

Expected behavior

transformers serve --continuous-batching should handle multimodal models whose processor returns a string from apply_chat_template.

#api #ssr #installation #tensor shape #autograd error #vector store #embedding generation #cache error #pipeline error #runtime error

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

transformers - ✅(Solved) Fix [Bug] `transformers serve --continuous-batching` crashes with multimodal models (Qwen3.5) — AttributeError: 'str' object has no attribute 'to' [2 pull requests, 3 comments, 4 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Fixed

PR fix notes

PR #44424: Fix transformers serve --continuous-batching for multimodal models

Description (problem / solution / changelog)

What does this PR do?

Before submitting

Who can review?

Changed files

PR #44436: Fix continuous batching for multimodal models

Description (problem / solution / changelog)

Changed files

Code Example

System Info

Who can help?

Reproduction

Error

Root Cause

Expected behavior

extent analysis

Fix Plan

Fix Name

Fix Steps

1. Update serve.py to handle multimodal processors

2. Update serve.py to handle string inputs

3. Update serve.py to handle multimodal models

FAQ

Expected behavior

Still need to ship something?

RELATED_DISCOVERY

TRENDING

PR #44424: Fix `transformers serve --continuous-batching` for multimodal models

1. Update `serve.py` to handle multimodal processors

2. Update `serve.py` to handle string inputs

3. Update `serve.py` to handle multimodal models