vllm - ✅(Solved) Fix vLLM 0.19 may lose tool calls for Qwen/Qwen3.5-35B-A3B-FP8 when XML tool_call is emitted inside <think> [1 pull requests, 9 comments, 6 participants]

vllm2026-04-06 03:55:45

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

vllm-project/vllm#39056•Fetched 2026-04-08 02:52:46

View on GitHub

Comments

Participants

Timeline

Reactions

Author

Participants

Timeline (top)

commented ×9subscribed ×2cross-referenced ×1mentioned ×1

Fix Action

Fix / Workaround

That patch promotes embedded XML tool-call blocks out of reasoning into content in qwen3_reasoning_parser, so the existing qwen3_coder tool parser can still parse them.

PR fix notes

PR #39055: Fix Qwen3 reasoning tool calls embedded inside think

Repository: vllm-project/vllm
Author: ZenoAFfectionate
State: open | merged: False
Link: https://github.com/vllm-project/vllm/pull/39055

Description (problem / solution / changelog)

Summary

This PR fixes a Qwen3/Qwen3.5 non-streaming compatibility issue when using:

--reasoning-parser qwen3
--tool-call-parser qwen3_coder

Qwen models can emit XML tool calls inside <think> ... </think>. The current non-streaming pipeline extracts reasoning first and only parses tool calls from content, so valid XML tool calls embedded in reasoning are lost.

This patch updates qwen3_reasoning_parser to promote valid XML tool-call blocks out of reasoning into content, allowing the existing qwen3_coder tool parser to recover them without changing the generic serving stack.

Why this scope

This PR fixes parser recovery, not model generation behavior. It does not try to prevent Qwen3.5 from emitting tool calls inside <think>; it makes vLLM robust when that output pattern appears.

Tests

Added tests cover:

unchanged behavior for normal reasoning extraction
embedded tool call promotion from reasoning to content
successful parsing by qwen3_coder
truncated reasoning recovery without </think>
preservation of post-</think> content

Limitation

This change fixes the non-streaming path. Streaming recovery would require additional serving-layer changes and is intentionally left out of this minimal patch.

Changed files

docs/design/qwen3_reasoning_tool_call_recovery.md (added, +88/-0)
tests/reasoning/test_qwen3_reasoning_parser.py (modified, +114/-0)
vllm/reasoning/qwen3_reasoning_parser.py (modified, +49/-2)

Code Example

<think>
...
<tool_call>
<function=Finish>
<parameter=answer>
204
</parameter>
</function>
</tool_call>
</think>

RAW_BUFFERClick to expand / collapse

Problem

With vLLM 0.19, when serving the model:

Qwen/Qwen3.5-35B-A3B-FP8

using:

--reasoning-parser qwen3
--tool-call-parser qwen3_coder

non-streaming tool-call parsing can fail if the model emits XML tool-call markup inside the reasoning region, for example:

<think>
...
<tool_call>
<function=Finish>
<parameter=answer>
204
</parameter>
</function>
</tool_call>
</think>

In this case, the model may have produced a valid tool call, but the OpenAI-compatible response ends up with populated reasoning and empty tool_calls.

Observed model

The issue was observed and reproduced with:

Qwen/Qwen3.5-35B-A3B-FP8

It may also affect other Qwen3 / Qwen3.5 models that use the same parser combination, but the confirmed reproduction here is specifically on Qwen/Qwen3.5-35B-A3B-FP8.

Why this happens

The issue appears to come from the interaction between the reasoning parser and the tool parser:

qwen3_reasoning_parser extracts everything before </think> into reasoning.
downstream tool parsing only inspects content.
if <tool_call>...</tool_call> remains inside reasoning, it never reaches qwen3_coder.

So the bug is not that vLLM makes Qwen3.5 generate tool calls inside <think>. The bug is that vLLM currently does not recover those tool calls when that output pattern occurs.

Scope

This report is specifically about the non-streaming path, which can break benchmark / agent flows that rely on structured tool_calls being returned.

Proposed fix

I have opened a PR with a minimal parser-side fix here:

PR: https://github.com/vllm-project/vllm/pull/39055

That patch promotes embedded XML tool-call blocks out of reasoning into content in qwen3_reasoning_parser, so the existing qwen3_coder tool parser can still parse them.

Request

Please consider reviewing and merging the PR if the approach looks acceptable.

extent analysis

TL;DR

The most likely fix for the issue is to apply the proposed parser-side fix in the PR, which promotes embedded XML tool-call blocks out of reasoning into content in qwen3_reasoning_parser.

Guidance

Review the PR (https://github.com/vllm-project/vllm/pull/39055) to ensure the proposed fix addresses the issue without introducing new problems.
Verify that the fix works by testing the non-streaming tool-call parsing with the Qwen/Qwen3.5-35B-A3B-FP8 model and the --reasoning-parser qwen3 and --tool-call-parser qwen3_coder options.
Consider testing other Qwen3 / Qwen3.5 models that use the same parser combination to ensure the fix does not introduce issues with other models.
If the PR is merged, re-test the benchmark / agent flows that rely on structured tool_calls being returned to ensure they are working as expected.

Example

No code snippet is provided as the issue does not require a code example to understand the fix.

Notes

The proposed fix only addresses the non-streaming path, so it may not resolve issues with streaming tool-call parsing. Additionally, the fix may not work for all models, so thorough testing is necessary to ensure its effectiveness.

Recommendation

Apply the workaround by merging the proposed PR (https://github.com/vllm-project/vllm/pull/39055), as it provides a targeted fix for the issue and has been tested with the Qwen/Qwen3.5-35B-A3B-FP8 model.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#serialization error #model compatibility #GPU setup #container setup #orchestration issue

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

vllm - ✅(Solved) Fix vLLM 0.19 may lose tool calls for Qwen/Qwen3.5-35B-A3B-FP8 when XML tool_call is emitted inside <think> [1 pull requests, 9 comments, 6 participants]

Recommended Tools

GitHub issue graph ai analysis

Fix Action

Fix / Workaround

PR fix notes

PR #39055: Fix Qwen3 reasoning tool calls embedded inside think

Description (problem / solution / changelog)

Summary

Why this scope

Tests

Limitation

Changed files

Code Example

Problem

Observed model

Why this happens

Scope

Proposed fix

Request

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

TRENDING

vllm - ✅(Solved) Fix vLLM 0.19 may lose tool calls for Qwen/Qwen3.5-35B-A3B-FP8 when XML tool_call is emitted inside <think> [1 pull requests, 9 comments, 6 participants]

Recommended Tools

GitHub issue graph ai analysis

Fix Action

Fix / Workaround

PR fix notes

PR #39055: Fix Qwen3 reasoning tool calls embedded inside think

Description (problem / solution / changelog)

Summary

Why this scope

Tests

Limitation

Changed files

Code Example

Problem

Observed model

Why this happens

Scope

Proposed fix

Request

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING