vllm - 💡(How to fix) Fix [Bug]: Streaming reasoning tokens truncated when `</think>` and `<tool_call>` appear in the same delta

vllm - 💡(How to fix) Fix [Bug]: Streaming reasoning tokens truncated when ` ` and ` ` appear in the same delta

StepCodex · 2026-05-20T14:47:15Z

[vllm] Your current environment OS: any vllm: main 🐛 Describe the bug Problem Description: When using Qwen3.5 models with streaming inference, Multi-Token Pre… ### Your current environment OS: any vllm: main ### 🐛 Describe the bug **Problem Description:** When using Qwen3.5 models with streaming inference, Multi-Token Prediction (MTP), thinking mode enabled, and tool calling, the last few tokens of the thinking section are occasionally truncated. Non-streaming inference works correctly. **Reproduction Steps:** 1. Enable streaming inference with Qwen3.5 model 2. Enable thinking mode (`--reasoning-parser qwen3`) 3. Enable tool calling (`--tool-call-parser qwen3_coder`) 4. Use MTP (default behavior in Qwen3.5) 5. Trigger responses where model output contains reasoning followed by tool calls **Expected Behavior:** The complete reasoning content should be streamed to the client before transitioning to tool call parsing. **Actual Behavior:** When MTP generates multiple tokens in a single inference step that include both the reasoning end token (` `) and the tool call start token (` `), the reasoning tokens immediately preceding ` ` are lost. **Example:** config: num_speculative_tokens=3 something output like I will use the tool Write. the delta_text is "Write. " - MTP output tokens: `["Write", ".", " ", " "]` - Expected streaming output: reasoning="Write.", then tool call - Actual streaming output: reasoning is empty/partial, only tool call is received. Got Something like: ```text Thinking: I will use the tool Tool: xxxx ``` **Root Cause Analysis:** In `vllm/parser/abstract_parser.py`, the `DelegatingParser.parse_delta` method processes reasoning extraction and tool call extraction sequentially. When both the reasoning end token and tool call token appear in the same delta: 1. The reasoning parser correctly extracts the reasoning content 2. However, when the tool parser runs in the same iteration, its return value directly **overwrites** the `delta_message` variable, losing the previously extracted reasoning content **Suggested Fix:** Preserve the reasoning delta message and merge results from both parsers instead of overwriting. The fix ensures that when both phases run in the same delta, the reasoning content is retained while adding tool call information. **Files Modified:** - `vllm/parser/abstract_parser.py` - Fixed the delta message merging logic in `DelegatingParser.parse_delta()` **Additional Context:** This issue only affects streaming inference because non-streaming mode processes the complete output in separate phases without the overwrite issue. The fix maintains backward compatibility and only affects the edge case where reasoning ends and tool calls begin in the same inference step. ### Before submitting a new issue... - [x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the [documentation page](https://docs.vllm.ai/en/latest/), which can answer lots of frequently asked questions.

vllm2026-05-20 14:47:15

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

Root Cause

Root Cause Analysis: In vllm/parser/abstract_parser.py, the DelegatingParser.parse_delta method processes reasoning extraction and tool call extraction sequentially. When both the reasoning end token and tool call token appear in the same delta:

Code Example

Thinking:
I will use the tool 
Tool:
xxxx

RAW_BUFFERClick to expand / collapse

Your current environment

OS: any vllm: main

🐛 Describe the bug

Problem Description: When using Qwen3.5 models with streaming inference, Multi-Token Prediction (MTP), thinking mode enabled, and tool calling, the last few tokens of the thinking section are occasionally truncated. Non-streaming inference works correctly.

Reproduction Steps:

Enable streaming inference with Qwen3.5 model
Enable thinking mode (--reasoning-parser qwen3)
Enable tool calling (--tool-call-parser qwen3_coder)
Use MTP (default behavior in Qwen3.5)
Trigger responses where model output contains reasoning followed by tool calls

Expected Behavior: The complete reasoning content should be streamed to the client before transitioning to tool call parsing.

Actual Behavior: When MTP generates multiple tokens in a single inference step that include both the reasoning end token (</think>) and the tool call start token (<tool_call>), the reasoning tokens immediately preceding </think> are lost.

Example: config: num_speculative_tokens=3 something output like <think> I will use the tool Write.</think><tool_call> the delta_text is "Write.</think><tool_call>"

MTP output tokens: ["Write", ".", "</think>", "<tool_call>"]
Expected streaming output: reasoning="Write.", then tool call
Actual streaming output: reasoning is empty/partial, only tool call is received. Got Something like:

Thinking:
I will use the tool 
Tool:
xxxx

The reasoning parser correctly extracts the reasoning content
However, when the tool parser runs in the same iteration, its return value directly overwrites the delta_message variable, losing the previously extracted reasoning content

Suggested Fix: Preserve the reasoning delta message and merge results from both parsers instead of overwriting. The fix ensures that when both phases run in the same delta, the reasoning content is retained while adding tool call information.

Files Modified:

vllm/parser/abstract_parser.py - Fixed the delta message merging logic in DelegatingParser.parse_delta()

Additional Context: This issue only affects streaming inference because non-streaming mode processes the complete output in separate phases without the overwrite issue. The fix maintains backward compatibility and only affects the edge case where reasoning ends and tool calls begin in the same inference step.

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering