vllm - ✅(Solved) Fix [Bug]: MiniMax-M2.5 reasoning missing in chat completions stream [1 pull requests, 2 comments, 2 participants]

JakubCerven · 2026-03-10T10:01:05Z

[vllm] PR 34779: Bugfix Fix Qwen3/Qwen3.5 Reasoning Parser - Repository: vllm-project/vllm - Author: ywang96 - State: closed | merged: True - Link: https://git… # PR #34779: [Bugfix] Fix Qwen3/Qwen3.5 Reasoning Parser - Repository: vllm-project/vllm - Author: ywang96 - State: closed | merged: True - Link: https://github.com/vllm-project/vllm/pull/34779 ## Description (problem / solution / changelog) ## Purpose - Fix the qwen3 reasoning parser to work with Qwen3.5 models (e.g., `Qwen/Qwen3.5-397B-A17B`) where the chat template places ` ` in the prompt rather than having the model generate it. Previously the parser required both ` ` and ` ` in the generated output, causing it to fail for Qwen3.5 models entirely. - Fix the "only reasoning" streaming code path in `serving.py` to check `prompt_is_reasoning_end_arr` before calling the parser, matching the behavior already present in the `tool_choice=auto` and `tool_choice=required` paths. Without this, `enable_thinking=False` at inference time would misroute content as reasoning. - Fix the `tool_choice_function_name` streaming path to check `prompt_is_reasoning_end_arr before` calling the parser (was checked after), preventing a spurious reasoning delta on the first chunk when thinking is disabled. Properly fixes #34684 ## Test Plan All tests pass in `tests/reasoning/test_qwen3_reasoning_parser.py` ## Test Result --- Essential Elements of an Effective PR Description Checklist - [ ] The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)". - [ ] The test plan, such as providing test command. - [ ] The test results, such as pasting the results comparison before and after, or e2e results - [ ] (Optional) The necessary documentation update, such as updating `supported_models.md` and `examples` for a new model. - [ ] (Optional) Release notes update. If your change is user facing, please update the release notes draft in the [Google Doc](https://docs.google.com/document/d/1YyVqrgX4gHTtrstbq8oWUImOyPCKSGnJ7xtTpmXzlRs/edit?tab=t.0). ## Changed files - `tests/reasoning/test_qwen3_reasoning_parser.py` (modified, +121/-12) - `vllm/entrypoints/openai/chat_completion/serving.py` (modified, +31/-16) - `vllm/reasoning/qwen3_reasoning_parser.py` (modified, +81/-19) ## Fixed - Fixed by PR: [Bugfix] Fix Qwen3/Qwen3.5 Reasoning Parser (https://github.com/vllm-project/vllm/pull/34779) ### Your current environment The output of python collect_env.py ```text Your output of `python collect_env.py` here ``` ### 🐛 Describe the bug when serving vllm using `vllm/vllm-openai:v0.17.0` + `lukealonso/MiniMax-M2.5-NVFP4 --reasoning-parser minimax_m2` reasoning is not present when calling /v1/chat/completions + stream=True it is working correctly with non-stream + /v1/responses bellow is script that reproduces the issue ```python from openai import OpenAI client = OpenAI(api_key=api_key, base_url=base_url) print("=== Non-Streaming ===") response = client.chat.completions.create( model=model, messages=[{"role": "user", "content": "hello"}], ) message = response.choices[0].message print(f"Reasoning: {getattr(message, 'reasoning_content', None)}") print(f"Content: {message.content}") print("\n=== Streaming ===") stream = client.chat.completions.create( model=model, messages=[{"role": "user", "content": "hello"}], stream=True, ) for chunk in stream: delta = chunk.choices[0].delta reasoning = getattr(delta, "reasoning_content", None) content = getattr(delta, "content", None) if reasoning: print(f"Reasoning: {reasoning}") if content: print(f"Content: {content}") ``` when running with same config on `vllm/vllm-openai:v0.15.1` reasoning is streamed properly ### Before submitting a new issue... - [x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the [documentation page](https://docs.vllm.ai/en/latest/), which can answer lots of frequently asked questions.

vllm2026-03-10 10:01:05

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

vllm-project/vllm#36632•Fetched 2026-04-08 00:35:46

View on GitHub

Comments

Participants

Timeline

Reactions

Author

JakubCerven

Participants

JakubCerven

ZhongsJie

Timeline (top)

commented ×2closed ×1cross-referenced ×1labeled ×1

Fix Action

Fixed

Fixed by PR: [Bugfix] Fix Qwen3/Qwen3.5 Reasoning Parser (https://github.com/vllm-project/vllm/pull/34779)

PR fix notes

PR #34779: [Bugfix] Fix Qwen3/Qwen3.5 Reasoning Parser

Repository: vllm-project/vllm
Author: ywang96
State: closed | merged: True
Link: https://github.com/vllm-project/vllm/pull/34779

Description (problem / solution / changelog)

Purpose

Fix the qwen3 reasoning parser to work with Qwen3.5 models (e.g., Qwen/Qwen3.5-397B-A17B) where the chat template places <think> in the prompt rather than having the model generate it. Previously the parser required both <think> and </think> in the generated output, causing it to fail for Qwen3.5 models entirely.
Fix the "only reasoning" streaming code path in serving.py to check prompt_is_reasoning_end_arr before calling the parser, matching the behavior already present in the tool_choice=auto and tool_choice=required paths. Without this, enable_thinking=False at inference time would misroute content as reasoning.
Fix the tool_choice_function_name streaming path to check prompt_is_reasoning_end_arr before calling the parser (was checked after), preventing a spurious reasoning delta on the first chunk when thinking is disabled.

Properly fixes #34684

Test Plan

All tests pass in tests/reasoning/test_qwen3_reasoning_parser.py

Test Result

<details> <summary> Essential Elements of an Effective PR Description Checklist </summary>

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

</details>

Changed files

tests/reasoning/test_qwen3_reasoning_parser.py (modified, +121/-12)
vllm/entrypoints/openai/chat_completion/serving.py (modified, +31/-16)
vllm/reasoning/qwen3_reasoning_parser.py (modified, +81/-19)

Code Example

Your output of `python collect_env.py` here

---

from openai import OpenAI

client = OpenAI(api_key=api_key, base_url=base_url)

print("=== Non-Streaming ===")
response = client.chat.completions.create(
    model=model,
    messages=[{"role": "user", "content": "hello"}],
)

message = response.choices[0].message
print(f"Reasoning: {getattr(message, 'reasoning_content', None)}")
print(f"Content: {message.content}")

print("\n=== Streaming ===")
stream = client.chat.completions.create(
    model=model,
    messages=[{"role": "user", "content": "hello"}],
    stream=True,
)

for chunk in stream:
    delta = chunk.choices[0].delta
    reasoning = getattr(delta, "reasoning_content", None)
    content = getattr(delta, "content", None)
    if reasoning:
        print(f"Reasoning: {reasoning}")
    if content:
        print(f"Content: {content}")

RAW_BUFFERClick to expand / collapse

Your current environment

<details> <summary>The output of <code>python collect_env.py</code></summary>

Your output of `python collect_env.py` here

</details>

🐛 Describe the bug

when serving vllm using vllm/vllm-openai:v0.17.0 + lukealonso/MiniMax-M2.5-NVFP4 --reasoning-parser minimax_m2

reasoning is not present when calling /v1/chat/completions + stream=True it is working correctly with non-stream + /v1/responses

bellow is script that reproduces the issue

from openai import OpenAI

client = OpenAI(api_key=api_key, base_url=base_url)

print("=== Non-Streaming ===")
response = client.chat.completions.create(
    model=model,
    messages=[{"role": "user", "content": "hello"}],
)

message = response.choices[0].message
print(f"Reasoning: {getattr(message, 'reasoning_content', None)}")
print(f"Content: {message.content}")

print("\n=== Streaming ===")
stream = client.chat.completions.create(
    model=model,
    messages=[{"role": "user", "content": "hello"}],
    stream=True,
)

for chunk in stream:
    delta = chunk.choices[0].delta
    reasoning = getattr(delta, "reasoning_content", None)
    content = getattr(delta, "content", None)
    if reasoning:
        print(f"Reasoning: {reasoning}")
    if content:
        print(f"Content: {content}")

when running with same config on vllm/vllm-openai:v0.15.1 reasoning is streamed properly

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

Fix Plan

The issue seems to be related to the streaming functionality in the vllm/vllm-openai:v0.17.0 version. To fix this, we need to modify the streaming logic to properly handle the reasoning_content attribute.

Code Changes

Here are the steps to fix the issue:

Update the streaming logic to check for reasoning_content in the delta object:

print("\n=== Streaming ===")
stream = client.chat.completions.create(
    model=model,
    messages=[{"role": "user", "content": "hello"}],
    stream=True,
)

for chunk in stream:
    delta = chunk.choices[0].delta
    if hasattr(delta, 'reasoning_content'):
        reasoning = delta.reasoning_content
        print(f"Reasoning: {reasoning}")
    if hasattr(delta, "content"):
        content = delta.content
        print(f"Content: {content}")

Alternatively, you can also try downgrading to vllm/vllm-openai:v0.15.1 as mentioned in the issue description, which seems to be working correctly with streaming.

Verification

To verify that the fix worked, run the modified script and check if the reasoning_content is being streamed properly. You should see the reasoning output in the console.

Extra Tips

Make sure to check the API documentation for any changes in the streaming functionality between versions.
If the issue persists, try debugging the client.chat.completions.create method to see if the reasoning_content attribute is being sent in the response.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #ssr #installation #tensor shape #autograd error #mixed precision #training loop #device allocation #model download

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

vllm - ✅(Solved) Fix [Bug]: MiniMax-M2.5 reasoning missing in chat completions stream [1 pull requests, 2 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Fix Action

Fixed

PR fix notes

PR #34779: [Bugfix] Fix Qwen3/Qwen3.5 Reasoning Parser

Description (problem / solution / changelog)

Purpose

Test Plan

Test Result

Changed files

Code Example

Your current environment

🐛 Describe the bug

Before submitting a new issue...

extent analysis

Fix Plan

Code Changes

Verification

Extra Tips

Still need to ship something?

TRENDING

vllm - ✅(Solved) Fix [Bug]: MiniMax-M2.5 reasoning missing in chat completions stream [1 pull requests, 2 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Fix Action

Fixed

PR fix notes

PR #34779: [Bugfix] Fix Qwen3/Qwen3.5 Reasoning Parser

Description (problem / solution / changelog)

Purpose

Test Plan

Test Result

Changed files

Code Example

Your current environment

🐛 Describe the bug

Before submitting a new issue...

extent analysis

Fix Plan

Code Changes

Verification

Extra Tips

Still need to ship something?

RELATED_DISCOVERY

TRENDING