vllm - 💡(How to fix) Fix [Bug]: `json_object` structured output is not enforced after Qwen thinking because reasoning end token is missed with async scheduling + spec decode

vllm2026-05-22 03:51:55

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

When using a Qwen-style reasoning model with:

chat_template_kwargs.enable_thinking: true
response_format: {"type": "json_object"}
default structured_outputs_config.enable_in_reasoning=false
async scheduling + speculative decoding the final response content may contain JSON wrapped in Markdown fences, for example:

{ ... }
instead of a raw JSON object.

The issue appears to be that the reasoning end token </think> is generated, but vLLM fails to detect it in StructuredOutputRequest.should_advance(). As a result, reasoning_ended remains False, should_fill_bitmask() keeps returning False, and the JSON grammar is never applied to the post-thinking content.

Root Cause

apply=True: 0 times advance=True: 0 times reasoning_ended=True: 0 times final content may start with Markdown fence such as ```json JSON grammar is not applied during the final answer generation Root Cause Analysis In vllm/v1/structured_output/init.py, should_advance() detects reasoning end by slicing request.all_token_ids:

Code Example

{ ... }
instead of a raw JSON object.

The issue appears to be that the reasoning end token </think> is generated, but vLLM fails to detect it in StructuredOutputRequest.should_advance(). As a result, reasoning_ended remains False, should_fill_bitmask() keeps returning False, and the JSON grammar is never applied to the post-thinking content.

### 🐛 Describe the bug

Expected Behavior
After the model emits the reasoning end token, vLLM should detect that reasoning has ended and enable the configured json_object structured output constraint for the final answer.

The response content should be constrained as a JSON object and should not freely emit Markdown code fences before the JSON object.

Actual Behavior
The reasoning end token is generated, but reasoning_ended never flips to True.

Observed behavior:

apply=True: 0 times
advance=True: 0 times
reasoning_ended=True: 0 times
final content may start with Markdown fence such as

RAW_BUFFERClick to expand / collapse

Your current environment

Summary

When using a Qwen-style reasoning model with:

chat_template_kwargs.enable_thinking: true
response_format: {"type": "json_object"}
default structured_outputs_config.enable_in_reasoning=false
async scheduling + speculative decoding the final response content may contain JSON wrapped in Markdown fences, for example:

{ ... }
instead of a raw JSON object.

The issue appears to be that the reasoning end token </think> is generated, but vLLM fails to detect it in StructuredOutputRequest.should_advance(). As a result, reasoning_ended remains False, should_fill_bitmask() keeps returning False, and the JSON grammar is never applied to the post-thinking content.

### 🐛 Describe the bug

Expected Behavior
After the model emits the reasoning end token, vLLM should detect that reasoning has ended and enable the configured json_object structured output constraint for the final answer.

The response content should be constrained as a JSON object and should not freely emit Markdown code fences before the JSON object.

Actual Behavior
The reasoning end token is generated, but reasoning_ended never flips to True.

Observed behavior:

apply=True: 0 times
advance=True: 0 times
reasoning_ended=True: 0 times
final content may start with Markdown fence such as ```json
JSON grammar is not applied during the final answer generation
Root Cause Analysis
In vllm/v1/structured_output/__init__.py, should_advance() detects reasoning end by slicing request.all_token_ids:

start = num_computed_tokens - num_output_placeholders
delta_ids = islice(all_token_ids, start, None)
However, in async scheduling + speculative decoding, new_token_ids can contain multiple tokens in one step.

The scheduler first appends new_token_ids to request.all_token_ids, then calls should_advance(request, new_token_ids=new_token_ids).

In one captured case:
new_token_ids=[9, 198, 248069, 271]
end_token_id=248069
first_end_idx=5974
start=5975
delta_slice=[271]
failure=end_before_delta_window

So the end token 248069 was generated in the current batch, but the computed delta window started after it. Therefore:
end_in_new_tokens=True
stream_hit=False
reasoning_ended=False

This prevents structured output from being enabled for the final answer.

Evidence
Diagnostics showed:
reasoner_cls=Qwen3ReasoningParser
enable_in_reasoning=False
structured_reasoning_ended=False
end_token='</think>'
end_token_id=248069



### Before submitting a new issue...

- [x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the [documentation page](https://docs.vllm.ai/en/latest/), which can answer lots of frequently asked questions.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering