vllm - ✅(Solved) Fix Guidance backend structured output doesn't work with openai_gptoss reasoning parser (offline LLM.generate) [1 pull requests, 1 participants]

vllm2026-03-18 00:15:12

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

vllm-project/vllm#37359•Fetched 2026-04-08 00:53:20

View on GitHub

Comments

Participants

Timeline

Reactions

Author

ivnle

Participants

ivnle

Timeline (top)

cross-referenced ×3subscribed ×1

When using LLM.generate() (offline/batch mode) with openai/gpt-oss-20b and the guidance structured output backend, the guidance FSM never activates. The model generates free-form text ignoring the JSON schema.

Root Cause

In vllm/v1/structured_output/__init__.py, should_fill_bitmask() calls self.reasoner.is_reasoning_end(request.prompt_token_ids) on the prompt tokens before any generation occurs. Since the prompt doesn't contain Harmony channel markers (<|channel|>final<|message|>), is_reasoning_end() returns False, and the guidance FSM is never activated.

This works correctly for <think>-based models (Qwen, DeepSeek) because the prompt contains <think> when thinking is enabled, so the parser can find the end marker once </think> is generated. But GPT-OSS uses Harmony channel tokens that only appear in the model output, not the prompt.

Fix Action

Fixed

Fixed by PR: [Bugfix] Fix reasoning parser disabling structured output when enable_thinking=false (https://github.com/vllm-project/vllm/pull/39138)

PR fix notes

PR #39138: [Bugfix] Fix reasoning parser disabling structured output when enable_thinking=false

Repository: vllm-project/vllm
Author: SuperMarioYL
State: open | merged: False
Link: https://github.com/vllm-project/vllm/pull/39138

Description (problem / solution / changelog)

Essential Checks

Ran pre-commit run ruff-check and ruff-format on changed files — both passed
Ran pytest tests/reasoning/test_base_thinking_reasoning_parser.py — 24/24 passed
PR is not a duplicate — searched open PRs for #39130 related fixes, found none
This is not a low-value change — it fixes a silent correctness bug affecting all users who configure a reasoning parser with enable_thinking=false

Summary

When --reasoning-parser gemma4 (or any BaseThinkingReasoningParser subclass) is used with enable_thinking=false, the xgrammar structured output engine is silently bypassed for every request. Grammar constraints are never enforced, even though the user explicitly requested structured output.

Root Cause

BaseThinkingReasoningParser.is_reasoning_end() scans backward through input_ids looking for reasoning start/end tokens. When enable_thinking=false, the chat template does not inject any reasoning tokens into the prompt. The method's fallback return False incorrectly signals "reasoning has not ended yet", causing:

should_fill_bitmask() → returns False → xgrammar bitmask never computed
should_advance() → returns False → FSM never advances
The model never generates the end-of-reasoning token → state never transitions

Fix

Change the fallback return value from False to True: when no reasoning tokens are found, reasoning was never started, so it should be treated as already ended. This is consistent with IdentityReasoningParser.is_reasoning_end() which always returns True.

Impact

Affects all BaseThinkingReasoningParser subclasses (Gemma4, Qwen3, DeepSeek, Mistral, etc.) when used with enable_thinking=false. The fix only changes behavior when neither start nor end token is present — all other paths remain unchanged.

Changes

File	Change
`vllm/reasoning/basic_parsers.py`	`return False` → `return True` (line 73)
`tests/reasoning/test_base_thinking_reasoning_parser.py`	Updated 2 assertions to expect `True` for no-reasoning-token cases
`tests/reasoning/test_gemma4_reasoning_parser.py`	Updated `NO_REASONING` and `EMPTY` test case expectations

Testing

pytest tests/reasoning/test_base_thinking_reasoning_parser.py -v
# 24/24 passed

Fixes #39130

Related: #37359, #37362

Changed files

tests/reasoning/test_base_thinking_reasoning_parser.py (modified, +1/-1)
tests/reasoning/test_deepseekr1_reasoning_parser.py (modified, +4/-4)
tests/reasoning/test_gemma4_reasoning_parser.py (modified, +2/-2)
tests/reasoning/test_minimax_m2_reasoning_parser.py (modified, +3/-3)
tests/reasoning/test_mistral_reasoning_parser.py (modified, +11/-11)
tests/reasoning/test_seedoss_reasoning_parser.py (modified, +6/-5)
tests/reasoning/test_step3p5_reasoning_parser.py (modified, +4/-4)
vllm/reasoning/deepseek_r1_reasoning_parser.py (modified, +10/-0)
vllm/reasoning/ernie45_reasoning_parser.py (modified, +10/-0)
vllm/reasoning/gemma4_reasoning_parser.py (modified, +3/-1)
vllm/reasoning/minimax_m2_reasoning_parser.py (modified, +10/-0)
vllm/reasoning/mistral_reasoning_parser.py (modified, +3/-1)
vllm/reasoning/qwen3_reasoning_parser.py (modified, +10/-0)
vllm/reasoning/seedoss_reasoning_parser.py (modified, +12/-0)
vllm/reasoning/step3p5_reasoning_parser.py (modified, +3/-1)

Code Example

from vllm import LLM, SamplingParams

llm = LLM(
    model="openai/gpt-oss-20b",
    enforce_eager=True,
    trust_remote_code=True,
    structured_outputs_config={
        "backend": "guidance",
        "reasoning_parser": "openai_gptoss",
    },
)

schema = {
    "type": "object",
    "properties": {"answer": {"type": "integer"}},
    "required": ["answer"],
}

outputs = llm.generate(
    ["What is 2+2?"],
    SamplingParams(
        max_tokens=256,
        temperature=0.0,
        structured_outputs={"type": "json_schema", "value": schema},
    ),
)
# Output is free-form text, not JSON
print(outputs[0].outputs[0].text)

RAW_BUFFERClick to expand / collapse

Summary

Environment

vLLM v0.17.2rc1 (commit 3ec8ae438)
LLM.generate() offline path (not OpenAI API server)
structured_outputs_config = {"backend": "guidance", "reasoning_parser": "openai_gptoss"}

Root Cause

Expected Behavior

The guidance FSM should activate once the model generates the content channel markers (<|channel|>final<|message|>), constraining subsequent output to the JSON schema.

Reproduction

from vllm import LLM, SamplingParams

llm = LLM(
    model="openai/gpt-oss-20b",
    enforce_eager=True,
    trust_remote_code=True,
    structured_outputs_config={
        "backend": "guidance",
        "reasoning_parser": "openai_gptoss",
    },
)

schema = {
    "type": "object",
    "properties": {"answer": {"type": "integer"}},
    "required": ["answer"],
}

outputs = llm.generate(
    ["What is 2+2?"],
    SamplingParams(
        max_tokens=256,
        temperature=0.0,
        structured_outputs={"type": "json_schema", "value": schema},
    ),
)
# Output is free-form text, not JSON
print(outputs[0].outputs[0].text)

Notes

The OpenAI API server path handles GPT-OSS reasoning correctly via the streaming parser
The offline LLM.generate() path doesn't use the streaming parser, so reasoning detection relies on is_reasoning_end() which checks prompt tokens
A possible fix: track reasoning_ended incrementally during generation (checking newly generated tokens) rather than only checking the prompt at startup

extent analysis

Fix Plan

To fix the issue, we need to modify the should_fill_bitmask() function to track reasoning_ended incrementally during generation. We can achieve this by checking the newly generated tokens for the Harmony channel markers (<|channel|>final<|message|>).

Here are the steps to fix the issue:

Modify the vllm/v1/structured_output/__init__.py file to track reasoning_ended incrementally.
Update the should_fill_bitmask() function to check the generated tokens for the Harmony channel markers.

Example code:

class StructuredOutput:
    def __init__(self, ...):
        self.reasoning_ended = False
        ...

    def should_fill_bitmask(self, request, generated_tokens):
        if not self.reasoning_ended:
            self.reasoning_ended = self.check_harmony_channel_markers(generated_tokens)
        return self.reasoning_ended

    def check_harmony_channel_markers(self, tokens):
        # Check if the generated tokens contain the Harmony channel markers
        harmony_channel_markers = ["<|channel|>final<|message|>"]
        for marker in harmony_channel_markers:
            if marker in tokens:
                return True
        return False

Verification

To verify that the fix worked, you can run the reproduction code again and check if the output is now in JSON format, constrained by the provided schema.

outputs = llm.generate(
    ["What is 2+2?"],
    SamplingParams(
        max_tokens=256,
        temperature=0.0,
        structured_outputs={"type": "json_schema", "value": schema},
    ),
)
print(outputs[0].outputs[0].text)  # Should print a JSON object with the answer

Extra Tips

Make sure to update the vllm library to the latest version after applying the fix.
If you're using other models that rely on the should_fill_bitmask() function, ensure that the fix doesn't break their functionality.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #ssr #installation #tensor shape #autograd error #retrieval issue #search optimization #API routing #API middleware

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

vllm - ✅(Solved) Fix Guidance backend structured output doesn't work with openai_gptoss reasoning parser (offline LLM.generate) [1 pull requests, 1 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Fix Action

Fixed

PR fix notes

PR #39138: [Bugfix] Fix reasoning parser disabling structured output when enable_thinking=false

Description (problem / solution / changelog)

Essential Checks

Summary

Root Cause

Fix

Impact

Changes

Testing

Changed files

Code Example

Summary

Environment

Root Cause

Expected Behavior

Reproduction

Notes

extent analysis

Fix Plan

Verification

Extra Tips

Still need to ship something?

RELATED_DISCOVERY

TRENDING