vllm - ✅(Solved) Fix Guidance backend structured output doesn't work with openai_gptoss reasoning parser (offline LLM.generate) [1 pull requests, 1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#37359Fetched 2026-04-08 00:53:20
View on GitHub
Comments
0
Participants
1
Timeline
4
Reactions
0
Author
Participants
Timeline (top)
cross-referenced ×3subscribed ×1

When using LLM.generate() (offline/batch mode) with openai/gpt-oss-20b and the guidance structured output backend, the guidance FSM never activates. The model generates free-form text ignoring the JSON schema.

Root Cause

In vllm/v1/structured_output/__init__.py, should_fill_bitmask() calls self.reasoner.is_reasoning_end(request.prompt_token_ids) on the prompt tokens before any generation occurs. Since the prompt doesn't contain Harmony channel markers (<|channel|>final<|message|>), is_reasoning_end() returns False, and the guidance FSM is never activated.

This works correctly for <think>-based models (Qwen, DeepSeek) because the prompt contains <think> when thinking is enabled, so the parser can find the end marker once </think> is generated. But GPT-OSS uses Harmony channel tokens that only appear in the model output, not the prompt.

Fix Action

Fixed

PR fix notes

PR #39138: [Bugfix] Fix reasoning parser disabling structured output when enable_thinking=false

Description (problem / solution / changelog)

Essential Checks

  • Ran pre-commit run ruff-check and ruff-format on changed files — both passed
  • Ran pytest tests/reasoning/test_base_thinking_reasoning_parser.py — 24/24 passed
  • PR is not a duplicate — searched open PRs for #39130 related fixes, found none
  • This is not a low-value change — it fixes a silent correctness bug affecting all users who configure a reasoning parser with enable_thinking=false

Summary

When --reasoning-parser gemma4 (or any BaseThinkingReasoningParser subclass) is used with enable_thinking=false, the xgrammar structured output engine is silently bypassed for every request. Grammar constraints are never enforced, even though the user explicitly requested structured output.

Root Cause

BaseThinkingReasoningParser.is_reasoning_end() scans backward through input_ids looking for reasoning start/end tokens. When enable_thinking=false, the chat template does not inject any reasoning tokens into the prompt. The method's fallback return False incorrectly signals "reasoning has not ended yet", causing:

  1. should_fill_bitmask() → returns False → xgrammar bitmask never computed
  2. should_advance() → returns False → FSM never advances
  3. The model never generates the end-of-reasoning token → state never transitions

Fix

Change the fallback return value from False to True: when no reasoning tokens are found, reasoning was never started, so it should be treated as already ended. This is consistent with IdentityReasoningParser.is_reasoning_end() which always returns True.

Impact

Affects all BaseThinkingReasoningParser subclasses (Gemma4, Qwen3, DeepSeek, Mistral, etc.) when used with enable_thinking=false. The fix only changes behavior when neither start nor end token is present — all other paths remain unchanged.

Changes

FileChange
vllm/reasoning/basic_parsers.pyreturn Falsereturn True (line 73)
tests/reasoning/test_base_thinking_reasoning_parser.pyUpdated 2 assertions to expect True for no-reasoning-token cases
tests/reasoning/test_gemma4_reasoning_parser.pyUpdated NO_REASONING and EMPTY test case expectations

Testing

pytest tests/reasoning/test_base_thinking_reasoning_parser.py -v
# 24/24 passed

Fixes #39130

Related: #37359, #37362

Changed files

  • tests/reasoning/test_base_thinking_reasoning_parser.py (modified, +1/-1)
  • tests/reasoning/test_deepseekr1_reasoning_parser.py (modified, +4/-4)
  • tests/reasoning/test_gemma4_reasoning_parser.py (modified, +2/-2)
  • tests/reasoning/test_minimax_m2_reasoning_parser.py (modified, +3/-3)
  • tests/reasoning/test_mistral_reasoning_parser.py (modified, +11/-11)
  • tests/reasoning/test_seedoss_reasoning_parser.py (modified, +6/-5)
  • tests/reasoning/test_step3p5_reasoning_parser.py (modified, +4/-4)
  • vllm/reasoning/deepseek_r1_reasoning_parser.py (modified, +10/-0)
  • vllm/reasoning/ernie45_reasoning_parser.py (modified, +10/-0)
  • vllm/reasoning/gemma4_reasoning_parser.py (modified, +3/-1)
  • vllm/reasoning/minimax_m2_reasoning_parser.py (modified, +10/-0)
  • vllm/reasoning/mistral_reasoning_parser.py (modified, +3/-1)
  • vllm/reasoning/qwen3_reasoning_parser.py (modified, +10/-0)
  • vllm/reasoning/seedoss_reasoning_parser.py (modified, +12/-0)
  • vllm/reasoning/step3p5_reasoning_parser.py (modified, +3/-1)

Code Example

from vllm import LLM, SamplingParams

llm = LLM(
    model="openai/gpt-oss-20b",
    enforce_eager=True,
    trust_remote_code=True,
    structured_outputs_config={
        "backend": "guidance",
        "reasoning_parser": "openai_gptoss",
    },
)

schema = {
    "type": "object",
    "properties": {"answer": {"type": "integer"}},
    "required": ["answer"],
}

outputs = llm.generate(
    ["What is 2+2?"],
    SamplingParams(
        max_tokens=256,
        temperature=0.0,
        structured_outputs={"type": "json_schema", "value": schema},
    ),
)
# Output is free-form text, not JSON
print(outputs[0].outputs[0].text)
RAW_BUFFERClick to expand / collapse

Summary

When using LLM.generate() (offline/batch mode) with openai/gpt-oss-20b and the guidance structured output backend, the guidance FSM never activates. The model generates free-form text ignoring the JSON schema.

Environment

  • vLLM v0.17.2rc1 (commit 3ec8ae438)
  • LLM.generate() offline path (not OpenAI API server)
  • structured_outputs_config = {"backend": "guidance", "reasoning_parser": "openai_gptoss"}

Root Cause

In vllm/v1/structured_output/__init__.py, should_fill_bitmask() calls self.reasoner.is_reasoning_end(request.prompt_token_ids) on the prompt tokens before any generation occurs. Since the prompt doesn't contain Harmony channel markers (<|channel|>final<|message|>), is_reasoning_end() returns False, and the guidance FSM is never activated.

This works correctly for <think>-based models (Qwen, DeepSeek) because the prompt contains <think> when thinking is enabled, so the parser can find the end marker once </think> is generated. But GPT-OSS uses Harmony channel tokens that only appear in the model output, not the prompt.

Expected Behavior

The guidance FSM should activate once the model generates the content channel markers (<|channel|>final<|message|>), constraining subsequent output to the JSON schema.

Reproduction

from vllm import LLM, SamplingParams

llm = LLM(
    model="openai/gpt-oss-20b",
    enforce_eager=True,
    trust_remote_code=True,
    structured_outputs_config={
        "backend": "guidance",
        "reasoning_parser": "openai_gptoss",
    },
)

schema = {
    "type": "object",
    "properties": {"answer": {"type": "integer"}},
    "required": ["answer"],
}

outputs = llm.generate(
    ["What is 2+2?"],
    SamplingParams(
        max_tokens=256,
        temperature=0.0,
        structured_outputs={"type": "json_schema", "value": schema},
    ),
)
# Output is free-form text, not JSON
print(outputs[0].outputs[0].text)

Notes

  • The OpenAI API server path handles GPT-OSS reasoning correctly via the streaming parser
  • The offline LLM.generate() path doesn't use the streaming parser, so reasoning detection relies on is_reasoning_end() which checks prompt tokens
  • A possible fix: track reasoning_ended incrementally during generation (checking newly generated tokens) rather than only checking the prompt at startup

extent analysis

Fix Plan

To fix the issue, we need to modify the should_fill_bitmask() function to track reasoning_ended incrementally during generation. We can achieve this by checking the newly generated tokens for the Harmony channel markers (<|channel|>final<|message|>).

Here are the steps to fix the issue:

  • Modify the vllm/v1/structured_output/__init__.py file to track reasoning_ended incrementally.
  • Update the should_fill_bitmask() function to check the generated tokens for the Harmony channel markers.

Example code:

class StructuredOutput:
    def __init__(self, ...):
        self.reasoning_ended = False
        ...

    def should_fill_bitmask(self, request, generated_tokens):
        if not self.reasoning_ended:
            self.reasoning_ended = self.check_harmony_channel_markers(generated_tokens)
        return self.reasoning_ended

    def check_harmony_channel_markers(self, tokens):
        # Check if the generated tokens contain the Harmony channel markers
        harmony_channel_markers = ["<|channel|>final<|message|>"]
        for marker in harmony_channel_markers:
            if marker in tokens:
                return True
        return False

Verification

To verify that the fix worked, you can run the reproduction code again and check if the output is now in JSON format, constrained by the provided schema.

outputs = llm.generate(
    ["What is 2+2?"],
    SamplingParams(
        max_tokens=256,
        temperature=0.0,
        structured_outputs={"type": "json_schema", "value": schema},
    ),
)
print(outputs[0].outputs[0].text)  # Should print a JSON object with the answer

Extra Tips

  • Make sure to update the vllm library to the latest version after applying the fix.
  • If you're using other models that rely on the should_fill_bitmask() function, ensure that the fix doesn't break their functionality.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

vllm - ✅(Solved) Fix Guidance backend structured output doesn't work with openai_gptoss reasoning parser (offline LLM.generate) [1 pull requests, 1 participants]