vllm - ✅(Solved) Fix Guidance structured output blocked during thinking with nemotron_v3 reasoning parser (offline LLM.generate) [1 pull requests, 1 participants]

vllm2026-03-18 01:00:07

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

vllm-project/vllm#37362•Fetched 2026-04-08 00:53:19

View on GitHub

Comments

Participants

Timeline

Reactions

Author

ivnle

Participants

ivnle

Timeline (top)

cross-referenced ×2subscribed ×1

When using LLM.generate() (offline/batch mode) with nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16 and the guidance structured output backend + nemotron_v3 reasoning parser, the guidance FSM appears to constrain output from the first token instead of waiting for thinking to end (</think>). The model generates 8192 tokens but only produces { as visible content.

Root Cause

Fix Action

Fixed

Fixed by PR: [Bugfix] Fix reasoning parser disabling structured output when enable_thinking=false (https://github.com/vllm-project/vllm/pull/39138)

PR fix notes

PR #39138: [Bugfix] Fix reasoning parser disabling structured output when enable_thinking=false

Repository: vllm-project/vllm
Author: SuperMarioYL
State: open | merged: False
Link: https://github.com/vllm-project/vllm/pull/39138

Description (problem / solution / changelog)

Essential Checks

Ran pre-commit run ruff-check and ruff-format on changed files — both passed
Ran pytest tests/reasoning/test_base_thinking_reasoning_parser.py — 24/24 passed
PR is not a duplicate — searched open PRs for #39130 related fixes, found none
This is not a low-value change — it fixes a silent correctness bug affecting all users who configure a reasoning parser with enable_thinking=false

Summary

When --reasoning-parser gemma4 (or any BaseThinkingReasoningParser subclass) is used with enable_thinking=false, the xgrammar structured output engine is silently bypassed for every request. Grammar constraints are never enforced, even though the user explicitly requested structured output.

Root Cause

BaseThinkingReasoningParser.is_reasoning_end() scans backward through input_ids looking for reasoning start/end tokens. When enable_thinking=false, the chat template does not inject any reasoning tokens into the prompt. The method's fallback return False incorrectly signals "reasoning has not ended yet", causing:

should_fill_bitmask() → returns False → xgrammar bitmask never computed
should_advance() → returns False → FSM never advances
The model never generates the end-of-reasoning token → state never transitions

Fix

Change the fallback return value from False to True: when no reasoning tokens are found, reasoning was never started, so it should be treated as already ended. This is consistent with IdentityReasoningParser.is_reasoning_end() which always returns True.

Impact

Affects all BaseThinkingReasoningParser subclasses (Gemma4, Qwen3, DeepSeek, Mistral, etc.) when used with enable_thinking=false. The fix only changes behavior when neither start nor end token is present — all other paths remain unchanged.

Changes

File	Change
`vllm/reasoning/basic_parsers.py`	`return False` → `return True` (line 73)
`tests/reasoning/test_base_thinking_reasoning_parser.py`	Updated 2 assertions to expect `True` for no-reasoning-token cases
`tests/reasoning/test_gemma4_reasoning_parser.py`	Updated `NO_REASONING` and `EMPTY` test case expectations

Testing

pytest tests/reasoning/test_base_thinking_reasoning_parser.py -v
# 24/24 passed

Fixes #39130

Related: #37359, #37362

Changed files

tests/reasoning/test_base_thinking_reasoning_parser.py (modified, +1/-1)
tests/reasoning/test_deepseekr1_reasoning_parser.py (modified, +4/-4)
tests/reasoning/test_gemma4_reasoning_parser.py (modified, +2/-2)
tests/reasoning/test_minimax_m2_reasoning_parser.py (modified, +3/-3)
tests/reasoning/test_mistral_reasoning_parser.py (modified, +11/-11)
tests/reasoning/test_seedoss_reasoning_parser.py (modified, +6/-5)
tests/reasoning/test_step3p5_reasoning_parser.py (modified, +4/-4)
vllm/reasoning/deepseek_r1_reasoning_parser.py (modified, +10/-0)
vllm/reasoning/ernie45_reasoning_parser.py (modified, +10/-0)
vllm/reasoning/gemma4_reasoning_parser.py (modified, +3/-1)
vllm/reasoning/minimax_m2_reasoning_parser.py (modified, +10/-0)
vllm/reasoning/mistral_reasoning_parser.py (modified, +3/-1)
vllm/reasoning/qwen3_reasoning_parser.py (modified, +10/-0)
vllm/reasoning/seedoss_reasoning_parser.py (modified, +12/-0)
vllm/reasoning/step3p5_reasoning_parser.py (modified, +3/-1)

Code Example

from vllm import LLM, SamplingParams

llm = LLM(
    model="nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16",
    tensor_parallel_size=2,
    enforce_eager=True,
    trust_remote_code=True,
    structured_outputs_config={
        "backend": "guidance",
        "reasoning_parser": "nemotron_v3",
    },
)

schema = {
    "type": "object",
    "properties": {"answer": {"type": "integer"}},
    "required": ["answer"],
}

outputs = llm.generate(
    ["What is 2+2?"],
    SamplingParams(
        max_tokens=8192,
        temperature=0.0,
        structured_outputs={"type": "json_schema", "value": schema},
    ),
)
# Output is mostly thinking tokens forced into JSON patterns, visible text is just "{"
print(outputs[0].outputs[0].text)

RAW_BUFFERClick to expand / collapse

Summary

Environment

vLLM v0.17.2rc1 (commit 3ec8ae438)
LLM.generate() offline path
structured_outputs_config = {"backend": "guidance", "reasoning_parser": "nemotron_v3"}
enable_thinking=True via chat template

Behavior

Thinking OFF + structured output: WORKS — model produces valid JSON in 34 tokens
Thinking ON + free-form: WORKS — thinking extracted correctly, correct answer
Thinking ON + structured output: FAILS — 8192 tokens generated, visible text is just {

This suggests the guidance FSM constrains during the thinking phase rather than waiting for </think>. The model's thinking tokens are forced into JSON-conforming patterns, producing garbage.

Similar to #37359 (GPT-OSS + openai_gptoss parser), but different reasoning parser and different model architecture (NemotronHForCausalLM with <think>/</think> tags vs Harmony channels).

Reproduction

from vllm import LLM, SamplingParams

llm = LLM(
    model="nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16",
    tensor_parallel_size=2,
    enforce_eager=True,
    trust_remote_code=True,
    structured_outputs_config={
        "backend": "guidance",
        "reasoning_parser": "nemotron_v3",
    },
)

schema = {
    "type": "object",
    "properties": {"answer": {"type": "integer"}},
    "required": ["answer"],
}

outputs = llm.generate(
    ["What is 2+2?"],
    SamplingParams(
        max_tokens=8192,
        temperature=0.0,
        structured_outputs={"type": "json_schema", "value": schema},
    ),
)
# Output is mostly thinking tokens forced into JSON patterns, visible text is just "{"
print(outputs[0].outputs[0].text)

extent analysis

Fix Plan

To address the issue, we need to modify the LLM.generate() call to wait for the </think> token before applying the guidance FSM. We can achieve this by adding a custom thinking_end_token to the structured_outputs_config.

Update the structured_outputs_config to include a thinking_end_token:

structured_outputs_config = {
    "backend": "guidance",
    "reasoning_parser": "nemotron_v3",
    "thinking_end_token": "</think>",
}

Modify the LLM.generate() call to use the updated structured_outputs_config:

llm = LLM(
    model="nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16",
    tensor_parallel_size=2,
    enforce_eager=True,
    trust_remote_code=True,
    structured_outputs_config={
        "backend": "guidance",
        "reasoning_parser": "nemotron_v3",
        "thinking_end_token": "</think>",
    },
)

Use the updated llm instance to generate outputs:

outputs = llm.generate(
    ["What is 2+2?"],
    SamplingParams(
        max_tokens=8192,
        temperature=0.0,
        structured_outputs={"type": "json_schema", "value": schema},
    ),
)

Verification

To verify that the fix worked, check the generated output for the correct answer:

print(outputs[0].outputs[0].text)

The output should now contain the correct answer, rather than just {.

Extra Tips

Make sure to update the structured_outputs_config correctly, as incorrect configuration can lead to unexpected behavior.
If you're using a different model or parser, you may need to adjust the thinking_end_token accordingly.
Consider adding error handling to your code to catch any potential issues with the LLM.generate() call.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #ssr #installation #tensor shape #autograd error #index setup #retrieval issue #search optimization #API routing #API middleware

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

vllm - ✅(Solved) Fix Guidance structured output blocked during thinking with nemotron_v3 reasoning parser (offline LLM.generate) [1 pull requests, 1 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Fix Action

Fixed

PR fix notes

PR #39138: [Bugfix] Fix reasoning parser disabling structured output when enable_thinking=false

Description (problem / solution / changelog)

Essential Checks

Summary

Root Cause

Fix

Impact

Changes

Testing

Changed files

Code Example

Summary

Environment

Behavior

Related

Reproduction

extent analysis

Fix Plan

Verification

Extra Tips

Still need to ship something?

TRENDING

vllm - ✅(Solved) Fix Guidance structured output blocked during thinking with nemotron_v3 reasoning parser (offline LLM.generate) [1 pull requests, 1 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Fix Action

Fixed

PR fix notes

PR #39138: [Bugfix] Fix reasoning parser disabling structured output when enable_thinking=false

Description (problem / solution / changelog)

Essential Checks

Summary

Root Cause

Fix

Impact

Changes

Testing

Changed files

Code Example

Summary

Environment

Behavior

Related

Reproduction

extent analysis

Fix Plan

Verification

Extra Tips

Still need to ship something?

RELATED_DISCOVERY

TRENDING