vllm - ✅(Solved) Fix Guidance structured output blocked during thinking with nemotron_v3 reasoning parser (offline LLM.generate) [1 pull requests, 1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#37362Fetched 2026-04-08 00:53:19
View on GitHub
Comments
0
Participants
1
Timeline
3
Reactions
0
Author
Participants
Timeline (top)
cross-referenced ×2subscribed ×1

When using LLM.generate() (offline/batch mode) with nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16 and the guidance structured output backend + nemotron_v3 reasoning parser, the guidance FSM appears to constrain output from the first token instead of waiting for thinking to end (</think>). The model generates 8192 tokens but only produces { as visible content.

Root Cause

When using LLM.generate() (offline/batch mode) with nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16 and the guidance structured output backend + nemotron_v3 reasoning parser, the guidance FSM appears to constrain output from the first token instead of waiting for thinking to end (</think>). The model generates 8192 tokens but only produces { as visible content.

Fix Action

Fixed

PR fix notes

PR #39138: [Bugfix] Fix reasoning parser disabling structured output when enable_thinking=false

Description (problem / solution / changelog)

Essential Checks

  • Ran pre-commit run ruff-check and ruff-format on changed files — both passed
  • Ran pytest tests/reasoning/test_base_thinking_reasoning_parser.py — 24/24 passed
  • PR is not a duplicate — searched open PRs for #39130 related fixes, found none
  • This is not a low-value change — it fixes a silent correctness bug affecting all users who configure a reasoning parser with enable_thinking=false

Summary

When --reasoning-parser gemma4 (or any BaseThinkingReasoningParser subclass) is used with enable_thinking=false, the xgrammar structured output engine is silently bypassed for every request. Grammar constraints are never enforced, even though the user explicitly requested structured output.

Root Cause

BaseThinkingReasoningParser.is_reasoning_end() scans backward through input_ids looking for reasoning start/end tokens. When enable_thinking=false, the chat template does not inject any reasoning tokens into the prompt. The method's fallback return False incorrectly signals "reasoning has not ended yet", causing:

  1. should_fill_bitmask() → returns False → xgrammar bitmask never computed
  2. should_advance() → returns False → FSM never advances
  3. The model never generates the end-of-reasoning token → state never transitions

Fix

Change the fallback return value from False to True: when no reasoning tokens are found, reasoning was never started, so it should be treated as already ended. This is consistent with IdentityReasoningParser.is_reasoning_end() which always returns True.

Impact

Affects all BaseThinkingReasoningParser subclasses (Gemma4, Qwen3, DeepSeek, Mistral, etc.) when used with enable_thinking=false. The fix only changes behavior when neither start nor end token is present — all other paths remain unchanged.

Changes

FileChange
vllm/reasoning/basic_parsers.pyreturn Falsereturn True (line 73)
tests/reasoning/test_base_thinking_reasoning_parser.pyUpdated 2 assertions to expect True for no-reasoning-token cases
tests/reasoning/test_gemma4_reasoning_parser.pyUpdated NO_REASONING and EMPTY test case expectations

Testing

pytest tests/reasoning/test_base_thinking_reasoning_parser.py -v
# 24/24 passed

Fixes #39130

Related: #37359, #37362

Changed files

  • tests/reasoning/test_base_thinking_reasoning_parser.py (modified, +1/-1)
  • tests/reasoning/test_deepseekr1_reasoning_parser.py (modified, +4/-4)
  • tests/reasoning/test_gemma4_reasoning_parser.py (modified, +2/-2)
  • tests/reasoning/test_minimax_m2_reasoning_parser.py (modified, +3/-3)
  • tests/reasoning/test_mistral_reasoning_parser.py (modified, +11/-11)
  • tests/reasoning/test_seedoss_reasoning_parser.py (modified, +6/-5)
  • tests/reasoning/test_step3p5_reasoning_parser.py (modified, +4/-4)
  • vllm/reasoning/deepseek_r1_reasoning_parser.py (modified, +10/-0)
  • vllm/reasoning/ernie45_reasoning_parser.py (modified, +10/-0)
  • vllm/reasoning/gemma4_reasoning_parser.py (modified, +3/-1)
  • vllm/reasoning/minimax_m2_reasoning_parser.py (modified, +10/-0)
  • vllm/reasoning/mistral_reasoning_parser.py (modified, +3/-1)
  • vllm/reasoning/qwen3_reasoning_parser.py (modified, +10/-0)
  • vllm/reasoning/seedoss_reasoning_parser.py (modified, +12/-0)
  • vllm/reasoning/step3p5_reasoning_parser.py (modified, +3/-1)

Code Example

from vllm import LLM, SamplingParams

llm = LLM(
    model="nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16",
    tensor_parallel_size=2,
    enforce_eager=True,
    trust_remote_code=True,
    structured_outputs_config={
        "backend": "guidance",
        "reasoning_parser": "nemotron_v3",
    },
)

schema = {
    "type": "object",
    "properties": {"answer": {"type": "integer"}},
    "required": ["answer"],
}

outputs = llm.generate(
    ["What is 2+2?"],
    SamplingParams(
        max_tokens=8192,
        temperature=0.0,
        structured_outputs={"type": "json_schema", "value": schema},
    ),
)
# Output is mostly thinking tokens forced into JSON patterns, visible text is just "{"
print(outputs[0].outputs[0].text)
RAW_BUFFERClick to expand / collapse

Summary

When using LLM.generate() (offline/batch mode) with nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16 and the guidance structured output backend + nemotron_v3 reasoning parser, the guidance FSM appears to constrain output from the first token instead of waiting for thinking to end (</think>). The model generates 8192 tokens but only produces { as visible content.

Environment

  • vLLM v0.17.2rc1 (commit 3ec8ae438)
  • LLM.generate() offline path
  • structured_outputs_config = {"backend": "guidance", "reasoning_parser": "nemotron_v3"}
  • enable_thinking=True via chat template

Behavior

  • Thinking OFF + structured output: WORKS — model produces valid JSON in 34 tokens
  • Thinking ON + free-form: WORKS — thinking extracted correctly, correct answer
  • Thinking ON + structured output: FAILS — 8192 tokens generated, visible text is just {

This suggests the guidance FSM constrains during the thinking phase rather than waiting for </think>. The model's thinking tokens are forced into JSON-conforming patterns, producing garbage.

Related

Similar to #37359 (GPT-OSS + openai_gptoss parser), but different reasoning parser and different model architecture (NemotronHForCausalLM with <think>/</think> tags vs Harmony channels).

Reproduction

from vllm import LLM, SamplingParams

llm = LLM(
    model="nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16",
    tensor_parallel_size=2,
    enforce_eager=True,
    trust_remote_code=True,
    structured_outputs_config={
        "backend": "guidance",
        "reasoning_parser": "nemotron_v3",
    },
)

schema = {
    "type": "object",
    "properties": {"answer": {"type": "integer"}},
    "required": ["answer"],
}

outputs = llm.generate(
    ["What is 2+2?"],
    SamplingParams(
        max_tokens=8192,
        temperature=0.0,
        structured_outputs={"type": "json_schema", "value": schema},
    ),
)
# Output is mostly thinking tokens forced into JSON patterns, visible text is just "{"
print(outputs[0].outputs[0].text)

extent analysis

Fix Plan

To address the issue, we need to modify the LLM.generate() call to wait for the </think> token before applying the guidance FSM. We can achieve this by adding a custom thinking_end_token to the structured_outputs_config.

  • Update the structured_outputs_config to include a thinking_end_token:
structured_outputs_config = {
    "backend": "guidance",
    "reasoning_parser": "nemotron_v3",
    "thinking_end_token": "</think>",
}
  • Modify the LLM.generate() call to use the updated structured_outputs_config:
llm = LLM(
    model="nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16",
    tensor_parallel_size=2,
    enforce_eager=True,
    trust_remote_code=True,
    structured_outputs_config={
        "backend": "guidance",
        "reasoning_parser": "nemotron_v3",
        "thinking_end_token": "</think>",
    },
)
  • Use the updated llm instance to generate outputs:
outputs = llm.generate(
    ["What is 2+2?"],
    SamplingParams(
        max_tokens=8192,
        temperature=0.0,
        structured_outputs={"type": "json_schema", "value": schema},
    ),
)

Verification

To verify that the fix worked, check the generated output for the correct answer:

print(outputs[0].outputs[0].text)

The output should now contain the correct answer, rather than just {.

Extra Tips

  • Make sure to update the structured_outputs_config correctly, as incorrect configuration can lead to unexpected behavior.
  • If you're using a different model or parser, you may need to adjust the thinking_end_token accordingly.
  • Consider adding error handling to your code to catch any potential issues with the LLM.generate() call.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING