vllm - ✅(Solved) Fix [Bug]: `--reasoning-parser gemma4` silently disables structured output (xgrammar) when `enable_thinking=false` [1 pull requests, 2 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#39130Fetched 2026-04-08 03:01:50
View on GitHub
Comments
2
Participants
2
Timeline
8
Reactions
1
Author
Participants
Assignees
Timeline (top)
commented ×2subscribed ×2assigned ×1cross-referenced ×1
ConfigBitmask filled?FSM advances?Grammar enforced?
--reasoning-parser gemma4 + enable_thinking: falseNONONO — silently bypassed
No --reasoning-parserYESYESYES — works correctly

Root Cause

The bug is in vllm/v1/structured_output/__init__.py, in the interaction between should_fill_bitmask() / should_advance() and the reasoning parser.

Fix Action

Fixed

PR fix notes

PR #39138: [Bugfix] Fix reasoning parser disabling structured output when enable_thinking=false

Description (problem / solution / changelog)

Essential Checks

  • Ran pre-commit run ruff-check and ruff-format on changed files — both passed
  • Ran pytest tests/reasoning/test_base_thinking_reasoning_parser.py — 24/24 passed
  • PR is not a duplicate — searched open PRs for #39130 related fixes, found none
  • This is not a low-value change — it fixes a silent correctness bug affecting all users who configure a reasoning parser with enable_thinking=false

Summary

When --reasoning-parser gemma4 (or any BaseThinkingReasoningParser subclass) is used with enable_thinking=false, the xgrammar structured output engine is silently bypassed for every request. Grammar constraints are never enforced, even though the user explicitly requested structured output.

Root Cause

BaseThinkingReasoningParser.is_reasoning_end() scans backward through input_ids looking for reasoning start/end tokens. When enable_thinking=false, the chat template does not inject any reasoning tokens into the prompt. The method's fallback return False incorrectly signals "reasoning has not ended yet", causing:

  1. should_fill_bitmask() → returns False → xgrammar bitmask never computed
  2. should_advance() → returns False → FSM never advances
  3. The model never generates the end-of-reasoning token → state never transitions

Fix

Change the fallback return value from False to True: when no reasoning tokens are found, reasoning was never started, so it should be treated as already ended. This is consistent with IdentityReasoningParser.is_reasoning_end() which always returns True.

Impact

Affects all BaseThinkingReasoningParser subclasses (Gemma4, Qwen3, DeepSeek, Mistral, etc.) when used with enable_thinking=false. The fix only changes behavior when neither start nor end token is present — all other paths remain unchanged.

Changes

FileChange
vllm/reasoning/basic_parsers.pyreturn Falsereturn True (line 73)
tests/reasoning/test_base_thinking_reasoning_parser.pyUpdated 2 assertions to expect True for no-reasoning-token cases
tests/reasoning/test_gemma4_reasoning_parser.pyUpdated NO_REASONING and EMPTY test case expectations

Testing

pytest tests/reasoning/test_base_thinking_reasoning_parser.py -v
# 24/24 passed

Fixes #39130

Related: #37359, #37362

Changed files

  • tests/reasoning/test_base_thinking_reasoning_parser.py (modified, +1/-1)
  • tests/reasoning/test_deepseekr1_reasoning_parser.py (modified, +4/-4)
  • tests/reasoning/test_gemma4_reasoning_parser.py (modified, +2/-2)
  • tests/reasoning/test_minimax_m2_reasoning_parser.py (modified, +3/-3)
  • tests/reasoning/test_mistral_reasoning_parser.py (modified, +11/-11)
  • tests/reasoning/test_seedoss_reasoning_parser.py (modified, +6/-5)
  • tests/reasoning/test_step3p5_reasoning_parser.py (modified, +4/-4)
  • vllm/reasoning/deepseek_r1_reasoning_parser.py (modified, +10/-0)
  • vllm/reasoning/ernie45_reasoning_parser.py (modified, +10/-0)
  • vllm/reasoning/gemma4_reasoning_parser.py (modified, +10/-0)
  • vllm/reasoning/minimax_m2_reasoning_parser.py (modified, +10/-0)
  • vllm/reasoning/mistral_reasoning_parser.py (modified, +3/-1)
  • vllm/reasoning/qwen3_reasoning_parser.py (modified, +10/-0)
  • vllm/reasoning/seedoss_reasoning_parser.py (modified, +12/-0)
  • vllm/reasoning/step3p5_reasoning_parser.py (modified, +3/-1)

Code Example

vllm serve google/gemma-4-E4B-it \
  --quantization fp8 \
  --max-model-len 25600 \
  --max-num-seqs 32 \
  --kv-cache-dtype fp8 \
  --gpu-memory-utilization 0.92 \
  --enable-auto-tool-choice \
  --tool-call-parser gemma4 \
  --reasoning-parser gemma4 \
  --default-chat-template-kwargs '{"enable_thinking": false}' \
  --structured-outputs-config '{"backend":"xgrammar"}'

---

# vllm/reasoning/basic_parsers.py
def is_reasoning_end(self, input_ids):
    for i in range(len(input_ids) - 1, -1, -1):
        if input_ids[i] == start_token_id:
            return False
        if input_ids[i] == end_token_id:
            return True
    return False  # ← neither found → "reasoning has NOT ended"

---

# vllm/v1/structured_output/__init__.py
def should_fill_bitmask(self, request):
    if self.reasoner is not None:
        if self.enable_in_reasoning:  # default: False
            return True
        return request.structured_output_request.reasoning_ended  # ← False!
    return True  # ← no parser: always fill

def should_advance(self, request):
    if self.reasoner is None:
        return True  # ← no parser: always advance
    if self.enable_in_reasoning:
        return True
    if structured_req.reasoning_ended:  # ← False!
        return True
    # ... checks is_reasoning_end_streaming() for <channel|> in delta ...
    return False  # ← never advances

---

# Fix in basic_parsers.py
def is_reasoning_end(self, input_ids):
    for i in range(len(input_ids) - 1, -1, -1):
        if input_ids[i] == start_token_id:
            return False
        if input_ids[i] == end_token_id:
            return True
    return True  # ← no thinking tokens found → treat as "not in reasoning"
RAW_BUFFERClick to expand / collapse

Your current environment

  • vLLM version: v0.19.0 (vllm/vllm-openai:v0.19.0-x86_64-cu130-ubuntu2404 with transformers>=5.5.0)
  • GPU: NVIDIA RTX PRO 6000 Blackwell Server Edition (96GB GDDR7, SM 12.0)
  • OS: Linux 5.15.0-171-generic
  • CUDA: 13.0
  • Python: 3.12

Model

google/gemma-4-E4B-it (dense 4B model, FP8 quantization)

Also applies to any Gemma 4 variant (google/gemma-4-26B-A4B-it, google/gemma-4-31B-it) — and likely any model whose reasoning parser uses channel-style delimiters not present in the prompt.

Command

vllm serve google/gemma-4-E4B-it \
  --quantization fp8 \
  --max-model-len 25600 \
  --max-num-seqs 32 \
  --kv-cache-dtype fp8 \
  --gpu-memory-utilization 0.92 \
  --enable-auto-tool-choice \
  --tool-call-parser gemma4 \
  --reasoning-parser gemma4 \
  --default-chat-template-kwargs '{"enable_thinking": false}' \
  --structured-outputs-config '{"backend":"xgrammar"}'

🐛 Describe the bug

When --reasoning-parser gemma4 is specified together with --default-chat-template-kwargs '{"enable_thinking": false}', the xgrammar structured output engine is completely bypassed for every request. Grammar constraints (JSON schema, BNF, etc.) are never enforced. The model generates unconstrained text that happens to look valid because it's well-trained, but the grammar FSM never runs.

This manifests as a dramatic performance difference that led to discovering the bug:

Benchmark data (single GPU, google/gemma-4-E4B-it FP8)

max_parallelTPS (with parser — grammar bypassed)TPS (without parser — grammar enforced)
195.865.8
292.151.5
487.823.1
878.717.1
1665.414.6
max_parallelTTFT (with parser)TTFT (without parser)
10.232s0.284s
40.237s0.388s
160.302s0.532s

The "with parser" numbers are faster because xgrammar bitmask computation and FSM advancement are silently skipped on every decode step.

Root cause

The bug is in vllm/v1/structured_output/__init__.py, in the interaction between should_fill_bitmask() / should_advance() and the reasoning parser.

Step 1: is_reasoning_end() returns False for prompts without thinking tokens

Gemma4ReasoningParser inherits from BaseThinkingReasoningParser, which scans the prompt backward for <|channel> (start, token 100) or <channel|> (end, token 101):

# vllm/reasoning/basic_parsers.py
def is_reasoning_end(self, input_ids):
    for i in range(len(input_ids) - 1, -1, -1):
        if input_ids[i] == start_token_id:
            return False
        if input_ids[i] == end_token_id:
            return True
    return False  # ← neither found → "reasoning has NOT ended"

With enable_thinking: false, the prompt contains neither <|channel> nor <channel|>. The method returns False — meaning "reasoning has not ended yet."

Step 2: Structured output engine skips grammar enforcement

# vllm/v1/structured_output/__init__.py
def should_fill_bitmask(self, request):
    if self.reasoner is not None:
        if self.enable_in_reasoning:  # default: False
            return True
        return request.structured_output_request.reasoning_ended  # ← False!
    return True  # ← no parser: always fill

def should_advance(self, request):
    if self.reasoner is None:
        return True  # ← no parser: always advance
    if self.enable_in_reasoning:
        return True
    if structured_req.reasoning_ended:  # ← False!
        return True
    # ... checks is_reasoning_end_streaming() for <channel|> in delta ...
    return False  # ← never advances

Step 3: Model never generates <channel|>, so reasoning_ended stays False forever

Since thinking is disabled, the model never outputs the <channel|> end token. The is_reasoning_end_streaming() check on each decode step never finds it. Grammar enforcement is permanently disabled for the entire generation.

Summary

ConfigBitmask filled?FSM advances?Grammar enforced?
--reasoning-parser gemma4 + enable_thinking: falseNONONO — silently bypassed
No --reasoning-parserYESYESYES — works correctly

Suggested fix

is_reasoning_end() should return True (not False) when no reasoning tokens are found in the input. If thinking was never started, there is no reasoning to wait for — the model is already in "content" mode.

# Fix in basic_parsers.py
def is_reasoning_end(self, input_ids):
    for i in range(len(input_ids) - 1, -1, -1):
        if input_ids[i] == start_token_id:
            return False
        if input_ids[i] == end_token_id:
            return True
    return True  # ← no thinking tokens found → treat as "not in reasoning"

Alternatively, the structured output engine could handle the "no reasoning tokens in prompt" case explicitly, treating it as reasoning_ended = True.

Related issues

  • #37359 — Same root cause for GPT-OSS models with guidance backend (offline path)
  • #37362 — Same root cause for Nemotron V3 reasoning parser
  • #34650 — Speculative decoding causes </think> detection failure in structured output
  • ollama/ollama#15260 — Identical bug for Gemma 4 in Ollama (think=false + format silently ignored)
  • ollama/ollama#14645 — Same bug for Qwen 3.5 in Ollama

Impact

Silent correctness issue: Users who configure --reasoning-parser gemma4 with enable_thinking: false (a common production setup to get reasoning parsing without thinking overhead) get zero grammar enforcement on structured output. The output appears correct because the model is well-trained, but the safety guarantee of grammar-constrained decoding is completely lost. There are no warnings or errors logged.

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

TL;DR

The most likely fix for the issue is to update the is_reasoning_end() function in basic_parsers.py to return True when no reasoning tokens are found in the input.

Guidance

  • Review the is_reasoning_end() function in basic_parsers.py and update it to return True when no reasoning tokens are found in the input.
  • Verify that the structured output engine is correctly handling the "no reasoning tokens in prompt" case.
  • Test the updated code with the --reasoning-parser gemma4 and enable_thinking: false configuration to ensure that grammar enforcement is working correctly.
  • Check the related issues (#37359, #37362, #34650, ollama/ollama#15260, ollama/ollama#14645) to see if the same fix applies to other models and parsers.

Example

# Fix in basic_parsers.py
def is_reasoning_end(self, input_ids):
    for i in range(len(input_ids) - 1, -1, -1):
        if input_ids[i] == start_token_id:
            return False
        if input_ids[i] == end_token_id:
            return True
    return True  # ← no thinking tokens found → treat as "not in reasoning"

Notes

The fix assumes that the is_reasoning_end() function is the root cause of the issue. If the problem persists after updating this function, further investigation may be needed to identify the correct root cause.

Recommendation

Apply the workaround by updating the is_reasoning_end() function in basic_parsers.py to return True when no reasoning tokens are found in the input. This fix should resolve the silent correctness issue and ensure that grammar enforcement is working correctly for the --reasoning-parser gemma4 and enable_thinking: false configuration.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

vllm - ✅(Solved) Fix [Bug]: `--reasoning-parser gemma4` silently disables structured output (xgrammar) when `enable_thinking=false` [1 pull requests, 2 comments, 2 participants]