vllm - ✅(Solved) Fix [Bug]: `--reasoning-parser gemma4` silently disables structured output (xgrammar) when `enable_thinking=false` [1 pull requests, 2 comments, 2 participants]

vllm2026-04-06 23:53:02

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

vllm-project/vllm#39130•Fetched 2026-04-08 03:01:50

View on GitHub

Comments

Participants

Timeline

Reactions

Author

Participants

Assignees

Timeline (top)

commented ×2subscribed ×2assigned ×1cross-referenced ×1

Config	Bitmask filled?	FSM advances?	Grammar enforced?
`--reasoning-parser gemma4` + `enable_thinking: false`	NO	NO	NO — silently bypassed
No `--reasoning-parser`	YES	YES	YES — works correctly

Root Cause

The bug is in vllm/v1/structured_output/__init__.py, in the interaction between should_fill_bitmask() / should_advance() and the reasoning parser.

Fix Action

Fixed

Fixed by PR: [Bugfix] Fix reasoning parser disabling structured output when enable_thinking=false (https://github.com/vllm-project/vllm/pull/39138)

PR fix notes

PR #39138: [Bugfix] Fix reasoning parser disabling structured output when enable_thinking=false

Repository: vllm-project/vllm
Author: SuperMarioYL
State: open | merged: False
Link: https://github.com/vllm-project/vllm/pull/39138

Description (problem / solution / changelog)

Essential Checks

Ran pre-commit run ruff-check and ruff-format on changed files — both passed
Ran pytest tests/reasoning/test_base_thinking_reasoning_parser.py — 24/24 passed
PR is not a duplicate — searched open PRs for #39130 related fixes, found none
This is not a low-value change — it fixes a silent correctness bug affecting all users who configure a reasoning parser with enable_thinking=false

Summary

When --reasoning-parser gemma4 (or any BaseThinkingReasoningParser subclass) is used with enable_thinking=false, the xgrammar structured output engine is silently bypassed for every request. Grammar constraints are never enforced, even though the user explicitly requested structured output.

Root Cause

BaseThinkingReasoningParser.is_reasoning_end() scans backward through input_ids looking for reasoning start/end tokens. When enable_thinking=false, the chat template does not inject any reasoning tokens into the prompt. The method's fallback return False incorrectly signals "reasoning has not ended yet", causing:

should_fill_bitmask() → returns False → xgrammar bitmask never computed
should_advance() → returns False → FSM never advances
The model never generates the end-of-reasoning token → state never transitions

Fix

Change the fallback return value from False to True: when no reasoning tokens are found, reasoning was never started, so it should be treated as already ended. This is consistent with IdentityReasoningParser.is_reasoning_end() which always returns True.

Impact

Affects all BaseThinkingReasoningParser subclasses (Gemma4, Qwen3, DeepSeek, Mistral, etc.) when used with enable_thinking=false. The fix only changes behavior when neither start nor end token is present — all other paths remain unchanged.

Changes

File	Change
`vllm/reasoning/basic_parsers.py`	`return False` → `return True` (line 73)
`tests/reasoning/test_base_thinking_reasoning_parser.py`	Updated 2 assertions to expect `True` for no-reasoning-token cases
`tests/reasoning/test_gemma4_reasoning_parser.py`	Updated `NO_REASONING` and `EMPTY` test case expectations

Testing

pytest tests/reasoning/test_base_thinking_reasoning_parser.py -v
# 24/24 passed

Fixes #39130

Related: #37359, #37362

Changed files

tests/reasoning/test_base_thinking_reasoning_parser.py (modified, +1/-1)
tests/reasoning/test_deepseekr1_reasoning_parser.py (modified, +4/-4)
tests/reasoning/test_gemma4_reasoning_parser.py (modified, +2/-2)
tests/reasoning/test_minimax_m2_reasoning_parser.py (modified, +3/-3)
tests/reasoning/test_mistral_reasoning_parser.py (modified, +11/-11)
tests/reasoning/test_seedoss_reasoning_parser.py (modified, +6/-5)
tests/reasoning/test_step3p5_reasoning_parser.py (modified, +4/-4)
vllm/reasoning/deepseek_r1_reasoning_parser.py (modified, +10/-0)
vllm/reasoning/ernie45_reasoning_parser.py (modified, +10/-0)
vllm/reasoning/gemma4_reasoning_parser.py (modified, +10/-0)
vllm/reasoning/minimax_m2_reasoning_parser.py (modified, +10/-0)
vllm/reasoning/mistral_reasoning_parser.py (modified, +3/-1)
vllm/reasoning/qwen3_reasoning_parser.py (modified, +10/-0)
vllm/reasoning/seedoss_reasoning_parser.py (modified, +12/-0)
vllm/reasoning/step3p5_reasoning_parser.py (modified, +3/-1)

Code Example

vllm serve google/gemma-4-E4B-it \
  --quantization fp8 \
  --max-model-len 25600 \
  --max-num-seqs 32 \
  --kv-cache-dtype fp8 \
  --gpu-memory-utilization 0.92 \
  --enable-auto-tool-choice \
  --tool-call-parser gemma4 \
  --reasoning-parser gemma4 \
  --default-chat-template-kwargs '{"enable_thinking": false}' \
  --structured-outputs-config '{"backend":"xgrammar"}'

---

# vllm/reasoning/basic_parsers.py
def is_reasoning_end(self, input_ids):
    for i in range(len(input_ids) - 1, -1, -1):
        if input_ids[i] == start_token_id:
            return False
        if input_ids[i] == end_token_id:
            return True
    return False  # ← neither found → "reasoning has NOT ended"

---

# vllm/v1/structured_output/__init__.py
def should_fill_bitmask(self, request):
    if self.reasoner is not None:
        if self.enable_in_reasoning:  # default: False
            return True
        return request.structured_output_request.reasoning_ended  # ← False!
    return True  # ← no parser: always fill

def should_advance(self, request):
    if self.reasoner is None:
        return True  # ← no parser: always advance
    if self.enable_in_reasoning:
        return True
    if structured_req.reasoning_ended:  # ← False!
        return True
    # ... checks is_reasoning_end_streaming() for <channel|> in delta ...
    return False  # ← never advances

---

# Fix in basic_parsers.py
def is_reasoning_end(self, input_ids):
    for i in range(len(input_ids) - 1, -1, -1):
        if input_ids[i] == start_token_id:
            return False
        if input_ids[i] == end_token_id:
            return True
    return True  # ← no thinking tokens found → treat as "not in reasoning"

RAW_BUFFERClick to expand / collapse

Your current environment

vLLM version: v0.19.0 (vllm/vllm-openai:v0.19.0-x86_64-cu130-ubuntu2404 with transformers>=5.5.0)
GPU: NVIDIA RTX PRO 6000 Blackwell Server Edition (96GB GDDR7, SM 12.0)
OS: Linux 5.15.0-171-generic
CUDA: 13.0
Python: 3.12

Model

google/gemma-4-E4B-it (dense 4B model, FP8 quantization)

Also applies to any Gemma 4 variant (google/gemma-4-26B-A4B-it, google/gemma-4-31B-it) — and likely any model whose reasoning parser uses channel-style delimiters not present in the prompt.

Command

vllm serve google/gemma-4-E4B-it \
  --quantization fp8 \
  --max-model-len 25600 \
  --max-num-seqs 32 \
  --kv-cache-dtype fp8 \
  --gpu-memory-utilization 0.92 \
  --enable-auto-tool-choice \
  --tool-call-parser gemma4 \
  --reasoning-parser gemma4 \
  --default-chat-template-kwargs '{"enable_thinking": false}' \
  --structured-outputs-config '{"backend":"xgrammar"}'

🐛 Describe the bug

When --reasoning-parser gemma4 is specified together with --default-chat-template-kwargs '{"enable_thinking": false}', the xgrammar structured output engine is completely bypassed for every request. Grammar constraints (JSON schema, BNF, etc.) are never enforced. The model generates unconstrained text that happens to look valid because it's well-trained, but the grammar FSM never runs.

This manifests as a dramatic performance difference that led to discovering the bug:

Benchmark data (single GPU, `google/gemma-4-E4B-it` FP8)

max_parallel	TPS (with parser — grammar bypassed)	TPS (without parser — grammar enforced)
1	95.8	65.8
2	92.1	51.5
4	87.8	23.1
8	78.7	17.1
16	65.4	14.6

max_parallel	TTFT (with parser)	TTFT (without parser)
1	0.232s	0.284s
4	0.237s	0.388s
16	0.302s	0.532s

The "with parser" numbers are faster because xgrammar bitmask computation and FSM advancement are silently skipped on every decode step.

Root cause

The bug is in vllm/v1/structured_output/__init__.py, in the interaction between should_fill_bitmask() / should_advance() and the reasoning parser.

Step 1: `is_reasoning_end()` returns `False` for prompts without thinking tokens

Gemma4ReasoningParser inherits from BaseThinkingReasoningParser, which scans the prompt backward for <|channel> (start, token 100) or <channel|> (end, token 101):

# vllm/reasoning/basic_parsers.py
def is_reasoning_end(self, input_ids):
    for i in range(len(input_ids) - 1, -1, -1):
        if input_ids[i] == start_token_id:
            return False
        if input_ids[i] == end_token_id:
            return True
    return False  # ← neither found → "reasoning has NOT ended"

With enable_thinking: false, the prompt contains neither <|channel> nor <channel|>. The method returns False — meaning "reasoning has not ended yet."

Step 2: Structured output engine skips grammar enforcement

# vllm/v1/structured_output/__init__.py
def should_fill_bitmask(self, request):
    if self.reasoner is not None:
        if self.enable_in_reasoning:  # default: False
            return True
        return request.structured_output_request.reasoning_ended  # ← False!
    return True  # ← no parser: always fill

def should_advance(self, request):
    if self.reasoner is None:
        return True  # ← no parser: always advance
    if self.enable_in_reasoning:
        return True
    if structured_req.reasoning_ended:  # ← False!
        return True
    # ... checks is_reasoning_end_streaming() for <channel|> in delta ...
    return False  # ← never advances

Step 3: Model never generates `<channel|>`, so `reasoning_ended` stays `False` forever

Since thinking is disabled, the model never outputs the <channel|> end token. The is_reasoning_end_streaming() check on each decode step never finds it. Grammar enforcement is permanently disabled for the entire generation.

Summary

Config	Bitmask filled?	FSM advances?	Grammar enforced?
`--reasoning-parser gemma4` + `enable_thinking: false`	NO	NO	NO — silently bypassed
No `--reasoning-parser`	YES	YES	YES — works correctly

Suggested fix

is_reasoning_end() should return True (not False) when no reasoning tokens are found in the input. If thinking was never started, there is no reasoning to wait for — the model is already in "content" mode.

# Fix in basic_parsers.py
def is_reasoning_end(self, input_ids):
    for i in range(len(input_ids) - 1, -1, -1):
        if input_ids[i] == start_token_id:
            return False
        if input_ids[i] == end_token_id:
            return True
    return True  # ← no thinking tokens found → treat as "not in reasoning"

Alternatively, the structured output engine could handle the "no reasoning tokens in prompt" case explicitly, treating it as reasoning_ended = True.

Related issues

#37359 — Same root cause for GPT-OSS models with guidance backend (offline path)
#37362 — Same root cause for Nemotron V3 reasoning parser
#34650 — Speculative decoding causes </think> detection failure in structured output
ollama/ollama#15260 — Identical bug for Gemma 4 in Ollama (think=false + format silently ignored)
ollama/ollama#14645 — Same bug for Qwen 3.5 in Ollama

Impact

Silent correctness issue: Users who configure --reasoning-parser gemma4 with enable_thinking: false (a common production setup to get reasoning parsing without thinking overhead) get zero grammar enforcement on structured output. The output appears correct because the model is well-trained, but the safety guarantee of grammar-constrained decoding is completely lost. There are no warnings or errors logged.

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

TL;DR

The most likely fix for the issue is to update the is_reasoning_end() function in basic_parsers.py to return True when no reasoning tokens are found in the input.

Guidance

Review the is_reasoning_end() function in basic_parsers.py and update it to return True when no reasoning tokens are found in the input.
Verify that the structured output engine is correctly handling the "no reasoning tokens in prompt" case.
Test the updated code with the --reasoning-parser gemma4 and enable_thinking: false configuration to ensure that grammar enforcement is working correctly.
Check the related issues (#37359, #37362, #34650, ollama/ollama#15260, ollama/ollama#14645) to see if the same fix applies to other models and parsers.

Example

# Fix in basic_parsers.py
def is_reasoning_end(self, input_ids):
    for i in range(len(input_ids) - 1, -1, -1):
        if input_ids[i] == start_token_id:
            return False
        if input_ids[i] == end_token_id:
            return True
    return True  # ← no thinking tokens found → treat as "not in reasoning"

Notes

The fix assumes that the is_reasoning_end() function is the root cause of the issue. If the problem persists after updating this function, further investigation may be needed to identify the correct root cause.

Recommendation

Apply the workaround by updating the is_reasoning_end() function in basic_parsers.py to return True when no reasoning tokens are found in the input. This fix should resolve the silent correctness issue and ensure that grammar enforcement is working correctly for the --reasoning-parser gemma4 and enable_thinking: false configuration.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#task chaining #parallel task #integration issue #index setup #retrieval issue

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

vllm - ✅(Solved) Fix [Bug]: `--reasoning-parser gemma4` silently disables structured output (xgrammar) when `enable_thinking=false` [1 pull requests, 2 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Fix Action

Fixed

PR fix notes

PR #39138: [Bugfix] Fix reasoning parser disabling structured output when enable_thinking=false

Description (problem / solution / changelog)

Essential Checks

Summary

Root Cause

Fix

Impact

Changes

Testing

Changed files

Code Example

Your current environment

Model

Command

🐛 Describe the bug

Benchmark data (single GPU, google/gemma-4-E4B-it FP8)

Root cause

Step 1: is_reasoning_end() returns False for prompts without thinking tokens

Step 2: Structured output engine skips grammar enforcement

Step 3: Model never generates <channel|>, so reasoning_ended stays False forever

Summary

Suggested fix

Related issues

Impact

Before submitting a new issue...

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING

Benchmark data (single GPU, `google/gemma-4-E4B-it` FP8)

Step 1: `is_reasoning_end()` returns `False` for prompts without thinking tokens

Step 3: Model never generates `<channel|>`, so `reasoning_ended` stays `False` forever