vllm - ✅(Solved) Fix [Bug]: Gemma4 reasoning parser fails to separate reasoning_content — <|channel> tokens stripped before parsing [2 pull requests, 6 comments, 6 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#38855Fetched 2026-04-08 02:34:30
View on GitHub
Comments
6
Participants
6
Timeline
30
Reactions
6
Author
Assignees
Timeline (top)
subscribed ×17commented ×6mentioned ×3assigned ×1

Root Cause

The model correctly generates <|channel>thought\n...reasoning...<channel|> tokens, confirmed via logprobs. However, vLLM's text decoding strips these special tokens (skip_special_tokens=True) before the reasoning parser sees the text.

The Gemma4ReasoningParser defines start_token and end_token as text properties ("<|channel>" / "<channel|>"), but unlike Qwen3ReasoningParser, it does not implement start_token_id / end_token_id for token-level matching in the streaming path. The base class extract_reasoning_streaming receives text without special tokens, so the channel markers are invisible.

The unit tests in tests/reasoning/test_gemma4_reasoning_parser.py pass because they inject <|channel> as literal text in the test strings — this doesn't match actual serving behavior where the tokens are decoded with skip_special_tokens=True.

Fix Action

Fix / Workaround

Also note: Gemma4ToolParser.__init__ has a minor signature mismatch — takes (self, tokenizer) but base class passes (self, tokenizer, tools). Patched locally with tools=None default.

PR fix notes

PR #38858: [Bugfix] Fix Gemma4 non-streaming reasoning parsing

Description (problem / solution / changelog)

Purpose

Fixes #38855, where Gemma4 non-streaming chat completions fail to populate reasoning_content and instead return the thought trace in content.

The issue report suggested a parser-side token ID fix, but after inspecting the merged Gemma4 implementation from #38826, the Gemma4 parser already inherits start_token_id / end_token_id support from BaseThinkingReasoningParser. The actual failure is in the non-streaming OpenAI chat serving path: it passes output.text to the parser after special tokens have already been stripped, so Gemma4 never sees <|channel> / <channel|> boundaries.

This PR fixes that handoff in vllm/entrypoints/openai/chat_completion/serving.py by reconstructing parser input from output.token_ids with skip_special_tokens=False when the reasoning parser's boundary token IDs are present. This keeps the fix narrow, preserves the existing parser contract, and avoids adding parser-side logic that would need to infer reasoning boundaries from text after the delimiter tokens have already been removed.

Test Plan

.venv/bin/python -m pytest tests/entrypoints/openai/chat_completion/test_serving_chat.py -k gemma4_non_streaming_reasoning_uses_token_ids -v
.venv/bin/pre-commit run --files vllm/entrypoints/openai/chat_completion/serving.py tests/entrypoints/openai/chat_completion/test_serving_chat.py

Test Result

Passed:

  • Targeted regression test for the new non-streaming Gemma4 behavior in tests/entrypoints/openai/chat_completion/test_serving_chat.py
  • pre-commit on all changed files

The focused regression covers the before/after behavior for this bug:

  • Before: reasoning_content was null and content contained the thought trace
  • After: reasoning and final content are separated correctly

Changed files

  • tests/entrypoints/openai/chat_completion/test_serving_chat.py (modified, +45/-1)
  • vllm/entrypoints/openai/chat_completion/serving.py (modified, +22/-1)

Code Example

{
  "choices": [{
    "message": {
      "reasoning_content": "The user is asking...",
      "content": "2 + 2 = 4"
    }
  }]
}

---

{
  "choices": [{
    "message": {
      "reasoning_content": null,
      "content": "thought\nThe user is asking...\n2 + 2 = 4"
    }
  }]
}

---

# vLLM from main (0.18.2rc1.dev69+g08ed2b968)
vllm serve google/gemma-4-26B-A4B-it \
  --quantization fp8 \
  --reasoning-parser gemma4 \
  --default-chat-template-kwargs '{"enable_thinking": true}'

curl http://localhost:8000/v1/chat/completions -d '{
  "model": "google/gemma-4-26B-A4B-it",
  "messages": [{"role": "user", "content": "What is 2+2?"}],
  "max_tokens": 200
}'
# reasoning_content is null, thinking appears in content
RAW_BUFFERClick to expand / collapse

Bug Description

The Gemma4ReasoningParser (added in PR #38826) fails to populate reasoning_content in the OpenAI chat completions response. All thinking content ends up in content instead.

Root Cause

The model correctly generates <|channel>thought\n...reasoning...<channel|> tokens, confirmed via logprobs. However, vLLM's text decoding strips these special tokens (skip_special_tokens=True) before the reasoning parser sees the text.

The Gemma4ReasoningParser defines start_token and end_token as text properties ("<|channel>" / "<channel|>"), but unlike Qwen3ReasoningParser, it does not implement start_token_id / end_token_id for token-level matching in the streaming path. The base class extract_reasoning_streaming receives text without special tokens, so the channel markers are invisible.

The unit tests in tests/reasoning/test_gemma4_reasoning_parser.py pass because they inject <|channel> as literal text in the test strings — this doesn't match actual serving behavior where the tokens are decoded with skip_special_tokens=True.

Expected Behavior

{
  "choices": [{
    "message": {
      "reasoning_content": "The user is asking...",
      "content": "2 + 2 = 4"
    }
  }]
}

Actual Behavior

{
  "choices": [{
    "message": {
      "reasoning_content": null,
      "content": "thought\nThe user is asking...\n2 + 2 = 4"
    }
  }]
}

Reproduction

# vLLM from main (0.18.2rc1.dev69+g08ed2b968)
vllm serve google/gemma-4-26B-A4B-it \
  --quantization fp8 \
  --reasoning-parser gemma4 \
  --default-chat-template-kwargs '{"enable_thinking": true}'

curl http://localhost:8000/v1/chat/completions -d '{
  "model": "google/gemma-4-26B-A4B-it",
  "messages": [{"role": "user", "content": "What is 2+2?"}],
  "max_tokens": 200
}'
# reasoning_content is null, thinking appears in content

Tested on both V1 and V0 (VLLM_USE_V1=0) engines — same behavior.

Suggested Fix

Add start_token_id and end_token_id properties to Gemma4ReasoningParser (token IDs 100 and 101 respectively), matching the pattern used by Qwen3ReasoningParser. The streaming extraction should match on token IDs, not decoded text.

Environment

  • vLLM: 0.18.2rc1.dev69+g08ed2b968 (built from main)
  • Model: google/gemma-4-26B-A4B-it
  • GPU: RTX PRO 6000 Blackwell
  • transformers: 5.5.0.dev0 (from git main)

Also note: Gemma4ToolParser.__init__ has a minor signature mismatch — takes (self, tokenizer) but base class passes (self, tokenizer, tools). Patched locally with tools=None default.

extent analysis

TL;DR

Add start_token_id and end_token_id properties to Gemma4ReasoningParser to enable token-level matching in the streaming path.

Guidance

  • Implement start_token_id and end_token_id in Gemma4ReasoningParser with token IDs 100 and 101 respectively, following the pattern used by Qwen3ReasoningParser.
  • Verify the fix by checking if reasoning_content is correctly populated in the OpenAI chat completions response.
  • Test the updated parser with the provided reproduction steps to ensure the issue is resolved.
  • Review the unit tests in tests/reasoning/test_gemma4_reasoning_parser.py to ensure they accurately reflect the serving behavior.

Example

class Gemma4ReasoningParser(ReasoningParser):
    # ...
    start_token_id =

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

vllm - ✅(Solved) Fix [Bug]: Gemma4 reasoning parser fails to separate reasoning_content — <|channel> tokens stripped before parsing [2 pull requests, 6 comments, 6 participants]