vllm - ✅(Solved) Fix [Bug]: Gemma4 reasoning parser fails to separate reasoning_content — <|channel> tokens stripped before parsing [2 pull requests, 6 comments, 6 participants]

vllm - ✅(Solved) Fix [Bug]: Gemma4 reasoning parser fails to separate reasoning_content — tokens stripped before parsing [2 pull requests, 6 comments, 6 participants]

mabry1985 · 2026-04-02T22:44:49Z

[vllm] PR 38858: Bugfix Fix Gemma4 non-streaming reasoning parsing - Repository: vllm-project/vllm - Author: jacobzhang22 - State: closed | merged: False - Lin… # PR #38858: [Bugfix] Fix Gemma4 non-streaming reasoning parsing - Repository: vllm-project/vllm - Author: jacobzhang22 - State: closed | merged: False - Link: https://github.com/vllm-project/vllm/pull/38858 ## Description (problem / solution / changelog) ## Purpose Fixes [#38855](https://github.com/vllm-project/vllm/issues/38855), where Gemma4 non-streaming chat completions fail to populate `reasoning_content` and instead return the thought trace in `content`. The issue report suggested a parser-side token ID fix, but after inspecting the merged Gemma4 implementation from [#38826](https://github.com/vllm-project/vllm/pull/38826), the Gemma4 parser already inherits `start_token_id` / `end_token_id` support from `BaseThinkingReasoningParser`. The actual failure is in the non-streaming OpenAI chat serving path: it passes `output.text` to the parser after special tokens have already been stripped, so Gemma4 never sees ` ` / ` ` boundaries. This PR fixes that handoff in `vllm/entrypoints/openai/chat_completion/serving.py` by reconstructing parser input from `output.token_ids` with `skip_special_tokens=False` when the reasoning parser's boundary token IDs are present. This keeps the fix narrow, preserves the existing parser contract, and avoids adding parser-side logic that would need to infer reasoning boundaries from text after the delimiter tokens have already been removed. ## Test Plan ```bash .venv/bin/python -m pytest tests/entrypoints/openai/chat_completion/test_serving_chat.py -k gemma4_non_streaming_reasoning_uses_token_ids -v .venv/bin/pre-commit run --files vllm/entrypoints/openai/chat_completion/serving.py tests/entrypoints/openai/chat_completion/test_serving_chat.py ``` ## Test Result Passed: - Targeted regression test for the new non-streaming Gemma4 behavior in `tests/entrypoints/openai/chat_completion/test_serving_chat.py` - `pre-commit` on all changed files The focused regression covers the before/after behavior for this bug: - Before: `reasoning_content` was `null` and `content` contained the thought trace - After: reasoning and final content are separated correctly ## Changed files - `tests/entrypoints/openai/chat_completion/test_serving_chat.py` (modified, +45/-1) - `vllm/entrypoints/openai/chat_completion/serving.py` (modified, +22/-1) ## Fix / Workaround Also note: `Gemma4ToolParser.__init__` has a minor signature mismatch — takes `(self, tokenizer)` but base class passes `(self, tokenizer, tools)`. Patched locally with `tools=None` default. ## Bug Description The `Gemma4ReasoningParser` (added in PR #38826) fails to populate `reasoning_content` in the OpenAI chat completions response. All thinking content ends up in `content` instead. ## Root Cause The model correctly generates ` thought\n...reasoning... ` tokens, confirmed via logprobs. However, vLLM's text decoding strips these special tokens (`skip_special_tokens=True`) before the reasoning parser sees the text. The `Gemma4ReasoningParser` defines `start_token` and `end_token` as text properties (`" "` / `" "`), but unlike `Qwen3ReasoningParser`, it does **not** implement `start_token_id` / `end_token_id` for token-level matching in the streaming path. The base class `extract_reasoning_streaming` receives text without special tokens, so the channel markers are invisible. The unit tests in `tests/reasoning/test_gemma4_reasoning_parser.py` pass because they inject ` ` as literal text in the test strings — this doesn't match actual serving behavior where the tokens are decoded with `skip_special_tokens=True`. ## Expected Behavior ```json { "choices": [{ "message": { "reasoning_content": "The user is asking...", "content": "2 + 2 = 4" } }] } ``` ## Actual Behavior ```json { "choices": [{ "message": { "reasoning_content": null, "content": "thought\nThe user is asking...\n2 + 2 = 4" } }] } ``` ## Reproduction ```bash # vLLM from main (0.18.2rc1.dev69+g08ed2b968) vllm serve google/gemma-4-26B-A4B-it \ --quantization fp8 \ --reasoning-parser gemma4 \ --default-chat-template-kwargs '{"enable_thinking": true}' curl http://localhost:8000/v1/chat/completions -d '{ "model": "google/gemma-4-26B-A4B-it", "messages": [{"role": "user", "content": "What is 2+2?"}], "max_tokens": 200 }' # reasoning_content is null, thinking appears in content ``` Tested on both V1 and V0 (`VLLM_USE_V1=0`) engines — same behavior. ## Suggested Fix Add `start_token_id` and `end_token_id` properties to `Gemma4ReasoningParser` (token IDs 100 and 101 respectively), matching the pattern used by `Qwen3ReasoningParser`. The streaming extraction should match on token IDs, not decoded text. ## Environment - vLLM: 0.18.2rc1.dev69+g08ed2b968 (built from main) - Model: google/gemma-4-26B-A4B-it - GPU: RTX PRO 6000 Blackwell - transformers: 5.5.0.dev0 (from git main

vllm2026-04-02 22:44:49

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

vllm-project/vllm#38855•Fetched 2026-04-08 02:34:30

View on GitHub

Comments

Participants

Timeline

Reactions

Author

Participants

Assignees

Timeline (top)

subscribed ×17commented ×6mentioned ×3assigned ×1

Root Cause

The model correctly generates <|channel>thought\n...reasoning...<channel|> tokens, confirmed via logprobs. However, vLLM's text decoding strips these special tokens (skip_special_tokens=True) before the reasoning parser sees the text.

The Gemma4ReasoningParser defines start_token and end_token as text properties ("<|channel>" / "<channel|>"), but unlike Qwen3ReasoningParser, it does not implement start_token_id / end_token_id for token-level matching in the streaming path. The base class extract_reasoning_streaming receives text without special tokens, so the channel markers are invisible.

The unit tests in tests/reasoning/test_gemma4_reasoning_parser.py pass because they inject <|channel> as literal text in the test strings — this doesn't match actual serving behavior where the tokens are decoded with skip_special_tokens=True.

Fix Action

Fix / Workaround

Also note: Gemma4ToolParser.__init__ has a minor signature mismatch — takes (self, tokenizer) but base class passes (self, tokenizer, tools). Patched locally with tools=None default.

PR fix notes

PR #38858: [Bugfix] Fix Gemma4 non-streaming reasoning parsing

Repository: vllm-project/vllm
Author: jacobzhang22
State: closed | merged: False
Link: https://github.com/vllm-project/vllm/pull/38858

Description (problem / solution / changelog)

Purpose

Fixes #38855, where Gemma4 non-streaming chat completions fail to populate reasoning_content and instead return the thought trace in content.

The issue report suggested a parser-side token ID fix, but after inspecting the merged Gemma4 implementation from #38826, the Gemma4 parser already inherits start_token_id / end_token_id support from BaseThinkingReasoningParser. The actual failure is in the non-streaming OpenAI chat serving path: it passes output.text to the parser after special tokens have already been stripped, so Gemma4 never sees <|channel> / <channel|> boundaries.

This PR fixes that handoff in vllm/entrypoints/openai/chat_completion/serving.py by reconstructing parser input from output.token_ids with skip_special_tokens=False when the reasoning parser's boundary token IDs are present. This keeps the fix narrow, preserves the existing parser contract, and avoids adding parser-side logic that would need to infer reasoning boundaries from text after the delimiter tokens have already been removed.

Test Plan

.venv/bin/python -m pytest tests/entrypoints/openai/chat_completion/test_serving_chat.py -k gemma4_non_streaming_reasoning_uses_token_ids -v
.venv/bin/pre-commit run --files vllm/entrypoints/openai/chat_completion/serving.py tests/entrypoints/openai/chat_completion/test_serving_chat.py

Test Result

Passed:

Targeted regression test for the new non-streaming Gemma4 behavior in tests/entrypoints/openai/chat_completion/test_serving_chat.py
pre-commit on all changed files

The focused regression covers the before/after behavior for this bug:

Before: reasoning_content was null and content contained the thought trace
After: reasoning and final content are separated correctly

Changed files

tests/entrypoints/openai/chat_completion/test_serving_chat.py (modified, +45/-1)
vllm/entrypoints/openai/chat_completion/serving.py (modified, +22/-1)

Code Example

{
  "choices": [{
    "message": {
      "reasoning_content": "The user is asking...",
      "content": "2 + 2 = 4"
    }
  }]
}

---

{
  "choices": [{
    "message": {
      "reasoning_content": null,
      "content": "thought\nThe user is asking...\n2 + 2 = 4"
    }
  }]
}

---

# vLLM from main (0.18.2rc1.dev69+g08ed2b968)
vllm serve google/gemma-4-26B-A4B-it \
  --quantization fp8 \
  --reasoning-parser gemma4 \
  --default-chat-template-kwargs '{"enable_thinking": true}'

curl http://localhost:8000/v1/chat/completions -d '{
  "model": "google/gemma-4-26B-A4B-it",
  "messages": [{"role": "user", "content": "What is 2+2?"}],
  "max_tokens": 200
}'
# reasoning_content is null, thinking appears in content

RAW_BUFFERClick to expand / collapse

Bug Description

The Gemma4ReasoningParser (added in PR #38826) fails to populate reasoning_content in the OpenAI chat completions response. All thinking content ends up in content instead.

Root Cause

Expected Behavior

{
  "choices": [{
    "message": {
      "reasoning_content": "The user is asking...",
      "content": "2 + 2 = 4"
    }
  }]
}

Actual Behavior

{
  "choices": [{
    "message": {
      "reasoning_content": null,
      "content": "thought\nThe user is asking...\n2 + 2 = 4"
    }
  }]
}

Reproduction

# vLLM from main (0.18.2rc1.dev69+g08ed2b968)
vllm serve google/gemma-4-26B-A4B-it \
  --quantization fp8 \
  --reasoning-parser gemma4 \
  --default-chat-template-kwargs '{"enable_thinking": true}'

curl http://localhost:8000/v1/chat/completions -d '{
  "model": "google/gemma-4-26B-A4B-it",
  "messages": [{"role": "user", "content": "What is 2+2?"}],
  "max_tokens": 200
}'
# reasoning_content is null, thinking appears in content

Tested on both V1 and V0 (VLLM_USE_V1=0) engines — same behavior.

Suggested Fix

Add start_token_id and end_token_id properties to Gemma4ReasoningParser (token IDs 100 and 101 respectively), matching the pattern used by Qwen3ReasoningParser. The streaming extraction should match on token IDs, not decoded text.

Environment

vLLM: 0.18.2rc1.dev69+g08ed2b968 (built from main)
Model: google/gemma-4-26B-A4B-it
GPU: RTX PRO 6000 Blackwell
transformers: 5.5.0.dev0 (from git main)

Also note: Gemma4ToolParser.__init__ has a minor signature mismatch — takes (self, tokenizer) but base class passes (self, tokenizer, tools). Patched locally with tools=None default.

extent analysis

TL;DR

Add start_token_id and end_token_id properties to Gemma4ReasoningParser to enable token-level matching in the streaming path.

Guidance

Implement start_token_id and end_token_id in Gemma4ReasoningParser with token IDs 100 and 101 respectively, following the pattern used by Qwen3ReasoningParser.
Verify the fix by checking if reasoning_content is correctly populated in the OpenAI chat completions response.
Test the updated parser with the provided reproduction steps to ensure the issue is resolved.
Review the unit tests in tests/reasoning/test_gemma4_reasoning_parser.py to ensure they accurately reflect the serving behavior.

Example

class Gemma4ReasoningParser(ReasoningParser):
    # ...
    start_token_id =

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#model download #tokenizer error #prompt formatting #chain error #conversation history

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

vllm - ✅(Solved) Fix [Bug]: Gemma4 reasoning parser fails to separate reasoning_content — <|channel> tokens stripped before parsing [2 pull requests, 6 comments, 6 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Fix Action

Fix / Workaround

PR fix notes

PR #38858: [Bugfix] Fix Gemma4 non-streaming reasoning parsing

Description (problem / solution / changelog)

Purpose

Test Plan

Test Result

Changed files

Code Example

Bug Description

Root Cause

Expected Behavior

Actual Behavior

Reproduction

Suggested Fix

Environment

extent analysis

TL;DR

Guidance

Example

Still need to ship something?

TRENDING

vllm - ✅(Solved) Fix [Bug]: Gemma4 reasoning parser fails to separate reasoning_content — <|channel> tokens stripped before parsing [2 pull requests, 6 comments, 6 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Fix Action

Fix / Workaround

PR fix notes

PR #38858: [Bugfix] Fix Gemma4 non-streaming reasoning parsing

Description (problem / solution / changelog)

Purpose

Test Plan

Test Result

Changed files

Code Example

Bug Description

Root Cause

Expected Behavior

Actual Behavior

Reproduction

Suggested Fix

Environment

extent analysis

TL;DR

Guidance

Example

Still need to ship something?

RELATED_DISCOVERY

TRENDING