vllm - 💡(How to fix) Fix [Bug]: Responses API does not surface reasoning output with `--reasoning-parser gemma4` (works with deepseek_r1) [1 pull requests]

vllm2026-05-22 07:01:54

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

Root Cause

The root cause is in the gemma4 reasoning parser's interaction with the Responses API serving path:

Chat Completions path works: The gemma4 reasoning parser correctly extracts <think>...</think> blocks and surfaces them in the reasoning field of the chat completion response.
Responses API non-harmony path (_make_response_output_items): Delegates to parser.extract_response_outputs() at serving.py:1044. The gemma4 parser's implementation of this method appears to not construct ResponseReasoningItem objects from the parsed reasoning content — it may be stripping the reasoning and only returning the final content/tool calls.
Contrast with working parsers: The deepseek_r1 parser correctly produces ResponseReasoningItem in its extract_response_outputs() implementation, which is why Qwen3 works fine.
reasoning_tokens counting also fails: The fallback at serving.py:866 that tries to count reasoning tokens from accumulated token IDs doesn't trigger for the gemma4 parser context.

Fix Action

Fixed

Fixed by PR: [Bugfix] Map reasoning_effort to enable_thinking for Gemma4 Responses API (https://github.com/vllm-project/vllm/pull/43401)

Code Example

OS                           : Ubuntu 22.04.5 LTS (x86_64)
PyTorch version              : 2.11.0+cu130
CUDA used to build PyTorch   : 13.0
Python version               : 3.12.13
Is CUDA available            : True
CUDA runtime version         : 13.0.88
GPU 0-3: NVIDIA RTX PRO 6000 Blackwell Server Edition
vLLM Version                 : 0.21.0
transformers                 : 5.8.1

---

docker run -d \
  --name vllm-gemma4 \
  --gpus all \
  --ipc=host \
  -p 8000:8000 \
  -e VLLM_ENABLE_RESPONSES_API_STORE=1 \
  vllm/vllm-openai:v0.21.0 \
  --model google/gemma-4-26B-A4B-it \
  --tensor-parallel-size 4 \
  --max-model-len 32768 \
  --gpu-memory-utilization 0.9 \
  --max-num-seqs 4 \
  --max-num-batched-tokens 14336 \
  --tool-call-parser functiongemma \
  --enable-auto-tool-choice \
  --reasoning-parser gemma4

---

import requests, json

BASE = "http://localhost:8000/v1"
MODEL = "google/gemma-4-26B-A4B-it"

tools = [
    {
        "type": "function",
        "name": "get_weather",
        "description": "Get current weather for a city",
        "parameters": {
            "type": "object",
            "properties": {"city": {"type": "string"}},
            "required": ["city"],
        },
    }
]

# Test 1: Chat Completions — reasoning WORKS
print("=== Chat Completions (works) ===")
r1 = requests.post(f"{BASE}/chat/completions", json={
    "model": MODEL,
    "messages": [{"role": "user", "content": "What is the weather in NYC?"}],
    "tools": [{"type": "function", "function": {"name": "get_weather", "description": "Get weather", "parameters": {"type": "object", "properties": {"city": {"type": "string"}}, "required": ["city"]}}}],
    "tool_choice": "required",
    "max_tokens": 1024,
    "chat_template_kwargs": {"enable_thinking": True},
})
d1 = r1.json()
msg = d1["choices"][0]["message"]
print(f"  reasoning: {msg.get('reasoning', '')[:200]}")
print(f"  tool_calls: {msg.get('tool_calls')}")
print(f"  finish_reason: {d1['choices'][0]['finish_reason']}")

# Test 2: Responses API — reasoning NOT surfaced
print("\n=== Responses API (broken) ===")
r2 = requests.post(f"{BASE}/responses", json={
    "model": MODEL,
    "input": [{"role": "user", "content": "What is the weather in NYC?"}],
    "tools": tools,
    "tool_choice": "required",
    "reasoning": {"effort": "high"},
})
d2 = r2.json()
usage = d2.get("usage", {})
output_details = usage.get("output_tokens_details", {})
print(f"  reasoning_tokens: {output_details.get('reasoning_tokens', 0)}")
print(f"  output types: {[item.get('type') for item in d2.get('output', [])]}")
print(f"  top-level reasoning field: {d2.get('reasoning')}")

# Test 3: Responses API text-only — still no reasoning
print("\n=== Responses API text-only (also broken) ===")
r3 = requests.post(f"{BASE}/responses", json={
    "model": MODEL,
    "input": [{"role": "user", "content": "What is 2+2? Think step by step."}],
    "reasoning": {"effort": "high"},
    "max_output_tokens": 500,
})
d3 = r3.json()
usage3 = d3.get("usage", {})
print(f"  reasoning_tokens: {usage3.get('output_tokens_details', {}).get('reasoning_tokens', 0)}")
print(f"  output types: {[item.get('type') for item in d3.get('output', [])]}")

---

=== Chat Completions (works) ===
  reasoning: The user is asking about the weather in NYC. I should look at the available tools... The `get_weather` tool seems appropriate for this task.
  tool_calls: [{'id': 'chatcmpl-tool-...', 'type': 'function', 'function': {'name': 'get_weather', 'arguments': '{"city": "NYC"}'}}]
  finish_reason: tool_calls

=== Responses API (broken) ===
  reasoning_tokens: 0
  output types: ['function_call']
  top-level reasoning field: None

=== Responses API text-only (also broken) ===
  reasoning_tokens: 0
  output types: ['message']

---

=== Chat Completions with reasoning ===
  reasoning present: True
  reasoning preview: The user wants to know the product of 15 × 37...
  finish_reason: stop

=== Chat Completions with tools + reasoning ===
  reasoning present: True
  reasoning preview: The user is asking about the weather in NYC. I should look for a tool...
  tool_calls: [{"function": {"name": "get_weather", "arguments": "{\"city\": \"NYC\"}"}}]
  finish_reason: tool_calls

=== Responses API with reasoning={'effort': 'high'} ===
  reasoning_tokens: 0
  output types: ['message']
  has ResponseReasoningItem: False

=== Responses API with tools + reasoning ===
  reasoning_tokens: 0
  output types: ['function_call']
  has ResponseReasoningItem: False

RAW_BUFFERClick to expand / collapse

Your current environment

<details> <summary>The output of <code>python collect_env.py</code></summary>

OS                           : Ubuntu 22.04.5 LTS (x86_64)
PyTorch version              : 2.11.0+cu130
CUDA used to build PyTorch   : 13.0
Python version               : 3.12.13
Is CUDA available            : True
CUDA runtime version         : 13.0.88
GPU 0-3: NVIDIA RTX PRO 6000 Blackwell Server Edition
vLLM Version                 : 0.21.0
transformers                 : 5.8.1

</details>

🐛 Describe the bug

With --reasoning-parser gemma4 enabled on vLLM v0.21.0, the Chat Completions API correctly surfaces model reasoning in a reasoning field alongside tool calls. However, the Responses API (/v1/responses) does not surface reasoning output in any form — reasoning_tokens is always 0 and no ResponseReasoningItem output item appears — even when reasoning: {"effort": "high"} is passed in the request.

This is specific to the gemma4 reasoning parser — tested with Qwen3 + deepseek_r1 which works correctly on the same image.

Server start command:

docker run -d \
  --name vllm-gemma4 \
  --gpus all \
  --ipc=host \
  -p 8000:8000 \
  -e VLLM_ENABLE_RESPONSES_API_STORE=1 \
  vllm/vllm-openai:v0.21.0 \
  --model google/gemma-4-26B-A4B-it \
  --tensor-parallel-size 4 \
  --max-model-len 32768 \
  --gpu-memory-utilization 0.9 \
  --max-num-seqs 4 \
  --max-num-batched-tokens 14336 \
  --tool-call-parser functiongemma \
  --enable-auto-tool-choice \
  --reasoning-parser gemma4

Reproduction script:

import requests, json

BASE = "http://localhost:8000/v1"
MODEL = "google/gemma-4-26B-A4B-it"

tools = [
    {
        "type": "function",
        "name": "get_weather",
        "description": "Get current weather for a city",
        "parameters": {
            "type": "object",
            "properties": {"city": {"type": "string"}},
            "required": ["city"],
        },
    }
]

# Test 1: Chat Completions — reasoning WORKS
print("=== Chat Completions (works) ===")
r1 = requests.post(f"{BASE}/chat/completions", json={
    "model": MODEL,
    "messages": [{"role": "user", "content": "What is the weather in NYC?"}],
    "tools": [{"type": "function", "function": {"name": "get_weather", "description": "Get weather", "parameters": {"type": "object", "properties": {"city": {"type": "string"}}, "required": ["city"]}}}],
    "tool_choice": "required",
    "max_tokens": 1024,
    "chat_template_kwargs": {"enable_thinking": True},
})
d1 = r1.json()
msg = d1["choices"][0]["message"]
print(f"  reasoning: {msg.get('reasoning', '')[:200]}")
print(f"  tool_calls: {msg.get('tool_calls')}")
print(f"  finish_reason: {d1['choices'][0]['finish_reason']}")

# Test 2: Responses API — reasoning NOT surfaced
print("\n=== Responses API (broken) ===")
r2 = requests.post(f"{BASE}/responses", json={
    "model": MODEL,
    "input": [{"role": "user", "content": "What is the weather in NYC?"}],
    "tools": tools,
    "tool_choice": "required",
    "reasoning": {"effort": "high"},
})
d2 = r2.json()
usage = d2.get("usage", {})
output_details = usage.get("output_tokens_details", {})
print(f"  reasoning_tokens: {output_details.get('reasoning_tokens', 0)}")
print(f"  output types: {[item.get('type') for item in d2.get('output', [])]}")
print(f"  top-level reasoning field: {d2.get('reasoning')}")

# Test 3: Responses API text-only — still no reasoning
print("\n=== Responses API text-only (also broken) ===")
r3 = requests.post(f"{BASE}/responses", json={
    "model": MODEL,
    "input": [{"role": "user", "content": "What is 2+2? Think step by step."}],
    "reasoning": {"effort": "high"},
    "max_output_tokens": 500,
})
d3 = r3.json()
usage3 = d3.get("usage", {})
print(f"  reasoning_tokens: {usage3.get('output_tokens_details', {}).get('reasoning_tokens', 0)}")
print(f"  output types: {[item.get('type') for item in d3.get('output', [])]}")

Expected behavior:

The Responses API should surface reasoning output when --reasoning-parser gemma4 is active and reasoning: {"effort": "high"} is passed. Expected:

A "reasoning" output item (ResponseReasoningItem) containing the model's thinking
reasoning_tokens > 0 in usage details

The Chat Completions API demonstrates this works at the model/parser level — the Responses API just doesn't wire it through to output items.

Actual behavior:

=== Chat Completions (works) ===
  reasoning: The user is asking about the weather in NYC. I should look at the available tools... The `get_weather` tool seems appropriate for this task.
  tool_calls: [{'id': 'chatcmpl-tool-...', 'type': 'function', 'function': {'name': 'get_weather', 'arguments': '{"city": "NYC"}'}}]
  finish_reason: tool_calls

=== Responses API (broken) ===
  reasoning_tokens: 0
  output types: ['function_call']
  top-level reasoning field: None

=== Responses API text-only (also broken) ===
  reasoning_tokens: 0
  output types: ['message']

Reproduction Evidence

Confirmed on vllm/vllm-openai:latest (v0.21.0) with google/gemma-4-26B-A4B-it, TP=4, --reasoning-parser gemma4, --tool-call-parser functiongemma:

=== Chat Completions with reasoning ===
  reasoning present: True
  reasoning preview: The user wants to know the product of 15 × 37...
  finish_reason: stop

=== Chat Completions with tools + reasoning ===
  reasoning present: True
  reasoning preview: The user is asking about the weather in NYC. I should look for a tool...
  tool_calls: [{"function": {"name": "get_weather", "arguments": "{\"city\": \"NYC\"}"}}]
  finish_reason: tool_calls

=== Responses API with reasoning={'effort': 'high'} ===
  reasoning_tokens: 0
  output types: ['message']
  has ResponseReasoningItem: False

=== Responses API with tools + reasoning ===
  reasoning_tokens: 0
  output types: ['function_call']
  has ResponseReasoningItem: False

Note: This bug is parser-specific. Tested with Qwen3-1.7B + --reasoning-parser deepseek_r1 on the same image — the Responses API correctly returns ResponseReasoningItem with reasoning_tokens: 1023. The issue is specific to the gemma4 reasoning parser integration with the Responses API.

Analysis

The root cause is in the gemma4 reasoning parser's interaction with the Responses API serving path:

Chat Completions path works: The gemma4 reasoning parser correctly extracts <think>...</think> blocks and surfaces them in the reasoning field of the chat completion response.
Responses API non-harmony path (_make_response_output_items): Delegates to parser.extract_response_outputs() at serving.py:1044. The gemma4 parser's implementation of this method appears to not construct ResponseReasoningItem objects from the parsed reasoning content — it may be stripping the reasoning and only returning the final content/tool calls.
Contrast with working parsers: The deepseek_r1 parser correctly produces ResponseReasoningItem in its extract_response_outputs() implementation, which is why Qwen3 works fine.
reasoning_tokens counting also fails: The fallback at serving.py:866 that tries to count reasoning tokens from accumulated token IDs doesn't trigger for the gemma4 parser context.

Related Issues / PRs

PR #41393 — "Adding reasoning for responses API V1" (open, adds chat_template_kwargs + thinking_token_budget passthrough — partial fix for a different layer)
Issue #42962 — "Documentation falsely claims chat_template_kwargs support for the Responses API"
Issue #33915 — "Support include_reasoning request parameter for non-harmony models"

This issue is distinct: it's about the gemma4 reasoning parser not producing ResponseReasoningItem in its extract_response_outputs() path, while other parsers (like deepseek_r1) work correctly.

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering