vllm - 💡(How to fix) Fix [Bug]: Batch chat completions drop Gemma 4 reasoning delimiters [1 pull requests]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

Root Cause

OpenAIServingChatBatch.render_batch_chat_request() manually calls render.preprocess_chat(...), but did not pass reasoning_parser=self.reasoning_parser_cls.

The batch path also converted the batch request into per-conversation ChatCompletionRequest objects separately from preprocessing, which made it easy to lose mutations applied by parser request hooks.

Regular chat serving already passes the reasoning parser into preprocessing:

conversation, engine_inputs = await self.preprocess_chat(
    ...,
    reasoning_parser=self.reasoning_parser,
)

Batch serving needs the equivalent behavior for each conversation in the batch and should carry the adjusted per-conversation request objects forward into sampling and final parsing.

Fix Action

Fixed

Code Example

vllm serve google/gemma-4-... \
  --reasoning-parser gemma4

---

conversation, engine_inputs = await self.preprocess_chat(
    ...,
    reasoning_parser=self.reasoning_parser,
)
RAW_BUFFERClick to expand / collapse

Your current environment

Observed in the current vllm-project/vllm development tree while inspecting the OpenAI chat completions batch endpoint.

Relevant configuration:

vllm serve google/gemma-4-... \
  --reasoning-parser gemma4

🐛 Describe the bug

/v1/chat/completions/batch does not preserve Gemma 4 reasoning delimiter tokens before reasoning parsing.

Gemma 4 reasoning depends on special tokens such as <|channel> and <channel|> to delimit reasoning content. Gemma4ReasoningParser.adjust_request() sets skip_special_tokens=False so those delimiters remain in the generated text and can be parsed correctly.

The regular /v1/chat/completions path calls the renderer with the reasoning parser, so adjust_request() runs before sampling params are built. The batch chat completions path did not pass the reasoning parser through preprocessing, so Gemma4ReasoningParser.adjust_request() was skipped. As a result, skip_special_tokens remained at the default True, the special delimiter tokens were dropped during detokenization, and the Gemma 4 reasoning parser could not reliably separate reasoning content from final answer content.

This is related to the issue fixed for the regular chat path in #39081, but the batch endpoint has a separate preprocessing path and was not covered by that change.

Expected behavior

Batch chat completions should behave consistently with regular chat completions when --reasoning-parser gemma4 is enabled:

  • Gemma4ReasoningParser.adjust_request() should run for each per-conversation request.
  • skip_special_tokens should be set to False before sampling params are built.
  • Generated Gemma 4 reasoning delimiter tokens should be preserved long enough for reasoning parsing.
  • The response should place reasoning in the reasoning field and final answer text in content.

Root cause

OpenAIServingChatBatch.render_batch_chat_request() manually calls render.preprocess_chat(...), but did not pass reasoning_parser=self.reasoning_parser_cls.

The batch path also converted the batch request into per-conversation ChatCompletionRequest objects separately from preprocessing, which made it easy to lose mutations applied by parser request hooks.

Regular chat serving already passes the reasoning parser into preprocessing:

conversation, engine_inputs = await self.preprocess_chat(
    ...,
    reasoning_parser=self.reasoning_parser,
)

Batch serving needs the equivalent behavior for each conversation in the batch and should carry the adjusted per-conversation request objects forward into sampling and final parsing.

Impact

This affects Gemma 4 batch chat completions with reasoning enabled. It can cause reasoning parsing to fail because the structural tokens needed by Gemma4ReasoningParser are removed before the parser sees the model output.

Suggested fix

Make the batch path mirror the regular chat path:

  1. Create each per-conversation ChatCompletionRequest.
  2. Pass reasoning_parser=self.reasoning_parser_cls into render.preprocess_chat(...).
  3. Keep and reuse the adjusted per-conversation request objects for:
    • to_sampling_params(...)
    • reasoning state passed to engine_client.generate(...)
    • final reasoning_parser.extract_reasoning(...)
  4. Add a regression test proving batch preprocessing invokes the reasoning parser adjustment and preserves the adjusted request objects.

Before submitting a new issue...

  • I searched existing issues and PRs for Gemma 4 batch chat reasoning/parser reports.
  • I found related regular-chat Gemma 4 parser work in #39081, but not a duplicate for /v1/chat/completions/batch.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

FAQ

Expected behavior

Batch chat completions should behave consistently with regular chat completions when --reasoning-parser gemma4 is enabled:

  • Gemma4ReasoningParser.adjust_request() should run for each per-conversation request.
  • skip_special_tokens should be set to False before sampling params are built.
  • Generated Gemma 4 reasoning delimiter tokens should be preserved long enough for reasoning parsing.
  • The response should place reasoning in the reasoning field and final answer text in content.

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING