vllm - ✅(Solved) Fix [Bug]: --reasoning-parser gemma4: streaming leaks reasoning into content after tool results in multi-turn conversations [2 pull requests, 4 comments, 4 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#39885Fetched 2026-04-17 08:23:58
View on GitHub
Comments
4
Participants
4
Timeline
25
Reactions
0
Author
Timeline (top)
subscribed ×9mentioned ×8commented ×4cross-referenced ×2

When using --reasoning-parser gemma4 with streaming enabled, the reasoning content leaks into the content field instead of the reasoning field. This happens specifically in multi-turn conversations after a tool call result, when the model generates a new tool call in the following turn.

The issue does NOT occur with stream=False (sync mode), where reasoning is correctly separated.

Root Cause

When using --reasoning-parser gemma4 with streaming enabled, the reasoning content leaks into the content field instead of the reasoning field. This happens specifically in multi-turn conversations after a tool call result, when the model generates a new tool call in the following turn.

The issue does NOT occur with stream=False (sync mode), where reasoning is correctly separated.

Fix Action

Fix / Workaround

============================== CPU Info

Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Address sizes: 52 bits physical, 57 bits virtual Byte Order: Little Endian CPU(s): 64 On-line CPU(s) list: 0-63 Vendor ID: AuthenticAMD Model name: AMD Eng Sample CPU family: 25 Model: 17 Thread(s) per core: 2 Core(s) per socket: 32 Socket(s): 1 Stepping: 0 Frequency boost: enabled CPU max MHz: 3612.2549 CPU min MHz: 1500.0000 BogoMIPS: 5100.29 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good amd_lbr_v2 nopl xtopology nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpuid_fault cpb cat_l3 cdp_l3 hw_pstate ssbd mba perfmon_v2 ibrs ibpb stibp ibrs_enhanced vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local user_shstk avx512_bf16 clzero irperf xsaveerptr rdpru wbnoinvd amd_ppin cppc arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif x2avic v_spec_ctrl vnmi avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq la57 rdpid overflow_recov succor smca fsrm flush_l1d debug_swap amd_lbr_pmc_freeze Virtualization: AMD-V L1d cache: 1 MiB (32 instances) L1i cache: 1 MiB (32 instances) L2 cache: 32 MiB (32 instances) L3 cache: 128 MiB (4 instances) NUMA node(s): 1 NUMA node0 CPU(s): 0-63 Vulnerability Gather data sampling: Not affected Vulnerability Ghostwrite: Not affected Vulnerability Indirect target selection: Not affected Vulnerability Itlb multihit: Not affected Vulnerability L1tf: Not affected Vulnerability Mds: Not affected Vulnerability Meltdown: Not affected Vulnerability Mmio stale data: Not affected Vulnerability Old microcode: Not affected Vulnerability Reg file data sampling: Not affected Vulnerability Retbleed: Not affected Vulnerability Spec rstack overflow: Vulnerable: Safe RET, no microcode Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Vulnerability Spectre v2: Mitigation; Enhanced / Automatic IBRS; IBPB conditional; STIBP always-on; PBRSB-eIBRS Not affected; BHI Not affected Vulnerability Srbds: Not affected Vulnerability Tsa: Vulnerable: No microcode Vulnerability Tsx async abort: Not affected Vulnerability Vmscape: Mitigation; IBPB before exit to userspace

PR fix notes

PR #39898: fix: prevent reasoning content leakage after tool results

Description (problem / solution / changelog)

Fix reasoning parser leaking reasoning content into response after tool results in multi-turn conversations. The parser wasn't properly clearing reasoning state after tool calls. Fixes #39885.

Changed files

  • vllm/reasoning/basic_parsers.py (modified, +7/-1)

PR #40006: [Bugfix][Frontend] Fix streaming reasoning leak with tool_choice="auto" (#39885)

Description (problem / solution / changelog)

Fixes #39885.

What's wrong

With tool_choice="auto" and a reasoning parser, reasoning content leaks into delta.content instead of delta.reasoning when streaming a multi-turn conversation that has a prior tool result.

The streaming handler in chat_completion_stream_generator has three branches for tool parsing. The named and required branches both call reasoning_parser.extract_reasoning_streaming() before handing off to the tool parser. The elif parser is not None branch (auto) skipped that entirely and called parser.parse_delta() directly.

In multi-turn conversations the prompt already contains a prior <|tool_call> token. DelegatingParser.parse_delta() calls is_reasoning_end(prompt_token_ids) on the first chunk; Gemma4ReasoningParser sees that token and returns True before reaching a <|turn> boundary, so reasoning_ended gets permanently set and every reasoning token ends up in delta.content.

stream=False is unaffected because the non-streaming path calls extract_reasoning() on the full output before extract_tool_calls().

Fix

Make the auto branch follow the same reasoning-first pattern as the named and required branches. Once reasoning ends, hand off to parser.extract_tool_calls_streaming() as before.

Relation to #39898

PR #39898 fixes a separate token-split edge case in basic_parsers.py. This PR fixes the root cause for the auto tool-choice streaming path in serving.py. They are complementary.

Tests:

  • pytest tests/tool_parsers/test_gemma4_tool_parser.py
  • 47 passed, 1 skipped <img width="722" height="187" alt="image" src="https://github.com/user-attachments/assets/c634302c-6027-4c2e-9feb-c1812681de06" />

Changed files

  • tests/tool_parsers/test_gemma4_tool_parser.py (modified, +246/-0)
  • vllm/entrypoints/openai/chat_completion/serving.py (modified, +106/-11)

Code Example

Collecting environment information...
uv is set
==============================
        System Info
==============================
OS                           : Ubuntu 22.04.5 LTS (x86_64)
GCC version                  : (Ubuntu 12.3.0-1ubuntu1~22.04.2) 12.3.0
Clang version                : Could not collect
CMake version                : version 3.29.2
Libc version                 : glibc-2.35

==============================
       PyTorch Info
==============================
PyTorch version              : 2.11.0+cu129
Is debug build               : False
CUDA used to build PyTorch   : 12.9
ROCM used to build PyTorch   : N/A
XPU used to build PyTorch    : N/A

==============================
      Python Environment
==============================
Python version               : 3.12.11 (main, Jun  4 2025, 08:56:18) [GCC 11.4.0] (64-bit runtime)
Python platform              : Linux-6.17.0-14-generic-x86_64-with-glibc2.35

==============================
       CUDA / GPU Info
==============================
Is CUDA available            : True
CUDA runtime version         : 12.9.86
CUDA_MODULE_LOADING set to   :
GPU models and configuration :
GPU 0: NVIDIA GeForce RTX 5090
GPU 1: NVIDIA GeForce RTX 5090
GPU 2: NVIDIA GeForce RTX 5090
GPU 3: NVIDIA GeForce RTX 5090

Nvidia driver version        : 590.48.01
cuDNN version                : Could not collect
HIP runtime version          : N/A
MIOpen runtime version       : N/A
Is XNNPACK available         : True

==============================
          CPU Info
==============================
Architecture:                            x86_64
CPU op-mode(s):                          32-bit, 64-bit
Address sizes:                           52 bits physical, 57 bits virtual
Byte Order:                              Little Endian
CPU(s):                                  64
On-line CPU(s) list:                     0-63
Vendor ID:                               AuthenticAMD
Model name:                              AMD Eng Sample
CPU family:                              25
Model:                                   17
Thread(s) per core:                      2
Core(s) per socket:                      32
Socket(s):                               1
Stepping:                                0
Frequency boost:                         enabled
CPU max MHz:                             3612.2549
CPU min MHz:                             1500.0000
BogoMIPS:                                5100.29
Flags:                                   fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good amd_lbr_v2 nopl xtopology nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpuid_fault cpb cat_l3 cdp_l3 hw_pstate ssbd mba perfmon_v2 ibrs ibpb stibp ibrs_enhanced vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local user_shstk avx512_bf16 clzero irperf xsaveerptr rdpru wbnoinvd amd_ppin cppc arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif x2avic v_spec_ctrl vnmi avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq la57 rdpid overflow_recov succor smca fsrm flush_l1d debug_swap amd_lbr_pmc_freeze
Virtualization:                          AMD-V
L1d cache:                               1 MiB (32 instances)
L1i cache:                               1 MiB (32 instances)
L2 cache:                                32 MiB (32 instances)
L3 cache:                                128 MiB (4 instances)
NUMA node(s):                            1
NUMA node0 CPU(s):                       0-63
Vulnerability Gather data sampling:      Not affected
Vulnerability Ghostwrite:                Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Old microcode:             Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Vulnerable: Safe RET, no microcode
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; Enhanced / Automatic IBRS; IBPB conditional; STIBP always-on; PBRSB-eIBRS Not affected; BHI Not affected
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Vulnerable: No microcode
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Mitigation; IBPB before exit to userspace

==============================
Versions of relevant libraries
==============================
[pip3] flashinfer-python==0.6.7
[pip3] numpy==2.4.4
[pip3] nvidia-cublas-cu12==12.9.1.4
[pip3] nvidia-cuda-cupti-cu12==12.9.79
[pip3] nvidia-cuda-nvrtc-cu12==12.9.86
[pip3] nvidia-cuda-runtime-cu12==12.9.79
[pip3] nvidia-cudnn-cu12==9.17.1.4
[pip3] nvidia-cudnn-frontend==1.18.0
[pip3] nvidia-cufft-cu12==11.4.1.4
[pip3] nvidia-cufile-cu12==1.14.1.1
[pip3] nvidia-curand-cu12==10.3.10.19
[pip3] nvidia-cusolver-cu12==11.7.5.82
[pip3] nvidia-cusparse-cu12==12.5.10.65
[pip3] nvidia-cusparselt-cu12==0.7.1
[pip3] nvidia-cutlass-dsl==4.5.0.dev0
[pip3] nvidia-cutlass-dsl-libs-base==4.5.0.dev0
[pip3] nvidia-ml-py==13.595.45
[pip3] nvidia-nccl-cu12==2.28.9
[pip3] nvidia-nvjitlink-cu12==12.9.86
[pip3] nvidia-nvshmem-cu12==3.4.5
[pip3] nvidia-nvtx-cu12==12.9.79
[pip3] pyzmq==27.1.0
[pip3] torch==2.11.0+cu129
[pip3] torch-c-dlpack-ext==0.1.5
[pip3] torchaudio==2.11.0+cu129
[pip3] torchvision==0.26.0+cu129
[pip3] transformers==5.5.4
[pip3] triton==3.6.0
[conda] Could not collect

==============================
         vLLM Info
==============================
ROCM Version                 : Could not collect
vLLM Version                 : 0.19.1rc1.dev297+g799973af4 (git sha: 799973af4)
vLLM Build Flags:
  CUDA Archs: 12.0; ROCm: Disabled; XPU: Disabled
GPU Topology:
  	GPU0	GPU1	GPU2	GPU3	CPU Affinity	NUMA Affinity	GPU NUMA ID
GPU0	 X 	PHB	NODE	NODE	0-63	0		N/A
GPU1	PHB	 X 	NODE	NODE	0-63	0		N/A
GPU2	NODE	NODE	 X 	PHB	0-63	0		N/A
GPU3	NODE	NODE	PHB	 X 	0-63	0		N/A

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

==============================
     Environment Variables
==============================
NVIDIA_VISIBLE_DEVICES=void
TORCH_CUDA_ARCH_LIST=12.0
VLLM_DISABLED_KERNELS=MacheteLinearKernel
TORCHINDUCTOR_DISABLE=1
LD_LIBRARY_PATH=/usr/local/cuda/lib64:
NVIDIA_CTK_LIBCUDA_DIR=/usr/lib/x86_64-linux-gnu
CUDA_HOME=/usr/local/cuda-12.9
CUDA_HOME=/usr/local/cuda-12.9
PYTORCH_NVML_BASED_CUDA_CHECK=1
TORCHINDUCTOR_COMPILE_THREADS=1
TORCHINDUCTOR_CACHE_DIR=/tmp/torchinductor_root

---

from openai import OpenAI
import httpx

client = OpenAI(
    base_url='http://localhost:8003/v1',
    api_key='your-key',
    http_client=httpx.Client(verify=False)
)

tools = [
    {"type":"function","function":{"name":"ToolA","description":"First tool","parameters":{"type":"object","properties":{"x":{"type":"string"}},"required":["x"]}}},
    {"type":"function","function":{"name":"ToolB","description":"Second tool","parameters":{"type":"object","properties":{"query":{"type":"string"}},"required":["query"]}}},
]

# Simulate a conversation where turn 1 called ToolA, got a result,
# and now the model should call ToolB in turn 2
messages = [
    {"role": "system", "content": "You are a helpful assistant. Think step by step."},
    {"role": "user", "content": "Search for information about data protection laws"},
    {"role": "assistant", "content": "", "tool_calls": [
        {"id": "call_001", "type": "function", "function": {"name": "ToolA", "arguments": '{"x": "load ToolB"}'}}
    ]},
    {"role": "tool", "tool_call_id": "call_001", "content": "Success: ToolB is now available"},
]

# STREAMING (BROKEN): reasoning leaks into content
stream = client.chat.completions.create(
    model="your-served-model-name",
    messages=messages,
    tools=tools,
    tool_choice="auto",
    temperature=0.4,
    max_tokens=4000,
    stream=True,
    extra_body={"chat_template_kwargs": {"enable_thinking": True}, "thinking_budget": 4096}
)

content, reasoning = "", ""
for chunk in stream:
    if not chunk.choices: continue
    delta = chunk.choices[0].delta
    if delta and delta.content: content += delta.content
    if delta:
        r = getattr(delta, "reasoning", None) or (getattr(delta, "model_extra", None) or {}).get("reasoning")
        if r: reasoning += r

print(f"STREAM - content: {repr(content[:100])}")
print(f"STREAM - reasoning: {repr(reasoning[:100])}")
# Output: content starts with 'thought\n...' (LEAKED), reasoning is empty

# SYNC (WORKS CORRECTLY)
response = client.chat.completions.create(
    model="your-served-model-name",
    messages=messages,
    tools=tools,
    tool_choice="auto",
    temperature=0.4,
    max_tokens=4000,
    stream=False,
    extra_body={"chat_template_kwargs": {"enable_thinking": True}, "thinking_budget": 4096}
)
msg = response.choices[0].message
print(f"SYNC - content: {repr((msg.content or '')[:100])}")
print(f"SYNC - reasoning: {repr(getattr(msg, 'reasoning', '')[:100])}")
# Output: content is empty (correct), reasoning has the thinking (correct)

---

python -m vllm.entrypoints.cli.main serve RedHatAI/gemma-4-26B-A4B-it-FP8-Dynamic \
  --served_model_name my_model \
  --enable-auto-tool-choice \
  --tool-call-parser gemma4 \
  --reasoning-parser gemma4 \
  --enable_prefix_caching \
  --enable-chunked-prefill \
  --tensor_parallel_size 2 \
  --max_num_seqs 64
RAW_BUFFERClick to expand / collapse

Your current environment

<details> <summary>The output of <code>python collect_env.py</code></summary>
Collecting environment information...
uv is set
==============================
        System Info
==============================
OS                           : Ubuntu 22.04.5 LTS (x86_64)
GCC version                  : (Ubuntu 12.3.0-1ubuntu1~22.04.2) 12.3.0
Clang version                : Could not collect
CMake version                : version 3.29.2
Libc version                 : glibc-2.35

==============================
       PyTorch Info
==============================
PyTorch version              : 2.11.0+cu129
Is debug build               : False
CUDA used to build PyTorch   : 12.9
ROCM used to build PyTorch   : N/A
XPU used to build PyTorch    : N/A

==============================
      Python Environment
==============================
Python version               : 3.12.11 (main, Jun  4 2025, 08:56:18) [GCC 11.4.0] (64-bit runtime)
Python platform              : Linux-6.17.0-14-generic-x86_64-with-glibc2.35

==============================
       CUDA / GPU Info
==============================
Is CUDA available            : True
CUDA runtime version         : 12.9.86
CUDA_MODULE_LOADING set to   :
GPU models and configuration :
GPU 0: NVIDIA GeForce RTX 5090
GPU 1: NVIDIA GeForce RTX 5090
GPU 2: NVIDIA GeForce RTX 5090
GPU 3: NVIDIA GeForce RTX 5090

Nvidia driver version        : 590.48.01
cuDNN version                : Could not collect
HIP runtime version          : N/A
MIOpen runtime version       : N/A
Is XNNPACK available         : True

==============================
          CPU Info
==============================
Architecture:                            x86_64
CPU op-mode(s):                          32-bit, 64-bit
Address sizes:                           52 bits physical, 57 bits virtual
Byte Order:                              Little Endian
CPU(s):                                  64
On-line CPU(s) list:                     0-63
Vendor ID:                               AuthenticAMD
Model name:                              AMD Eng Sample
CPU family:                              25
Model:                                   17
Thread(s) per core:                      2
Core(s) per socket:                      32
Socket(s):                               1
Stepping:                                0
Frequency boost:                         enabled
CPU max MHz:                             3612.2549
CPU min MHz:                             1500.0000
BogoMIPS:                                5100.29
Flags:                                   fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good amd_lbr_v2 nopl xtopology nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpuid_fault cpb cat_l3 cdp_l3 hw_pstate ssbd mba perfmon_v2 ibrs ibpb stibp ibrs_enhanced vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local user_shstk avx512_bf16 clzero irperf xsaveerptr rdpru wbnoinvd amd_ppin cppc arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif x2avic v_spec_ctrl vnmi avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq la57 rdpid overflow_recov succor smca fsrm flush_l1d debug_swap amd_lbr_pmc_freeze
Virtualization:                          AMD-V
L1d cache:                               1 MiB (32 instances)
L1i cache:                               1 MiB (32 instances)
L2 cache:                                32 MiB (32 instances)
L3 cache:                                128 MiB (4 instances)
NUMA node(s):                            1
NUMA node0 CPU(s):                       0-63
Vulnerability Gather data sampling:      Not affected
Vulnerability Ghostwrite:                Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Old microcode:             Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Vulnerable: Safe RET, no microcode
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; Enhanced / Automatic IBRS; IBPB conditional; STIBP always-on; PBRSB-eIBRS Not affected; BHI Not affected
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Vulnerable: No microcode
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Mitigation; IBPB before exit to userspace

==============================
Versions of relevant libraries
==============================
[pip3] flashinfer-python==0.6.7
[pip3] numpy==2.4.4
[pip3] nvidia-cublas-cu12==12.9.1.4
[pip3] nvidia-cuda-cupti-cu12==12.9.79
[pip3] nvidia-cuda-nvrtc-cu12==12.9.86
[pip3] nvidia-cuda-runtime-cu12==12.9.79
[pip3] nvidia-cudnn-cu12==9.17.1.4
[pip3] nvidia-cudnn-frontend==1.18.0
[pip3] nvidia-cufft-cu12==11.4.1.4
[pip3] nvidia-cufile-cu12==1.14.1.1
[pip3] nvidia-curand-cu12==10.3.10.19
[pip3] nvidia-cusolver-cu12==11.7.5.82
[pip3] nvidia-cusparse-cu12==12.5.10.65
[pip3] nvidia-cusparselt-cu12==0.7.1
[pip3] nvidia-cutlass-dsl==4.5.0.dev0
[pip3] nvidia-cutlass-dsl-libs-base==4.5.0.dev0
[pip3] nvidia-ml-py==13.595.45
[pip3] nvidia-nccl-cu12==2.28.9
[pip3] nvidia-nvjitlink-cu12==12.9.86
[pip3] nvidia-nvshmem-cu12==3.4.5
[pip3] nvidia-nvtx-cu12==12.9.79
[pip3] pyzmq==27.1.0
[pip3] torch==2.11.0+cu129
[pip3] torch-c-dlpack-ext==0.1.5
[pip3] torchaudio==2.11.0+cu129
[pip3] torchvision==0.26.0+cu129
[pip3] transformers==5.5.4
[pip3] triton==3.6.0
[conda] Could not collect

==============================
         vLLM Info
==============================
ROCM Version                 : Could not collect
vLLM Version                 : 0.19.1rc1.dev297+g799973af4 (git sha: 799973af4)
vLLM Build Flags:
  CUDA Archs: 12.0; ROCm: Disabled; XPU: Disabled
GPU Topology:
  	GPU0	GPU1	GPU2	GPU3	CPU Affinity	NUMA Affinity	GPU NUMA ID
GPU0	 X 	PHB	NODE	NODE	0-63	0		N/A
GPU1	PHB	 X 	NODE	NODE	0-63	0		N/A
GPU2	NODE	NODE	 X 	PHB	0-63	0		N/A
GPU3	NODE	NODE	PHB	 X 	0-63	0		N/A

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

==============================
     Environment Variables
==============================
NVIDIA_VISIBLE_DEVICES=void
TORCH_CUDA_ARCH_LIST=12.0
VLLM_DISABLED_KERNELS=MacheteLinearKernel
TORCHINDUCTOR_DISABLE=1
LD_LIBRARY_PATH=/usr/local/cuda/lib64:
NVIDIA_CTK_LIBCUDA_DIR=/usr/lib/x86_64-linux-gnu
CUDA_HOME=/usr/local/cuda-12.9
CUDA_HOME=/usr/local/cuda-12.9
PYTORCH_NVML_BASED_CUDA_CHECK=1
TORCHINDUCTOR_COMPILE_THREADS=1
TORCHINDUCTOR_CACHE_DIR=/tmp/torchinductor_root
</details>

🐛 Describe the bug

Description

When using --reasoning-parser gemma4 with streaming enabled, the reasoning content leaks into the content field instead of the reasoning field. This happens specifically in multi-turn conversations after a tool call result, when the model generates a new tool call in the following turn.

The issue does NOT occur with stream=False (sync mode), where reasoning is correctly separated.

How to reproduce

from openai import OpenAI
import httpx

client = OpenAI(
    base_url='http://localhost:8003/v1',
    api_key='your-key',
    http_client=httpx.Client(verify=False)
)

tools = [
    {"type":"function","function":{"name":"ToolA","description":"First tool","parameters":{"type":"object","properties":{"x":{"type":"string"}},"required":["x"]}}},
    {"type":"function","function":{"name":"ToolB","description":"Second tool","parameters":{"type":"object","properties":{"query":{"type":"string"}},"required":["query"]}}},
]

# Simulate a conversation where turn 1 called ToolA, got a result,
# and now the model should call ToolB in turn 2
messages = [
    {"role": "system", "content": "You are a helpful assistant. Think step by step."},
    {"role": "user", "content": "Search for information about data protection laws"},
    {"role": "assistant", "content": "", "tool_calls": [
        {"id": "call_001", "type": "function", "function": {"name": "ToolA", "arguments": '{"x": "load ToolB"}'}}
    ]},
    {"role": "tool", "tool_call_id": "call_001", "content": "Success: ToolB is now available"},
]

# STREAMING (BROKEN): reasoning leaks into content
stream = client.chat.completions.create(
    model="your-served-model-name",
    messages=messages,
    tools=tools,
    tool_choice="auto",
    temperature=0.4,
    max_tokens=4000,
    stream=True,
    extra_body={"chat_template_kwargs": {"enable_thinking": True}, "thinking_budget": 4096}
)

content, reasoning = "", ""
for chunk in stream:
    if not chunk.choices: continue
    delta = chunk.choices[0].delta
    if delta and delta.content: content += delta.content
    if delta:
        r = getattr(delta, "reasoning", None) or (getattr(delta, "model_extra", None) or {}).get("reasoning")
        if r: reasoning += r

print(f"STREAM - content: {repr(content[:100])}")
print(f"STREAM - reasoning: {repr(reasoning[:100])}")
# Output: content starts with 'thought\n...' (LEAKED), reasoning is empty

# SYNC (WORKS CORRECTLY)
response = client.chat.completions.create(
    model="your-served-model-name",
    messages=messages,
    tools=tools,
    tool_choice="auto",
    temperature=0.4,
    max_tokens=4000,
    stream=False,
    extra_body={"chat_template_kwargs": {"enable_thinking": True}, "thinking_budget": 4096}
)
msg = response.choices[0].message
print(f"SYNC - content: {repr((msg.content or '')[:100])}")
print(f"SYNC - reasoning: {repr(getattr(msg, 'reasoning', '')[:100])}")
# Output: content is empty (correct), reasoning has the thinking (correct)

vLLM launch command

python -m vllm.entrypoints.cli.main serve RedHatAI/gemma-4-26B-A4B-it-FP8-Dynamic \
  --served_model_name my_model \
  --enable-auto-tool-choice \
  --tool-call-parser gemma4 \
  --reasoning-parser gemma4 \
  --enable_prefix_caching \
  --enable-chunked-prefill \
  --tensor_parallel_size 2 \
  --max_num_seqs 64

Expected behavior

In streaming mode, reasoning tokens should appear in delta.reasoning (same as sync mode where they correctly appear in message.reasoning).

Actual behavior

In streaming mode, reasoning tokens appear in delta.content as plain text starting with the literal word "thought" (no <think> tags). The delta.reasoning field is always empty/None.

This happens 100% of the time when:

  1. The conversation includes a prior assistant message with tool_calls followed by a tool result message
  2. The model generates another tool call in the current turn
  3. stream=True

With stream=False, the exact same messages produce correct output (reasoning in message.reasoning, content empty).

Additional observations

  • The leak happens regardless of enable_thinking being True or False
  • The leak happens regardless of thinking_budget value
  • The leak happens regardless of temperature
  • The leak does NOT happen on the first turn (only after tool results are in the conversation)
  • Tested on vLLM v0.19.1rc1 and latest dev build -- same behavior on both

Environment

  • vLLM version: 0.19.1rc1.dev297+g799973af4 (also tested 0.19.1rc1.dev139+g66c079ae8)
  • Model: RedHatAI/gemma-4-26B-A4B-it-FP8-Dynamic (Gemma 4 26B MoE)
  • GPU: NVIDIA RTX 5090 (Blackwell, sm_120)
  • CUDA: 12.9
  • OS: Ubuntu (Docker)

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

TL;DR

The issue can be fixed by modifying the reasoning-parser to correctly handle the delta object in streaming mode, ensuring that reasoning tokens are properly separated from content.

Guidance

  1. Verify the reasoning-parser configuration: Check the --reasoning-parser flag in the vLLM launch command to ensure it is set to gemma4 and that the parser is correctly handling the delta object in streaming mode.
  2. Inspect the delta object: In the streaming mode, print or log the delta object to verify its structure and contents, checking if the reasoning tokens are indeed present in the content field instead of the reasoning field.
  3. Modify the reasoning-parser: Update the reasoning-parser to correctly handle the delta object in streaming mode, ensuring that reasoning tokens are properly separated from content and appear in the delta.reasoning field.
  4. Test with a minimal example: Create a minimal example with a simple conversation and tool call to reproduce the issue and verify the fix.

Example

No code snippet is provided as the issue requires modifications to the reasoning-parser which is not explicitly defined in the provided code.

Notes

The fix may require updates to the reasoning-parser implementation, which is not provided in the issue. Additionally, the issue may be specific to the gemma4 parser and the RedHatAI/gemma-4-26B-A4B-it-FP8-Dynamic model.

Recommendation

Apply a workaround by modifying the reasoning-parser to correctly handle the delta object in streaming mode, ensuring that reasoning tokens are properly separated from content.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

FAQ

Expected behavior

In streaming mode, reasoning tokens should appear in delta.reasoning (same as sync mode where they correctly appear in message.reasoning).

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING