vllm - ✅(Solved) Fix [Bug]: The last few reasoning output tokens are missing when using Gemma4 and setting "--streaming-interval" to be larger than 1 [1 pull requests, 1 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#41691Fetched 2026-05-06 06:15:26
View on GitHub
Comments
1
Participants
2
Timeline
5
Reactions
0
Participants
Timeline (top)
commented ×1cross-referenced ×1labeled ×1mentioned ×1

Root Cause

  • The snail reaches the top during the day. Once it hits 20 feet, it doesn't slide back because it's already out of the well.
    • Let $n$ be the number of days.
    • On the last day, the snail climbs 3 feet to reach 20. This means it must have started that day at $20 - 3 = 17$ feet.
    • How many full cycles (day + night) does it take to reach 17 feet?
    • Since the net gain is 1 foot per cycle, it takes 17 full cycles to reach 17 feet.
    • Wait, let's check the math:
      • End of Day 17 (after sliding): 17 feet.
      • Start of Day 18: 17 feet.
      • During Day 18: $17 + 3 = 20$ feet.
    • The snail reaches the top on Day 18.

Fix Action

Fix / Workaround

============================== CPU Info

Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Address sizes: 46 bits physical, 57 bits virtual Byte Order: Little Endian CPU(s): 128 On-line CPU(s) list: 0-127 Vendor ID: GenuineIntel BIOS Vendor ID: Intel(R) Corporation Model name: Intel(R) Xeon(R) Platinum 8358P CPU @ 2.60GHz BIOS Model name: Intel(R) Xeon(R) Platinum 8358P CPU @ 2.60GHz CPU @ 2.6GHz BIOS CPU family: 179 CPU family: 6 Model: 106 Thread(s) per core: 2 Core(s) per socket: 32 Socket(s): 2 Stepping: 6 Frequency boost: enabled CPU(s) scaling MHz: 43% CPU max MHz: 3400.0000 CPU min MHz: 800.0000 BogoMIPS: 5200.00 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 invpcid_single ssbd mba ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb intel_pt avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local wbnoinvd dtherm ida arat pln pts avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg tme avx512_vpopcntdq rdpid md_clear pconfig flush_l1d arch_capabilities Virtualization: VT-x L1d cache: 3 MiB (64 instances) L1i cache: 2 MiB (64 instances) L2 cache: 80 MiB (64 instances) L3 cache: 96 MiB (2 instances) NUMA node(s): 2 NUMA node0 CPU(s): 0-31,64-95 NUMA node1 CPU(s): 32-63,96-127 Vulnerability Gather data sampling: Mitigation; Microcode Vulnerability Itlb multihit: Not affected Vulnerability L1tf: Not affected Vulnerability Mds: Not affected Vulnerability Meltdown: Not affected Vulnerability Mmio stale data: Mitigation; Clear CPU buffers; SMT vulnerable Vulnerability Retbleed: Not affected Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Vulnerability Spectre v2: Mitigation; Enhanced IBRS, IBPB conditional, RSB filling, PBRSB-eIBRS SW sequence Vulnerability Srbds: Not affected Vulnerability Tsx async abort: Not affected

PR fix notes

PR #41698: [Bugfix] Guard against buffered token IDs in BaseThinkingReasoningParser streaming

Description (problem / solution / changelog)

Purpose

Fix streaming token truncation when --stream-interval > 1 with reasoning parsers (e.g. --reasoning-parser gemma4).

When stream_interval > 1, token IDs can arrive in delta_token_ids before their text is flushed to delta_text (stop-sequence buffering). find() then returns -1, causing delta_text[:-1] which silently drops the last character of reasoning output.

This adds text-presence guards before find() calls in BaseThinkingReasoningParser.extract_reasoning_streaming(), matching the pattern already applied in KimiK2 (#41068), Hermes, GLM, DeepSeekV3.2, and MiniMax M2 parsers.

Fixes #41691

Test Plan

pytest tests/reasoning/test_base_thinking_reasoning_parser.py -xvs

<details> <summary> Essential Elements of an Effective PR Description Checklist </summary>
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
</details>

Changed files

  • tests/reasoning/test_base_thinking_reasoning_parser.py (modified, +36/-0)
  • vllm/reasoning/basic_parsers.py (modified, +11/-0)

Code Example

Collecting environment information...
==============================
        System Info
==============================
OS                           : Ubuntu 24.04.3 LTS (x86_64)
GCC version                  : (Ubuntu 13.3.0-6ubuntu2~24.04.1) 13.3.0
Clang version                : Could not collect
CMake version                : Could not collect
Libc version                 : glibc-2.39

==============================
       PyTorch Info
==============================
PyTorch version              : 2.11.0+cu130
Is debug build               : False
CUDA used to build PyTorch   : 13.0
ROCM used to build PyTorch   : N/A
XPU used to build PyTorch    : N/A

==============================
      Python Environment
==============================
Python version               : 3.12.3 (main, Mar 23 2026, 19:04:32) [GCC 13.3.0] (64-bit runtime)
Python platform              : Linux-5.4.0-173-generic-x86_64-with-glibc2.39
    
==============================
       CUDA / GPU Info
==============================
Is CUDA available            : True
CUDA runtime version         : 13.0.88
CUDA_MODULE_LOADING set to   : 
GPU models and configuration : 
GPU 0: NVIDIA A100-SXM4-80GB
GPU 1: NVIDIA A100-SXM4-80GB
GPU 2: NVIDIA A100-SXM4-80GB
GPU 3: NVIDIA A100-SXM4-80GB
GPU 4: NVIDIA A100-SXM4-80GB
GPU 5: NVIDIA A100-SXM4-80GB
GPU 6: NVIDIA A100-SXM4-80GB
GPU 7: NVIDIA A100-SXM4-80GB

Nvidia driver version        : 570.172.08
cuDNN version                : Could not collect
HIP runtime version          : N/A
MIOpen runtime version       : N/A
Is XNNPACK available         : True

==============================
          CPU Info
==============================
Architecture:                       x86_64
CPU op-mode(s):                     32-bit, 64-bit
Address sizes:                      46 bits physical, 57 bits virtual
Byte Order:                         Little Endian
CPU(s):                             128
On-line CPU(s) list:                0-127
Vendor ID:                          GenuineIntel
BIOS Vendor ID:                     Intel(R) Corporation
Model name:                         Intel(R) Xeon(R) Platinum 8358P CPU @ 2.60GHz
BIOS Model name:                    Intel(R) Xeon(R) Platinum 8358P CPU @ 2.60GHz  CPU @ 2.6GHz
BIOS CPU family:                    179
CPU family:                         6
Model:                              106
Thread(s) per core:                 2
Core(s) per socket:                 32
Socket(s):                          2
Stepping:                           6
Frequency boost:                    enabled
CPU(s) scaling MHz:                 43%
CPU max MHz:                        3400.0000
CPU min MHz:                        800.0000
BogoMIPS:                           5200.00
Flags:                              fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 invpcid_single ssbd mba ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb intel_pt avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local wbnoinvd dtherm ida arat pln pts avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg tme avx512_vpopcntdq rdpid md_clear pconfig flush_l1d arch_capabilities
Virtualization:                     VT-x
L1d cache:                          3 MiB (64 instances)
L1i cache:                          2 MiB (64 instances)
L2 cache:                           80 MiB (64 instances)
L3 cache:                           96 MiB (2 instances)
NUMA node(s):                       2
NUMA node0 CPU(s):                  0-31,64-95
NUMA node1 CPU(s):                  32-63,96-127
Vulnerability Gather data sampling: Mitigation; Microcode
Vulnerability Itlb multihit:        Not affected
Vulnerability L1tf:                 Not affected
Vulnerability Mds:                  Not affected
Vulnerability Meltdown:             Not affected
Vulnerability Mmio stale data:      Mitigation; Clear CPU buffers; SMT vulnerable
Vulnerability Retbleed:             Not affected
Vulnerability Spec store bypass:    Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1:           Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:           Mitigation; Enhanced IBRS, IBPB conditional, RSB filling, PBRSB-eIBRS SW sequence
Vulnerability Srbds:                Not affected
Vulnerability Tsx async abort:      Not affected

==============================
Versions of relevant libraries
==============================
[pip3] flashinfer-python==0.6.8.post1
[pip3] numpy==2.2.6
[pip3] nvidia-cublas==13.1.0.3
[pip3] nvidia-cuda-cupti==13.0.85
[pip3] nvidia-cuda-nvrtc==13.0.88
[pip3] nvidia-cuda-runtime==13.0.96
[pip3] nvidia-cudnn-cu13==9.19.0.56
[pip3] nvidia-cudnn-frontend==1.18.0
[pip3] nvidia-cufft==12.0.0.61
[pip3] nvidia-cufile==1.15.1.6
[pip3] nvidia-curand==10.4.0.35
[pip3] nvidia-cusolver==12.0.4.66
[pip3] nvidia-cusparse==12.6.3.3
[pip3] nvidia-cusparselt-cu13==0.8.0
[pip3] nvidia-cutlass-dsl==4.4.2
[pip3] nvidia-cutlass-dsl-libs-base==4.4.2
[pip3] nvidia-ml-py==13.595.45
[pip3] nvidia-nccl-cu13==2.28.9
[pip3] nvidia-nvjitlink==13.0.88
[pip3] nvidia-nvshmem-cu13==3.4.5
[pip3] nvidia-nvtx==13.0.85
[pip3] pyzmq==27.1.0
[pip3] torch==2.11.0+cu130
[pip3] torch_c_dlpack_ext==0.1.5
[pip3] torchaudio==2.11.0+cu130
[pip3] torchvision==0.26.0+cu130
[pip3] transformers==5.6.2
[pip3] triton==3.6.0
[conda] Could not collect

==============================
         vLLM Info
==============================
ROCM Version                 : Could not collect
vLLM Version                 : 0.20.0
vLLM Build Flags:
  CUDA Archs: 7.5 8.0 8.6 8.9 9.0 10.0 12.0+PTX; ROCm: Disabled; XPU: Disabled
GPU Topology:
        GPU0    GPU1    GPU2    GPU3    GPU4    GPU5    GPU6    GPU7    NIC0    NIC1    NIC2    NIC3    NIC4    NIC5    NIC6    NIC7    NIC8    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X      NV12    NV12    NV12    NV12    NV12    NV12    NV12    SYS     SYS     NODE    PXB     SYS     SYS     SYS     SYS     NODE    0-31,64-95      0               N/A
GPU1    NV12     X      NV12    NV12    NV12    NV12    NV12    NV12    SYS     SYS     NODE    PXB     SYS     SYS     SYS     SYS     NODE    0-31,64-95      0               N/A
GPU2    NV12    NV12     X      NV12    NV12    NV12    NV12    NV12    SYS     SYS     PXB     NODE    SYS     SYS     SYS     SYS     NODE    0-31,64-95      0               N/A
GPU3    NV12    NV12    NV12     X      NV12    NV12    NV12    NV12    SYS     SYS     PXB     NODE    SYS     SYS     SYS     SYS     NODE    0-31,64-95      0               N/A
GPU4    NV12    NV12    NV12    NV12     X      NV12    NV12    NV12    NODE    PXB     SYS     SYS     NODE    NODE    NODE    NODE    SYS     32-63,96-127    1               N/A
GPU5    NV12    NV12    NV12    NV12    NV12     X      NV12    NV12    NODE    PXB     SYS     SYS     NODE    NODE    NODE    NODE    SYS     32-63,96-127    1               N/A
GPU6    NV12    NV12    NV12    NV12    NV12    NV12     X      NV12    PXB     NODE    SYS     SYS     NODE    NODE    NODE    NODE    SYS     32-63,96-127    1               N/A
GPU7    NV12    NV12    NV12    NV12    NV12    NV12    NV12     X      PXB     NODE    SYS     SYS     NODE    NODE    NODE    NODE    SYS     32-63,96-127    1               N/A
NIC0    SYS     SYS     SYS     SYS     NODE    NODE    PXB     PXB      X      NODE    SYS     SYS     NODE    NODE    NODE    NODE    SYS
NIC1    SYS     SYS     SYS     SYS     PXB     PXB     NODE    NODE    NODE     X      SYS     SYS     NODE    NODE    NODE    NODE    SYS
NIC2    NODE    NODE    PXB     PXB     SYS     SYS     SYS     SYS     SYS     SYS      X      NODE    SYS     SYS     SYS     SYS     NODE
NIC3    PXB     PXB     NODE    NODE    SYS     SYS     SYS     SYS     SYS     SYS     NODE     X      SYS     SYS     SYS     SYS     NODE
NIC4    SYS     SYS     SYS     SYS     NODE    NODE    NODE    NODE    NODE    NODE    SYS     SYS      X      PIX     PHB     PHB     SYS
NIC5    SYS     SYS     SYS     SYS     NODE    NODE    NODE    NODE    NODE    NODE    SYS     SYS     PIX      X      PHB     PHB     SYS
NIC6    SYS     SYS     SYS     SYS     NODE    NODE    NODE    NODE    NODE    NODE    SYS     SYS     PHB     PHB      X      PIX     SYS
NIC7    SYS     SYS     SYS     SYS     NODE    NODE    NODE    NODE    NODE    NODE    SYS     SYS     PHB     PHB     PIX      X      SYS
NIC8    NODE    NODE    NODE    NODE    SYS     SYS     SYS     SYS     SYS     SYS     NODE    NODE    SYS     SYS     SYS     SYS      X 

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

NIC Legend:

  NIC0: mlx5_cx6_0
  NIC1: mlx5_cx6_1
  NIC2: mlx5_cx6_2
  NIC3: mlx5_cx6_3
  NIC4: mlx5_cx4lx_0
  NIC5: mlx5_cx4lx_1
  NIC6: mlx5_cx4lx_2
  NIC7: mlx5_cx4lx_3
  NIC8: mlx5_cx4lx_4

==============================
     Environment Variables
==============================
NVIDIA_VISIBLE_DEVICES=GPU-ef7992f0-d2d2-b2ba-d9aa-13d7830bc191,GPU-fc1d6e4e-17e0-aad0-aa32-dbf2b8b52c68,GPU-ad0d22e3-e8fb-3212-1608-aebe404a86d5,GPU-b585aeb9-7a34-ffb5-dc63-f90d1e7884d6,GPU-5563e915-67e8-e4b4-68da-7e3e88ae6a86,GPU-fcc72bf3-adfe-e599-3da0-129f7c7d0894,GPU-9128ea07-a489-d45a-93fd-da1cdf5937c2,GPU-730848c4-28b3-ba31-ba2c-216169b00011
NCCL_IB_TC=168
NVIDIA_REQUIRE_CUDA=cuda>=13.0 brand=unknown,driver>=535,driver<536 brand=grid,driver>=535,driver<536 brand=tesla,driver>=535,driver<536 brand=nvidia,driver>=535,driver<536 brand=quadro,driver>=535,driver<536 brand=quadrortx,driver>=535,driver<536 brand=nvidiartx,driver>=535,driver<536 brand=vapps,driver>=535,driver<536 brand=vpc,driver>=535,driver<536 brand=vcs,driver>=535,driver<536 brand=vws,driver>=535,driver<536 brand=cloudgaming,driver>=535,driver<536 brand=unknown,driver>=550,driver<551 brand=grid,driver>=550,driver<551 brand=tesla,driver>=550,driver<551 brand=nvidia,driver>=550,driver<551 brand=quadro,driver>=550,driver<551 brand=quadrortx,driver>=550,driver<551 brand=nvidiartx,driver>=550,driver<551 brand=vapps,driver>=550,driver<551 brand=vpc,driver>=550,driver<551 brand=vcs,driver>=550,driver<551 brand=vws,driver>=550,driver<551 brand=cloudgaming,driver>=550,driver<551 brand=unknown,driver>=565,driver<566 brand=grid,driver>=565,driver<566 brand=tesla,driver>=565,driver<566 brand=nvidia,driver>=565,driver<566 brand=quadro,driver>=565,driver<566 brand=quadrortx,driver>=565,driver<566 brand=nvidiartx,driver>=565,driver<566 brand=vapps,driver>=565,driver<566 brand=vpc,driver>=565,driver<566 brand=vcs,driver>=565,driver<566 brand=vws,driver>=565,driver<566 brand=cloudgaming,driver>=565,driver<566 brand=unknown,driver>=570,driver<571 brand=grid,driver>=570,driver<571 brand=tesla,driver>=570,driver<571 brand=nvidia,driver>=570,driver<571 brand=quadro,driver>=570,driver<571 brand=quadrortx,driver>=570,driver<571 brand=nvidiartx,driver>=570,driver<571 brand=vapps,driver>=570,driver<571 brand=vpc,driver>=570,driver<571 brand=vcs,driver>=570,driver<571 brand=vws,driver>=570,driver<571 brand=cloudgaming,driver>=570,driver<571 brand=unknown,driver>=575,driver<576 brand=grid,driver>=575,driver<576 brand=tesla,driver>=575,driver<576 brand=nvidia,driver>=575,driver<576 brand=quadro,driver>=575,driver<576 brand=quadrortx,driver>=575,driver<576 brand=nvidiartx,driver>=575,driver<576 brand=vapps,driver>=575,driver<576 brand=vpc,driver>=575,driver<576 brand=vcs,driver>=575,driver<576 brand=vws,driver>=575,driver<576 brand=cloudgaming,driver>=575,driver<576
TORCH_CUDA_ARCH_LIST=7.5 8.0 8.6 8.9 9.0 10.0 12.0+PTX
NCCL_SOCKET_IFNAME=bondYW
NCCL_NET_GDR_LEVEL=3
NVIDIA_DRIVER_CAPABILITIES=compute,utility
NCCL_DEBUG=INFO
NCCL_IB_HCA=mlx5_cx6_0,mlx5_cx6_1,mlx5_cx6_2,mlx5_cx6_3
VLLM_USAGE_SOURCE=production-docker-image
NCCL_IB_GID_INDEX=3
CUDA_VERSION=13.0.2
VLLM_ENABLE_CUDA_COMPATIBILITY=0
NCCL_IB_TIMEOUT=22
LD_LIBRARY_PATH=/usr/local/nvidia/lib64:/usr/local/cuda/lib64:/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/cuda/lib64
PYTORCH_NVML_BASED_CUDA_CHECK=1
TORCHINDUCTOR_COMPILE_THREADS=1
TORCHINDUCTOR_CACHE_DIR=/tmp/torchinductor_root

---

nohup vllm serve ./Gemma4-31B-it \
    --served-model-name "gemma4-31b-it" \
    --api-key <your_api_key> \
    --host 0.0.0.0 \
    --port <your_service_port> \
    --max-model-len 262144 \
    --max-num-seqs 128 \
    --tensor-parallel-size 8 \
    --gpu-memory-utilization 0.90 \
    --stream-interval 10 \
    --enable-chunked-prefill \
    --max-num-batched-tokens 8192 \
    --async-scheduling \
    --enable-auto-tool-choice \
    --tool-call-parser gemma4 \
    --reasoning-parser gemma4 \
    > "$log_file" 2>&1 &

---

import os
from openai import OpenAI

API_BASE = "http://jb-aionlineinferenceservice-155859607233257216-8000-nhss-job.z2120.nhss.zhejianglab.com:31080/v1"
API_KEY  = os.environ['GEMMA4_API_KEY']
MODEL    = "gemma4-31b-it"

client = OpenAI(
    base_url = API_BASE,
    api_key = API_KEY,
)

#%% Streaming mode
stream = client.chat.completions.create(
    model = MODEL,
    messages = [
        {"role": "user", "content": "A snail is at the bottom of a 20-foot well. Each day it climbs 3 feet, but at night it slides back 2 feet. How many days will it take to reach the top?"}
    ],
    temperature = 0,
    extra_body = {
        "chat_template_kwargs": {"enable_thinking": True}
    },
    stream = True,
)

reasoning_parts = []
content_parts = []

for chunk in stream:
    delta = chunk.choices[0].delta
    if hasattr(delta, "reasoning") and delta.reasoning:
        reasoning_parts.append(delta.reasoning)
    if delta.content:
        content_parts.append(delta.content)

reasoning = "".join(reasoning_parts)
content = "".join(content_parts)

print('[Streaming Mode]:')
if reasoning:
    print("=== Thinking ===")
    print(reasoning)

print("\n=== Answer ===")
print(content)
print("\n\n---------\n\n")

#%% Non-streaming mode
response = client.chat.completions.create(
    model = MODEL,
    messages = [
        {"role": "user", "content": "A snail is at the bottom of a 20-foot well. Each day it climbs 3 feet, but at night it slides back 2 feet. How many days will it take to reach the top?"}
    ],
    temperature = 0,
    extra_body = {
        "chat_template_kwargs": {"enable_thinking": True}
    }
)

message = response.choices[0].message

print('[Non-streaming Mode]:')
if hasattr(message, "reasoning") and message.reasoning:
    print("=== Thinking ===")
    print(message.reasoning)

print("\n=== Answer ===")
print(message.content)

---

[Streaming Mode]:
=== Thinking ===
*   Object: Snail.
    *   Starting position: Bottom of a 20-foot well (0 feet).
    *   Goal: Reach the top (20 feet).
    *   Daily progress: +3 feet (day), -2 feet (night).

    *   Day 1:
        *   Day: 0 + 3 = 3 feet.
        *   Night: 3 - 2 = 1 foot.
    *   Day 2:
        *   Day: 1 + 3 = 4 feet.
        *   Night: 4 - 2 = 2 feet.
    *   Day 3:
        *   Day: 2 + 3 = 5 feet.
        *   Night: 5 - 2 = 3 feet.
    *   *Observation:* The net gain per 24-hour cycle is 1 foot.

    *   The snail reaches the top *during the day*. Once it hits 20 feet, it doesn't slide back because it's already out of the well.
    *   Let $n$ be the number of days.
    *   On the last day, the snail climbs 3 feet to reach 20. This means it must have started that day at $20 - 3 = 17$ feet.
    *   How many full cycles (day + night) does it take to reach 17 feet?
    *   Since the net gain is 1 foot per cycle, it takes 17 full cycles to reach 17 feet.
    *   Wait, let's check the math:
        *   End of Day 17 (after sliding): 17 feet.
        *   Start of Day 18: 17 feet.
        *   During Day 18: $17 + 3 = 20$ feet.
    *   The snail reaches the top on Day 18.

    *   Day 1: 1ft (net)
    *   Day 2: 2ft (net)
    *   ...
    *   Day 17: 17ft (net)
    *   Day 18: 17 + 3 = 20ft. (Reached!)

    *   State the answer clearly.
    *   Explain the logic (net gain vs

=== Answer ===
It will take **18 days**.

Here is the breakdown:

1.  **The Daily Net Gain:** Each day the snail climbs 3 feet and slides back 2, resulting in a net gain of **1 foot per day**.
2.  **The Critical Point:** Many people assume the answer is 20 days, but they forget that once the snail reaches the top, it doesn't slide back.
3.  **The Calculation:** 
    *   By the end of the **17th day** (after sliding back at night), the snail has reached a height of **17 feet**.
    *   On the **18th day**, the snail climbs **3 feet**.
    *   17 feet + 3 feet = **20 feet**.

Since it has now reached the top of the well, it is out and does not slide back.


---------


[Non-streaming Mode]:
=== Thinking ===
*   Object: Snail.
    *   Starting position: Bottom of a 20-foot well (0 feet).
    *   Goal: Reach the top (20 feet).
    *   Daily progress: +3 feet (day), -2 feet (night).

    *   Day 1:
        *   Day: 0 + 3 = 3 feet.
        *   Night: 3 - 2 = 1 foot.
    *   Day 2:
        *   Day: 1 + 3 = 4 feet.
        *   Night: 4 - 2 = 2 feet.
    *   Day 3:
        *   Day: 2 + 3 = 5 feet.
        *   Night: 5 - 2 = 3 feet.
    *   *Observation:* The net gain per 24-hour cycle is 1 foot.

    *   The snail reaches the top *during the day*. Once it hits 20 feet, it doesn't slide back because it's already out of the well.
    *   Let $n$ be the number of days.
    *   On the last day, the snail climbs 3 feet to reach 20. This means it must have started that day at $20 - 3 = 17$ feet.
    *   How many full cycles (day + night) does it take to reach 17 feet?
    *   Since the net gain is 1 foot per cycle, it takes 17 full cycles to reach 17 feet.
    *   Wait, let's check the math:
        *   End of Day 17 (after sliding): 17 feet.
        *   Start of Day 18: 17 feet.
        *   During Day 18: $17 + 3 = 20$ feet.
    *   The snail reaches the top on Day 18.

    *   Day 1: 1ft (net)
    *   Day 2: 2ft (net)
    *   ...
    *   Day 17: 17ft (net)
    *   Day 18: 17 + 3 = 20ft. (Reached!)

    *   State the answer clearly.
    *   Explain the logic (net gain vs. the final jump).

=== Answer ===
It will take **18 days**.

Here is the breakdown:

1.  **The Daily Net Gain:** Each day the snail climbs 3 feet and slides back 2, resulting in a net gain of **1 foot per day**.
2.  **The Critical Point:** Many people assume the answer is 20 days, but they forget that once the snail reaches the top, it doesn't slide back.
3.  **The Calculation:** 
    *   By the end of the **17th day** (after sliding back at night), the snail has reached a height of **17 feet**.
    *   On the **18th day**, the snail climbs **3 feet**.
    *   17 feet + 3 feet = **20 feet**.

Since it has now reached the top of the well, it is out and does not slide back.
RAW_BUFFERClick to expand / collapse

Your current environment

<details> <summary>The output of <code>python collect_env.py</code></summary>
Collecting environment information...
==============================
        System Info
==============================
OS                           : Ubuntu 24.04.3 LTS (x86_64)
GCC version                  : (Ubuntu 13.3.0-6ubuntu2~24.04.1) 13.3.0
Clang version                : Could not collect
CMake version                : Could not collect
Libc version                 : glibc-2.39

==============================
       PyTorch Info
==============================
PyTorch version              : 2.11.0+cu130
Is debug build               : False
CUDA used to build PyTorch   : 13.0
ROCM used to build PyTorch   : N/A
XPU used to build PyTorch    : N/A

==============================
      Python Environment
==============================
Python version               : 3.12.3 (main, Mar 23 2026, 19:04:32) [GCC 13.3.0] (64-bit runtime)
Python platform              : Linux-5.4.0-173-generic-x86_64-with-glibc2.39
    
==============================
       CUDA / GPU Info
==============================
Is CUDA available            : True
CUDA runtime version         : 13.0.88
CUDA_MODULE_LOADING set to   : 
GPU models and configuration : 
GPU 0: NVIDIA A100-SXM4-80GB
GPU 1: NVIDIA A100-SXM4-80GB
GPU 2: NVIDIA A100-SXM4-80GB
GPU 3: NVIDIA A100-SXM4-80GB
GPU 4: NVIDIA A100-SXM4-80GB
GPU 5: NVIDIA A100-SXM4-80GB
GPU 6: NVIDIA A100-SXM4-80GB
GPU 7: NVIDIA A100-SXM4-80GB

Nvidia driver version        : 570.172.08
cuDNN version                : Could not collect
HIP runtime version          : N/A
MIOpen runtime version       : N/A
Is XNNPACK available         : True

==============================
          CPU Info
==============================
Architecture:                       x86_64
CPU op-mode(s):                     32-bit, 64-bit
Address sizes:                      46 bits physical, 57 bits virtual
Byte Order:                         Little Endian
CPU(s):                             128
On-line CPU(s) list:                0-127
Vendor ID:                          GenuineIntel
BIOS Vendor ID:                     Intel(R) Corporation
Model name:                         Intel(R) Xeon(R) Platinum 8358P CPU @ 2.60GHz
BIOS Model name:                    Intel(R) Xeon(R) Platinum 8358P CPU @ 2.60GHz  CPU @ 2.6GHz
BIOS CPU family:                    179
CPU family:                         6
Model:                              106
Thread(s) per core:                 2
Core(s) per socket:                 32
Socket(s):                          2
Stepping:                           6
Frequency boost:                    enabled
CPU(s) scaling MHz:                 43%
CPU max MHz:                        3400.0000
CPU min MHz:                        800.0000
BogoMIPS:                           5200.00
Flags:                              fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 invpcid_single ssbd mba ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb intel_pt avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local wbnoinvd dtherm ida arat pln pts avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg tme avx512_vpopcntdq rdpid md_clear pconfig flush_l1d arch_capabilities
Virtualization:                     VT-x
L1d cache:                          3 MiB (64 instances)
L1i cache:                          2 MiB (64 instances)
L2 cache:                           80 MiB (64 instances)
L3 cache:                           96 MiB (2 instances)
NUMA node(s):                       2
NUMA node0 CPU(s):                  0-31,64-95
NUMA node1 CPU(s):                  32-63,96-127
Vulnerability Gather data sampling: Mitigation; Microcode
Vulnerability Itlb multihit:        Not affected
Vulnerability L1tf:                 Not affected
Vulnerability Mds:                  Not affected
Vulnerability Meltdown:             Not affected
Vulnerability Mmio stale data:      Mitigation; Clear CPU buffers; SMT vulnerable
Vulnerability Retbleed:             Not affected
Vulnerability Spec store bypass:    Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1:           Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:           Mitigation; Enhanced IBRS, IBPB conditional, RSB filling, PBRSB-eIBRS SW sequence
Vulnerability Srbds:                Not affected
Vulnerability Tsx async abort:      Not affected

==============================
Versions of relevant libraries
==============================
[pip3] flashinfer-python==0.6.8.post1
[pip3] numpy==2.2.6
[pip3] nvidia-cublas==13.1.0.3
[pip3] nvidia-cuda-cupti==13.0.85
[pip3] nvidia-cuda-nvrtc==13.0.88
[pip3] nvidia-cuda-runtime==13.0.96
[pip3] nvidia-cudnn-cu13==9.19.0.56
[pip3] nvidia-cudnn-frontend==1.18.0
[pip3] nvidia-cufft==12.0.0.61
[pip3] nvidia-cufile==1.15.1.6
[pip3] nvidia-curand==10.4.0.35
[pip3] nvidia-cusolver==12.0.4.66
[pip3] nvidia-cusparse==12.6.3.3
[pip3] nvidia-cusparselt-cu13==0.8.0
[pip3] nvidia-cutlass-dsl==4.4.2
[pip3] nvidia-cutlass-dsl-libs-base==4.4.2
[pip3] nvidia-ml-py==13.595.45
[pip3] nvidia-nccl-cu13==2.28.9
[pip3] nvidia-nvjitlink==13.0.88
[pip3] nvidia-nvshmem-cu13==3.4.5
[pip3] nvidia-nvtx==13.0.85
[pip3] pyzmq==27.1.0
[pip3] torch==2.11.0+cu130
[pip3] torch_c_dlpack_ext==0.1.5
[pip3] torchaudio==2.11.0+cu130
[pip3] torchvision==0.26.0+cu130
[pip3] transformers==5.6.2
[pip3] triton==3.6.0
[conda] Could not collect

==============================
         vLLM Info
==============================
ROCM Version                 : Could not collect
vLLM Version                 : 0.20.0
vLLM Build Flags:
  CUDA Archs: 7.5 8.0 8.6 8.9 9.0 10.0 12.0+PTX; ROCm: Disabled; XPU: Disabled
GPU Topology:
        GPU0    GPU1    GPU2    GPU3    GPU4    GPU5    GPU6    GPU7    NIC0    NIC1    NIC2    NIC3    NIC4    NIC5    NIC6    NIC7    NIC8    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X      NV12    NV12    NV12    NV12    NV12    NV12    NV12    SYS     SYS     NODE    PXB     SYS     SYS     SYS     SYS     NODE    0-31,64-95      0               N/A
GPU1    NV12     X      NV12    NV12    NV12    NV12    NV12    NV12    SYS     SYS     NODE    PXB     SYS     SYS     SYS     SYS     NODE    0-31,64-95      0               N/A
GPU2    NV12    NV12     X      NV12    NV12    NV12    NV12    NV12    SYS     SYS     PXB     NODE    SYS     SYS     SYS     SYS     NODE    0-31,64-95      0               N/A
GPU3    NV12    NV12    NV12     X      NV12    NV12    NV12    NV12    SYS     SYS     PXB     NODE    SYS     SYS     SYS     SYS     NODE    0-31,64-95      0               N/A
GPU4    NV12    NV12    NV12    NV12     X      NV12    NV12    NV12    NODE    PXB     SYS     SYS     NODE    NODE    NODE    NODE    SYS     32-63,96-127    1               N/A
GPU5    NV12    NV12    NV12    NV12    NV12     X      NV12    NV12    NODE    PXB     SYS     SYS     NODE    NODE    NODE    NODE    SYS     32-63,96-127    1               N/A
GPU6    NV12    NV12    NV12    NV12    NV12    NV12     X      NV12    PXB     NODE    SYS     SYS     NODE    NODE    NODE    NODE    SYS     32-63,96-127    1               N/A
GPU7    NV12    NV12    NV12    NV12    NV12    NV12    NV12     X      PXB     NODE    SYS     SYS     NODE    NODE    NODE    NODE    SYS     32-63,96-127    1               N/A
NIC0    SYS     SYS     SYS     SYS     NODE    NODE    PXB     PXB      X      NODE    SYS     SYS     NODE    NODE    NODE    NODE    SYS
NIC1    SYS     SYS     SYS     SYS     PXB     PXB     NODE    NODE    NODE     X      SYS     SYS     NODE    NODE    NODE    NODE    SYS
NIC2    NODE    NODE    PXB     PXB     SYS     SYS     SYS     SYS     SYS     SYS      X      NODE    SYS     SYS     SYS     SYS     NODE
NIC3    PXB     PXB     NODE    NODE    SYS     SYS     SYS     SYS     SYS     SYS     NODE     X      SYS     SYS     SYS     SYS     NODE
NIC4    SYS     SYS     SYS     SYS     NODE    NODE    NODE    NODE    NODE    NODE    SYS     SYS      X      PIX     PHB     PHB     SYS
NIC5    SYS     SYS     SYS     SYS     NODE    NODE    NODE    NODE    NODE    NODE    SYS     SYS     PIX      X      PHB     PHB     SYS
NIC6    SYS     SYS     SYS     SYS     NODE    NODE    NODE    NODE    NODE    NODE    SYS     SYS     PHB     PHB      X      PIX     SYS
NIC7    SYS     SYS     SYS     SYS     NODE    NODE    NODE    NODE    NODE    NODE    SYS     SYS     PHB     PHB     PIX      X      SYS
NIC8    NODE    NODE    NODE    NODE    SYS     SYS     SYS     SYS     SYS     SYS     NODE    NODE    SYS     SYS     SYS     SYS      X 

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

NIC Legend:

  NIC0: mlx5_cx6_0
  NIC1: mlx5_cx6_1
  NIC2: mlx5_cx6_2
  NIC3: mlx5_cx6_3
  NIC4: mlx5_cx4lx_0
  NIC5: mlx5_cx4lx_1
  NIC6: mlx5_cx4lx_2
  NIC7: mlx5_cx4lx_3
  NIC8: mlx5_cx4lx_4

==============================
     Environment Variables
==============================
NVIDIA_VISIBLE_DEVICES=GPU-ef7992f0-d2d2-b2ba-d9aa-13d7830bc191,GPU-fc1d6e4e-17e0-aad0-aa32-dbf2b8b52c68,GPU-ad0d22e3-e8fb-3212-1608-aebe404a86d5,GPU-b585aeb9-7a34-ffb5-dc63-f90d1e7884d6,GPU-5563e915-67e8-e4b4-68da-7e3e88ae6a86,GPU-fcc72bf3-adfe-e599-3da0-129f7c7d0894,GPU-9128ea07-a489-d45a-93fd-da1cdf5937c2,GPU-730848c4-28b3-ba31-ba2c-216169b00011
NCCL_IB_TC=168
NVIDIA_REQUIRE_CUDA=cuda>=13.0 brand=unknown,driver>=535,driver<536 brand=grid,driver>=535,driver<536 brand=tesla,driver>=535,driver<536 brand=nvidia,driver>=535,driver<536 brand=quadro,driver>=535,driver<536 brand=quadrortx,driver>=535,driver<536 brand=nvidiartx,driver>=535,driver<536 brand=vapps,driver>=535,driver<536 brand=vpc,driver>=535,driver<536 brand=vcs,driver>=535,driver<536 brand=vws,driver>=535,driver<536 brand=cloudgaming,driver>=535,driver<536 brand=unknown,driver>=550,driver<551 brand=grid,driver>=550,driver<551 brand=tesla,driver>=550,driver<551 brand=nvidia,driver>=550,driver<551 brand=quadro,driver>=550,driver<551 brand=quadrortx,driver>=550,driver<551 brand=nvidiartx,driver>=550,driver<551 brand=vapps,driver>=550,driver<551 brand=vpc,driver>=550,driver<551 brand=vcs,driver>=550,driver<551 brand=vws,driver>=550,driver<551 brand=cloudgaming,driver>=550,driver<551 brand=unknown,driver>=565,driver<566 brand=grid,driver>=565,driver<566 brand=tesla,driver>=565,driver<566 brand=nvidia,driver>=565,driver<566 brand=quadro,driver>=565,driver<566 brand=quadrortx,driver>=565,driver<566 brand=nvidiartx,driver>=565,driver<566 brand=vapps,driver>=565,driver<566 brand=vpc,driver>=565,driver<566 brand=vcs,driver>=565,driver<566 brand=vws,driver>=565,driver<566 brand=cloudgaming,driver>=565,driver<566 brand=unknown,driver>=570,driver<571 brand=grid,driver>=570,driver<571 brand=tesla,driver>=570,driver<571 brand=nvidia,driver>=570,driver<571 brand=quadro,driver>=570,driver<571 brand=quadrortx,driver>=570,driver<571 brand=nvidiartx,driver>=570,driver<571 brand=vapps,driver>=570,driver<571 brand=vpc,driver>=570,driver<571 brand=vcs,driver>=570,driver<571 brand=vws,driver>=570,driver<571 brand=cloudgaming,driver>=570,driver<571 brand=unknown,driver>=575,driver<576 brand=grid,driver>=575,driver<576 brand=tesla,driver>=575,driver<576 brand=nvidia,driver>=575,driver<576 brand=quadro,driver>=575,driver<576 brand=quadrortx,driver>=575,driver<576 brand=nvidiartx,driver>=575,driver<576 brand=vapps,driver>=575,driver<576 brand=vpc,driver>=575,driver<576 brand=vcs,driver>=575,driver<576 brand=vws,driver>=575,driver<576 brand=cloudgaming,driver>=575,driver<576
TORCH_CUDA_ARCH_LIST=7.5 8.0 8.6 8.9 9.0 10.0 12.0+PTX
NCCL_SOCKET_IFNAME=bondYW
NCCL_NET_GDR_LEVEL=3
NVIDIA_DRIVER_CAPABILITIES=compute,utility
NCCL_DEBUG=INFO
NCCL_IB_HCA=mlx5_cx6_0,mlx5_cx6_1,mlx5_cx6_2,mlx5_cx6_3
VLLM_USAGE_SOURCE=production-docker-image
NCCL_IB_GID_INDEX=3
CUDA_VERSION=13.0.2
VLLM_ENABLE_CUDA_COMPATIBILITY=0
NCCL_IB_TIMEOUT=22
LD_LIBRARY_PATH=/usr/local/nvidia/lib64:/usr/local/cuda/lib64:/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/cuda/lib64
PYTORCH_NVML_BASED_CUDA_CHECK=1
TORCHINDUCTOR_COMPILE_THREADS=1
TORCHINDUCTOR_CACHE_DIR=/tmp/torchinductor_root
</details>

🐛 Describe the bug

When hosting Gemma4-31B-it on vLLM 0.20.0 with --streaming-interval larger than 1 and invoking the model in streaming mode, the last few reasoning output tokens are missing.

Script to launch the vLLM server:

nohup vllm serve ./Gemma4-31B-it \
    --served-model-name "gemma4-31b-it" \
    --api-key <your_api_key> \
    --host 0.0.0.0 \
    --port <your_service_port> \
    --max-model-len 262144 \
    --max-num-seqs 128 \
    --tensor-parallel-size 8 \
    --gpu-memory-utilization 0.90 \
    --stream-interval 10 \
    --enable-chunked-prefill \
    --max-num-batched-tokens 8192 \
    --async-scheduling \
    --enable-auto-tool-choice \
    --tool-call-parser gemma4 \
    --reasoning-parser gemma4 \
    > "$log_file" 2>&1 &

Script to demonstrate the bug:

import os
from openai import OpenAI

API_BASE = "http://jb-aionlineinferenceservice-155859607233257216-8000-nhss-job.z2120.nhss.zhejianglab.com:31080/v1"
API_KEY  = os.environ['GEMMA4_API_KEY']
MODEL    = "gemma4-31b-it"

client = OpenAI(
    base_url = API_BASE,
    api_key = API_KEY,
)

#%% Streaming mode
stream = client.chat.completions.create(
    model = MODEL,
    messages = [
        {"role": "user", "content": "A snail is at the bottom of a 20-foot well. Each day it climbs 3 feet, but at night it slides back 2 feet. How many days will it take to reach the top?"}
    ],
    temperature = 0,
    extra_body = {
        "chat_template_kwargs": {"enable_thinking": True}
    },
    stream = True,
)

reasoning_parts = []
content_parts = []

for chunk in stream:
    delta = chunk.choices[0].delta
    if hasattr(delta, "reasoning") and delta.reasoning:
        reasoning_parts.append(delta.reasoning)
    if delta.content:
        content_parts.append(delta.content)

reasoning = "".join(reasoning_parts)
content = "".join(content_parts)

print('[Streaming Mode]:')
if reasoning:
    print("=== Thinking ===")
    print(reasoning)

print("\n=== Answer ===")
print(content)
print("\n\n---------\n\n")

#%% Non-streaming mode
response = client.chat.completions.create(
    model = MODEL,
    messages = [
        {"role": "user", "content": "A snail is at the bottom of a 20-foot well. Each day it climbs 3 feet, but at night it slides back 2 feet. How many days will it take to reach the top?"}
    ],
    temperature = 0,
    extra_body = {
        "chat_template_kwargs": {"enable_thinking": True}
    }
)

message = response.choices[0].message

print('[Non-streaming Mode]:')
if hasattr(message, "reasoning") and message.reasoning:
    print("=== Thinking ===")
    print(message.reasoning)

print("\n=== Answer ===")
print(message.content)

Outputs:

[Streaming Mode]:
=== Thinking ===
*   Object: Snail.
    *   Starting position: Bottom of a 20-foot well (0 feet).
    *   Goal: Reach the top (20 feet).
    *   Daily progress: +3 feet (day), -2 feet (night).

    *   Day 1:
        *   Day: 0 + 3 = 3 feet.
        *   Night: 3 - 2 = 1 foot.
    *   Day 2:
        *   Day: 1 + 3 = 4 feet.
        *   Night: 4 - 2 = 2 feet.
    *   Day 3:
        *   Day: 2 + 3 = 5 feet.
        *   Night: 5 - 2 = 3 feet.
    *   *Observation:* The net gain per 24-hour cycle is 1 foot.

    *   The snail reaches the top *during the day*. Once it hits 20 feet, it doesn't slide back because it's already out of the well.
    *   Let $n$ be the number of days.
    *   On the last day, the snail climbs 3 feet to reach 20. This means it must have started that day at $20 - 3 = 17$ feet.
    *   How many full cycles (day + night) does it take to reach 17 feet?
    *   Since the net gain is 1 foot per cycle, it takes 17 full cycles to reach 17 feet.
    *   Wait, let's check the math:
        *   End of Day 17 (after sliding): 17 feet.
        *   Start of Day 18: 17 feet.
        *   During Day 18: $17 + 3 = 20$ feet.
    *   The snail reaches the top on Day 18.

    *   Day 1: 1ft (net)
    *   Day 2: 2ft (net)
    *   ...
    *   Day 17: 17ft (net)
    *   Day 18: 17 + 3 = 20ft. (Reached!)

    *   State the answer clearly.
    *   Explain the logic (net gain vs

=== Answer ===
It will take **18 days**.

Here is the breakdown:

1.  **The Daily Net Gain:** Each day the snail climbs 3 feet and slides back 2, resulting in a net gain of **1 foot per day**.
2.  **The Critical Point:** Many people assume the answer is 20 days, but they forget that once the snail reaches the top, it doesn't slide back.
3.  **The Calculation:** 
    *   By the end of the **17th day** (after sliding back at night), the snail has reached a height of **17 feet**.
    *   On the **18th day**, the snail climbs **3 feet**.
    *   17 feet + 3 feet = **20 feet**.

Since it has now reached the top of the well, it is out and does not slide back.


---------


[Non-streaming Mode]:
=== Thinking ===
*   Object: Snail.
    *   Starting position: Bottom of a 20-foot well (0 feet).
    *   Goal: Reach the top (20 feet).
    *   Daily progress: +3 feet (day), -2 feet (night).

    *   Day 1:
        *   Day: 0 + 3 = 3 feet.
        *   Night: 3 - 2 = 1 foot.
    *   Day 2:
        *   Day: 1 + 3 = 4 feet.
        *   Night: 4 - 2 = 2 feet.
    *   Day 3:
        *   Day: 2 + 3 = 5 feet.
        *   Night: 5 - 2 = 3 feet.
    *   *Observation:* The net gain per 24-hour cycle is 1 foot.

    *   The snail reaches the top *during the day*. Once it hits 20 feet, it doesn't slide back because it's already out of the well.
    *   Let $n$ be the number of days.
    *   On the last day, the snail climbs 3 feet to reach 20. This means it must have started that day at $20 - 3 = 17$ feet.
    *   How many full cycles (day + night) does it take to reach 17 feet?
    *   Since the net gain is 1 foot per cycle, it takes 17 full cycles to reach 17 feet.
    *   Wait, let's check the math:
        *   End of Day 17 (after sliding): 17 feet.
        *   Start of Day 18: 17 feet.
        *   During Day 18: $17 + 3 = 20$ feet.
    *   The snail reaches the top on Day 18.

    *   Day 1: 1ft (net)
    *   Day 2: 2ft (net)
    *   ...
    *   Day 17: 17ft (net)
    *   Day 18: 17 + 3 = 20ft. (Reached!)

    *   State the answer clearly.
    *   Explain the logic (net gain vs. the final jump).

=== Answer ===
It will take **18 days**.

Here is the breakdown:

1.  **The Daily Net Gain:** Each day the snail climbs 3 feet and slides back 2, resulting in a net gain of **1 foot per day**.
2.  **The Critical Point:** Many people assume the answer is 20 days, but they forget that once the snail reaches the top, it doesn't slide back.
3.  **The Calculation:** 
    *   By the end of the **17th day** (after sliding back at night), the snail has reached a height of **17 feet**.
    *   On the **18th day**, the snail climbs **3 feet**.
    *   17 feet + 3 feet = **20 feet**.

Since it has now reached the top of the well, it is out and does not slide back.

You can clearly see that in streaming mode, the last few tokens are missing, as the reasoning content ends with: Explain the logic (net gain vs.

It seems that this bug has something to do with the --streaming-interval parameter. There is no issue when setting the --streaming-interval to 1.

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

TL;DR

The issue is likely related to the --streaming-interval parameter, and setting it to 1 resolves the problem, so adjusting this parameter or modifying the streaming logic may provide a workaround.

Guidance

  • Verify the effect of different --streaming-interval values on the output to confirm the relationship between this parameter and the missing tokens.
  • Check the vLLM documentation for any known issues or limitations related to streaming mode and the --streaming-interval parameter.
  • Consider modifying the script to handle the streaming output differently, potentially buffering or processing the output in a way that prevents token loss.
  • If the issue persists, try updating vLLM to the latest version or checking for any compatibility issues with the current environment setup.

Example

No specific code example can be provided without further information on the vLLM API or the exact nature of the streaming issue. However, the solution may involve adjusting the streaming interval or implementing custom logic to handle the output streams.

Notes

The provided information suggests a potential bug in the vLLM streaming functionality when using a --streaming-interval greater than 1. The exact cause and solution may depend on the internal workings of vLLM and its handling of streaming outputs.

Recommendation

Apply a workaround by setting --streaming-interval to 1, as this has been shown to resolve the issue in the provided example. If this is not feasible due to performance or other constraints, further investigation into the vLLM streaming logic and potential updates or patches may be necessary.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING