Code Example

vllm serve /llm-models/gemma-4-26B-A4B-it \
  --served-model-name gemma4-26b-a4b-it \
  --host=0.0.0.0 \
  --port 9999 \
  --gpu_memory_utilization 0.9 \
  --tensor-parallel-size 2 \
  --max-model-len 262144 \
  --enable-prefix-caching \
  --enable-auto-tool-choice \
  --reasoning-parser gemma4 \
  --tool-call-parser gemma4

---

# pip install openai==1.76.1
from openai import OpenAI

client = OpenAI(base_url="http://127.0.0.1:9999/v1", api_key="EMPTY")

response = client.chat.completions.create(
    model="gemma4-26b-a4b-it",
    messages=[{
        "role": "user",
        "content": [{"type": "text", "text": "Calculate the sum of 108.2 and 22.8"}]
    }],
    tools=[{
        "type": "function",
        "function": {
            "name": "add",
            "description": "Calculate the sum of two numbers",
            "parameters": {
                "type": "object",
                "properties": {
                    "left": {"type": "number", "description": "The first number"},
                    "right": {"type": "number", "description": "The second number"}
                },
                "required": ["left", "right"]
            }
        }
    }],
    stream=True,
    temperature=1.0,
    top_p=0.95,
    presence_penalty=1.5,
    extra_body={
        "top_k": 20,
        "chat_template_kwargs": {"enable_thinking": False},
    },
)

arguments = ""
for chunk in response:
    tc = chunk.choices[0].delta.tool_calls
    if tc and len(tc) == 1:
        arguments += tc[0].function.arguments

print(arguments)
# Actual (streaming):   {"left": 108.02, "right": 22.08}
# Expected (streaming): {"left": 108.2,  "right": 22.8}

Your current environment

<details> <summary>The output of <code>python collect_env.py</code></summary>

============================== System Info

OS : Ubuntu 24.04.2 LTS (x86_64) GCC version : (Ubuntu 9.5.0-6ubuntu2) 9.5.0 Clang version : Could not collect CMake version : version 3.28.3 Libc version : glibc-2.39

============================== vLLM Info

ROCM Version : Could not collect vLLM Version : 0.19.0 vLLM Build Flags: CUDA Archs: Not Set; ROCm: Disabled; XPU: Disabled GPU Topology: GPU0 GPU1 CPU Affinity NUMA Affinity GPU NUMA ID GPU0 X PHB 0-7 0 N/A GPU1 PHB X 0-7 0 N/A

Legend:

X = Self SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI) NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU) PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge) PIX = Connection traversing at most a single PCIe bridge NV# = Connection traversing a bonded set of # NVLinks

</details>

🐛 Describe the bug

Bug Summary

In vLLM 0.19.1, the Gemma4ToolParser produces incorrect floating-point argument values in streaming mode. A number like 108.2 is incorrectly streamed as 108.02, and 22.8 becomes 22.08. String-typed arguments and non-streaming responses are unaffected.

Launch Command

vllm serve /llm-models/gemma-4-26B-A4B-it \
  --served-model-name gemma4-26b-a4b-it \
  --host=0.0.0.0 \
  --port 9999 \
  --gpu_memory_utilization 0.9 \
  --tensor-parallel-size 2 \
  --max-model-len 262144 \
  --enable-prefix-caching \
  --enable-auto-tool-choice \
  --reasoning-parser gemma4 \
  --tool-call-parser gemma4

Minimal Reproduction

# pip install openai==1.76.1
from openai import OpenAI

client = OpenAI(base_url="http://127.0.0.1:9999/v1", api_key="EMPTY")

response = client.chat.completions.create(
    model="gemma4-26b-a4b-it",
    messages=[{
        "role": "user",
        "content": [{"type": "text", "text": "Calculate the sum of 108.2 and 22.8"}]
    }],
    tools=[{
        "type": "function",
        "function": {
            "name": "add",
            "description": "Calculate the sum of two numbers",
            "parameters": {
                "type": "object",
                "properties": {
                    "left": {"type": "number", "description": "The first number"},
                    "right": {"type": "number", "description": "The second number"}
                },
                "required": ["left", "right"]
            }
        }
    }],
    stream=True,
    temperature=1.0,
    top_p=0.95,
    presence_penalty=1.5,
    extra_body={
        "top_k": 20,
        "chat_template_kwargs": {"enable_thinking": False},
    },
)

arguments = ""
for chunk in response:
    tc = chunk.choices[0].delta.tool_calls
    if tc and len(tc) == 1:
        arguments += tc[0].function.arguments

print(arguments)
# Actual (streaming):   {"left": 108.02, "right": 22.08}
# Expected (streaming): {"left": 108.2,  "right": 22.8}

Expected vs Actual Behavior

Expected (streaming): {"left": 108.2, "right": 22.8} Actual (streaming): {"left": 108.02, "right": 22.08}

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

vllm - 💡(How to fix) Fix [Bug] Gemma4ToolParser streams incorrect float values (e.g., 108.2 → 108.02)

Recommended Tools

GitHub issue graph ai analysis

Code Example

Your current environment

============================== System Info

============================== vLLM Info

🐛 Describe the bug

Bug Summary

Launch Command

Minimal Reproduction

Expected vs Actual Behavior

Before submitting a new issue...

Still need to ship something?

TRENDING

vllm - 💡(How to fix) Fix [Bug] Gemma4ToolParser streams incorrect float values (e.g., 108.2 → 108.02)

Recommended Tools

GitHub issue graph ai analysis

Code Example

Your current environment

============================== System Info

============================== vLLM Info

🐛 Describe the bug

Bug Summary

Launch Command

Minimal Reproduction

Expected vs Actual Behavior

Before submitting a new issue...

Still need to ship something?

RELATED_DISCOVERY

TRENDING