vllm - ✅(Solved) Fix [Bug]: Gemma4 tool-call-parser produces <pad> tokens under concurrent requests [2 pull requests, 3 comments, 3 participants]

Q: Expected behavior

Concurrent requests with `--tool-call-parser gemma4` should produce the same results as sequential requests.

vllm2026-04-09 06:56:12

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

vllm-project/vllm#39392•Fetched 2026-04-10 03:40:53

View on GitHub

Comments

Participants

Timeline

Reactions

Author

Participants

Timeline (top)

subscribed ×4commented ×3cross-referenced ×1labeled ×1

Root Cause

Suspected root cause

Fix Action

Workaround

Currently using a global lock to serialize all requests to vLLM, which eliminates the <pad> issue but reduces throughput.

PR fix notes

PR #38879: [Gemma4] Enable Fast Prefill Optimization

Repository: vllm-project/vllm
Author: LucasWilkinson
State: closed | merged: True
Link: https://github.com/vllm-project/vllm/pull/38879

Description (problem / solution / changelog)

Summary

Add --kv-sharing-fast-prefill support for Gemma 4 models, porting the YOCO (You Only Cache Once) fast prefill optimization from Gemma3n. When enabled, the cross-decoder layers (KV-shared) skip prefill tokens and only process decode tokens, significantly reducing prefill latency and improving throughput under concurrent load.

shout-out to @sarckk for the original optimzation (https://github.com/vllm-project/vllm/pull/22628)

Test Plan

GSM8K accuracy (Gemma4-E4B, 5-shot)

# FP=OFF (baseline)
lm_eval --model vllm --tasks gsm8k --num_fewshot 5 \
  --model_args pretrained=google/gemma-4-E4B-it,gpu_memory_utilization=0.9,max_model_len=4096,tensor_parallel_size=1,trust_remote_code=True,attention_backend=TRITON_ATTN,kv_sharing_fast_prefill=False \
  --batch_size auto --apply_chat_template --fewshot_as_multiturn

# FP=ON (this PR)
lm_eval --model vllm --tasks gsm8k --num_fewshot 5 \
  --model_args pretrained=google/gemma-4-E4B-it,gpu_memory_utilization=0.9,max_model_len=4096,tensor_parallel_size=1,trust_remote_code=True,attention_backend=TRITON_ATTN,kv_sharing_fast_prefill=True \
  --batch_size auto --apply_chat_template --fewshot_as_multiturn

Serving benchmark

# Start server (without fast prefill)
vllm serve google/gemma-4-E4B-it \
  --port 8434 \
  --disable-log-stats \
  --no-enable-prefix-caching \
  --max-num-seqs 128 \
  --max-model-len 32768 \
  --max-num-batched-tokens 8192 \
  --attention-backend TRITON_ATTN \
  --trust-remote-code

# Start server (with fast prefill)
vllm serve google/gemma-4-E4B-it \
  --port 8434 \
  --disable-log-stats \
  --no-enable-prefix-caching \
  --max-num-seqs 128 \
  --max-model-len 32768 \
  --max-num-batched-tokens 8192 \
  --attention-backend TRITON_ATTN \
  --trust-remote-code \
  --kv-sharing-fast-prefill

# Run benchmark (after server is ready)
# concurrency=8
vllm bench serve \
  --backend vllm \
  --ignore-eos \
  --port 8434 \
  --model google/gemma-4-E4B-it \
  --dataset-name random \
  --max-concurrency 8 \
  --request-rate inf \
  --num-prompts 256 \
  --random-input-len 8192 \
  --random-output-len 150

# concurrency=32
vllm bench serve \
  --backend vllm \
  --ignore-eos \
  --port 8434 \
  --model google/gemma-4-E4B-it \
  --dataset-name random \
  --max-concurrency 32 \
  --request-rate inf \
  --num-prompts 256 \
  --random-input-len 8192 \
  --random-output-len 150

Test Results

GSM8K accuracy (Gemma4-E4B, 5-shot)

No accuracy regression:

	strict-match	flexible-extract
FP=OFF (baseline)	0.1054	0.1751
FP=ON (this PR)	0.1031	0.1850

Serving performance (Gemma4-E4B, 1xB200, ISL=8192, OSL=150, n=256)

concurrency=8

Metric	NORMAL	FAST_PREFILL	Delta
Throughput	4.22 req/s	5.06 req/s	+19.9%
Mean TTFT	570 ms	363 ms	-36.3%
Mean TPOT	8.90 ms	8.16 ms	-8.3%

concurrency=32

Metric	NORMAL	FAST_PREFILL	Delta
Throughput	6.53 req/s	9.07 req/s	+38.9%
Mean TTFT	942 ms	622 ms	-34.0%
Mean TPOT	26.43 ms	19.37 ms	-26.7%

Changed files

vllm/model_executor/models/gemma4.py (modified, +369/-47)

Code Example

from openai import OpenAI
import concurrent.futures
 
client = OpenAI(base_url="http://localhost:8000/v1", api_key="your-api-key")
 
tools = [
    {
        "type": "function",
        "function": {
            "name": "regex_scan",
            "description": "Scan text for sensitive patterns like IP addresses, passwords, API keys.",
            "parameters": {
                "type": "object",
                "properties": {
                    "scope": {"type": "string", "default": "all"}
                },
            }
        }
    }
]
 
system_prompt = "You are a security reviewer. Analyze the given text and call tools to check for sensitive information, then output a JSON report."
 
# Use any long-ish content (10K+ characters)
content = "..."

---

for i in range(5):
    resp = client.chat.completions.create(
        model="gemma4-31b",
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": content},
        ],
        tools=tools,
        tool_choice="auto",
        max_tokens=4096,
        temperature=0.3,
    )
    output = resp.choices[0].message.content or ""
    print(f"  #{i}: pad={'<pad>' in output}, len={len(output)}")

---

def run_one(idx):
    resp = client.chat.completions.create(
        model="gemma4-31b",
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": f"Request {idx}: {content}"},
        ],
        tools=tools,
        tool_choice="auto",
        max_tokens=4096,
        temperature=0.3,
    )
    msg = resp.choices[0].message
    output = msg.content or ""
    return idx, "<pad>" in output, len(output), resp.choices[0].finish_reason
 
with concurrent.futures.ThreadPoolExecutor(max_workers=5) as pool:
    futures = [pool.submit(run_one, i) for i in range(5)]
    for f in concurrent.futures.as_completed(futures):
        idx, has_pad, length, reason = f.result()
        print(f"  #{idx}: pad={has_pad}, len={length}, finish={reason}")

---

#4: pad=False, len=217, finish=stop
  #2: pad=False, len=204, finish=stop
  #0: pad=False, len=209, finish=stop
  #1: pad=True,  len=20480, finish=no_tools    ← <pad> filled entire output
  #3: pad=True,  len=20418, finish=length      ← <pad> filled entire output

RAW_BUFFERClick to expand / collapse

Your current environment

vLLM Docker image: vllm/vllm-openai:gemma4
GPU: 8× NVIDIA GeForce RTX 4090 (24GB each)
OS: Ubuntu 24.04 LTS
Model: google/gemma-4-31B-it
tensor-parallel-size: 8
max-model-len: 196608

🐛 Describe the bug

Bug Description

When multiple requests with tool calling are sent concurrently to vLLM using --tool-call-parser gemma4, some requests produce outputs filled entirely with <pad> tokens (e.g., 4096 <pad> tokens). The same requests succeed 100% of the time when sent sequentially.

Reproduction Steps

1. Define tools and system prompt

from openai import OpenAI
import concurrent.futures
 
client = OpenAI(base_url="http://localhost:8000/v1", api_key="your-api-key")
 
tools = [
    {
        "type": "function",
        "function": {
            "name": "regex_scan",
            "description": "Scan text for sensitive patterns like IP addresses, passwords, API keys.",
            "parameters": {
                "type": "object",
                "properties": {
                    "scope": {"type": "string", "default": "all"}
                },
            }
        }
    }
]
 
system_prompt = "You are a security reviewer. Analyze the given text and call tools to check for sensitive information, then output a JSON report."
 
# Use any long-ish content (10K+ characters)
content = "..."

2. Sequential test — 5/5 pass

for i in range(5):
    resp = client.chat.completions.create(
        model="gemma4-31b",
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": content},
        ],
        tools=tools,
        tool_choice="auto",
        max_tokens=4096,
        temperature=0.3,
    )
    output = resp.choices[0].message.content or ""
    print(f"  #{i}: pad={'<pad>' in output}, len={len(output)}")

Result: 5/5 succeed, 0 <pad> occurrences.

3. Concurrent test — some fail

def run_one(idx):
    resp = client.chat.completions.create(
        model="gemma4-31b",
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": f"Request {idx}: {content}"},
        ],
        tools=tools,
        tool_choice="auto",
        max_tokens=4096,
        temperature=0.3,
    )
    msg = resp.choices[0].message
    output = msg.content or ""
    return idx, "<pad>" in output, len(output), resp.choices[0].finish_reason
 
with concurrent.futures.ThreadPoolExecutor(max_workers=5) as pool:
    futures = [pool.submit(run_one, i) for i in range(5)]
    for f in concurrent.futures.as_completed(futures):
        idx, has_pad, length, reason = f.result()
        print(f"  #{idx}: pad={has_pad}, len={length}, finish={reason}")

Result: 2/5 produce <pad> output.

  #4: pad=False, len=217, finish=stop
  #2: pad=False, len=204, finish=stop
  #0: pad=False, len=209, finish=stop
  #1: pad=True,  len=20480, finish=no_tools    ← <pad> filled entire output
  #3: pad=True,  len=20418, finish=length      ← <pad> filled entire output

4. Without tools — concurrent requests work fine

When the same content is sent concurrently without tools parameter, all requests succeed. This confirms the issue is in the tool-call-parser, not in the model itself.

Key observations

Test	`<pad>` rate
Sequential, with tools	0/5
Concurrent, with tools	2/5
Sequential, without tools	0/5
Concurrent, without tools	0/5
Sequential, different temperature (0.3 vs 1.0)	0/5 for both

Suspected root cause

The Gemma4ToolParser (added in #38826) likely has shared mutable state that is not thread-safe. Under concurrent requests, the parser state from one request interferes with another, causing the tool call parsing to fail silently. When parsing fails, the model output degrades into <pad> tokens.

This is consistent with other recent issues in the gemma4 parser:

#38837 (missing tools parameter)
#38855 (reasoning parser fails to separate reasoning_content)
#39130 (reasoning-parser silently disables structured output)

Expected behavior

Concurrent requests with --tool-call-parser gemma4 should produce the same results as sequential requests.

Workaround

Currently using a global lock to serialize all requests to vLLM, which eliminates the <pad> issue but reduces throughput.

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

TL;DR

The most likely fix for the issue of concurrent requests producing <pad> tokens is to address the suspected shared mutable state in the Gemma4ToolParser that is not thread-safe.

Guidance

Review the Gemma4ToolParser code to identify and refactor any shared mutable state to ensure thread safety.
Consider using synchronization mechanisms, such as locks or atomic operations, to protect access to shared resources.
Test the parser with concurrent requests to verify that the issue is resolved.
If the issue persists, investigate other potential causes, such as resource contention or model limitations.

Example

No code snippet is provided as the issue is related to a specific parser implementation, and any example would require more context.

Notes

The provided workaround of using a global lock to serialize requests eliminates the <pad> issue but reduces throughput. A more efficient solution would be to address the root cause of the issue, which is the suspected shared mutable state in the Gemma4ToolParser.

Recommendation

Apply a workaround, such as using a lock or other synchronization mechanism, to ensure thread safety in the Gemma4ToolParser until a permanent fix can be implemented. This will prevent the <pad> issue but may impact performance.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

FAQ

Expected behavior

Concurrent requests with --tool-call-parser gemma4 should produce the same results as sequential requests.

#api #request error #file not found #serialization error #model compatibility

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

vllm - ✅(Solved) Fix [Bug]: Gemma4 tool-call-parser produces <pad> tokens under concurrent requests [2 pull requests, 3 comments, 3 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Suspected root cause

Fix Action

Workaround

PR fix notes

PR #38879: [Gemma4] Enable Fast Prefill Optimization

Description (problem / solution / changelog)

Summary

Test Plan

GSM8K accuracy (Gemma4-E4B, 5-shot)

Serving benchmark

Test Results

GSM8K accuracy (Gemma4-E4B, 5-shot)

Serving performance (Gemma4-E4B, 1xB200, ISL=8192, OSL=150, n=256)

concurrency=8

concurrency=32

Changed files

Code Example

Your current environment

🐛 Describe the bug

Bug Description

Reproduction Steps

1. Define tools and system prompt

2. Sequential test — 5/5 pass

3. Concurrent test — some fail

4. Without tools — concurrent requests work fine

Key observations

Suspected root cause

Expected behavior

Workaround

Before submitting a new issue...

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

FAQ

Expected behavior

Still need to ship something?

RELATED_DISCOVERY

TRENDING