vllm - ✅(Solved) Fix [Bug]: Gemma4 tool-call-parser produces <pad> tokens under concurrent requests [2 pull requests, 3 comments, 3 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#39392Fetched 2026-04-10 03:40:53
View on GitHub
Comments
3
Participants
3
Timeline
10
Reactions
0
Timeline (top)
subscribed ×4commented ×3cross-referenced ×1labeled ×1

Root Cause

Suspected root cause

Fix Action

Workaround

Currently using a global lock to serialize all requests to vLLM, which eliminates the <pad> issue but reduces throughput.

PR fix notes

PR #38879: [Gemma4] Enable Fast Prefill Optimization

Description (problem / solution / changelog)

Summary

Add --kv-sharing-fast-prefill support for Gemma 4 models, porting the YOCO (You Only Cache Once) fast prefill optimization from Gemma3n. When enabled, the cross-decoder layers (KV-shared) skip prefill tokens and only process decode tokens, significantly reducing prefill latency and improving throughput under concurrent load.

shout-out to @sarckk for the original optimzation (https://github.com/vllm-project/vllm/pull/22628)

Test Plan

GSM8K accuracy (Gemma4-E4B, 5-shot)

# FP=OFF (baseline)
lm_eval --model vllm --tasks gsm8k --num_fewshot 5 \
  --model_args pretrained=google/gemma-4-E4B-it,gpu_memory_utilization=0.9,max_model_len=4096,tensor_parallel_size=1,trust_remote_code=True,attention_backend=TRITON_ATTN,kv_sharing_fast_prefill=False \
  --batch_size auto --apply_chat_template --fewshot_as_multiturn

# FP=ON (this PR)
lm_eval --model vllm --tasks gsm8k --num_fewshot 5 \
  --model_args pretrained=google/gemma-4-E4B-it,gpu_memory_utilization=0.9,max_model_len=4096,tensor_parallel_size=1,trust_remote_code=True,attention_backend=TRITON_ATTN,kv_sharing_fast_prefill=True \
  --batch_size auto --apply_chat_template --fewshot_as_multiturn

Serving benchmark

# Start server (without fast prefill)
vllm serve google/gemma-4-E4B-it \
  --port 8434 \
  --disable-log-stats \
  --no-enable-prefix-caching \
  --max-num-seqs 128 \
  --max-model-len 32768 \
  --max-num-batched-tokens 8192 \
  --attention-backend TRITON_ATTN \
  --trust-remote-code

# Start server (with fast prefill)
vllm serve google/gemma-4-E4B-it \
  --port 8434 \
  --disable-log-stats \
  --no-enable-prefix-caching \
  --max-num-seqs 128 \
  --max-model-len 32768 \
  --max-num-batched-tokens 8192 \
  --attention-backend TRITON_ATTN \
  --trust-remote-code \
  --kv-sharing-fast-prefill

# Run benchmark (after server is ready)
# concurrency=8
vllm bench serve \
  --backend vllm \
  --ignore-eos \
  --port 8434 \
  --model google/gemma-4-E4B-it \
  --dataset-name random \
  --max-concurrency 8 \
  --request-rate inf \
  --num-prompts 256 \
  --random-input-len 8192 \
  --random-output-len 150

# concurrency=32
vllm bench serve \
  --backend vllm \
  --ignore-eos \
  --port 8434 \
  --model google/gemma-4-E4B-it \
  --dataset-name random \
  --max-concurrency 32 \
  --request-rate inf \
  --num-prompts 256 \
  --random-input-len 8192 \
  --random-output-len 150

Test Results

GSM8K accuracy (Gemma4-E4B, 5-shot)

No accuracy regression:

strict-matchflexible-extract
FP=OFF (baseline)0.10540.1751
FP=ON (this PR)0.10310.1850

Serving performance (Gemma4-E4B, 1xB200, ISL=8192, OSL=150, n=256)

concurrency=8

MetricNORMALFAST_PREFILLDelta
Throughput4.22 req/s5.06 req/s+19.9%
Mean TTFT570 ms363 ms-36.3%
Mean TPOT8.90 ms8.16 ms-8.3%

concurrency=32

MetricNORMALFAST_PREFILLDelta
Throughput6.53 req/s9.07 req/s+38.9%
Mean TTFT942 ms622 ms-34.0%
Mean TPOT26.43 ms19.37 ms-26.7%

Changed files

  • vllm/model_executor/models/gemma4.py (modified, +369/-47)

Code Example

from openai import OpenAI
import concurrent.futures
 
client = OpenAI(base_url="http://localhost:8000/v1", api_key="your-api-key")
 
tools = [
    {
        "type": "function",
        "function": {
            "name": "regex_scan",
            "description": "Scan text for sensitive patterns like IP addresses, passwords, API keys.",
            "parameters": {
                "type": "object",
                "properties": {
                    "scope": {"type": "string", "default": "all"}
                },
            }
        }
    }
]
 
system_prompt = "You are a security reviewer. Analyze the given text and call tools to check for sensitive information, then output a JSON report."
 
# Use any long-ish content (10K+ characters)
content = "..."

---

for i in range(5):
    resp = client.chat.completions.create(
        model="gemma4-31b",
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": content},
        ],
        tools=tools,
        tool_choice="auto",
        max_tokens=4096,
        temperature=0.3,
    )
    output = resp.choices[0].message.content or ""
    print(f"  #{i}: pad={'<pad>' in output}, len={len(output)}")

---

def run_one(idx):
    resp = client.chat.completions.create(
        model="gemma4-31b",
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": f"Request {idx}: {content}"},
        ],
        tools=tools,
        tool_choice="auto",
        max_tokens=4096,
        temperature=0.3,
    )
    msg = resp.choices[0].message
    output = msg.content or ""
    return idx, "<pad>" in output, len(output), resp.choices[0].finish_reason
 
with concurrent.futures.ThreadPoolExecutor(max_workers=5) as pool:
    futures = [pool.submit(run_one, i) for i in range(5)]
    for f in concurrent.futures.as_completed(futures):
        idx, has_pad, length, reason = f.result()
        print(f"  #{idx}: pad={has_pad}, len={length}, finish={reason}")

---

#4: pad=False, len=217, finish=stop
  #2: pad=False, len=204, finish=stop
  #0: pad=False, len=209, finish=stop
  #1: pad=True,  len=20480, finish=no_tools    ← <pad> filled entire output
  #3: pad=True,  len=20418, finish=length      ← <pad> filled entire output
RAW_BUFFERClick to expand / collapse

Your current environment

  • vLLM Docker image: vllm/vllm-openai:gemma4
  • GPU: 8× NVIDIA GeForce RTX 4090 (24GB each)
  • OS: Ubuntu 24.04 LTS
  • Model: google/gemma-4-31B-it
  • tensor-parallel-size: 8
  • max-model-len: 196608

🐛 Describe the bug

Bug Description

When multiple requests with tool calling are sent concurrently to vLLM using --tool-call-parser gemma4, some requests produce outputs filled entirely with <pad> tokens (e.g., 4096 <pad> tokens). The same requests succeed 100% of the time when sent sequentially.

Reproduction Steps

1. Define tools and system prompt

from openai import OpenAI
import concurrent.futures
 
client = OpenAI(base_url="http://localhost:8000/v1", api_key="your-api-key")
 
tools = [
    {
        "type": "function",
        "function": {
            "name": "regex_scan",
            "description": "Scan text for sensitive patterns like IP addresses, passwords, API keys.",
            "parameters": {
                "type": "object",
                "properties": {
                    "scope": {"type": "string", "default": "all"}
                },
            }
        }
    }
]
 
system_prompt = "You are a security reviewer. Analyze the given text and call tools to check for sensitive information, then output a JSON report."
 
# Use any long-ish content (10K+ characters)
content = "..."

2. Sequential test — 5/5 pass

for i in range(5):
    resp = client.chat.completions.create(
        model="gemma4-31b",
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": content},
        ],
        tools=tools,
        tool_choice="auto",
        max_tokens=4096,
        temperature=0.3,
    )
    output = resp.choices[0].message.content or ""
    print(f"  #{i}: pad={'<pad>' in output}, len={len(output)}")

Result: 5/5 succeed, 0 <pad> occurrences.

3. Concurrent test — some fail

def run_one(idx):
    resp = client.chat.completions.create(
        model="gemma4-31b",
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": f"Request {idx}: {content}"},
        ],
        tools=tools,
        tool_choice="auto",
        max_tokens=4096,
        temperature=0.3,
    )
    msg = resp.choices[0].message
    output = msg.content or ""
    return idx, "<pad>" in output, len(output), resp.choices[0].finish_reason
 
with concurrent.futures.ThreadPoolExecutor(max_workers=5) as pool:
    futures = [pool.submit(run_one, i) for i in range(5)]
    for f in concurrent.futures.as_completed(futures):
        idx, has_pad, length, reason = f.result()
        print(f"  #{idx}: pad={has_pad}, len={length}, finish={reason}")

Result: 2/5 produce <pad> output.

  #4: pad=False, len=217, finish=stop
  #2: pad=False, len=204, finish=stop
  #0: pad=False, len=209, finish=stop
  #1: pad=True,  len=20480, finish=no_tools    ← <pad> filled entire output
  #3: pad=True,  len=20418, finish=length      ← <pad> filled entire output

4. Without tools — concurrent requests work fine

When the same content is sent concurrently without tools parameter, all requests succeed. This confirms the issue is in the tool-call-parser, not in the model itself.

Key observations

Test<pad> rate
Sequential, with tools0/5
Concurrent, with tools2/5
Sequential, without tools0/5
Concurrent, without tools0/5
Sequential, different temperature (0.3 vs 1.0)0/5 for both

Suspected root cause

The Gemma4ToolParser (added in #38826) likely has shared mutable state that is not thread-safe. Under concurrent requests, the parser state from one request interferes with another, causing the tool call parsing to fail silently. When parsing fails, the model output degrades into <pad> tokens.

This is consistent with other recent issues in the gemma4 parser:

  • #38837 (missing tools parameter)
  • #38855 (reasoning parser fails to separate reasoning_content)
  • #39130 (reasoning-parser silently disables structured output)

Expected behavior

Concurrent requests with --tool-call-parser gemma4 should produce the same results as sequential requests.

Workaround

Currently using a global lock to serialize all requests to vLLM, which eliminates the <pad> issue but reduces throughput.

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

TL;DR

The most likely fix for the issue of concurrent requests producing <pad> tokens is to address the suspected shared mutable state in the Gemma4ToolParser that is not thread-safe.

Guidance

  • Review the Gemma4ToolParser code to identify and refactor any shared mutable state to ensure thread safety.
  • Consider using synchronization mechanisms, such as locks or atomic operations, to protect access to shared resources.
  • Test the parser with concurrent requests to verify that the issue is resolved.
  • If the issue persists, investigate other potential causes, such as resource contention or model limitations.

Example

No code snippet is provided as the issue is related to a specific parser implementation, and any example would require more context.

Notes

The provided workaround of using a global lock to serialize requests eliminates the <pad> issue but reduces throughput. A more efficient solution would be to address the root cause of the issue, which is the suspected shared mutable state in the Gemma4ToolParser.

Recommendation

Apply a workaround, such as using a lock or other synchronization mechanism, to ensure thread safety in the Gemma4ToolParser until a permanent fix can be implemented. This will prevent the <pad> issue but may impact performance.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

FAQ

Expected behavior

Concurrent requests with --tool-call-parser gemma4 should produce the same results as sequential requests.

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING