vllm - ✅(Solved) Fix [Bug]: Gemma4 Fast Prefill Optimization degrades p95 inter-token latency significantly [2 pull requests, 1 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#41369Fetched 2026-05-01 05:33:56
View on GitHub
Comments
1
Participants
2
Timeline
4
Reactions
0
Timeline (top)
commented ×1labeled ×1mentioned ×1subscribed ×1

PR fix notes

PR #38879: [Gemma4] Enable Fast Prefill Optimization

Description (problem / solution / changelog)

Summary

Add --kv-sharing-fast-prefill support for Gemma 4 models, porting the YOCO (You Only Cache Once) fast prefill optimization from Gemma3n. When enabled, the cross-decoder layers (KV-shared) skip prefill tokens and only process decode tokens, significantly reducing prefill latency and improving throughput under concurrent load.

shout-out to @sarckk for the original optimzation (https://github.com/vllm-project/vllm/pull/22628)

Test Plan

GSM8K accuracy (Gemma4-E4B, 5-shot)

# FP=OFF (baseline)
lm_eval --model vllm --tasks gsm8k --num_fewshot 5 \
  --model_args pretrained=google/gemma-4-E4B-it,gpu_memory_utilization=0.9,max_model_len=4096,tensor_parallel_size=1,trust_remote_code=True,attention_backend=TRITON_ATTN,kv_sharing_fast_prefill=False \
  --batch_size auto --apply_chat_template --fewshot_as_multiturn

# FP=ON (this PR)
lm_eval --model vllm --tasks gsm8k --num_fewshot 5 \
  --model_args pretrained=google/gemma-4-E4B-it,gpu_memory_utilization=0.9,max_model_len=4096,tensor_parallel_size=1,trust_remote_code=True,attention_backend=TRITON_ATTN,kv_sharing_fast_prefill=True \
  --batch_size auto --apply_chat_template --fewshot_as_multiturn

Serving benchmark

# Start server (without fast prefill)
vllm serve google/gemma-4-E4B-it \
  --port 8434 \
  --disable-log-stats \
  --no-enable-prefix-caching \
  --max-num-seqs 128 \
  --max-model-len 32768 \
  --max-num-batched-tokens 8192 \
  --attention-backend TRITON_ATTN \
  --trust-remote-code

# Start server (with fast prefill)
vllm serve google/gemma-4-E4B-it \
  --port 8434 \
  --disable-log-stats \
  --no-enable-prefix-caching \
  --max-num-seqs 128 \
  --max-model-len 32768 \
  --max-num-batched-tokens 8192 \
  --attention-backend TRITON_ATTN \
  --trust-remote-code \
  --kv-sharing-fast-prefill

# Run benchmark (after server is ready)
# concurrency=8
vllm bench serve \
  --backend vllm \
  --ignore-eos \
  --port 8434 \
  --model google/gemma-4-E4B-it \
  --dataset-name random \
  --max-concurrency 8 \
  --request-rate inf \
  --num-prompts 256 \
  --random-input-len 8192 \
  --random-output-len 150

# concurrency=32
vllm bench serve \
  --backend vllm \
  --ignore-eos \
  --port 8434 \
  --model google/gemma-4-E4B-it \
  --dataset-name random \
  --max-concurrency 32 \
  --request-rate inf \
  --num-prompts 256 \
  --random-input-len 8192 \
  --random-output-len 150

Test Results

GSM8K accuracy (Gemma4-E4B, 5-shot)

No accuracy regression:

strict-matchflexible-extract
FP=OFF (baseline)0.10540.1751
FP=ON (this PR)0.10310.1850

Serving performance (Gemma4-E4B, 1xB200, ISL=8192, OSL=150, n=256)

concurrency=8

MetricNORMALFAST_PREFILLDelta
Throughput4.22 req/s5.06 req/s+19.9%
Mean TTFT570 ms363 ms-36.3%
Mean TPOT8.90 ms8.16 ms-8.3%

concurrency=32

MetricNORMALFAST_PREFILLDelta
Throughput6.53 req/s9.07 req/s+38.9%
Mean TTFT942 ms622 ms-34.0%
Mean TPOT26.43 ms19.37 ms-26.7%

Changed files

  • vllm/model_executor/models/gemma4.py (modified, +369/-47)

Code Example

Collecting environment information...
==============================
        System Info
==============================
OS                           : Ubuntu 22.04.5 LTS (x86_64)
GCC version                  : (Ubuntu 11.4.0-1ubuntu1~22.04.3) 11.4.0
Clang version                : Could not collect
CMake version                : Could not collect
Libc version                 : glibc-2.35

==============================
       PyTorch Info
==============================
PyTorch version              : 2.11.0+cu130
Is debug build               : False
CUDA used to build PyTorch   : 13.0
ROCM used to build PyTorch   : N/A
XPU used to build PyTorch    : N/A

==============================
      Python Environment
==============================
Python version               : 3.12.13 (main, Mar  4 2026, 09:23:07) [GCC 11.4.0] (64-bit runtime)
Python platform              : Linux-6.17.0-1012-gcp-x86_64-with-glibc2.35
    
==============================
       CUDA / GPU Info
==============================
Is CUDA available            : True
CUDA runtime version         : 13.0.88
CUDA_MODULE_LOADING set to   : 
GPU models and configuration : GPU 0: NVIDIA L4
Nvidia driver version        : 595.71.05
cuDNN version                : Could not collect
HIP runtime version          : N/A
MIOpen runtime version       : N/A
Is XNNPACK available         : True

==============================
          CPU Info
==============================
Architecture:                            x86_64
CPU op-mode(s):                          32-bit, 64-bit
Address sizes:                           46 bits physical, 48 bits virtual
Byte Order:                              Little Endian
CPU(s):                                  4
On-line CPU(s) list:                     0-3
Vendor ID:                               GenuineIntel
Model name:                              Intel(R) Xeon(R) CPU @ 2.20GHz
CPU family:                              6
Model:                                   85
Thread(s) per core:                      2
Core(s) per socket:                      2
Socket(s):                               1
Stepping:                                7
BogoMIPS:                                4400.44
Hypervisor vendor:                       KVM
Virtualization type:                     full
L1d cache:                               64 KiB (2 instances)
L1i cache:                               64 KiB (2 instances)
L2 cache:                                2 MiB (2 instances)
L3 cache:                                38.5 MiB (1 instance)
NUMA node(s):                            1
NUMA node0 CPU(s):                       0-3

==============================
Versions of relevant libraries
==============================
[pip3] flashinfer-python==0.6.8.post1
[pip3] numpy==2.2.6
[pip3] nvidia-cublas==13.1.0.3
[pip3] nvidia-cuda-cupti==13.0.85
[pip3] nvidia-cuda-nvrtc==13.0.88
[pip3] nvidia-cuda-runtime==13.0.96
[pip3] nvidia-cudnn-cu13==9.19.0.56
[pip3] nvidia-cudnn-frontend==1.18.0
[pip3] nvidia-cufft==12.0.0.61
[pip3] nvidia-cufile==1.15.1.6
[pip3] nvidia-curand==10.4.0.35
[pip3] nvidia-cusolver==12.0.4.66
[pip3] nvidia-cusparse==12.6.3.3
[pip3] nvidia-cusparselt-cu13==0.8.0
[pip3] nvidia-cutlass-dsl==4.4.2
[pip3] nvidia-cutlass-dsl-libs-base==4.4.2
[pip3] nvidia-ml-py==13.595.45
[pip3] nvidia-nccl-cu13==2.28.9
[pip3] nvidia-nvjitlink==13.0.88
[pip3] nvidia-nvshmem-cu13==3.4.5
[pip3] nvidia-nvtx==13.0.85
[pip3] pyzmq==27.1.0
[pip3] torch==2.11.0+cu130
[pip3] torch_c_dlpack_ext==0.1.5
[pip3] torchaudio==2.11.0+cu130
[pip3] torchvision==0.26.0+cu130
[pip3] transformers==5.6.2
[pip3] triton==3.6.0
[conda] Could not collect

==============================
         vLLM Info
==============================
ROCM Version                 : Could not collect
vLLM Version                 : 0.20.0
vLLM Build Flags:
  CUDA Archs: 7.5 8.0 8.6 8.9 9.0 10.0 12.0+PTX; ROCm: Disabled; XPU: Disabled
GPU Topology:
        GPU0    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X      0-3     0               N/A

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

==============================
     Environment Variables
==============================
NVIDIA_VISIBLE_DEVICES=all
NVIDIA_REQUIRE_CUDA=cuda>=13.0 brand=unknown,driver>=535,driver<536 brand=grid,driver>=535,driver<536 brand=tesla,driver>=535,driver<536 brand=nvidia,driver>=535,driver<536 brand=quadro,driver>=535,driver<536 brand=quadrortx,driver>=535,driver<536 brand=nvidiartx,driver>=535,driver<536 brand=vapps,driver>=535,driver<536 brand=vpc,driver>=535,driver<536 brand=vcs,driver>=535,driver<536 brand=vws,driver>=535,driver<536 brand=cloudgaming,driver>=535,driver<536 brand=unknown,driver>=550,driver<551 brand=grid,driver>=550,driver<551 brand=tesla,driver>=550,driver<551 brand=nvidia,driver>=550,driver<551 brand=quadro,driver>=550,driver<551 brand=quadrortx,driver>=550,driver<551 brand=nvidiartx,driver>=550,driver<551 brand=vapps,driver>=550,driver<551 brand=vpc,driver>=550,driver<551 brand=vcs,driver>=550,driver<551 brand=vws,driver>=550,driver<551 brand=cloudgaming,driver>=550,driver<551 brand=unknown,driver>=565,driver<566 brand=grid,driver>=565,driver<566 brand=tesla,driver>=565,driver<566 brand=nvidia,driver>=565,driver<566 brand=quadro,driver>=565,driver<566 brand=quadrortx,driver>=565,driver<566 brand=nvidiartx,driver>=565,driver<566 brand=vapps,driver>=565,driver<566 brand=vpc,driver>=565,driver<566 brand=vcs,driver>=565,driver<566 brand=vws,driver>=565,driver<566 brand=cloudgaming,driver>=565,driver<566 brand=unknown,driver>=570,driver<571 brand=grid,driver>=570,driver<571 brand=tesla,driver>=570,driver<571 brand=nvidia,driver>=570,driver<571 brand=quadro,driver>=570,driver<571 brand=quadrortx,driver>=570,driver<571 brand=nvidiartx,driver>=570,driver<571 brand=vapps,driver>=570,driver<571 brand=vpc,driver>=570,driver<571 brand=vcs,driver>=570,driver<571 brand=vws,driver>=570,driver<571 brand=cloudgaming,driver>=570,driver<571 brand=unknown,driver>=575,driver<576 brand=grid,driver>=575,driver<576 brand=tesla,driver>=575,driver<576 brand=nvidia,driver>=575,driver<576 brand=quadro,driver>=575,driver<576 brand=quadrortx,driver>=575,driver<576 brand=nvidiartx,driver>=575,driver<576 brand=vapps,driver>=575,driver<576 brand=vpc,driver>=575,driver<576 brand=vcs,driver>=575,driver<576 brand=vws,driver>=575,driver<576 brand=cloudgaming,driver>=575,driver<576
TORCH_CUDA_ARCH_LIST=7.5 8.0 8.6 8.9 9.0 10.0 12.0+PTX
NVIDIA_DRIVER_CAPABILITIES=compute,utility
VLLM_USAGE_SOURCE=production-docker-image
CUDA_VERSION=13.0.2
VLLM_ENABLE_CUDA_COMPATIBILITY=0
VLLM_HTTP_TIMEOUT_KEEP_ALIVE=620
LD_LIBRARY_PATH=/usr/local/nvidia/lib64:/usr/local/cuda/lib64:/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/cuda/lib64
PYTORCH_NVML_BASED_CUDA_CHECK=1
TORCHINDUCTOR_COMPILE_THREADS=1
TORCHINDUCTOR_CACHE_DIR=/tmp/torchinductor_root

---

docker run -d \
  --gpus all \
  --restart=on-failure \
  --log-driver=gcplogs \
  --log-opt gcp-log-cmd=true \
  --log-opt mode=non-blocking \
  --log-opt max-buffer-size=5m \
  -e VLLM_HTTP_TIMEOUT_KEEP_ALIVE=620 \
  -p 8080:8080 \
  -v "$MIDDLEWARE_HOST_PATH":/workspace/request_middleware.py:ro \
  --name $CONTAINER_NAME \
  "$IMAGE" \
  --port 8080 \
  --model google/gemma-4-E4B-it \
  --served-model-name google/gemma-4-E4B-it google/gemma-3n-E4B-it \
  --max-model-len 3072 \
  --max-num-seqs 16 \
  --gpu-memory-utilization 0.9 \
  --api-key XXX \
  --disable-access-log-for-endpoints "/health,/metrics,/ping,/ready" \
  --limit-mm-per-prompt '{"image": 0, "audio": 0}' \
  --dtype bfloat16 \
  --no-enable-log-requests \
  --middleware XXX \
  --structured-outputs-config '{"backend":"guidance"}' \
  --async-scheduling \
  --enable-prefix-caching \
  --kv-sharing-fast-prefill
RAW_BUFFERClick to expand / collapse

Your current environment

<details> <summary>The output of <code>python collect_env.py</code></summary>
Collecting environment information...
==============================
        System Info
==============================
OS                           : Ubuntu 22.04.5 LTS (x86_64)
GCC version                  : (Ubuntu 11.4.0-1ubuntu1~22.04.3) 11.4.0
Clang version                : Could not collect
CMake version                : Could not collect
Libc version                 : glibc-2.35

==============================
       PyTorch Info
==============================
PyTorch version              : 2.11.0+cu130
Is debug build               : False
CUDA used to build PyTorch   : 13.0
ROCM used to build PyTorch   : N/A
XPU used to build PyTorch    : N/A

==============================
      Python Environment
==============================
Python version               : 3.12.13 (main, Mar  4 2026, 09:23:07) [GCC 11.4.0] (64-bit runtime)
Python platform              : Linux-6.17.0-1012-gcp-x86_64-with-glibc2.35
    
==============================
       CUDA / GPU Info
==============================
Is CUDA available            : True
CUDA runtime version         : 13.0.88
CUDA_MODULE_LOADING set to   : 
GPU models and configuration : GPU 0: NVIDIA L4
Nvidia driver version        : 595.71.05
cuDNN version                : Could not collect
HIP runtime version          : N/A
MIOpen runtime version       : N/A
Is XNNPACK available         : True

==============================
          CPU Info
==============================
Architecture:                            x86_64
CPU op-mode(s):                          32-bit, 64-bit
Address sizes:                           46 bits physical, 48 bits virtual
Byte Order:                              Little Endian
CPU(s):                                  4
On-line CPU(s) list:                     0-3
Vendor ID:                               GenuineIntel
Model name:                              Intel(R) Xeon(R) CPU @ 2.20GHz
CPU family:                              6
Model:                                   85
Thread(s) per core:                      2
Core(s) per socket:                      2
Socket(s):                               1
Stepping:                                7
BogoMIPS:                                4400.44
Hypervisor vendor:                       KVM
Virtualization type:                     full
L1d cache:                               64 KiB (2 instances)
L1i cache:                               64 KiB (2 instances)
L2 cache:                                2 MiB (2 instances)
L3 cache:                                38.5 MiB (1 instance)
NUMA node(s):                            1
NUMA node0 CPU(s):                       0-3

==============================
Versions of relevant libraries
==============================
[pip3] flashinfer-python==0.6.8.post1
[pip3] numpy==2.2.6
[pip3] nvidia-cublas==13.1.0.3
[pip3] nvidia-cuda-cupti==13.0.85
[pip3] nvidia-cuda-nvrtc==13.0.88
[pip3] nvidia-cuda-runtime==13.0.96
[pip3] nvidia-cudnn-cu13==9.19.0.56
[pip3] nvidia-cudnn-frontend==1.18.0
[pip3] nvidia-cufft==12.0.0.61
[pip3] nvidia-cufile==1.15.1.6
[pip3] nvidia-curand==10.4.0.35
[pip3] nvidia-cusolver==12.0.4.66
[pip3] nvidia-cusparse==12.6.3.3
[pip3] nvidia-cusparselt-cu13==0.8.0
[pip3] nvidia-cutlass-dsl==4.4.2
[pip3] nvidia-cutlass-dsl-libs-base==4.4.2
[pip3] nvidia-ml-py==13.595.45
[pip3] nvidia-nccl-cu13==2.28.9
[pip3] nvidia-nvjitlink==13.0.88
[pip3] nvidia-nvshmem-cu13==3.4.5
[pip3] nvidia-nvtx==13.0.85
[pip3] pyzmq==27.1.0
[pip3] torch==2.11.0+cu130
[pip3] torch_c_dlpack_ext==0.1.5
[pip3] torchaudio==2.11.0+cu130
[pip3] torchvision==0.26.0+cu130
[pip3] transformers==5.6.2
[pip3] triton==3.6.0
[conda] Could not collect

==============================
         vLLM Info
==============================
ROCM Version                 : Could not collect
vLLM Version                 : 0.20.0
vLLM Build Flags:
  CUDA Archs: 7.5 8.0 8.6 8.9 9.0 10.0 12.0+PTX; ROCm: Disabled; XPU: Disabled
GPU Topology:
        GPU0    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X      0-3     0               N/A

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

==============================
     Environment Variables
==============================
NVIDIA_VISIBLE_DEVICES=all
NVIDIA_REQUIRE_CUDA=cuda>=13.0 brand=unknown,driver>=535,driver<536 brand=grid,driver>=535,driver<536 brand=tesla,driver>=535,driver<536 brand=nvidia,driver>=535,driver<536 brand=quadro,driver>=535,driver<536 brand=quadrortx,driver>=535,driver<536 brand=nvidiartx,driver>=535,driver<536 brand=vapps,driver>=535,driver<536 brand=vpc,driver>=535,driver<536 brand=vcs,driver>=535,driver<536 brand=vws,driver>=535,driver<536 brand=cloudgaming,driver>=535,driver<536 brand=unknown,driver>=550,driver<551 brand=grid,driver>=550,driver<551 brand=tesla,driver>=550,driver<551 brand=nvidia,driver>=550,driver<551 brand=quadro,driver>=550,driver<551 brand=quadrortx,driver>=550,driver<551 brand=nvidiartx,driver>=550,driver<551 brand=vapps,driver>=550,driver<551 brand=vpc,driver>=550,driver<551 brand=vcs,driver>=550,driver<551 brand=vws,driver>=550,driver<551 brand=cloudgaming,driver>=550,driver<551 brand=unknown,driver>=565,driver<566 brand=grid,driver>=565,driver<566 brand=tesla,driver>=565,driver<566 brand=nvidia,driver>=565,driver<566 brand=quadro,driver>=565,driver<566 brand=quadrortx,driver>=565,driver<566 brand=nvidiartx,driver>=565,driver<566 brand=vapps,driver>=565,driver<566 brand=vpc,driver>=565,driver<566 brand=vcs,driver>=565,driver<566 brand=vws,driver>=565,driver<566 brand=cloudgaming,driver>=565,driver<566 brand=unknown,driver>=570,driver<571 brand=grid,driver>=570,driver<571 brand=tesla,driver>=570,driver<571 brand=nvidia,driver>=570,driver<571 brand=quadro,driver>=570,driver<571 brand=quadrortx,driver>=570,driver<571 brand=nvidiartx,driver>=570,driver<571 brand=vapps,driver>=570,driver<571 brand=vpc,driver>=570,driver<571 brand=vcs,driver>=570,driver<571 brand=vws,driver>=570,driver<571 brand=cloudgaming,driver>=570,driver<571 brand=unknown,driver>=575,driver<576 brand=grid,driver>=575,driver<576 brand=tesla,driver>=575,driver<576 brand=nvidia,driver>=575,driver<576 brand=quadro,driver>=575,driver<576 brand=quadrortx,driver>=575,driver<576 brand=nvidiartx,driver>=575,driver<576 brand=vapps,driver>=575,driver<576 brand=vpc,driver>=575,driver<576 brand=vcs,driver>=575,driver<576 brand=vws,driver>=575,driver<576 brand=cloudgaming,driver>=575,driver<576
TORCH_CUDA_ARCH_LIST=7.5 8.0 8.6 8.9 9.0 10.0 12.0+PTX
NVIDIA_DRIVER_CAPABILITIES=compute,utility
VLLM_USAGE_SOURCE=production-docker-image
CUDA_VERSION=13.0.2
VLLM_ENABLE_CUDA_COMPATIBILITY=0
VLLM_HTTP_TIMEOUT_KEEP_ALIVE=620
LD_LIBRARY_PATH=/usr/local/nvidia/lib64:/usr/local/cuda/lib64:/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/cuda/lib64
PYTORCH_NVML_BASED_CUDA_CHECK=1
TORCHINDUCTOR_COMPILE_THREADS=1
TORCHINDUCTOR_CACHE_DIR=/tmp/torchinductor_root
</details>

🐛 Describe the bug

p95 inter-token latency went up significantly after enabling --kv-sharing-fast-prefill on Gemma4 model (contrary to https://github.com/vllm-project/vllm/pull/38879). The following is a graph showing p50, p95, p99 on metric prometheus/vllm:inter_token_latency_seconds/histogram. You can see where I enabled the new flag.

<img width="1206" height="548" alt="Image" src="https://github.com/user-attachments/assets/db32bce0-bb41-4c10-96bf-314206eb64a4" />

There are no meaningful improvements in E2E request latency or time to first token. The VMs are deployed on Google Compute Engine. The way we use it is just this docker run command in a Linux startup script:

docker run -d \
  --gpus all \
  --restart=on-failure \
  --log-driver=gcplogs \
  --log-opt gcp-log-cmd=true \
  --log-opt mode=non-blocking \
  --log-opt max-buffer-size=5m \
  -e VLLM_HTTP_TIMEOUT_KEEP_ALIVE=620 \
  -p 8080:8080 \
  -v "$MIDDLEWARE_HOST_PATH":/workspace/request_middleware.py:ro \
  --name $CONTAINER_NAME \
  "$IMAGE" \
  --port 8080 \
  --model google/gemma-4-E4B-it \
  --served-model-name google/gemma-4-E4B-it google/gemma-3n-E4B-it \
  --max-model-len 3072 \
  --max-num-seqs 16 \
  --gpu-memory-utilization 0.9 \
  --api-key XXX \
  --disable-access-log-for-endpoints "/health,/metrics,/ping,/ready" \
  --limit-mm-per-prompt '{"image": 0, "audio": 0}' \
  --dtype bfloat16 \
  --no-enable-log-requests \
  --middleware XXX \
  --structured-outputs-config '{"backend":"guidance"}' \
  --async-scheduling \
  --enable-prefix-caching \
  --kv-sharing-fast-prefill

Our Docker image is based on vllm/vllm-openai:v0.20.0

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

TL;DR

The most likely fix is to disable the --kv-sharing-fast-prefill flag, as it has caused a significant increase in p95 inter-token latency on the Gemma4 model.

Guidance

  • Review the documentation and previous issues related to the --kv-sharing-fast-prefill flag to understand its intended behavior and potential side effects.
  • Test the model without the --kv-sharing-fast-prefill flag to verify if it resolves the latency issue.
  • Investigate other potential factors that could be contributing to the increased latency, such as GPU utilization or memory allocation.
  • Consider reaching out to the VLLM community or support team for further assistance, as the issue may be specific to the Gemma4 model or the Docker image being used.

Notes

The provided information suggests that the --kv-sharing-fast-prefill flag was intended to improve performance, but it has had the opposite effect in this case. Further investigation is needed to determine the root cause of the issue.

Recommendation

Apply workaround: Disable the --kv-sharing-fast-prefill flag to see if it resolves the latency issue, and then investigate other potential factors that could be contributing to the problem.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

vllm - ✅(Solved) Fix [Bug]: Gemma4 Fast Prefill Optimization degrades p95 inter-token latency significantly [2 pull requests, 1 comments, 2 participants]