vllm - 💡(How to fix) Fix [Performance]: Encode performance of vLLM [1 participants]

rahulgurnani · 2026-04-29T18:42:32Z

[vllm] Discussion on performance I am running vLLM with open telemetry tracing enabled to measure time taken in encode. Though vLLM doesn't export spans for en… ### Discussion on performance I am running vLLM with open telemetry tracing enabled to measure time taken in encode. Though vLLM doesn't export spans for encode phase, I assumed that the remaining time during a request processing which is not accounted for in any of the steps. I set max_tokens as 1 so the decode time is almost 0 **Setup** vLLM: v0.19.1 Model/hardware: Qwen-vl 2.5 7B deployed on a h100 80 GB node **Sample Trace** Here is an example trace while processing a 1080p image: gen_ai.latency.e2e | 0.293 seconds -- | -- gen_ai.latency.time_in_model_decode | 0.000 seconds gen_ai.latency.time_in_model_inference | 0.126 seconds gen_ai.latency.time_in_model_prefill | 0.126 seconds gen_ai.latency.time_in_queue | 0.000 seconds gen_ai.latency.time_to_first_token | 0.293 seconds gen_ai.request.max_tokens | 1 gen_ai.request.n | 1 gen_ai.request.temperature | 0.01 gen_ai.request.top_p | 1 gen_ai.usage.completion_tokens | 1 gen_ai.usage.prompt_tokens | 2718 Time taken by encoder | 0.167 seconds **Observations** The time taken to encode is more than prefill, I expected the time to be lower. In general for different image sizes, I found that the encode time is similar order of magnitude as prefill. Latency Component | 1080p image | 720p image | 360p image | Text request for contrast -- | -- | -- | -- | -- E2e latency | 0.293 | 0.125 | 0.060 | 2.271 Prefill latency | 0.126 | 0.059 | 0.039 | 0.008 Queue is 0 when only 1 request is sent | 0.000 | 0.000 | 0.000 | 0.000 Decode is 0 when max_tokens=1 | 0.000 | 0.000 | 0.000 | 2.261 Remaining time = Approx Time taken by encoder | 0.167 | 0.066 | 0.020 | 0.00 Encoder time as % of Prefill time | 132.54% | 111.86% | 51.28% | 0.00% **Question** Is it expected that encode would take similar order of magnitude of time as prefill? cc @vMaroon ### Your current environment (if you think it is necessary) vLLM: v0.19.1 Model/hardware: Qwen-vl 2.5 7B deployed on a h100 80 GB node in a k8s cluster ### Before submitting a new issue... - [x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the [documentation page](https://docs.vllm.ai/en/latest/), which can answer lots of frequently asked questions.

vllm2026-04-29 18:42:32

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

vllm-project/vllm#41267•Fetched 2026-04-30 06:19:15

View on GitHub

Comments

Participants

Timeline

Reactions

Author

rahulgurnani

Participants

rahulgurnani

Timeline (top)

subscribed ×2labeled ×1mentioned ×1

RAW_BUFFERClick to expand / collapse

Discussion on performance

I am running vLLM with open telemetry tracing enabled to measure time taken in encode. Though vLLM doesn't export spans for encode phase, I assumed that the remaining time during a request processing which is not accounted for in any of the steps. I set max_tokens as 1 so the decode time is almost 0

Setup vLLM: v0.19.1 Model/hardware: Qwen-vl 2.5 7B deployed on a h100 80 GB node

Sample Trace Here is an example trace while processing a 1080p image: <meta charset="utf-8"><b style="font-weight:normal;" id="docs-internal-guid-f31fcf8b-7fff-bbeb-1cf6-ed4accc7e14a"><div dir="ltr" style="margin-left:0pt;" align="left">

gen_ai.latency.e2e	0.293 seconds
gen_ai.latency.time_in_model_decode	0.000 seconds
gen_ai.latency.time_in_model_inference	0.126 seconds
gen_ai.latency.time_in_model_prefill	0.126 seconds
gen_ai.latency.time_in_queue	0.000 seconds
gen_ai.latency.time_to_first_token	0.293 seconds
gen_ai.request.max_tokens	1
gen_ai.request.n	1
gen_ai.request.temperature	0.01
gen_ai.request.top_p	1
gen_ai.usage.completion_tokens	1
gen_ai.usage.prompt_tokens	2718
Time taken by encoder	0.167 seconds

</div></b>

Observations The time taken to encode is more than prefill, I expected the time to be lower. In general for different image sizes, I found that the encode time is similar order of magnitude as prefill.

Latency Component	1080p image	720p image	360p image	Text request for contrast
E2e latency	0.293	0.125	0.060	2.271
Prefill latency	0.126	0.059	0.039	0.008
Queue is 0 when only 1 request is sent	0.000	0.000	0.000	0.000
Decode is 0 when max_tokens=1	0.000	0.000	0.000	2.261
Remaining time = Approx Time taken by encoder	0.167	0.066	0.020	0.00
Encoder time as % of Prefill time	132.54%	111.86%	51.28%	0.00%

</div></b>

Question

Is it expected that encode would take similar order of magnitude of time as prefill?

cc @vMaroon

Your current environment (if you think it is necessary)

vLLM: v0.19.1 Model/hardware: Qwen-vl 2.5 7B deployed on a h100 80 GB node in a k8s cluster

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

TL;DR

The encode time taking a similar order of magnitude as prefill time may not be expected and could be worth investigating further for optimization opportunities.

Guidance

Review the documentation for vLLM v0.19.1 to understand the expected performance characteristics of the encode phase compared to prefill.
Investigate how the image size affects the encode time, as the provided data suggests a significant difference in encode times for different image sizes.
Consider reaching out to the vLLM community or support for more insight into the expected performance of the encode phase, especially given the specific model and hardware configuration.
Analyze the system resources (e.g., CPU, memory, GPU utilization) during the encode phase to identify potential bottlenecks.

Notes

The provided data suggests an unexpected performance characteristic of the encode phase, but without more information on the expected behavior or additional context about the system's configuration and resource utilization, it's challenging to provide a definitive solution.

Recommendation

Apply workaround: Investigate and potentially optimize the encode phase for better performance, as the current behavior seems unexpected and might be impacting overall system efficiency.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #ssr #parallel task #integration issue #index setup

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

vllm - 💡(How to fix) Fix [Performance]: Encode performance of vLLM [1 participants]

Recommended Tools

GitHub issue graph ai analysis

Discussion on performance

Your current environment (if you think it is necessary)

Before submitting a new issue...

extent analysis

TL;DR

Guidance

Notes

Recommendation

Still need to ship something?

TRENDING

vllm - 💡(How to fix) Fix [Performance]: Encode performance of vLLM [1 participants]

Recommended Tools

GitHub issue graph ai analysis

Discussion on performance

Your current environment (if you think it is necessary)

Before submitting a new issue...

extent analysis

TL;DR

Guidance

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING