vllm - 💡(How to fix) Fix [Performance]: Encode performance of vLLM [1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#41267Fetched 2026-04-30 06:19:15
View on GitHub
Comments
0
Participants
1
Timeline
4
Reactions
0
Participants
Timeline (top)
subscribed ×2labeled ×1mentioned ×1
RAW_BUFFERClick to expand / collapse

Discussion on performance

I am running vLLM with open telemetry tracing enabled to measure time taken in encode. Though vLLM doesn't export spans for encode phase, I assumed that the remaining time during a request processing which is not accounted for in any of the steps. I set max_tokens as 1 so the decode time is almost 0

Setup vLLM: v0.19.1 Model/hardware: Qwen-vl 2.5 7B deployed on a h100 80 GB node

Sample Trace Here is an example trace while processing a 1080p image: <meta charset="utf-8"><b style="font-weight:normal;" id="docs-internal-guid-f31fcf8b-7fff-bbeb-1cf6-ed4accc7e14a"><div dir="ltr" style="margin-left:0pt;" align="left">

gen_ai.latency.e2e0.293 seconds
gen_ai.latency.time_in_model_decode0.000 seconds
gen_ai.latency.time_in_model_inference0.126 seconds
gen_ai.latency.time_in_model_prefill0.126 seconds
gen_ai.latency.time_in_queue0.000 seconds
gen_ai.latency.time_to_first_token0.293 seconds
gen_ai.request.max_tokens1
gen_ai.request.n1
gen_ai.request.temperature0.01
gen_ai.request.top_p1
gen_ai.usage.completion_tokens1
gen_ai.usage.prompt_tokens2718
Time taken by encoder0.167 seconds
</div></b>

Observations The time taken to encode is more than prefill, I expected the time to be lower. In general for different image sizes, I found that the encode time is similar order of magnitude as prefill.

<meta charset="utf-8"><b style="font-weight:normal;" id="docs-internal-guid-97c37505-7fff-fb1f-1813-884ddad0f63f"><div dir="ltr" style="margin-left:0pt;" align="left">

Latency Component1080p image720p image360p imageText request for contrast
E2e latency0.2930.1250.0602.271
Prefill latency0.1260.0590.0390.008
Queue is 0 when only 1 request is sent0.0000.0000.0000.000
Decode is 0 when max_tokens=10.0000.0000.0002.261
Remaining time = Approx Time taken by encoder0.1670.0660.0200.00
Encoder time as % of Prefill time132.54%111.86%51.28%0.00%
</div></b>

Question

Is it expected that encode would take similar order of magnitude of time as prefill?

cc @vMaroon

Your current environment (if you think it is necessary)

vLLM: v0.19.1 Model/hardware: Qwen-vl 2.5 7B deployed on a h100 80 GB node in a k8s cluster

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

TL;DR

The encode time taking a similar order of magnitude as prefill time may not be expected and could be worth investigating further for optimization opportunities.

Guidance

  • Review the documentation for vLLM v0.19.1 to understand the expected performance characteristics of the encode phase compared to prefill.
  • Investigate how the image size affects the encode time, as the provided data suggests a significant difference in encode times for different image sizes.
  • Consider reaching out to the vLLM community or support for more insight into the expected performance of the encode phase, especially given the specific model and hardware configuration.
  • Analyze the system resources (e.g., CPU, memory, GPU utilization) during the encode phase to identify potential bottlenecks.

Notes

The provided data suggests an unexpected performance characteristic of the encode phase, but without more information on the expected behavior or additional context about the system's configuration and resource utilization, it's challenging to provide a definitive solution.

Recommendation

Apply workaround: Investigate and potentially optimize the encode phase for better performance, as the current behavior seems unexpected and might be impacting overall system efficiency.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

vllm - 💡(How to fix) Fix [Performance]: Encode performance of vLLM [1 participants]