vllm - ✅(Solved) Fix [Feature]: Add return_progress parameter to stream prompt processing progress during prefill [1 pull requests]

vllm2026-04-20 13:16:50

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

Fix Action

Fixed

Fixed by PR: [Feat] added support for prompt_progress report (https://github.com/vllm-project/vllm/pull/40371)

PR fix notes

PR #40371: [Feat] added support for prompt_progress report

Repository: vllm-project/vllm
Author: ExtReMLapin
State: open | merged: False
Link: https://github.com/vllm-project/vllm/pull/40371

Description (problem / solution / changelog)

fixes #40362

When return_progress is added to kwargs (in stream mode), prompt_progress is returned

Ex : {"prompt_progress": {"total": 4096, "cache": 512, "processed": 1024, "time_ms": 340}}

SSE example :

data: {"id":"chatcmpl-978697bee5dbce5b","object":"chat.completion.chunk","created":1776696288,"model":"cyankiwi/Qwen3.6-35B-A3B-AWQ-4bit","choices":[],"prompt_progress":{"total":102623,"cache":0,"processed":2112,"time_ms":2892}}

data: {"id":"chatcmpl-978697bee5dbce5b","object":"chat.completion.chunk","created":1776696288,"model":"cyankiwi/Qwen3.6-35B-A3B-AWQ-4bit","choices":[{"index":0,"delta":{"role":"assistant","content":""},"logprobs":null,"finish_reason":null}],"prompt_token_ids":null}

data: {"id":"chatcmpl-978697bee5dbce5b","object":"chat.completion.chunk","created":1776696288,"model":"cyankiwi/Qwen3.6-35B-A3B-AWQ-4bit","choices":[],"prompt_progress":{"total":102623,"cache":0,"processed":3168,"time_ms":3488}}

data: {"id":"chatcmpl-978697bee5dbce5b","object":"chat.completion.chunk","created":1776696288,"model":"cyankiwi/Qwen3.6-35B-A3B-AWQ-4bit","choices":[],"prompt_progress":{"total":102623,"cache":0,"processed":4224,"time_ms":3836}}

data: {"id":"chatcmpl-978697bee5dbce5b","object":"chat.completion.chunk","created":1776696288,"model":"cyankiwi/Qwen3.6-35B-A3B-AWQ-4bit","choices":[],"prompt_progress":{"total":102623,"cache":0,"processed":5280,"time_ms":4185}}

data: {"id":"chatcmpl-978697bee5dbce5b","object":"chat.completion.chunk","created":1776696288,"model":"cyankiwi/Qwen3.6-35B-A3B-AWQ-4bit","choices":[],"prompt_progress":{"total":102623,"cache":0,"processed":6336,"time_ms":4535}}

data: {"id":"chatcmpl-978697bee5dbce5b","object":"chat.completion.chunk","created":1776696288,"model":"cyankiwi/Qwen3.6-35B-A3B-AWQ-4bit","choices":[],"prompt_progress":{"total":102623,"cache":0,"processed":7392,"time_ms":4888}}

data: {"id":"chatcmpl-978697bee5dbce5b","object":"chat.completion.chunk","created":1776696288,"model":"cyankiwi/Qwen3.6-35B-A3B-AWQ-4bit","choices":[],"prompt_progress":{"total":102623,"cache":0,"processed":8448,"time_ms":5242}}

UX wide, for very long prompts, the user has to wait not knowing when it will finally start getting a LLM answer which can make the user anxious.

You don't want the user to be anxious.

Changed files

vllm/entrypoints/openai/chat_completion/protocol.py (modified, +2/-0)
vllm/entrypoints/openai/chat_completion/serving.py (modified, +21/-0)
vllm/entrypoints/openai/completion/protocol.py (modified, +2/-0)
vllm/entrypoints/openai/completion/serving.py (modified, +26/-0)
vllm/outputs.py (modified, +2/-0)
vllm/v1/core/sched/scheduler.py (modified, +3/-0)
vllm/v1/engine/__init__.py (modified, +1/-0)
vllm/v1/engine/output_processor.py (modified, +8/-2)
vllm/v1/metrics/stats.py (modified, +4/-3)

Code Example

{"prompt_progress": {"total": 4096, "cache": 512, "processed": 1024, "time_ms": 340}}

RAW_BUFFERClick to expand / collapse

🚀 The feature, motivation and pitch

When serving very long contexts (100k+ tokens), clients have no visibility into prefill progress, the connection appears frozen until the first generated token arrives, which can take tens of seconds or more.

llama.cpp server implements this via a return_progress parameter: when set alongside stream: true, the server emits SSE chunks during prefill with a prompt_progress field:

{"prompt_progress": {"total": 4096, "cache": 512, "processed": 1024, "time_ms": 340}}

Fields:

total — total prompt tokens
cache — tokens already served from prefix cache
processed — tokens processed so far
time_ms — elapsed time since prefill started

Alternatives

Polling /metrics is possible but requires a separate connection, adds complexity client-side, and the Prometheus metrics don't map cleanly to per-request prefill progress.

Additional context

Reference implementation: tools/server in llama.cpp. The parameter would naturally be passed via extra_body with the OpenAI Python SDK since it's outside the OpenAI spec.

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

TL;DR

To address the issue of clients having no visibility into prefill progress when serving very long contexts, implement the return_progress parameter alongside stream: true to emit SSE chunks with a prompt_progress field.

Guidance

Set return_progress to true when calling the server with stream: true to enable progress updates.
Verify that the server emits SSE chunks with a prompt_progress field, which contains fields like total, cache, processed, and time_ms.
Use the prompt_progress field to update the client-side progress indicator, providing visibility into prefill progress.
Consider using the reference implementation in tools/server in llama.cpp as a guide for implementing the return_progress parameter.

Example

{"prompt_progress": {"total": 4096, "cache": 512, "processed": 1024, "time_ms": 340}}

This example shows the format of the prompt_progress field, which can be used to update the client-side progress indicator.

Notes

The return_progress parameter is specific to the llama.cpp server and may not be applicable to other servers or implementations.

Recommendation

Apply the workaround by implementing the return_progress parameter, as it provides a straightforward solution to the issue of clients having no visibility into prefill progress.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#ISR setup #authentication setup #request error #file not found #serialization error

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

vllm - ✅(Solved) Fix [Feature]: Add return_progress parameter to stream prompt processing progress during prefill [1 pull requests]

Recommended Tools

GitHub issue graph ai analysis

Fix Action

Fixed

PR fix notes

PR #40371: [Feat] added support for prompt_progress report

Description (problem / solution / changelog)

Changed files

Code Example

🚀 The feature, motivation and pitch

Alternatives

Additional context

Before submitting a new issue...

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

TRENDING

vllm - ✅(Solved) Fix [Feature]: Add return_progress parameter to stream prompt processing progress during prefill [1 pull requests]

Recommended Tools

GitHub issue graph ai analysis

Fix Action

Fixed

PR fix notes

PR #40371: [Feat] added support for prompt_progress report

Description (problem / solution / changelog)

Changed files

Code Example

🚀 The feature, motivation and pitch

Alternatives

Additional context

Before submitting a new issue...

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING