vllm - ✅(Solved) Fix [Feature]: Add return_progress parameter to stream prompt processing progress during prefill [1 pull requests]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

Fix Action

Fixed

PR fix notes

PR #40371: [Feat] added support for prompt_progress report

Description (problem / solution / changelog)

fixes #40362

When return_progress is added to kwargs (in stream mode), prompt_progress is returned

Ex : {"prompt_progress": {"total": 4096, "cache": 512, "processed": 1024, "time_ms": 340}}

SSE example :

data: {"id":"chatcmpl-978697bee5dbce5b","object":"chat.completion.chunk","created":1776696288,"model":"cyankiwi/Qwen3.6-35B-A3B-AWQ-4bit","choices":[],"prompt_progress":{"total":102623,"cache":0,"processed":2112,"time_ms":2892}}

data: {"id":"chatcmpl-978697bee5dbce5b","object":"chat.completion.chunk","created":1776696288,"model":"cyankiwi/Qwen3.6-35B-A3B-AWQ-4bit","choices":[{"index":0,"delta":{"role":"assistant","content":""},"logprobs":null,"finish_reason":null}],"prompt_token_ids":null}

data: {"id":"chatcmpl-978697bee5dbce5b","object":"chat.completion.chunk","created":1776696288,"model":"cyankiwi/Qwen3.6-35B-A3B-AWQ-4bit","choices":[],"prompt_progress":{"total":102623,"cache":0,"processed":3168,"time_ms":3488}}

data: {"id":"chatcmpl-978697bee5dbce5b","object":"chat.completion.chunk","created":1776696288,"model":"cyankiwi/Qwen3.6-35B-A3B-AWQ-4bit","choices":[],"prompt_progress":{"total":102623,"cache":0,"processed":4224,"time_ms":3836}}

data: {"id":"chatcmpl-978697bee5dbce5b","object":"chat.completion.chunk","created":1776696288,"model":"cyankiwi/Qwen3.6-35B-A3B-AWQ-4bit","choices":[],"prompt_progress":{"total":102623,"cache":0,"processed":5280,"time_ms":4185}}

data: {"id":"chatcmpl-978697bee5dbce5b","object":"chat.completion.chunk","created":1776696288,"model":"cyankiwi/Qwen3.6-35B-A3B-AWQ-4bit","choices":[],"prompt_progress":{"total":102623,"cache":0,"processed":6336,"time_ms":4535}}

data: {"id":"chatcmpl-978697bee5dbce5b","object":"chat.completion.chunk","created":1776696288,"model":"cyankiwi/Qwen3.6-35B-A3B-AWQ-4bit","choices":[],"prompt_progress":{"total":102623,"cache":0,"processed":7392,"time_ms":4888}}

data: {"id":"chatcmpl-978697bee5dbce5b","object":"chat.completion.chunk","created":1776696288,"model":"cyankiwi/Qwen3.6-35B-A3B-AWQ-4bit","choices":[],"prompt_progress":{"total":102623,"cache":0,"processed":8448,"time_ms":5242}}

UX wide, for very long prompts, the user has to wait not knowing when it will finally start getting a LLM answer which can make the user anxious.

You don't want the user to be anxious.

Changed files

  • vllm/entrypoints/openai/chat_completion/protocol.py (modified, +2/-0)
  • vllm/entrypoints/openai/chat_completion/serving.py (modified, +21/-0)
  • vllm/entrypoints/openai/completion/protocol.py (modified, +2/-0)
  • vllm/entrypoints/openai/completion/serving.py (modified, +26/-0)
  • vllm/outputs.py (modified, +2/-0)
  • vllm/v1/core/sched/scheduler.py (modified, +3/-0)
  • vllm/v1/engine/__init__.py (modified, +1/-0)
  • vllm/v1/engine/output_processor.py (modified, +8/-2)
  • vllm/v1/metrics/stats.py (modified, +4/-3)

Code Example

{"prompt_progress": {"total": 4096, "cache": 512, "processed": 1024, "time_ms": 340}}
RAW_BUFFERClick to expand / collapse

🚀 The feature, motivation and pitch

When serving very long contexts (100k+ tokens), clients have no visibility into prefill progress, the connection appears frozen until the first generated token arrives, which can take tens of seconds or more.

llama.cpp server implements this via a return_progress parameter: when set alongside stream: true, the server emits SSE chunks during prefill with a prompt_progress field:

{"prompt_progress": {"total": 4096, "cache": 512, "processed": 1024, "time_ms": 340}}

Fields:

  • total — total prompt tokens
  • cache — tokens already served from prefix cache
  • processed — tokens processed so far
  • time_ms — elapsed time since prefill started

Alternatives

Polling /metrics is possible but requires a separate connection, adds complexity client-side, and the Prometheus metrics don't map cleanly to per-request prefill progress.

Additional context

Reference implementation: tools/server in llama.cpp. The parameter would naturally be passed via extra_body with the OpenAI Python SDK since it's outside the OpenAI spec.

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

TL;DR

To address the issue of clients having no visibility into prefill progress when serving very long contexts, implement the return_progress parameter alongside stream: true to emit SSE chunks with a prompt_progress field.

Guidance

  • Set return_progress to true when calling the server with stream: true to enable progress updates.
  • Verify that the server emits SSE chunks with a prompt_progress field, which contains fields like total, cache, processed, and time_ms.
  • Use the prompt_progress field to update the client-side progress indicator, providing visibility into prefill progress.
  • Consider using the reference implementation in tools/server in llama.cpp as a guide for implementing the return_progress parameter.

Example

{"prompt_progress": {"total": 4096, "cache": 512, "processed": 1024, "time_ms": 340}}

This example shows the format of the prompt_progress field, which can be used to update the client-side progress indicator.

Notes

The return_progress parameter is specific to the llama.cpp server and may not be applicable to other servers or implementations.

Recommendation

Apply the workaround by implementing the return_progress parameter, as it provides a straightforward solution to the issue of clients having no visibility into prefill progress.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

vllm - ✅(Solved) Fix [Feature]: Add return_progress parameter to stream prompt processing progress during prefill [1 pull requests]