vllm - 💡(How to fix) Fix [Feature]: Per-request timing metrics in response body [2 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#40076Fetched 2026-04-17 08:27:22
View on GitHub
Comments
2
Participants
2
Timeline
3
Reactions
0
Timeline (top)
commented ×2labeled ×1

Add an opt-in capability for vLLM to return per-request timing and compute metrics in the response body of OpenAI-compatible completion endpoints. The feature would be gated by a server-level flag (e.g. --enable-per-request-metrics) plus a per-request parameter (e.g. include_metrics: true), and would expose a structured metrics object alongside the normal response payload.

This issue is intentionally filed as a companion / counter-proposal to #36189, which proposes exposing the same information via HTTP response headers. A draft implementation of the body-based approach already exists in #36383.

Root Cause

Add an opt-in capability for vLLM to return per-request timing and compute metrics in the response body of OpenAI-compatible completion endpoints. The feature would be gated by a server-level flag (e.g. --enable-per-request-metrics) plus a per-request parameter (e.g. include_metrics: true), and would expose a structured metrics object alongside the normal response payload.

This issue is intentionally filed as a companion / counter-proposal to #36189, which proposes exposing the same information via HTTP response headers. A draft implementation of the body-based approach already exists in #36383.

Code Example

vllm serve <model> --enable-per-request-metrics

---

{ "model": "...", "messages": [...], "include_metrics": true }
RAW_BUFFERClick to expand / collapse

🚀 The feature, motivation and pitch

Summary

Add an opt-in capability for vLLM to return per-request timing and compute metrics in the response body of OpenAI-compatible completion endpoints. The feature would be gated by a server-level flag (e.g. --enable-per-request-metrics) plus a per-request parameter (e.g. include_metrics: true), and would expose a structured metrics object alongside the normal response payload.

This issue is intentionally filed as a companion / counter-proposal to #36189, which proposes exposing the same information via HTTP response headers. A draft implementation of the body-based approach already exists in #36383.

Motivation

(The motivation here largely overlaps with #36189; only real difference is delivery mechanism)

vLLM already tracks detailed per-request timing internally (queue time, prefill time, decode time, inter-token latency, etc.) via RequestStateStats, and surfaces aggregated versions of this data through Prometheus metrics and OpenTelemetry traces. Those are backend-only, aggregate observability tools — they do not let an API consumer see where time was spent on their specific request.

Exposing this data to API consumers directly unlocks two use cases:

1. Per-user / per-tenant billing and cost attribution

Operators running multi-tenant deployments need to attribute GPU time and token counts back to individual requests for usage-based billing and chargeback. Prometheus gives aggregates per endpoint/model, not per request. Having generation_time_ms, queue_time_ms, prompt_tokens, and completion_tokens in the response body means the billing system that is already parsing the response JSON has everything it needs in one place, with no separate infrastructure.

2. Per-request SLA attribution and latency debugging

Application developers building on top of vLLM currently see only total latency. With a structured metrics field they can distinguish:

  • Time waiting in the scheduler queue (capacity issue)
  • Time in prefill / time-to-first-token (prompt cost)
  • Time in decode / inter-token latency (generation cost)

This makes it trivial to add per-request SLA tracking to an application without running a Prometheus scraper or an OTEL collector.

Why the response body (and not only headers, as in #36189)

The headers-only approach proposed in #36189 is attractive for proxies and load balancers, but it has a hard limitation that a body-based approach does not:

  • Streaming support - Headers are flushed before the first token. Metrics that are only known at end-of-generation (generation_time_ms, mean_itl_ms) cannot be carried in headers without HTTP trailers, which have very limited client/proxy support. A final SSE event (or final chunk) carries the completed metrics naturally and is trivial for clients to consume. |

The headers and body approaches are complementary, not mutually exclusive. Routers that want real-time hot-path signals benefit from headers; billing pipelines and SDK users need the body. The position of this RFE is that the body-based API covers cases that headers cannot.

Proposal

Opt-in flags (double gate)

vllm serve <model> --enable-per-request-metrics

Plus a per-request parameter:

{ "model": "...", "messages": [...], "include_metrics": true }

Both must be set for metrics to be computed and returned. Default: off (no behavior change for existing users, no CPU overhead for deployments that do not opt in).

Response body additions

A new optional metrics field on ChatCompletionResponse, ChatCompletionStreamResponse, CompletionResponse, and CompletionStreamResponse:

FieldUnitDescription
time_to_first_token_msmsTime from scheduling to first output token
queue_time_msmsTime spent waiting in the scheduler queue
generation_time_msmsTotal decode time (excludes queue wait)
mean_itl_msmsMean inter-token latency during decode
tokens_per_secondcount / sOutput throughput for this request

(Prefill-time, cached-token, and per-phase GPU-time fields can be added incrementally as they become cleanly attributable to a single request.)

Streaming behavior

  • For non-streaming responses: metrics is populated on the final response object.
  • For streaming responses: metrics is emitted on the final SSE chunk — consistent with how OpenAI already emits final usage when stream_options.include_usage=true.

Alternatives

AlternativeWhy Not
Headers only (#36189)Cannot carry end-of-generation timing for streaming cases.
Prometheus onlyAggregate, not per-request; requires scraping infrastructure; invisible to the caller.
OpenTelemetry tracesRequires a tracing backend; not accessible over plain HTTP; high operational overhead.
Always-on body fieldsAdds CPU cost and response size for deployments that don't care. Opt-in keeps zero overhead as the default.
Custom middlewareCan only observe wall-clock time; cannot reach engine-internal timings (queue / prefill / decode).

Additional context

Relationship to existing work

  • Draft implementation: #36383 (already implements the core shape proposed here, including PerRequestTimingMetrics and the double-gate flag).
  • Companion RFE for headers: #36189. That issue remains a valid proposal for router-facing, hot-path metrics; this RFE is scoped to the body-side API that headers cannot replace.

Why opt-in?

  • Zero overhead when disabled - no extra computation, no extra fields serialized.
  • Response size - clients doing strict schema validation shouldn't see new fields unless they ask for them.
  • Information disclosure - timing data can reveal server capacity characteristics; operators should choose to expose it.

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

TL;DR

To implement per-request timing and compute metrics in the response body of OpenAI-compatible completion endpoints, enable the --enable-per-request-metrics server-level flag and include the include_metrics: true parameter in the request.

Guidance

  • Implement the double-gate flag by setting --enable-per-request-metrics when serving the model and including include_metrics: true in the request payload.
  • Add a new optional metrics field to the response body, containing fields such as time_to_first_token_ms, queue_time_ms, generation_time_ms, mean_itl_ms, and tokens_per_second.
  • For streaming responses, emit the metrics field on the final SSE chunk.
  • Consider the trade-offs between this approach and the headers-only approach proposed in #36189, and choose the one that best fits your use case.

Example

{
  "model": "...",
  "messages": [...],
  "include_metrics": true
}

Response:

{
  "response": "...",
  "metrics": {
    "time_to_first_token_ms": 10,
    "queue_time_ms": 5,
    "generation_time_ms": 20,
    "mean_itl_ms": 1,
    "tokens_per_second": 100
  }
}

Notes

The implementation should be done in a way that allows for easy switching between the body-based and headers-only approaches, as they are complementary and not mutually exclusive.

Recommendation

Apply the workaround by implementing the double-gate flag and adding the metrics field to the response body, as this approach provides more flexibility and support for streaming cases.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

vllm - 💡(How to fix) Fix [Feature]: Per-request timing metrics in response body [2 comments, 2 participants]