vllm - 💡(How to fix) Fix [Feature]: Per-request timing metrics in response body [2 comments, 2 participants]

Root Cause

Add an opt-in capability for vLLM to return per-request timing and compute metrics in the response body of OpenAI-compatible completion endpoints. The feature would be gated by a server-level flag (e.g. --enable-per-request-metrics) plus a per-request parameter (e.g. include_metrics: true), and would expose a structured metrics object alongside the normal response payload.

This issue is intentionally filed as a companion / counter-proposal to #36189, which proposes exposing the same information via HTTP response headers. A draft implementation of the body-based approach already exists in #36383.

🚀 The feature, motivation and pitch

Summary

Motivation

(The motivation here largely overlaps with #36189; only real difference is delivery mechanism)

vLLM already tracks detailed per-request timing internally (queue time, prefill time, decode time, inter-token latency, etc.) via RequestStateStats, and surfaces aggregated versions of this data through Prometheus metrics and OpenTelemetry traces. Those are backend-only, aggregate observability tools — they do not let an API consumer see where time was spent on their specific request.

Exposing this data to API consumers directly unlocks two use cases:

1. Per-user / per-tenant billing and cost attribution

Operators running multi-tenant deployments need to attribute GPU time and token counts back to individual requests for usage-based billing and chargeback. Prometheus gives aggregates per endpoint/model, not per request. Having generation_time_ms, queue_time_ms, prompt_tokens, and completion_tokens in the response body means the billing system that is already parsing the response JSON has everything it needs in one place, with no separate infrastructure.

2. Per-request SLA attribution and latency debugging

Application developers building on top of vLLM currently see only total latency. With a structured metrics field they can distinguish:

Time waiting in the scheduler queue (capacity issue)
Time in prefill / time-to-first-token (prompt cost)
Time in decode / inter-token latency (generation cost)

This makes it trivial to add per-request SLA tracking to an application without running a Prometheus scraper or an OTEL collector.

Why the response body (and not only headers, as in #36189)

The headers-only approach proposed in #36189 is attractive for proxies and load balancers, but it has a hard limitation that a body-based approach does not:

Streaming support - Headers are flushed before the first token. Metrics that are only known at end-of-generation (generation_time_ms, mean_itl_ms) cannot be carried in headers without HTTP trailers, which have very limited client/proxy support. A final SSE event (or final chunk) carries the completed metrics naturally and is trivial for clients to consume. |

The headers and body approaches are complementary, not mutually exclusive. Routers that want real-time hot-path signals benefit from headers; billing pipelines and SDK users need the body. The position of this RFE is that the body-based API covers cases that headers cannot.

Proposal

Opt-in flags (double gate)

vllm serve <model> --enable-per-request-metrics

Plus a per-request parameter:

{ "model": "...", "messages": [...], "include_metrics": true }

Both must be set for metrics to be computed and returned. Default: off (no behavior change for existing users, no CPU overhead for deployments that do not opt in).

Response body additions

A new optional metrics field on ChatCompletionResponse, ChatCompletionStreamResponse, CompletionResponse, and CompletionStreamResponse:

Field	Unit	Description
`time_to_first_token_ms`	ms	Time from scheduling to first output token
`queue_time_ms`	ms	Time spent waiting in the scheduler queue
`generation_time_ms`	ms	Total decode time (excludes queue wait)
`mean_itl_ms`	ms	Mean inter-token latency during decode
`tokens_per_second`	count / s	Output throughput for this request

(Prefill-time, cached-token, and per-phase GPU-time fields can be added incrementally as they become cleanly attributable to a single request.)

Streaming behavior

For non-streaming responses: metrics is populated on the final response object.
For streaming responses: metrics is emitted on the final SSE chunk — consistent with how OpenAI already emits final usage when stream_options.include_usage=true.

Alternatives

Alternative	Why Not
Headers only (#36189)	Cannot carry end-of-generation timing for streaming cases.
Prometheus only	Aggregate, not per-request; requires scraping infrastructure; invisible to the caller.
OpenTelemetry traces	Requires a tracing backend; not accessible over plain HTTP; high operational overhead.
Always-on body fields	Adds CPU cost and response size for deployments that don't care. Opt-in keeps zero overhead as the default.
Custom middleware	Can only observe wall-clock time; cannot reach engine-internal timings (queue / prefill / decode).

Additional context

Relationship to existing work

Draft implementation: #36383 (already implements the core shape proposed here, including PerRequestTimingMetrics and the double-gate flag).
Companion RFE for headers: #36189. That issue remains a valid proposal for router-facing, hot-path metrics; this RFE is scoped to the body-side API that headers cannot replace.

Why opt-in?

Zero overhead when disabled - no extra computation, no extra fields serialized.
Response size - clients doing strict schema validation shouldn't see new fields unless they ask for them.
Information disclosure - timing data can reveal server capacity characteristics; operators should choose to expose it.

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

TL;DR

To implement per-request timing and compute metrics in the response body of OpenAI-compatible completion endpoints, enable the --enable-per-request-metrics server-level flag and include the include_metrics: true parameter in the request.

Guidance

Implement the double-gate flag by setting --enable-per-request-metrics when serving the model and including include_metrics: true in the request payload.
Add a new optional metrics field to the response body, containing fields such as time_to_first_token_ms, queue_time_ms, generation_time_ms, mean_itl_ms, and tokens_per_second.
For streaming responses, emit the metrics field on the final SSE chunk.
Consider the trade-offs between this approach and the headers-only approach proposed in #36189, and choose the one that best fits your use case.

Example

{
  "model": "...",
  "messages": [...],
  "include_metrics": true
}

Response:

{
  "response": "...",
  "metrics": {
    "time_to_first_token_ms": 10,
    "queue_time_ms": 5,
    "generation_time_ms": 20,
    "mean_itl_ms": 1,
    "tokens_per_second": 100
  }
}

Notes

The implementation should be done in a way that allows for easy switching between the body-based and headers-only approaches, as they are complementary and not mutually exclusive.

Recommendation

Apply the workaround by implementing the double-gate flag and adding the metrics field to the response body, as this approach provides more flexibility and support for streaming cases.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

vllm - 💡(How to fix) Fix [Feature]: Per-request timing metrics in response body [2 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Code Example

🚀 The feature, motivation and pitch

Summary

Motivation

1. Per-user / per-tenant billing and cost attribution

2. Per-request SLA attribution and latency debugging

Why the response body (and not only headers, as in #36189)

Proposal

Opt-in flags (double gate)

Response body additions

Streaming behavior

Alternatives

Additional context

Relationship to existing work

Why opt-in?

Before submitting a new issue...

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

TRENDING

vllm - 💡(How to fix) Fix [Feature]: Per-request timing metrics in response body [2 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Code Example

🚀 The feature, motivation and pitch

Summary

Motivation

1. Per-user / per-tenant billing and cost attribution

2. Per-request SLA attribution and latency debugging

Why the response body (and not only headers, as in #36189)

Proposal

Opt-in flags (double gate)

Response body additions

Streaming behavior

Alternatives

Additional context

Relationship to existing work

Why opt-in?

Before submitting a new issue...

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING