vllm - 💡(How to fix) Fix [Feature]: Add request-level OTel span attribute for cached prefix-cache input tokens [1 participants]

Jayachander123 · 2026-05-06T06:28:35Z

[vllm] 🚀 The feature, motivation and pitch vLLM already exposes useful global Prometheus metrics for prefix caching, including vllm:prefix cache hits , vllm:p… ### 🚀 The feature, motivation and pitch vLLM already exposes useful global Prometheus metrics for prefix caching, including `vllm:prefix_cache_hits`, `vllm:prefix_cache_queries`, and cached prompt-token metrics such as `vllm:prompt_tokens_cached`. The V1 request path also carries per-request cached-token information through `RequestOutput.num_cached_tokens`. From the current tracing docs/source, request-level OpenTelemetry spans appear to include prompt/completion token usage and latency attributes, but I could not find an emitted cache-read / cached-token attribute. This makes it difficult for enterprise observability and FinOps systems to correlate prefix-cache savings with individual request traces. I propose adding a request-level OpenTelemetry span attribute for cached input tokens, using the existing per-request `num_cached_tokens` value already available in the V1 output path. Suggested attribute: `gen_ai.usage.cache_read.input_tokens` This aligns with the current OpenTelemetry GenAI semantic conventions for cached input tokens and keeps the change isolated to observability/tracing. It would allow trace backends such as Grafana Tempo, Datadog, Honeycomb, Jaeger, or OpenTelemetry Collector pipelines to correlate cached-token savings with request IDs and upstream gateway/tenant metadata without adding high-cardinality tenant labels to Prometheus metrics. ### Alternatives 1. Rely only on global Prometheus counters such as `vllm:prefix_cache_hits`, `vllm:prefix_cache_queries`, and `vllm:prompt_tokens_cached`. These are useful for fleet-level monitoring, but they do not directly provide request-level trace correlation. 2. Rely on OpenAI-compatible response usage fields such as `prompt_tokens_details.cached_tokens`. This helps API clients, but it does not make the cached-token information available inside distributed traces. 3. Add user/API-key labels to Prometheus metrics. This is not desirable because it can create high-cardinality metrics and privacy concerns. A safer design is to emit cached-token count on the request span and let upstream gateways correlate request/trace IDs to tenants. ### Additional context This would help enterprise MLOps / FinOps teams measure per-request cache savings in distributed tracing systems while preserving the existing global Prometheus metrics for fleet-level monitoring. A possible implementation path could be: 1. Add a span attribute constant for `gen_ai.usage.cache_read.input_tokens`. 2. In the V1 request tracing path, emit the attribute from the existing per-request cached-token value, likely `num_cached_tokens`. 3. Add or update tracing tests to verify the cached-token attribute is present when cached tokens are available. This should be an observability-only change and should not affect scheduling, prefix-cache behavior, OpenAI API response behavior, or Prometheus metrics. If maintainers agree with the direction, I am happy to implement this and submit a PR. ### Before submitting a new issue... - [x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the [documentation page](https://docs.vllm.ai/en/latest/), which can answer lots of frequently asked questions.

vllm2026-05-06 06:28:35

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

vllm-project/vllm#41788•Fetched 2026-05-07 03:32:55

View on GitHub

Comments

Participants

Timeline

Reactions

Author

Jayachander123

Participants

Jayachander123

Timeline (top)

labeled ×1

Root Cause

Add user/API-key labels to Prometheus metrics. This is not desirable because it can create high-cardinality metrics and privacy concerns. A safer design is to emit cached-token count on the request span and let upstream gateways correlate request/trace IDs to tenants.

RAW_BUFFERClick to expand / collapse

🚀 The feature, motivation and pitch

vLLM already exposes useful global Prometheus metrics for prefix caching, including vllm:prefix_cache_hits, vllm:prefix_cache_queries, and cached prompt-token metrics such as vllm:prompt_tokens_cached. The V1 request path also carries per-request cached-token information through RequestOutput.num_cached_tokens.

From the current tracing docs/source, request-level OpenTelemetry spans appear to include prompt/completion token usage and latency attributes, but I could not find an emitted cache-read / cached-token attribute. This makes it difficult for enterprise observability and FinOps systems to correlate prefix-cache savings with individual request traces.

I propose adding a request-level OpenTelemetry span attribute for cached input tokens, using the existing per-request num_cached_tokens value already available in the V1 output path.

Suggested attribute:

gen_ai.usage.cache_read.input_tokens

This aligns with the current OpenTelemetry GenAI semantic conventions for cached input tokens and keeps the change isolated to observability/tracing. It would allow trace backends such as Grafana Tempo, Datadog, Honeycomb, Jaeger, or OpenTelemetry Collector pipelines to correlate cached-token savings with request IDs and upstream gateway/tenant metadata without adding high-cardinality tenant labels to Prometheus metrics.

Alternatives

Rely only on global Prometheus counters such as vllm:prefix_cache_hits, vllm:prefix_cache_queries, and vllm:prompt_tokens_cached. These are useful for fleet-level monitoring, but they do not directly provide request-level trace correlation.
Rely on OpenAI-compatible response usage fields such as prompt_tokens_details.cached_tokens. This helps API clients, but it does not make the cached-token information available inside distributed traces.
Add user/API-key labels to Prometheus metrics. This is not desirable because it can create high-cardinality metrics and privacy concerns. A safer design is to emit cached-token count on the request span and let upstream gateways correlate request/trace IDs to tenants.

Additional context

This would help enterprise MLOps / FinOps teams measure per-request cache savings in distributed tracing systems while preserving the existing global Prometheus metrics for fleet-level monitoring.

A possible implementation path could be:

Add a span attribute constant for gen_ai.usage.cache_read.input_tokens.
In the V1 request tracing path, emit the attribute from the existing per-request cached-token value, likely num_cached_tokens.
Add or update tracing tests to verify the cached-token attribute is present when cached tokens are available.

This should be an observability-only change and should not affect scheduling, prefix-cache behavior, OpenAI API response behavior, or Prometheus metrics.

If maintainers agree with the direction, I am happy to implement this and submit a PR.

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #batch processing #GPU compatibility #latency issue #model loading

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

vllm - 💡(How to fix) Fix [Feature]: Add request-level OTel span attribute for cached prefix-cache input tokens [1 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

🚀 The feature, motivation and pitch

Alternatives

Additional context

Before submitting a new issue...

Still need to ship something?

TRENDING

vllm - 💡(How to fix) Fix [Feature]: Add request-level OTel span attribute for cached prefix-cache input tokens [1 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

🚀 The feature, motivation and pitch

Alternatives

Additional context

Before submitting a new issue...

Still need to ship something?

RELATED_DISCOVERY

TRENDING