vllm - 💡(How to fix) Fix [Feature]: Add request-level OTel span attribute for cached prefix-cache input tokens [1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#41788Fetched 2026-05-07 03:32:55
View on GitHub
Comments
0
Participants
1
Timeline
1
Reactions
0
Participants
Timeline (top)
labeled ×1

Root Cause

  1. Add user/API-key labels to Prometheus metrics. This is not desirable because it can create high-cardinality metrics and privacy concerns. A safer design is to emit cached-token count on the request span and let upstream gateways correlate request/trace IDs to tenants.
RAW_BUFFERClick to expand / collapse

🚀 The feature, motivation and pitch

vLLM already exposes useful global Prometheus metrics for prefix caching, including vllm:prefix_cache_hits, vllm:prefix_cache_queries, and cached prompt-token metrics such as vllm:prompt_tokens_cached. The V1 request path also carries per-request cached-token information through RequestOutput.num_cached_tokens.

From the current tracing docs/source, request-level OpenTelemetry spans appear to include prompt/completion token usage and latency attributes, but I could not find an emitted cache-read / cached-token attribute. This makes it difficult for enterprise observability and FinOps systems to correlate prefix-cache savings with individual request traces.

I propose adding a request-level OpenTelemetry span attribute for cached input tokens, using the existing per-request num_cached_tokens value already available in the V1 output path.

Suggested attribute:

gen_ai.usage.cache_read.input_tokens

This aligns with the current OpenTelemetry GenAI semantic conventions for cached input tokens and keeps the change isolated to observability/tracing. It would allow trace backends such as Grafana Tempo, Datadog, Honeycomb, Jaeger, or OpenTelemetry Collector pipelines to correlate cached-token savings with request IDs and upstream gateway/tenant metadata without adding high-cardinality tenant labels to Prometheus metrics.

Alternatives

  1. Rely only on global Prometheus counters such as vllm:prefix_cache_hits, vllm:prefix_cache_queries, and vllm:prompt_tokens_cached. These are useful for fleet-level monitoring, but they do not directly provide request-level trace correlation.

  2. Rely on OpenAI-compatible response usage fields such as prompt_tokens_details.cached_tokens. This helps API clients, but it does not make the cached-token information available inside distributed traces.

  3. Add user/API-key labels to Prometheus metrics. This is not desirable because it can create high-cardinality metrics and privacy concerns. A safer design is to emit cached-token count on the request span and let upstream gateways correlate request/trace IDs to tenants.

Additional context

This would help enterprise MLOps / FinOps teams measure per-request cache savings in distributed tracing systems while preserving the existing global Prometheus metrics for fleet-level monitoring.

A possible implementation path could be:

  1. Add a span attribute constant for gen_ai.usage.cache_read.input_tokens.
  2. In the V1 request tracing path, emit the attribute from the existing per-request cached-token value, likely num_cached_tokens.
  3. Add or update tracing tests to verify the cached-token attribute is present when cached tokens are available.

This should be an observability-only change and should not affect scheduling, prefix-cache behavior, OpenAI API response behavior, or Prometheus metrics.

If maintainers agree with the direction, I am happy to implement this and submit a PR.

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING