vllm - 💡(How to fix) Fix [RFC]: Offloading Metrics Redesign

vllm2026-05-29 17:17:03

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

This RFC proposes a metadata-driven metrics design for the KV offloading connector and offloading managers. The goal is to make offloading metrics general enough for counters, gauges, and histograms while preserving the public Prometheus metrics contract that users may already rely on.

The immediate motivation is the stores_skipped metric for CPU offloading. The next expected metric is a CPU KV offload cache usage gauge. The current connector metrics path is hard-coded around transfer statistics, so adding more manager-specific metrics would keep increasing special-case code unless we redesign the path first.

PR: #35669 #43877

Root Cause

The immediate example is vllm:kv_offload_stores_skipped, which is emitted by the CPU offloading manager when a store is skipped because the reuse threshold is not reached. A near-term follow-up is a CPU KV offload cache usage gauge. These metrics are not transfer operations, so adding them through the existing hard-coded transfer metrics path would keep expanding special cases in the connector.

Fix Action

Fix / Workaround

Long term, get_kv_connector_stats() should return None when there are no stats to emit. During compatibility work, an empty stats object may be used as a temporary bridge only where the current scheduler aggregation path requires a worker-side stats object before scheduler-side stats can be included. The dedicated scheduler fix should remove the need for that workaround by allowing scheduler-side-only stats directly.

Look up the metric metadata by key.
Validate the observed value type.
Dispatch to inc, set, or observe.
Update compatibility metrics when applicable.

Prefixes such as counter: and histogram: make dispatch easy, but duplicate information already present in metadata and make the payload less natural.

Code Example

@dataclass(frozen=True)
class OffloadingMetricMetadata:
    documentation: str


@dataclass(frozen=True)
class OffloadingCounterMetadata(OffloadingMetricMetadata):
    pass


@dataclass(frozen=True)
class OffloadingGaugeMetadata(OffloadingMetricMetadata):
    pass


@dataclass(frozen=True)
class OffloadingHistogramMetadata(OffloadingMetricMetadata):
    buckets: tuple[float, ...] | None = None

---

{
    "vllm:kv_offload_store_bytes": OffloadingCounterMetadata(...),
    "vllm:kv_offload_store_size": OffloadingHistogramMetadata(...),
}

---

class OffloadingSpec(ABC):
    metric_definitions: dict[str, OffloadingMetricMetadata]

    @classmethod
    @abstractmethod
    def get_manager_cls(cls) -> type[OffloadingManager]:
        ...

---

self.metric_definitions = self.get_manager_cls().get_metric_definitions(self)

---

class OffloadingManager(ABC):
    def __init__(
        self,
        metric_definitions: dict[str, OffloadingMetricMetadata] | None = None,
    ):
        self.metric_definitions = metric_definitions or {}

    @classmethod
    def get_metric_definitions(
        cls,
        spec: OffloadingSpec,
    ) -> dict[str, OffloadingMetricMetadata]:
        return {}

    def get_stats(self) -> dict[str, Any] | None:
        return None

---

{
    "vllm:kv_offload_load_bytes": 1234,
    "vllm:kv_offload_load_size": [1024, 2048],
    "vllm:kv_offload_stores_skipped": 3,
}

RAW_BUFFERClick to expand / collapse

Motivation.

The offloading connector currently treats metrics as a transfer-specific concern. That worked while the only exposed metrics were load/store bytes, time, and size, but it does not scale well as offloading managers start exposing their own state.

We need a general offloading metrics path where connector metrics and manager metrics are declared as metadata, registered up front, and observed through the same counter/gauge/histogram machinery. This also makes aggregation semantics explicit: counters are summed, gauges keep the latest value, and histograms accumulate observations.

At the same time, offloading metrics are part of vLLM's public Prometheus surface. Existing metrics such as vllm:kv_offload_total_bytes, vllm:kv_offload_total_time, and vllm:kv_offload_size may already be used by dashboards, alerts, or autoscaling rules. The redesign should therefore improve the internal model and introduce clearer flat metrics without silently removing or renaming existing public metrics.

Proposed Change.

Summary

PR: #35669 #43877

Background

The current offloading connector exposes transfer metrics as labelled Prometheus metrics:

vllm:kv_offload_total_bytes{transfer_type=...}
vllm:kv_offload_total_time{transfer_type=...}
vllm:kv_offload_size{transfer_type=...}

Internally, the recent flat-metrics work moved toward representing transfer stats and manager stats as metric-name keyed payloads. Review discussion then raised three design requirements:

Offloading managers should be able to declare their own metrics, not just counters.
Metrics should be registered up front from metadata, rather than lazily created or discovered during observe().
Existing public metrics should not be removed casually. If they are ever replaced, the vLLM metrics deprecation process should be followed.

Goals

Provide one offloading metrics path for counters, gauges, and histograms.
Let each OffloadingManager define the metrics it emits.
Let the offloading connector define transfer metrics through the same metadata model.
Use a flat stats payload, with metric names as keys.
Make aggregation semantics explicit by metric type.
Register Prometheus metrics and per-engine labelled children during initialization.
Preserve existing labelled transfer metrics while introducing any new metric names.
Support tiered offloading managers that may need the full OffloadingSpec when deciding which metrics to expose.

Non-Goals

This RFC does not redesign all vLLM metrics.
This RFC does not remove existing offloading transfer metrics.
This RFC does not finalize the exact public name for the CPU KV offload cache usage gauge.
This RFC does not change unrelated metrics such as prefix-cache metrics, request latency metrics, speculative decoding metrics, or KV block sampling metrics.

Scope

In scope:

Offloading connector transfer metrics.
Offloading manager metrics emitted through the offloading connector.
Internal OffloadingConnectorStats aggregation and reduction semantics.
Prometheus registration for offloading connector metrics.
Compatibility behavior for legacy offloading transfer metrics.

Out of scope:

General server/request metrics.
HTTP metrics.
NIXL-specific metrics.
Prefix-cache metrics.
The legacy swapped-preemption metrics, except where CPU cache usage naming overlaps with historical metrics.

Existing Public Metrics

The following public metrics already exist and should continue to be emitted during the redesign:

Metric	Type	Labels	Semantics
`vllm:kv_offload_total_bytes`	Counter	`transfer_type`	Total bytes transferred by the offloading connector
`vllm:kv_offload_total_time`	Counter	`transfer_type`	Total transfer time measured by offloading operations
`vllm:kv_offload_size`	Histogram	`transfer_type`	Distribution of KV offload transfer sizes

The current transfer_type values are:

CPU_to_GPU
GPU_to_CPU

These names are not ideal because each metric mixes load and store semantics behind a label, but they are public metrics. Mark's feedback was that removing or replacing them based on an assumption that nobody depends on them is not acceptable. This RFC therefore treats compatibility as a design requirement.

Proposed Public Metrics

The redesigned connector can expose clearer flat transfer metrics:

Metric	Type	Semantics
`vllm:kv_offload_load_bytes`	Counter	Total bytes loaded from offload storage to GPU
`vllm:kv_offload_load_time`	Counter	Total load time from offload storage to GPU, in seconds
`vllm:kv_offload_load_size`	Histogram	Per-operation load size distribution
`vllm:kv_offload_store_bytes`	Counter	Total bytes stored from GPU to offload storage
`vllm:kv_offload_store_time`	Counter	Total store time from GPU to offload storage, in seconds
`vllm:kv_offload_store_size`	Histogram	Per-operation store size distribution

Manager-defined metrics are also represented in the same registry. Known examples:

Metric	Type	Semantics
`vllm:kv_offload_stores_skipped`	Counter	Number of KV offload stores skipped because the reuse threshold was not reached
TBD CPU KV cache usage metric	Gauge	Current CPU KV offload cache usage

The CPU KV cache usage metric needs a final naming decision. It should either use a new vllm:kv_offload_* name, or explicitly follow the deprecation policy if it is intended to replace or supersede an older CPU cache metric such as vllm:cpu_cache_usage_perc.

Metric Metadata

Offloading metrics should be described with subclasses instead of a generic metric_type discriminator:

@dataclass(frozen=True)
class OffloadingMetricMetadata:
    documentation: str


@dataclass(frozen=True)
class OffloadingCounterMetadata(OffloadingMetricMetadata):
    pass


@dataclass(frozen=True)
class OffloadingGaugeMetadata(OffloadingMetricMetadata):
    pass


@dataclass(frozen=True)
class OffloadingHistogramMetadata(OffloadingMetricMetadata):
    buckets: tuple[float, ...] | None = None

Metric names are the dictionary keys:

{
    "vllm:kv_offload_store_bytes": OffloadingCounterMetadata(...),
    "vllm:kv_offload_store_size": OffloadingHistogramMetadata(...),
}

This keeps histogram-specific fields on the histogram subclass and avoids duplicating the metric name inside the metadata object.

OffloadingSpec and OffloadingManager API

OffloadingSpec should own the resolved manager metric definitions:

class OffloadingSpec(ABC):
    metric_definitions: dict[str, OffloadingMetricMetadata]

    @classmethod
    @abstractmethod
    def get_manager_cls(cls) -> type[OffloadingManager]:
        ...

During spec initialization:

self.metric_definitions = self.get_manager_cls().get_metric_definitions(self)

Managers should receive their resolved metric definitions at construction time:

class OffloadingManager(ABC):
    def __init__(
        self,
        metric_definitions: dict[str, OffloadingMetricMetadata] | None = None,
    ):
        self.metric_definitions = metric_definitions or {}

    @classmethod
    def get_metric_definitions(
        cls,
        spec: OffloadingSpec,
    ) -> dict[str, OffloadingMetricMetadata]:
        return {}

    def get_stats(self) -> dict[str, Any] | None:
        return None

The thought is to align that metric definitions should be available from the spec/manager relationship and that OffloadingSpec should be passed instead of only VllmConfig. The spec contains vllm_config but can also carry resolved offloading-specific configuration such as block sizing, tiering configuration, and future manager-specific state.

None is the default no-stats return value for managers. Managers should avoid returning {} in the common case where they have no metrics to emit.

Flat Stats Payload

OffloadingConnectorStats.data should use metric names directly as flat keys:

{
    "vllm:kv_offload_load_bytes": 1234,
    "vllm:kv_offload_load_size": [1024, 2048],
    "vllm:kv_offload_stores_skipped": 3,
}

The stats payload should not use type prefixes such as counter: or histogram:. Type information comes from the metadata registry. Prefixes would duplicate metadata, require string parsing on the observation path, and make unknown-key validation less direct. Flat metric-name keys keep the payload close to the Prometheus surface while still allowing strict type checks through the registry.

Aggregation and Reduction Semantics

Aggregation must be metadata-driven:

Metric type	`aggregate()` behavior	`reduce()` behavior
Counter	Sum increments	Return summed value
Gauge	Keep latest observed value	Return latest value
Histogram	Concatenate samples	Return count and sum for logging; observe each sample in Prometheus

This avoids branching based on Python value shape. Lists are valid for histograms, scalars are valid for counters/gauges, and unknown metric keys should fail fast.

Stats Emission Semantics

Worker-side and scheduler-side offloading stats are both valid sources:

Worker-side stats report transfer metrics collected around asynchronous load and store operations.
Scheduler-side stats report connector/manager metrics that are only known after scheduler state has been updated.

The scheduler stats path should support three cases:

worker-side stats only
scheduler-side stats only
both worker-side and scheduler-side stats

When both exist, they should be aggregated with the same metadata-driven semantics described above. This refers to transfer stats forwarded from the worker and manager stats collected on the scheduler side within the same connector, not two simultaneously active connector instances. Collection order matters: scheduler-side connector stats should be collected after update_connector_output() has run, so manager state reflects the current worker output before stats are emitted.

Prometheus Registration

OffloadPromMetrics.__init__ should build a complete registry from:

connector metric definitions
selected spec's manager metric definitions

It should create all Prometheus metric objects and per-engine labelled children up front. observe() should then do only:

Look up the metric metadata by key.
Validate the observed value type.
Dispatch to inc, set, or observe.
Update compatibility metrics when applicable.

This removes lazy metric creation, prefix parsing, and string slicing from the hot path.

Compatibility Metrics

The implementation should update the old labelled metrics from the same flat stats payload:

New flat metric	Existing compatibility metric
`vllm:kv_offload_load_bytes`	`vllm:kv_offload_total_bytes{transfer_type="CPU_to_GPU"}`
`vllm:kv_offload_load_time`	`vllm:kv_offload_total_time{transfer_type="CPU_to_GPU"}`
`vllm:kv_offload_load_size`	`vllm:kv_offload_size{transfer_type="CPU_to_GPU"}`
`vllm:kv_offload_store_bytes`	`vllm:kv_offload_total_bytes{transfer_type="GPU_to_CPU"}`
`vllm:kv_offload_store_time`	`vllm:kv_offload_total_time{transfer_type="GPU_to_CPU"}`
`vllm:kv_offload_store_size`	`vllm:kv_offload_size{transfer_type="GPU_to_CPU"}`

Compatibility metrics should be emitted for CPU offloading while the existing public metrics are supported. If other offloading managers start using the same connector transfer metrics, we should decide whether the legacy CPU/GPU transfer_type labels are semantically correct for them before emitting the old metrics there too.

Deprecation Plan

The vLLM metrics documentation says metric deprecation should be handled carefully because users may only notice after a release breaks dashboards or autoscaling. The policy calls for clear /metrics help text, documentation, release notes, and an escape hatch for hidden metrics.

This RFC proposes the following phased approach:

Add the metadata registry and flat metrics.
Continue emitting the existing labelled metrics.
Add manager-defined metrics through the registry, including vllm:kv_offload_stores_skipped.
Add the CPU KV offload cache usage gauge as a follow-up using the same registry.
If the community agrees to deprecate the existing labelled metrics, update metric help strings, docs, and release notes with the target removal version.
Hide deprecated metrics only after the documented grace period, with the project-standard escape hatch.
Remove deprecated metrics only after the policy-compliant period has elapsed.

Implementation Plan

Define metric metadata subclasses in vllm/v1/kv_offload/base.py.
Add OffloadingSpec.get_manager_cls().
Add OffloadingSpec.metric_definitions, populated from get_manager_cls().get_metric_definitions(spec).
Add OffloadingManager.__init__(metric_definitions=...).
Pass spec.metric_definitions into concrete manager constructors.
Define connector transfer metric definitions in one place.
Convert OffloadingConnectorStats to flat metric-name keys.
Implement metadata-driven aggregation and reduction.
Build Prometheus metric objects and observer mappings during OffloadPromMetrics.__init__.
Emit both flat transfer metrics and compatibility labelled metrics.
Add stores_skipped as the first manager-defined counter.
Add the CPU KV offload cache usage gauge in a follow-up.
Once scheduler-side-only stats aggregation is supported, update get_kv_connector_stats() and manager stats plumbing so no-stats paths return None instead of an empty stats object.

Test Plan

Add or update tests for:

Counter aggregation.
Gauge aggregation using latest value.
Histogram aggregation using concatenated samples.
Unknown metric key failure.
Prometheus registration for counters, gauges, and histograms.
Existing labelled transfer metrics still being observed.
Manager metric definitions being discovered through OffloadingSpec.
Manager stats emitted through the connector scheduler path.
Scheduler-side and worker-side connector stats being aggregated correctly.
CPU KV cache usage gauge once implemented.

Suggested placement:

Offloading stats and Prometheus registry tests should live under the offloading connector unit tests.
Scheduler aggregation/order tests should live in tests/v1/core/test_scheduler.py and cover scheduler-side only, worker-side only, and both in one focused test.

Alternatives Considered

Keep Hard-Coded Transfer Metrics

This has the lowest immediate churn, but it leaves every new offloading manager metric as a special case.

Remove the Existing Labelled Metrics Immediately

This gives the cleanest final public surface, but it violates the spirit of the metrics deprecation policy and risks breaking existing dashboards.

Prefix Internal Stat Keys by Type

Prefixes such as counter: and histogram: make dispatch easy, but duplicate information already present in metadata and make the payload less natural.

Open Questions

What exact name should the CPU KV offload cache usage gauge use?
Should CPU cache usage be a percentage, bytes, block count, or more than one metric?
Should compatibility labelled metrics be emitted only for CPU offloading, or for all offloading managers that use load/store transfer metrics?
What release should begin deprecation, if we decide to deprecate the labelled transfer metrics?
Should we add a feature flag before hiding/removing any compatibility metric, or rely on the project-wide hidden metrics escape hatch?

Feedback Period.

No response

CC List.

@orozery @markmc

Any Other Things.

No response

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.