vllm - ✅(Solved) Fix [Docs] Document NIXL KV connector metrics aggregation semantics [1 pull requests, 3 comments, 3 participants]

vllm2026-04-29 12:38:34

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

vllm-project/vllm#41230•Fetched 2026-04-30 06:19:26

View on GitHub

Comments

Participants

Timeline

Reactions

Author

Participants

Assignees

Timeline (top)

commented ×3mentioned ×3subscribed ×3assigned ×1

The NIXL KV connector logs transfer metrics periodically:

KV Transfer metrics: Num successful transfers=4, Avg xfer time (ms)=1.381, P90 xfer time (ms)=2.601, Avg post time (ms)=0.672, P90 post time (ms)=0.801, Avg MB per transfer=2.25, Throughput (MB/s)=1629.549, Avg number of descriptors=72.0

Currently there is no documentation explaining what these metrics represent, especially in the context of multi-rank (TP > 1) deployments. This has already caused confusion among users.

Root Cause

This is unintuitive because users may expect metrics to reflect per-engine totals or aggregate system throughput.

Fix Action

Fixed

Fixed by PR: docs: clarify NIXL KV transfer metrics aggregation (https://github.com/vllm-project/vllm/pull/41259)

PR fix notes

PR #41259: docs: clarify NIXL KV transfer metrics aggregation

Repository: vllm-project/vllm
Author: zeel2104
State: open | merged: False
Link: https://github.com/vllm-project/vllm/pull/41259

Description (problem / solution / changelog)

Purpose

Fixes #41230.

This PR documents the aggregation semantics for NIXL KV connector transfer metrics. In TP > 1 deployments, NIXL transfer observations are recorded per rank and then aggregated before summary stats are computed. The updated docs and docstrings clarify that:

Num successful transfers is the total count across rank-level transfers.
averages and P90 values are computed over the combined rank-level observation pool.
Avg MB per transfer is per rank-level transfer, not per engine-level KV operation.
Throughput (MB/s) is total MB divided by summed rank-level transfer time, not aggregate system throughput over wall-clock time.

I checked for duplicate open PRs using GitHub search and did not find an existing PR addressing #41230.

AI assistance was used to draft and apply this documentation update. I reviewed the changed files.

Test Plan

pre-commit run ruff-check --files vllm/distributed/kv_transfer/kv_connector/v1/metrics.py vllm/distributed/kv_transfer/kv_connector/v1/nixl/stats.py
pre-commit run markdownlint-cli2 --files docs/usage/metrics.md

## Changed files

- `docs/usage/metrics.md` (modified, +24/-0)
- `vllm/distributed/kv_transfer/kv_connector/v1/metrics.py` (modified, +25/-0)
- `vllm/distributed/kv_transfer/kv_connector/v1/nixl/stats.py` (modified, +25/-2)

Code Example

KV Transfer metrics: Num successful transfers=4, Avg xfer time (ms)=1.381, P90 xfer time (ms)=2.601, Avg post time (ms)=0.672, P90 post time (ms)=0.801, Avg MB per transfer=2.25, Throughput (MB/s)=1629.549, Avg number of descriptors=72.0

RAW_BUFFERClick to expand / collapse

Summary

The NIXL KV connector logs transfer metrics periodically:

KV Transfer metrics: Num successful transfers=4, Avg xfer time (ms)=1.381, P90 xfer time (ms)=2.601, Avg post time (ms)=0.672, P90 post time (ms)=0.801, Avg MB per transfer=2.25, Throughput (MB/s)=1629.549, Avg number of descriptors=72.0

Currently there is no documentation explaining what these metrics represent, especially in the context of multi-rank (TP > 1) deployments. This has already caused confusion among users.

Current behavior

All metrics are aggregated across all TP ranks before summary stats are computed:

Each TP rank independently records per-transfer telemetry (transfer_duration, post_duration, bytes_transferred, num_descriptors) via NixlKVConnectorStats.record_transfer() in stats.py.
Stats from all ranks are concatenated via aggregate() (list.extend()).
reduce() computes averages, percentiles, and throughput over the combined pool of observations from all ranks.

This means:

"Num successful transfers" is the total count across all ranks, not per-rank.
"Avg MB per transfer" is the average over all individual rank-level transfers, not the total bytes moved for a single KV cache transfer operation.
"Throughput (MB/s)" is total_MB_all_ranks / total_time_all_ranks — effectively an average per-rank throughput, not the aggregate system throughput.
Percentiles (P90) are computed over the combined distribution of all ranks' transfer times.

This is unintuitive because users may expect metrics to reflect per-engine totals or aggregate system throughput.

What needs to be documented

Docstrings in stats.py: Add clear documentation to NixlKVConnectorStats explaining that stats are aggregated across all TP ranks and what each metric represents in that context.
Inline comments in reduce(): Clarify the semantics of throughput and averages — that they are per-rank averages over the combined observation pool.
Docstrings in metrics.py: Document the observe() → aggregate() → reduce() → log() pipeline and the fact that stats arrive pre-aggregated across workers.
(Optional) Docs page: Add a section to the disaggregated serving documentation explaining how to interpret the KV Transfer metrics log line.

Relevant files

vllm/distributed/kv_transfer/kv_connector/v1/nixl/stats.py — NixlKVConnectorStats (recording, aggregation, reduction)
vllm/distributed/kv_transfer/kv_connector/v1/metrics.py — KVConnectorLogging (observe/log pipeline), KVConnectorStats (base class)

Context

See related discussion: metrics are aggregated across ranks rather than reported per-rank or per-engine. This is a deliberate design choice (fire-and-forget from workers), but it needs to be clearly documented so users can correctly interpret the numbers.

extent analysis

TL;DR

To address the confusion around KV Transfer metrics, documentation should be added to explain that metrics are aggregated across all TP ranks.

Guidance

Add docstrings to NixlKVConnectorStats in stats.py to clarify the aggregation of metrics across TP ranks.
Include inline comments in reduce() to explain the semantics of throughput and averages.
Document the observe() → aggregate() → reduce() → log() pipeline in metrics.py to provide context on how stats are collected and reported.
Consider adding a section to the documentation explaining how to interpret the KV Transfer metrics log line.

Example

No code snippet is provided as the issue focuses on documentation rather than code changes.

Notes

The current implementation is a deliberate design choice, but clear documentation is necessary to avoid user confusion.

Recommendation

Apply workaround: Add documentation to explain the aggregation of metrics across TP ranks, as this will help users correctly interpret the numbers without requiring changes to the existing implementation.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#SSR setup #ISR setup #authentication setup #request error #file not found

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

vllm - ✅(Solved) Fix [Docs] Document NIXL KV connector metrics aggregation semantics [1 pull requests, 3 comments, 3 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Fix Action

Fixed

PR fix notes

PR #41259: docs: clarify NIXL KV transfer metrics aggregation

Description (problem / solution / changelog)

Purpose

Test Plan

Code Example

Summary

Current behavior

What needs to be documented

Relevant files

Context

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

TRENDING

vllm - ✅(Solved) Fix [Docs] Document NIXL KV connector metrics aggregation semantics [1 pull requests, 3 comments, 3 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Fix Action

Fixed

PR fix notes

PR #41259: docs: clarify NIXL KV transfer metrics aggregation

Description (problem / solution / changelog)

Purpose

Test Plan

Code Example

Summary

Current behavior

What needs to be documented

Relevant files

Context

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING