vllm - 💡(How to fix) Fix [RFC]: Per-iteration forward pass metrics with accurate engine-level timing [1 participants]

vllm2026-04-01 22:34:25

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

vllm-project/vllm#38760•Fetched 2026-04-08 02:22:50

View on GitHub

Comments

Participants

Timeline

Reactions

Author

tedzhouhk

Participants

tedzhouhk

Timeline (top)

subscribed ×3labeled ×1mentioned ×1

Root Cause

No per-iteration history: There is no way to reconstruct the sequence of batch compositions over time. An autoscaler cannot build a cost model from Prometheus data because it only sees snapshots.

Code Example

# In EngineCore.step():
scheduler_output = self.scheduler.schedule()
future = self.model_executor.execute_model(scheduler_output, non_block=True)
...
t_start = time.monotonic()
model_output = future.result()   # blocks until GPU finishes
wall_time = time.monotonic() - t_start
...
self.scheduler.update_from_output(scheduler_output, model_output, wall_time=wall_time)

---

class ForwardPassMetrics(msgspec.Struct, frozen=True):
    version: int = 1             # can include more info in later versions

    # Identity
    worker_id: str = ""          # unique engine instance identifier
    dp_rank: int = 0             # data parallel rank
    counter_id: int = 0          # monotonic sequence number

    # Timing (measured in EngineCore)
    wall_time: float = 0.0       # seconds, GPU forward pass time

    # Scheduled batch composition
    num_prefill_requests: int = 0
    sum_prefill_tokens: int = 0       # tokens being computed this iteration
    var_prefill_length: float = 0.0   # variance of total prompt lengths
    sum_prefill_kv_tokens: int = 0    # KV tokens read (cache hits + prior chunks)
    num_decode_requests: int = 0
    sum_decode_kv_tokens: int = 0     # total KV depth across decode requests
    var_decode_kv_tokens: float = 0.0

    # Queue state
    num_queued_prefill: int = 0
    sum_queued_prefill_tokens: int = 0
    num_queued_decode: int = 0        # preempted requests waiting
    sum_queued_decode_kv_tokens: int = 0

---

ZMQ message: [topic_bytes, sequence_bytes, msgpack_payload]

---

--forward-pass-metrics-port PORT   # 0 = disabled (default), >0 = ZMQ PUB base port

RAW_BUFFERClick to expand / collapse

Motivation.

Problem: orchestration systems need per-iteration scheduler telemetry, but vLLM only exposes aggregated Prometheus metrics.

Inference orchestrators (autoscalers, routers, disaggregated serving planners) need to understand the per-iteration cost structure of a running vLLM engine:

How many prefill vs decode requests were in each batch?
What was the KV cache depth distribution across decode requests?
How many tokens were computed vs cache-hit?
How long did the GPU forward pass actually take?
How many requests are queued and waiting?

Today, vLLM exposes Prometheus gauge/histogram metrics that are scraped asynchronously by an external collector. This has fundamental limitations for per-iteration telemetry:

Lossy: Prometheus scraping is pull-based at a configurable interval. With iteration times of 10-100ms, the scraper can miss 90%+ of iterations. Gauge values reflect only the most recent state at scrape time, not the full distribution. Aggregated metrics inevitably lose information.
Unsynchronized: The scraper runs on a separate timer from the engine loop. Metrics from different gauges may reflect different iterations, making it impossible to correlate prefill/decode counts with wall time for the same batch.
No per-iteration history: There is no way to reconstruct the sequence of batch compositions over time. An autoscaler cannot build a cost model from Prometheus data because it only sees snapshots.
Latency: Push-based Prometheus (Pushgateway) uses HTTP, adding latency and overhead proportional to scrape frequency. For per-iteration emission at 100+ iterations/second, this is prohibitive.

Why this matters for the ecosystem:

NVIDIA Dynamo currently implements this as an out-of-tree --scheduler-cls subclass (InstrumentedScheduler), but measuring wall time from the scheduler is inherently imprecise because the scheduler cannot observe the GPU forward pass boundary (see Proposed Change).
Autoscalers (Kubernetes HPA, custom planners) need per-iteration throughput signals to make scaling decisions within seconds, not minutes.

Proposed Change.

1. Add `wall_time` measurement in EngineCore

Measure the GPU forward pass time at the exact boundary -- around future.result() in EngineCore.step() / step_with_batch_queue():

# In EngineCore.step():
scheduler_output = self.scheduler.schedule()
future = self.model_executor.execute_model(scheduler_output, non_block=True)
...
t_start = time.monotonic()
model_output = future.result()   # blocks until GPU finishes
wall_time = time.monotonic() - t_start
...
self.scheduler.update_from_output(scheduler_output, model_output, wall_time=wall_time)

This is the only place in the codebase with direct access to both the GPU wait boundary and the scheduler output. The scheduler cannot measure this accurately because:

In sync mode: schedule() returns before execute_model runs
In async mode: schedule(N+1) runs concurrently with GPU batch N, so scheduler-side timestamps include overlap from adjacent batches

Pass wall_time to update_from_output() as a new optional kwarg so the scheduler can include it in metrics.

2. Define a per-iteration metrics struct

A compact, versioned struct emitted once per forward pass:

class ForwardPassMetrics(msgspec.Struct, frozen=True):
    version: int = 1             # can include more info in later versions

    # Identity
    worker_id: str = ""          # unique engine instance identifier
    dp_rank: int = 0             # data parallel rank
    counter_id: int = 0          # monotonic sequence number

    # Timing (measured in EngineCore)
    wall_time: float = 0.0       # seconds, GPU forward pass time

    # Scheduled batch composition
    num_prefill_requests: int = 0
    sum_prefill_tokens: int = 0       # tokens being computed this iteration
    var_prefill_length: float = 0.0   # variance of total prompt lengths
    sum_prefill_kv_tokens: int = 0    # KV tokens read (cache hits + prior chunks)
    num_decode_requests: int = 0
    sum_decode_kv_tokens: int = 0     # total KV depth across decode requests
    var_decode_kv_tokens: float = 0.0

    # Queue state
    num_queued_prefill: int = 0
    sum_queued_prefill_tokens: int = 0
    num_queued_decode: int = 0        # preempted requests waiting
    sum_queued_decode_kv_tokens: int = 0

Why these specific fields:

An autoscaler needs wall_time + num_prefill_requests + num_decode_requests + token counts to build a cost model of the form latency = f(prefill_tokens, decode_batch_size, kv_depth).
Variance fields enable detecting heterogeneous batches (mix of short and long sequences) which affect padding overhead and CUDA graph efficiency.
Queue metrics enable load-aware routing and backpressure signals.
msgspec.Struct is zero-copy serializable and already used by vLLM for KV cache events.

3. Emit via ZMQ PUB/SUB (not Prometheus)

Publish the struct over a ZMQ PUB socket bound to a configurable localhost port, using msgpack serialization:

ZMQ message: [topic_bytes, sequence_bytes, msgpack_payload]

Why ZMQ over Prometheus:

	ZMQ PUB/SUB	Prometheus
Delivery	Push, every iteration	Pull, scraper interval
Completeness	Every iteration captured	90%+ iterations missed
Correlation	All fields from same iteration in one message	Gauges may reflect different iterations
Latency	~10us per message (IPC)	HTTP round-trip per scrape
CPU overhead	Background thread, non-blocking send	Metric registry lock contention
Consumers	Multiple SUB sockets, zero-copy	One scraper endpoint
Format	Versioned, typed, extensible (msgspec)	Flat key-value gauges

The ZMQ publisher runs in a background daemon thread (same pattern as vLLM's existing ZmqEventPublisher for KV cache events). The scheduler hot path only pays for queue.put_nowait() on a bounded queue -- no serialization, no I/O.

Backward compatibility: Prometheus "most recent" gauges. For users who only need approximate metrics via existing Prometheus infrastructure, we can optionally expose the most recent ForwardPassMetrics as Prometheus gauges (updated in-place each iteration, scraped at whatever interval the collector uses). This is strictly less capable than the ZMQ stream but maintains compatibility with existing monitoring dashboards.

4. Data parallel support

Each DP rank runs its own EngineCore with its own scheduler. Each rank binds its own ZMQ PUB socket on base_port + dp_rank, emitting independent FPM streams tagged with dp_rank.

Attention DP (non-MoE): Each rank is fully independent (dp_size=1 locally). Each rank emits its own FPM stream. No cross-rank coordination needed -- the consumer (autoscaler, planner) subscribes to each rank's ZMQ port independently and aggregates as needed.

DP+EP (MoE): Each rank has its own scheduler and emits its own FPM. Although the GPU forward pass is synchronized across ranks via collectives (coordinate_batch_across_dp), each rank's wall_time is measured locally at its own future.result() boundary. The measurements are nearly identical across ranks (collectives force sync), so any rank's data is representative. Consumers can average or use rank 0's data.

This is the same approach used by KV cache events today: each DP rank publishes to its own ZMQ port, and the relay/consumer layer handles multi-rank aggregation outside the engine.

5. Activation

Controlled by a new engine argument:

--forward-pass-metrics-port PORT   # 0 = disabled (default), >0 = ZMQ PUB base port

For DP deployments, rank N binds on PORT + N. When enabled, the scheduler base class (or a thin mixin) handles metric extraction and ZMQ publishing. No subclass override needed -- this should work with any scheduler implementation.

6. Wire format and versioning

Serialization: msgpack via msgspec.msgpack.Encoder (same as KV cache events)
ZMQ multipart: [b"", seq.to_bytes(8, "big"), msgpack_payload]
- Empty topic allows future topic-based filtering
- 8-byte big-endian sequence number for ordering / gap detection
- msgpack payload is the serialized ForwardPassMetrics
Versioning: version field in the struct. Consumers must check version before interpreting fields. Bump on incompatible changes.

7. Implementation scope

Component	Change
`EngineCore.step()` / `step_with_batch_queue()`	Measure `wall_time` around `future.result()`, pass to `update_from_output()`
`Scheduler.update_from_output()`	Accept optional `wall_time` kwarg
`SchedulerInterface`	New optional method `get_forward_pass_metrics()` or mixin
New: `ForwardPassMetrics` struct	In `vllm/v1/metrics/` or `vllm/v1/core/sched/`
New: `FpmPublisher` (ZMQ background thread)	Modeled after existing `ZmqEventPublisher`
`AsyncEngineArgs`	New `--forward-pass-metrics-port` argument
Optional: Prometheus stat logger	Expose most-recent FPM fields as gauges

Feedback Period.

No response

CC List.

@alec-flowers

Any Other Things.

Reference implementation: NVIDIA Dynamo's InstrumentedScheduler implements this as an out-of-tree scheduler subclass with scheduler-side timing. Moving the timing into EngineCore and the ZMQ publisher into vLLM core would:

Eliminate the need for --scheduler-cls overrides for metrics
Provide accurate GPU timing (not scheduler-approximate)
Allow any orchestration system (not just Dynamo) to consume per-iteration metrics
Reuse existing ZMQ infrastructure from KV cache events

Existing ZMQ precedent in vLLM: The KV cache event system (KVEventsConfig, ZmqEventPublisher) already uses this exact pattern -- ZMQ PUB on localhost, msgpack serialization, background thread. Forward pass metrics would follow the same architecture.

Not in scope: How consumers (Dynamo, custom autoscalers, etc.) subscribe, relay, or aggregate these metrics. That is consumer-side logic. This RFC only covers emission from vLLM.

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

TL;DR

To address the need for per-iteration scheduler telemetry in vLLM, implement a solution that measures the GPU forward pass time in EngineCore and emits a compact, versioned metrics struct via ZMQ PUB/SUB.

Guidance

Measure wall_time in EngineCore: Modify EngineCore.step() to measure the time around future.result() and pass this wall_time to update_from_output().
Define ForwardPassMetrics struct: Create a versioned struct to hold per-iteration metrics, including wall_time, batch composition, and queue state.
Emit metrics via ZMQ PUB/SUB: Implement a ZMQ publisher in a background thread to emit the ForwardPassMetrics struct, using msgpack serialization.
Control emission with a new engine argument: Add --forward-pass-metrics-port to enable or disable metric emission, with the port number determining the base port for ZMQ PUB sockets.
Ensure backward compatibility with Prometheus: Optionally expose the most recent ForwardPassMetrics as Prometheus gauges for users relying on existing monitoring infrastructure.

Example

class ForwardPassMetrics(msgspec.Struct, frozen=True):
    # ... (fields as described in the issue)

# In EngineCore.step():
t_start = time.monotonic()
model_output = future.result()
wall_time = time.monotonic() - t_start
self.scheduler.update_from_output(scheduler_output, model_output, wall_time=wall_time)

Notes

This solution builds upon the existing ZMQ infrastructure used for KV cache events in vLLM.
The implementation scope includes modifications to EngineCore, Scheduler, and the introduction of a new ForwardPassMetrics struct and FpmPublisher.
Consumers of these metrics (e.g., autoscalers, custom planners) will need to subscribe to the ZMQ PUB socket and handle the metrics accordingly, which is outside the scope of this solution.

Recommendation

Apply the proposed workaround by implementing the ForwardPassMetrics emission via ZMQ PUB/SUB, as it provides a more accurate and complete solution for per-iteration scheduler telemetry compared to relying solely on Prometheus metrics.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #orchestration issue #cache issue #memory leak #API versioning

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

vllm - 💡(How to fix) Fix [RFC]: Per-iteration forward pass metrics with accurate engine-level timing [1 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Code Example

Motivation.

Proposed Change.

1. Add `wall_time` measurement in EngineCore

2. Define a per-iteration metrics struct

3. Emit via ZMQ PUB/SUB (not Prometheus)

4. Data parallel support

5. Activation

6. Wire format and versioning

7. Implementation scope

Feedback Period.

CC List.

Any Other Things.

Before submitting a new issue...

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

TRENDING

vllm - 💡(How to fix) Fix [RFC]: Per-iteration forward pass metrics with accurate engine-level timing [1 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Code Example

Motivation.

Proposed Change.

1. Add wall_time measurement in EngineCore

2. Define a per-iteration metrics struct

3. Emit via ZMQ PUB/SUB (not Prometheus)

4. Data parallel support

5. Activation

6. Wire format and versioning

7. Implementation scope

Feedback Period.

CC List.

Any Other Things.

Before submitting a new issue...

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING

1. Add `wall_time` measurement in EngineCore