vllm - 💡(How to fix) Fix [RFC]: Adaptive throughput/latency profile for RL rollout long-tail [1 participants]

vllm2026-05-06 13:21:35

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

vllm-project/vllm#41821•Fetched 2026-05-07 03:32:43

View on GitHub

Comments

Participants

Timeline

Reactions

Author

aoshen02

Participants

aoshen02

Timeline (top)

mentioned ×4subscribed ×4cross-referenced ×1

Root Cause

DSV3.2 + MTP gets the full benefit immediately because its enforce_eager=True means no cudagraph dimension to handle. For non-V3.2 MTP models that want cudagraph adaptation, we defer to #32374's Full-CG infrastructure rather than building a parallel one.

Fix Action

Fix / Workaround

Batch-level only, not per-request. (Per-request dynamism is explicitly out of scope; the complexity-vs-value tradeoff doesn't justify it before the proposer interface matures post-MRV2.)
Async-scheduling-preserving, no exceptions. No GPU→CPU sync added in the decision path. The detector reads CPU-side scheduler stats already available on every step.
Reuse #32374's dispatch path when it lands. The toggle proposed here is the K=0 end of that spectrum; both should converge on the same BatchDescriptor-keyed Full-CG dispatch infrastructure that #32374 builds.
Stage 1 (this RFC) — Detector + override RPC + propose() skip. Five-file change. Zero memory delta. No scheduler changes. Useful immediately for DSV3.2 + MTP.
Stage 2 — Converge the detector's K=0 / K=K_max decision onto the same BatchDescriptor-keyed Full-CG dispatch path that #32374 introduces, so dynamic SD with K ∈ {0, …, K_max} and the on/off toggle share infrastructure.
Stage 3 — Metrics (current_profile gauge, flip counter), verl integration recipe (pin THROUGHPUT around weight updates).

Risk	Mitigation
Detector flapping	5-in / 3-out hysteresis; `num_waiting > 0` short-circuit; flip-count metric
Mis-flip during weight update	Trainer pins THROUGHPUT before `start_weight_update`; `wake_up` resets detector
1-step async overshoot	Documented as acceptable; if not, observation point can move ahead of `schedule()`
Divergence from #32374	Stage 2 explicitly converges on shared dispatch infrastructure

Code Example

THROUGHPUT (initial)
    │  for 5 consecutive steps:
    │    num_running / max_num_seqs < 0.20  AND
    │    median(scheduled_tokens / num_running over last 5) < 5.0
    ▼
  LATENCY
    │  for 3 consecutive steps:  num_running / max_num_seqs > 0.40
    │  OR num_waiting > 0  (incoming work ⇒ throughput immediately)
    ▼
THROUGHPUT

RAW_BUFFERClick to expand / collapse

[RFC]: Adaptive throughput/latency profile for RL rollout long-tail

Motivation.

RL rollouts have two clearly distinguishable phases on the same engine:

Front phase (throughput-bound): trainer submits 256–1024 requests at once; batch is large, compute is saturated. MTP / spec decode is negative-ROI here — extra propose() + verify() FLOPs cost wall time without paying back.
Tail phase (latency-bound): 1–10 long-tail decodes remain; batch is tiny, memory bandwidth is idle. This is exactly when spec decode pays off — but today's vLLM forces a single static profile for the whole rollout.

The closest existing work:

#32374 [V1][Spec Decode] Add Dynamic SD — varies K (number of draft tokens) per batch based on offline-profiled goodput. Fits Full-CG and async scheduling. Already directionally endorsed by maintainers.
#36657 [RFC]: Dynamic Speculation Length with Confidence-Threshold Early Exit — adapts length per-step via drafter confidence; complementary technique.
#39359 / #40662 Synthetic Acceptance Rate — measurement infrastructure that lets dynamic-SD policies be evaluated without retraining drafters.
#25112 [Bug]: Spec decoding is not disabled at/after configured batch size — same problem statement; reports the deleted disable_by_batch_size (commit aa08a30fc, PR #35060).

This RFC proposes the boundary case of #32374: when a batch-level signal indicates compute saturation, set K = 0 (skip speculator.propose() entirely) instead of just shrinking K. Same mechanism, same constraints, cleaner semantics for the K=0 edge.

Proposed Change.

Scope (intentionally narrow)

Batch-level only, not per-request. (Per-request dynamism is explicitly out of scope; the complexity-vs-value tradeoff doesn't justify it before the proposer interface matures post-MRV2.)
Async-scheduling-preserving, no exceptions. No GPU→CPU sync added in the decision path. The detector reads CPU-side scheduler stats already available on every step.
Reuse #32374's dispatch path when it lands. The toggle proposed here is the K=0 end of that spectrum; both should converge on the same BatchDescriptor-keyed Full-CG dispatch infrastructure that #32374 builds.

Profile semantics

Profile	`speculator.propose()`	When
THROUGHPUT	skipped	Large batch, prefill-heavy
LATENCY	called	Long-tail decode, small batch

Three-level control

Level	Mechanism	Trigger	Use
L0	`RolloutPhaseDetector.observe()` in `EngineCore.step()`	vLLM internal	Default path
L1	`collective_rpc("set_rollout_phase_override", profile)`	Trainer (e.g. verl)	Pin during weight sync, debugging
L2	`compilation.adaptive_profile_enabled: bool = False`	Startup config	Off ⇒ today's static behavior

L1 takes precedence over L0; while overridden, the detector keeps observing but does not flip.

Detector

Hook point: vllm/v1/engine/core.py:391, after scheduler.schedule() and before execute_model. Inputs are CPU-side fields already populated for the step: len(scheduler.running), len(scheduler.waiting), scheduler_output.total_num_scheduled_tokens. No GPU reads.

Default state machine (tunable):

THROUGHPUT (initial)
    │  for 5 consecutive steps:
    │    num_running / max_num_seqs < 0.20  AND
    │    median(scheduled_tokens / num_running over last 5) < 5.0
    ▼
  LATENCY
    │  for 3 consecutive steps:  num_running / max_num_seqs > 0.40
    │  OR num_waiting > 0  (incoming work ⇒ throughput immediately)
    ▼
THROUGHPUT

wake_up() resets the detector to THROUGHPUT (post-weight-update phases start fresh).

Why this preserves async scheduling

KV cache and token budgets are demand-driven: request.num_tokens_with_spec collapses to num_tokens when spec_token_ids is empty.
In async mode, draft tokens flow via _copy_draft_token_ids_to_cpu() → update_async_spec_token_ids() directly into sampling_metadata.spec_token_ids for the rejection sampler — not through request.spec_token_ids.
AsyncScheduler._update_after_schedule() uses [-1] placeholders, not budget commitments.
The only coupling is num_output_placeholders += 1 + cur_num_spec_tokens. After a LATENCY → THROUGHPUT flip, the next step still increments by 1 + K (using already-scheduled stale spec tokens). This is one step of placeholder over-count, self-correcting on the following step — behavioral overshoot, not a correctness bug.
PRs fafe76b4a ("Zero-bubble async scheduling + spec decoding") and 711edaf0d already established the contract this RFC operates within.

No scheduler changes. No GPU sync. No per-request branching.

Implementation footprint (Stage 1)

File	Change
`vllm/config/compilation.py`	Add `adaptive_profile_enabled: bool = False`
`vllm/config/speculative.py`	In `__post_init__`, set `runtime_skip_supported` per method (mtp/eagle/eagle3/ngram = True; suffix = False conservatively)
`vllm/v1/engine/rollout_phase_detector.py`	New file: `RolloutPhaseDetector`
`vllm/v1/engine/core.py`	Construct detector; observe in `step()`; reset in `wake_up()`; expose `set_rollout_phase_override` RPC
`vllm/v1/worker/gpu_model_runner.py`	Add `_spec_decode_runtime_enabled`; new RPC `set_optimization_profile`; guard at `:4209`

Deployment topology (context, not a vLLM change)

For users on whom Stage 1 doesn't yet land or who want stronger isolation, two purely-deployment patterns work today with zero vLLM changes:

Multi-instance + router — separate throughput/latency pools on different GPUs; route by request type or rollout phase. Mature routers (SGLang router / Envoy) handle the routing.
Two co-located processes + sleep/wake — independent vLLM processes on the same GPU set, only one awake at a time, swap at rollout step boundaries via sleep(level=2) / wake_up(). Each process has its own CuMemAllocator. Wake latency is dominated by weight H2D over PCIe; for enforce_eager=True models there is no cudagraph re-capture cost. Constraint: swap requires in-flight = 0, so this pattern handles step-boundary phase shifts but not within-step adaptation — that remains Stage 1's job.

These patterns are recipes, not proposals — they may be worth documenting in a follow-up but they are not part of this RFC's request to vLLM.

Rollout

Stage 1 (this RFC) — Detector + override RPC + propose() skip. Five-file change. Zero memory delta. No scheduler changes. Useful immediately for DSV3.2 + MTP.
Stage 2 — Converge the detector's K=0 / K=K_max decision onto the same BatchDescriptor-keyed Full-CG dispatch path that #32374 introduces, so dynamic SD with K ∈ {0, …, K_max} and the on/off toggle share infrastructure.
Stage 3 — Metrics (current_profile gauge, flip counter), verl integration recipe (pin THROUGHPUT around weight updates).

Open questions

Is the detector's hook point in EngineCore.step() acceptable, or would the SIG prefer it as a plugin / external scheduler hook?
Detector defaults (0.20 enter / 0.40 exit / window 5 / debounce 5,3 / tokens-per-req threshold 5.0) — reasonable across model sizes, or per-model defaults?
runtime_skip_supported — defaulting to True for mtp/eagle/eagle3/ngram, False for suffix. Looking for sign-off from method maintainers.
Async overshoot at flip: 1 step of placeholder over-count is acceptable to us. Is that acceptable to the SIG, or is strict zero-overshoot required?
Stage-2 convergence with #32374: should this RFC wait for #32374 to land, or proceed with a temporary skip path that #32374 can later subsume?

Verification

Unit: synthetic step traces for the detector — assert hysteresis, num_waiting > 0 short-circuit, override pinning, median window vs single-step outliers. No torch sync in the detector path (verified via torch.cuda.set_sync_debug_mode).
Integration: DSV3.2 + MTP, submit 256 prompts (one with max_tokens=8192, rest ≤32). Assert (a) flip log around long-tail; (b) outputs bit-identical to non-adaptive baseline at the same seed; (c) tail-segment TTLT_p99 reduction. Re-run with async_scheduling=True for parity.
Benchmark: H200, Pareto output-length workload, 4 arms (always-on / always-off / adaptive / static-router). Expect adaptive to match always-off on the head and always-on on the tail.

Risks

Risk	Mitigation
Detector flapping	5-in / 3-out hysteresis; `num_waiting > 0` short-circuit; flip-count metric
Mis-flip during weight update	Trainer pins THROUGHPUT before `start_weight_update`; `wake_up` resets detector
1-step async overshoot	Documented as acceptable; if not, observation point can move ahead of `schedule()`
Divergence from #32374	Stage 2 explicitly converges on shared dispatch infrastructure

Out of scope

Per-request profile dynamism.
Speculator unload/reload (MTP heads are part of the model weight set).
A cuda_graph tag for sleep/wake (would help the deployment-topology recipes; separate RFC).

Feedback Period.

Two weeks from posting.

CC List.

@benchislett @LucasWilkinson @ekagra-ranjan @jmamou — flagging because this RFC is positioned as a boundary case of the dynamic-SD work in #32374 / #36657 / #39359 / #40662, and async-scheduling preservation is an explicit hard constraint shaped by your prior discussion. Happy to fold this into either of those threads if that's preferred over a separate RFC.

Any Other Things.

AI assistance was used in drafting. The submitter understands every claim and code path referenced and will defend the design end-to-end.
Empirical numbers (DSV3.2 + MTP on H200) will be posted as a follow-up comment.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#optimization #cache error #pipeline error #runtime error #dependency conflict

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

vllm - 💡(How to fix) Fix [RFC]: Adaptive throughput/latency profile for RL rollout long-tail [1 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Fix Action

Fix / Workaround

Code Example

[RFC]: Adaptive throughput/latency profile for RL rollout long-tail

Motivation.

Proposed Change.

Scope (intentionally narrow)

Profile semantics

Three-level control

Detector

Why this preserves async scheduling

Implementation footprint (Stage 1)

Deployment topology (context, not a vLLM change)

Rollout

Open questions

Verification

Risks

Out of scope

Feedback Period.

CC List.

Any Other Things.

Still need to ship something?

TRENDING

vllm - 💡(How to fix) Fix [RFC]: Adaptive throughput/latency profile for RL rollout long-tail [1 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Fix Action

Fix / Workaround

Code Example

[RFC]: Adaptive throughput/latency profile for RL rollout long-tail

Motivation.

Proposed Change.

Scope (intentionally narrow)

Profile semantics

Three-level control

Detector

Why this preserves async scheduling

Implementation footprint (Stage 1)

Deployment topology (context, not a vLLM change)

Rollout

Open questions

Verification

Risks

Out of scope

Feedback Period.

CC List.

Any Other Things.

Still need to ship something?

RELATED_DISCOVERY

TRENDING