vllm - 💡(How to fix) Fix [RFC]: Adaptive throughput/latency profile for RL rollout long-tail [1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#41821Fetched 2026-05-07 03:32:43
View on GitHub
Comments
0
Participants
1
Timeline
9
Reactions
0
Author
Participants
Timeline (top)
mentioned ×4subscribed ×4cross-referenced ×1

Root Cause

DSV3.2 + MTP gets the full benefit immediately because its enforce_eager=True means no cudagraph dimension to handle. For non-V3.2 MTP models that want cudagraph adaptation, we defer to #32374's Full-CG infrastructure rather than building a parallel one.

Fix Action

Fix / Workaround

  1. Batch-level only, not per-request. (Per-request dynamism is explicitly out of scope; the complexity-vs-value tradeoff doesn't justify it before the proposer interface matures post-MRV2.)

  2. Async-scheduling-preserving, no exceptions. No GPU→CPU sync added in the decision path. The detector reads CPU-side scheduler stats already available on every step.

  3. Reuse #32374's dispatch path when it lands. The toggle proposed here is the K=0 end of that spectrum; both should converge on the same BatchDescriptor-keyed Full-CG dispatch infrastructure that #32374 builds.

  4. Stage 1 (this RFC) — Detector + override RPC + propose() skip. Five-file change. Zero memory delta. No scheduler changes. Useful immediately for DSV3.2 + MTP.

  5. Stage 2 — Converge the detector's K=0 / K=K_max decision onto the same BatchDescriptor-keyed Full-CG dispatch path that #32374 introduces, so dynamic SD with K ∈ {0, …, K_max} and the on/off toggle share infrastructure.

  6. Stage 3 — Metrics (current_profile gauge, flip counter), verl integration recipe (pin THROUGHPUT around weight updates).

RiskMitigation
Detector flapping5-in / 3-out hysteresis; num_waiting > 0 short-circuit; flip-count metric
Mis-flip during weight updateTrainer pins THROUGHPUT before start_weight_update; wake_up resets detector
1-step async overshootDocumented as acceptable; if not, observation point can move ahead of schedule()
Divergence from #32374Stage 2 explicitly converges on shared dispatch infrastructure

Code Example

THROUGHPUT (initial)
for 5 consecutive steps:
    │    num_running / max_num_seqs < 0.20  AND
median(scheduled_tokens / num_running over last 5) < 5.0
  LATENCY
for 3 consecutive steps:  num_running / max_num_seqs > 0.40
OR num_waiting > 0  (incoming work ⇒ throughput immediately)
THROUGHPUT
RAW_BUFFERClick to expand / collapse

[RFC]: Adaptive throughput/latency profile for RL rollout long-tail

Motivation.

RL rollouts have two clearly distinguishable phases on the same engine:

  • Front phase (throughput-bound): trainer submits 256–1024 requests at once; batch is large, compute is saturated. MTP / spec decode is negative-ROI here — extra propose() + verify() FLOPs cost wall time without paying back.
  • Tail phase (latency-bound): 1–10 long-tail decodes remain; batch is tiny, memory bandwidth is idle. This is exactly when spec decode pays off — but today's vLLM forces a single static profile for the whole rollout.

The closest existing work:

  • #32374 [V1][Spec Decode] Add Dynamic SD — varies K (number of draft tokens) per batch based on offline-profiled goodput. Fits Full-CG and async scheduling. Already directionally endorsed by maintainers.
  • #36657 [RFC]: Dynamic Speculation Length with Confidence-Threshold Early Exit — adapts length per-step via drafter confidence; complementary technique.
  • #39359 / #40662 Synthetic Acceptance Rate — measurement infrastructure that lets dynamic-SD policies be evaluated without retraining drafters.
  • #25112 [Bug]: Spec decoding is not disabled at/after configured batch size — same problem statement; reports the deleted disable_by_batch_size (commit aa08a30fc, PR #35060).

This RFC proposes the boundary case of #32374: when a batch-level signal indicates compute saturation, set K = 0 (skip speculator.propose() entirely) instead of just shrinking K. Same mechanism, same constraints, cleaner semantics for the K=0 edge.

Proposed Change.

Scope (intentionally narrow)

  1. Batch-level only, not per-request. (Per-request dynamism is explicitly out of scope; the complexity-vs-value tradeoff doesn't justify it before the proposer interface matures post-MRV2.)
  2. Async-scheduling-preserving, no exceptions. No GPU→CPU sync added in the decision path. The detector reads CPU-side scheduler stats already available on every step.
  3. Reuse #32374's dispatch path when it lands. The toggle proposed here is the K=0 end of that spectrum; both should converge on the same BatchDescriptor-keyed Full-CG dispatch infrastructure that #32374 builds.

Profile semantics

Profilespeculator.propose()When
THROUGHPUTskippedLarge batch, prefill-heavy
LATENCYcalledLong-tail decode, small batch

Three-level control

LevelMechanismTriggerUse
L0RolloutPhaseDetector.observe() in EngineCore.step()vLLM internalDefault path
L1collective_rpc("set_rollout_phase_override", profile)Trainer (e.g. verl)Pin during weight sync, debugging
L2compilation.adaptive_profile_enabled: bool = FalseStartup configOff ⇒ today's static behavior

L1 takes precedence over L0; while overridden, the detector keeps observing but does not flip.

Detector

Hook point: vllm/v1/engine/core.py:391, after scheduler.schedule() and before execute_model. Inputs are CPU-side fields already populated for the step: len(scheduler.running), len(scheduler.waiting), scheduler_output.total_num_scheduled_tokens. No GPU reads.

Default state machine (tunable):

THROUGHPUT (initial)
    │  for 5 consecutive steps:
    │    num_running / max_num_seqs < 0.20  AND
    │    median(scheduled_tokens / num_running over last 5) < 5.0
  LATENCY
    │  for 3 consecutive steps:  num_running / max_num_seqs > 0.40
    │  OR num_waiting > 0  (incoming work ⇒ throughput immediately)
THROUGHPUT

wake_up() resets the detector to THROUGHPUT (post-weight-update phases start fresh).

Why this preserves async scheduling

  1. KV cache and token budgets are demand-driven: request.num_tokens_with_spec collapses to num_tokens when spec_token_ids is empty.
  2. In async mode, draft tokens flow via _copy_draft_token_ids_to_cpu()update_async_spec_token_ids() directly into sampling_metadata.spec_token_ids for the rejection sampler — not through request.spec_token_ids.
  3. AsyncScheduler._update_after_schedule() uses [-1] placeholders, not budget commitments.
  4. The only coupling is num_output_placeholders += 1 + cur_num_spec_tokens. After a LATENCY → THROUGHPUT flip, the next step still increments by 1 + K (using already-scheduled stale spec tokens). This is one step of placeholder over-count, self-correcting on the following step — behavioral overshoot, not a correctness bug.
  5. PRs fafe76b4a ("Zero-bubble async scheduling + spec decoding") and 711edaf0d already established the contract this RFC operates within.

No scheduler changes. No GPU sync. No per-request branching.

Implementation footprint (Stage 1)

FileChange
vllm/config/compilation.pyAdd adaptive_profile_enabled: bool = False
vllm/config/speculative.pyIn __post_init__, set runtime_skip_supported per method (mtp/eagle/eagle3/ngram = True; suffix = False conservatively)
vllm/v1/engine/rollout_phase_detector.pyNew file: RolloutPhaseDetector
vllm/v1/engine/core.pyConstruct detector; observe in step(); reset in wake_up(); expose set_rollout_phase_override RPC
vllm/v1/worker/gpu_model_runner.pyAdd _spec_decode_runtime_enabled; new RPC set_optimization_profile; guard at :4209

DSV3.2 + MTP gets the full benefit immediately because its enforce_eager=True means no cudagraph dimension to handle. For non-V3.2 MTP models that want cudagraph adaptation, we defer to #32374's Full-CG infrastructure rather than building a parallel one.

Deployment topology (context, not a vLLM change)

For users on whom Stage 1 doesn't yet land or who want stronger isolation, two purely-deployment patterns work today with zero vLLM changes:

  • Multi-instance + router — separate throughput/latency pools on different GPUs; route by request type or rollout phase. Mature routers (SGLang router / Envoy) handle the routing.
  • Two co-located processes + sleep/wake — independent vLLM processes on the same GPU set, only one awake at a time, swap at rollout step boundaries via sleep(level=2) / wake_up(). Each process has its own CuMemAllocator. Wake latency is dominated by weight H2D over PCIe; for enforce_eager=True models there is no cudagraph re-capture cost. Constraint: swap requires in-flight = 0, so this pattern handles step-boundary phase shifts but not within-step adaptation — that remains Stage 1's job.

These patterns are recipes, not proposals — they may be worth documenting in a follow-up but they are not part of this RFC's request to vLLM.

Rollout

  1. Stage 1 (this RFC) — Detector + override RPC + propose() skip. Five-file change. Zero memory delta. No scheduler changes. Useful immediately for DSV3.2 + MTP.
  2. Stage 2 — Converge the detector's K=0 / K=K_max decision onto the same BatchDescriptor-keyed Full-CG dispatch path that #32374 introduces, so dynamic SD with K ∈ {0, …, K_max} and the on/off toggle share infrastructure.
  3. Stage 3 — Metrics (current_profile gauge, flip counter), verl integration recipe (pin THROUGHPUT around weight updates).

Open questions

  1. Is the detector's hook point in EngineCore.step() acceptable, or would the SIG prefer it as a plugin / external scheduler hook?
  2. Detector defaults (0.20 enter / 0.40 exit / window 5 / debounce 5,3 / tokens-per-req threshold 5.0) — reasonable across model sizes, or per-model defaults?
  3. runtime_skip_supported — defaulting to True for mtp/eagle/eagle3/ngram, False for suffix. Looking for sign-off from method maintainers.
  4. Async overshoot at flip: 1 step of placeholder over-count is acceptable to us. Is that acceptable to the SIG, or is strict zero-overshoot required?
  5. Stage-2 convergence with #32374: should this RFC wait for #32374 to land, or proceed with a temporary skip path that #32374 can later subsume?

Verification

  • Unit: synthetic step traces for the detector — assert hysteresis, num_waiting > 0 short-circuit, override pinning, median window vs single-step outliers. No torch sync in the detector path (verified via torch.cuda.set_sync_debug_mode).
  • Integration: DSV3.2 + MTP, submit 256 prompts (one with max_tokens=8192, rest ≤32). Assert (a) flip log around long-tail; (b) outputs bit-identical to non-adaptive baseline at the same seed; (c) tail-segment TTLT_p99 reduction. Re-run with async_scheduling=True for parity.
  • Benchmark: H200, Pareto output-length workload, 4 arms (always-on / always-off / adaptive / static-router). Expect adaptive to match always-off on the head and always-on on the tail.

Risks

RiskMitigation
Detector flapping5-in / 3-out hysteresis; num_waiting > 0 short-circuit; flip-count metric
Mis-flip during weight updateTrainer pins THROUGHPUT before start_weight_update; wake_up resets detector
1-step async overshootDocumented as acceptable; if not, observation point can move ahead of schedule()
Divergence from #32374Stage 2 explicitly converges on shared dispatch infrastructure

Out of scope

  • Per-request profile dynamism.
  • Speculator unload/reload (MTP heads are part of the model weight set).
  • A cuda_graph tag for sleep/wake (would help the deployment-topology recipes; separate RFC).

Feedback Period.

Two weeks from posting.

CC List.

@benchislett @LucasWilkinson @ekagra-ranjan @jmamou — flagging because this RFC is positioned as a boundary case of the dynamic-SD work in #32374 / #36657 / #39359 / #40662, and async-scheduling preservation is an explicit hard constraint shaped by your prior discussion. Happy to fold this into either of those threads if that's preferred over a separate RFC.

Any Other Things.

  • AI assistance was used in drafting. The submitter understands every claim and code path referenced and will defend the design end-to-end.
  • Empirical numbers (DSV3.2 + MTP on H200) will be posted as a follow-up comment.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING