vllm - ✅(Solved) Fix [Bug]: n_completions + logprobs Causes Significant TTFT Spike for Co-Scheduled Requests on Cold Cache [1 pull requests, 2 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#37343Fetched 2026-04-08 00:53:25
View on GitHub
Comments
2
Participants
2
Timeline
5
Reactions
0
Author
Participants
Timeline (top)
commented ×2cross-referenced ×1labeled ×1subscribed ×1

Root Cause

This is a different root cause from the chunked-prefill head-of-line blocking bug #37308 . Enabling --enable-chunked-prefill does not help here. The regression occurs entirely in the decode phase and is caused by per-step compute amplification, not prefill scheduling order.

Fix Action

Fixed

PR fix notes

PR #37594: [Bugfix] fixd issue#37343: prevent TTFT regression by adding batched logprobs budget to scheduler

Description (problem / solution / changelog)

Purpose

Fixes the extreme TTFT regression (Noisy Neighbor effect) in the V1 engine when requests with high n (completions) and logprobs are co-scheduled.

Problem: In the current V1 scheduler, requests are batched based on token and sequence counts, but it ignores the compute overhead of the sampling stage. For models with large vocabularies like Qwen2.5 (~151k), requesting logprobs=20 with n=8 requires a massive Top-K sort across the full vocabulary for every sequence in every decode iteration.

Since batch execution is synchronous, a single "heavy" sampling request forces all other co-scheduled requests (even simple 16-token plain requests) to wait for its completion at every step. This leads to a 76x–423x TTFT regression for innocent "victim" requests.

Solution: Introduced a max_num_batched_logprobs budget in SchedulerConfig and the V1 Scheduler.

  • It tracks the cumulative compute cost of $n \times \text{logprobs}$ for all requests in a single batch.
  • It isolates heavy sampling tasks by deferring requests that would exceed the budget to the next batch.
  • This ensures that plain requests are not stalled by a single compute-intensive "amplifier" request.

Fixes #37343

Test Plan

  1. Environment: NVIDIA RTX A40, CUDA 12.2, Model: Qwen/Qwen2.5-0.5B-Instruct.
  2. Reproduction: Run the fuzzing reproduction script (repro.py) provided in the issue with max_num_batched_logprobs=100.
  3. Comparison: Compare the TTFT of the victim request (r9) and overall batch performance before and after the fix across 10 independent runs.

Test Result

Using max_num_batched_logprobs=100, the extreme latency spikes were successfully mitigated.

MetricBefore Fix (Baseline)After Fix (PR)Improvement
Batch p99 TTFT9838.8 ms295.2 ms33.3x
Victim r9 Mean TTFT1019.0 ms53.7 ms18.9x
Victim r9 p99 TTFT9761.9 ms65.2 ms149.7x

Summary: The fix restores the victim request's TTFT from ~9.7s back to a stable ~65ms, effectively neutralizing the compute amplification from "noisy neighbors" in the same batch.


Key Design Highlights: Non-Intrusive & Risk-Averse

  • Pure Python Implementation: The fix is entirely contained within the scheduling and configuration layers. No modifications were made to CUDA kernels or the C++ model executor.
  • Minimal Core Logic Change: Instead of redesigning the sampling process, this PR introduces a "Compute Budget" heuristic. It works seamlessly with the existing Continuous Batching mechanism without introducing side effects.
  • Backward Compatibility: By default, this budget can be disabled (set to 0 or a very high value), ensuring zero impact on existing user workflows while providing an "opt-in" safety net for latency-sensitive production environments.
  • Low Review Overhead: Since the changes are localized and high-level, it allows for safe verification without risking hardware-level or model-specific regressions.
Baseline: Synchronous Blocking via Compute Amplification

============================================================
  vLLM n_completions+logprobs compute amplification
  Finding : finding_01758_199991774 (iteration 1758)
  Server  : http://localhost:8000/
  Model   : Qwen/Qwen2.5-0.5B-Instruct
  Runs    : 10

  Threshold: 46 ms TTFT
============================================================

Trace: 5 concurrent requests sharing prefix_len=32
  r9  : 512-token prompt, max_tokens=64,  n=1, logprobs=None  ← plain victim (+313ms)
  r06 : 512-token prompt, max_tokens=16,  n=8, logprobs=20    ← amplifier  (+578ms)
  r01 : 512-token prompt, max_tokens=16,  n=1, logprobs=20                 (+601ms)
  r03 : 512-token prompt, max_tokens=16,  n=1, logprobs=20                 (+601ms)
  r08 : 512-token prompt, max_tokens=16,  n=1, logprobs=20                 (+602ms)

r06 expands into 8 parallel decode sequences, each computing full-vocab
logprobs every step. All co-scheduled requests pay this cost per step.


  Run  1  (11306ms wall)
    r9      :   9761.9 ms *** SLOW
    r06     :   9519.5 ms *** SLOW
    r01     :   9752.0 ms *** SLOW
    r03     :   9662.5 ms *** SLOW
    r08     :   9838.8 ms *** SLOW

  Run  2  (1263ms wall)
    r9      :     51.2 ms
    r06     :     84.7 ms
    r01     :     87.1 ms
    r03     :     84.5 ms
    r08     :     85.2 ms

  Run  3  (1304ms wall)
    r9      :     45.1 ms
    r06     :    421.0 ms *** SLOW
    r01     :    427.7 ms *** SLOW
    r03     :    427.1 ms *** SLOW
    r08     :    426.7 ms *** SLOW

  Run  4  (1243ms wall)
    r9      :     45.2 ms
    r06     :     77.7 ms
    r01     :    112.9 ms
    r03     :    113.6 ms
    r08     :    110.9 ms

  Run  5  (1257ms wall)
    r9      :     45.5 ms
    r06     :     89.5 ms
    r01     :     85.8 ms
    r03     :     84.8 ms
    r08     :     83.9 ms

  Run  6  (1258ms wall)
    r9      :     47.5 ms
    r06     :     82.6 ms
    r01     :     88.1 ms
    r03     :     87.3 ms
    r08     :     87.1 ms

  Run  7  (1261ms wall)
    r9      :     44.6 ms
    r06     :     77.0 ms
    r01     :     99.4 ms
    r03     :     96.8 ms
    r08     :     97.7 ms

  Run  8  (1248ms wall)
    r9      :     44.1 ms
    r06     :     85.1 ms
    r01     :     91.8 ms
    r03     :     91.0 ms
    r08     :     90.1 ms

  Run  9  (1262ms wall)
    r9      :     54.3 ms
    r06     :     84.4 ms
    r01     :     88.8 ms
    r03     :     88.9 ms
    r08     :     88.0 ms

  Run 10  (1270ms wall)
    r9      :     50.5 ms
    r06     :     92.7 ms
    r01     :     83.2 ms
    r03     :     82.7 ms
    r08     :     82.7 ms

============================================================

  RESULTS SUMMARY
============================================================

  All requests — 50 samples:
    mean :   1070.6 ms
    p99  :   9838.8 ms
    max  :   9838.8 ms
    threshold : 46 ms

  r9 (plain victim, arrived 265ms early)10 samples:
    mean :   1019.0 ms
    p99  :   9761.9 ms  (212.2× threshold)

------------------------------------------------------------

  RESULT : CONFIRMED
  p99 TTFT 9838.8 ms exceeds threshold 46 ms (213.9×)

============================================================

Fixed: Compute Isolation via Logprobs Budgeting

r06 expands into 8 parallel decode sequences, each computing full-vocab
logprobs every step. All co-scheduled requests pay this cost per step.

  Run  1  (1534ms wall)
    r9      :     65.2 ms
    r06     :     82.7 ms
    r01     :     89.5 ms
    r03     :    103.8 ms
    r08     :    295.2 ms *** SLOW

  Run  2  (1497ms wall)
    r9      :     48.8 ms
    r06     :     70.0 ms
    r01     :     89.8 ms
    r03     :    107.1 ms
    r08     :    282.9 ms *** SLOW

  Run  3  (1496ms wall)
    r9      :     52.3 ms
    r06     :     83.2 ms
    r01     :     75.3 ms
    r03     :    282.0 ms *** SLOW
    r08     :     94.8 ms

  Run  4  (1503ms wall)
    r9      :     52.7 ms
    r06     :     81.6 ms
    r01     :     86.0 ms
    r03     :    106.6 ms
    r08     :    295.0 ms *** SLOW

  Run  5  (1498ms wall)
    r9      :     55.7 ms
    r06     :     85.6 ms
    r01     :     84.0 ms
    r03     :     98.3 ms
    r08     :    289.8 ms *** SLOW

  Run  6  (1506ms wall)
    r9      :     53.0 ms
    r06     :     79.5 ms
    r01     :    102.9 ms
    r03     :     82.2 ms
    r08     :    291.5 ms *** SLOW

  Run  7  (1505ms wall)
    r9      :     54.0 ms
    r06     :     83.1 ms
    r01     :     82.1 ms
    r03     :    101.7 ms
    r08     :    292.5 ms *** SLOW

  Run  8  (1476ms wall)
    r9      :     49.8 ms
    r06     :     78.0 ms
    r01     :     75.9 ms
    r03     :    280.9 ms *** SLOW
    r08     :     93.5 ms

  Run  9  (1497ms wall)
    r9      :     51.1 ms
    r06     :     71.9 ms
    r01     :     73.7 ms
    r03     :     90.1 ms
    r08     :    279.0 ms *** SLOW

  Run 10  (1504ms wall)
    r9      :     54.8 ms
    r06     :     80.1 ms
    r01     :     84.8 ms
    r03     :    293.8 ms *** SLOW
    r08     :    102.0 ms

============================================================

  RESULTS SUMMARY
============================================================

  All requests — 50 samples:
    mean :    120.8 ms
    p99  :    295.2 ms
    max  :    295.2 ms
    threshold : 46 ms

  r9 (plain victim, arrived 265ms early)10 samples:
    mean :     53.7 ms
    p99  :     65.2 ms  (1.4× threshold)

------------------------------------------------------------

  RESULT : CONFIRMED
  p99 TTFT 295.2 ms exceeds threshold 46 ms (6.4×)

============================================================
repo.py
#!/usr/bin/env python3
"""
Minimal reproduction for vLLM n_completions + logprobs per-step compute amplification.
Bug: A single request with n_completions=8 and logprobs=20 expands into 8 parallel
decode sequences each requiring a full-vocabulary logprob computation per step.
Every co-scheduled request in the same decode batch is forced to wait for this
overhead on every step, causing 76-423x TTFT regression for plain requests.
This reproduces finding_01758 (seed 199991774, iteration 1758).
Note: --enable-chunked-prefill does NOT mitigate this — the regression is in the
decode phase, not the prefill scheduling order.
Setup:
    pip install httpx
    python -m vllm.entrypoints.openai.api_server \\
        --model Qwen/Qwen2.5-0.5B-Instruct \\
        --port 8000 \\
        --gpu-memory-utilization 0.95 \\
        --max-model-len 32768 \\
        --enable-prefix-caching \\
        --disable-log-requests
Run:
    python3 repro.py --base-url http://localhost:8000 --runs 10
"""

import argparse
import asyncio
import json
import sys
import time

try:
    import httpx
except ImportError:
    sys.exit("pip install httpx")


def build_prompt(prompt_len: int, prefix_len: int, idx: int) -> str:
    shared = [f"pre{i}" for i in range(min(prefix_len, prompt_len))]
    unique = [f"u{idx}_{i}" for i in range(prompt_len - len(shared))]
    return " ".join(shared + unique)


# ---------------------------------------------------------------------------
# The five requests from finding_01758.
# All share prefix_len=32. Offsets are relative to trace start (seconds).
#
# r06 is the amplifier: n_completions=8 + logprobs=20 expands into 8 parallel
# decode sequences each computing full-vocab logprobs every step.
# r9 is the primary victim: arrives 265ms before the burst, plain request,
# should be nearly done before r06 even arrives.
# ---------------------------------------------------------------------------

REQUESTS = [
    # label     prompt  prefix  max_tok  n_comp  logprobs  temp   stream  offset_s  idx
    ("r9",       512,    32,      64,      1,      None,    None,  True,   0.313,    0),  # plain victim
    ("r06",      512,    32,      16,      8,      20,      1.26,  True,   0.578,    1),  # ← amplifier
    ("r01",      512,    32,      16,      1,      20,      None,  True,   0.601,    2),
    ("r03",      512,    32,      16,      1,      20,      None,  True,   0.601,    3),
    ("r08",      512,    32,      16,      1,      20,      0.5,   True,   0.602,    4),
]

THRESHOLD_MS = 46.0  # expected TTFT for a 512-token prompt on idle server


async def send_request(
    client: httpx.AsyncClient,
    base_url: str,
    model: str,
    label: str,
    prompt: str,
    max_tokens: int,
    n: int,
    logprobs: int | None,
    temperature: float | None,
    stream: bool,
) -> tuple[str, float | None]:
    payload: dict = {
        "model":      model,
        "prompt":     prompt,
        "max_tokens": max_tokens,
        "stream":     stream,
        "n":          n,
    }
    if logprobs is not None:
        payload["logprobs"] = logprobs
    if temperature is not None:
        payload["temperature"] = temperature

    t_send = time.perf_counter()
    ttft   = None

    try:
        if stream:
            async with client.stream(
                "POST", f"{base_url}/v1/completions",
                json=payload, timeout=120.0,
            ) as resp:
                resp.raise_for_status()
                async for line in resp.aiter_lines():
                    if not line.startswith("data: "):
                        continue
                    raw = line[6:]
                    if raw.strip() == "[DONE]":
                        break
                    try:
                        chunk = json.loads(raw)
                        tok   = chunk["choices"][0].get("text", "")
                        if tok and ttft is None:
                            ttft = (time.perf_counter() - t_send) * 1000.0
                    except Exception:
                        pass
        else:
            resp = await client.post(
                f"{base_url}/v1/completions",
                json=payload, timeout=120.0,
            )
            resp.raise_for_status()
    except Exception as exc:
        print(f"  [ERROR] {label}: {exc}", file=sys.stderr)

    return label, ttft


async def run_once(base_url: str, model: str) -> dict[str, float | None]:
    results: dict[str, float | None] = {}
    tasks = []

    async with httpx.AsyncClient() as client:
        trace_start = time.perf_counter()

        for label, plen, pfx, max_tok, n, logprobs, temp, stream, offset, idx in REQUESTS:
            delay = (trace_start + offset) - time.perf_counter()
            if delay > 0:
                await asyncio.sleep(delay)

            task = asyncio.create_task(
                send_request(
                    client, base_url, model, label,
                    build_prompt(plen, pfx, idx),
                    max_tok, n, logprobs, temp, stream,
                )
            )
            tasks.append(task)

        for coro in asyncio.as_completed(tasks):
            label, ttft = await coro
            results[label] = ttft

    return results


async def main() -> None:
    parser = argparse.ArgumentParser(
        description="Reproduce vLLM n_completions+logprobs compute amplification (finding_01758).",
        formatter_class=argparse.ArgumentDefaultsHelpFormatter,
    )
    parser.add_argument("--base-url", default="http://localhost:8000")
    parser.add_argument("--model",    default="Qwen/Qwen2.5-0.5B-Instruct")
    parser.add_argument("--runs",     type=int, default=10)
    parser.add_argument("--settle-s", type=float, default=2.0)
    args = parser.parse_args()

    print("=" * 60)
    print("  vLLM n_completions+logprobs compute amplification")
    print(f"  Finding : finding_01758_199991774 (iteration 1758)")
    print(f"  Server  : {args.base_url}")
    print(f"  Model   : {args.model}")
    print(f"  Runs    : {args.runs}")
    print(f"  Threshold: {THRESHOLD_MS:.0f} ms TTFT")
    print("=" * 60)
    print()
    print("Trace: 5 concurrent requests sharing prefix_len=32")
    print("  r9  : 512-token prompt, max_tokens=64,  n=1, logprobs=None  ← plain victim (+313ms)")
    print("  r06 : 512-token prompt, max_tokens=16,  n=8, logprobs=20    ← amplifier  (+578ms)")
    print("  r01 : 512-token prompt, max_tokens=16,  n=1, logprobs=20                 (+601ms)")
    print("  r03 : 512-token prompt, max_tokens=16,  n=1, logprobs=20                 (+601ms)")
    print("  r08 : 512-token prompt, max_tokens=16,  n=1, logprobs=20                 (+602ms)")
    print()
    print("r06 expands into 8 parallel decode sequences, each computing full-vocab")
    print("logprobs every step. All co-scheduled requests pay this cost per step.")
    print()

    all_ttfts: list[float] = []
    r9_ttfts:  list[float] = []
    run_results = []

    for run_i in range(1, args.runs + 1):
        t0      = time.perf_counter()
        results = await run_once(args.base_url, args.model)
        elapsed = (time.perf_counter() - t0) * 1000.0

        ttfts = [v for v in results.values() if v is not None]
        all_ttfts.extend(ttfts)
        if results.get("r9") is not None:
            r9_ttfts.append(results["r9"])
        run_results.append(results)

        print(f"  Run {run_i:>2}  ({elapsed:.0f}ms wall)")
        for label, *_ in REQUESTS:
            ttft = results.get(label)
            if ttft is not None:
                flag = " *** SLOW" if ttft > THRESHOLD_MS * 3 else ""
                print(f"    {label:<8}: {ttft:>8.1f} ms{flag}")
        print()

        if run_i < args.runs:
            await asyncio.sleep(args.settle_s)

    print("=" * 60)
    print("  RESULTS SUMMARY")
    print("=" * 60)

    if not all_ttfts:
        print("  No TTFT samples collected.")
        return

    all_ttfts.sort()
    mean_ms = sum(all_ttfts) / len(all_ttfts)
    p99_ms  = all_ttfts[int(0.99 * len(all_ttfts))]
    max_ms  = all_ttfts[-1]

    print(f"  All requests — {len(all_ttfts)} samples:")
    print(f"    mean : {mean_ms:>8.1f} ms")
    print(f"    p99  : {p99_ms:>8.1f} ms")
    print(f"    max  : {max_ms:>8.1f} ms")
    print(f"    threshold : {THRESHOLD_MS:.0f} ms")
    print()

    if r9_ttfts:
        r9_ttfts.sort()
        r9_p99  = r9_ttfts[int(0.99 * len(r9_ttfts))]
        r9_mean = sum(r9_ttfts) / len(r9_ttfts)
        print(f"  r9 (plain victim, arrived 265ms early) — {len(r9_ttfts)} samples:")
        print(f"    mean : {r9_mean:>8.1f} ms")
        print(f"    p99  : {r9_p99:>8.1f} ms  ({r9_p99 / THRESHOLD_MS:.1f}× threshold)")
        print()

    print("-" * 60)
    ratio = p99_ms / THRESHOLD_MS
    if p99_ms > THRESHOLD_MS:
        print(f"  RESULT : CONFIRMED")
        print(f"  p99 TTFT {p99_ms:.1f} ms exceeds threshold {THRESHOLD_MS:.0f} ms ({ratio:.1f}×)")
        print()
    else:
        print(f"  RESULT : Damn it! NOT REPRODUCED")
        print(f"  p99 TTFT {p99_ms:.1f} ms is within threshold {THRESHOLD_MS:.0f} ms")
    print("=" * 60)


if __name__ == "__main__":
    asyncio.run(main())

Changed files

  • vllm/config/model.py (modified, +5/-0)
  • vllm/config/scheduler.py (modified, +6/-1)
  • vllm/engine/arg_utils.py (modified, +5/-0)
  • vllm/v1/core/sched/scheduler.py (modified, +52/-2)

Code Example

/nfshomes/yunze/miniconda3/envs/vllm-fuzz/lib/python3.11/site-packages/torch/cuda/__init__.py:63: FutureWarning: The pynvml package is deprecated. Please install nvidia-ml-py instead. If you did not install pynvml directly, please report this to the maintainers of the package that installed pynvml for you.
  import pynvml  # type: ignore[import]
Collecting environment information...
==============================
        System Info
==============================
OS                           : Red Hat Enterprise Linux release 8.10 (Ootpa) (x86_64)
GCC version                  : (GCC) 8.5.0 20210514 (Red Hat 8.5.0-28)
Clang version                : Could not collect
CMake version                : version 3.26.5
Libc version                 : glibc-2.28

==============================
       PyTorch Info
==============================
PyTorch version              : 2.9.0+cu128
Is debug build               : False
CUDA used to build PyTorch   : 12.8
ROCM used to build PyTorch   : N/A

==============================
      Python Environment
==============================
Python version               : 3.11.14 (main, Oct 21 2025, 18:31:21) [GCC 11.2.0] (64-bit runtime)
Python platform              : Linux-4.18.0-553.109.1.el8_10.x86_64-x86_64-with-glibc2.28

==============================
       CUDA / GPU Info
==============================
Is CUDA available            : True
CUDA runtime version         : 13.1.115
CUDA_MODULE_LOADING set to   : 
GPU models and configuration : GPU 0: NVIDIA RTX A6000
Nvidia driver version        : 590.48.01
cuDNN version                : Could not collect
HIP runtime version          : N/A
MIOpen runtime version       : N/A
Is XNNPACK available         : True

==============================
          CPU Info
==============================
Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              32
On-line CPU(s) list: 0-31
Thread(s) per core:  1
Core(s) per socket:  16
Socket(s):           2
NUMA node(s):        2
Vendor ID:           AuthenticAMD
CPU family:          23
Model:               49
Model name:          AMD EPYC 7302 16-Core Processor
Stepping:            0
CPU MHz:             3000.000
CPU max MHz:         3000.0000
CPU min MHz:         1500.0000
BogoMIPS:            6000.12
Virtualization:      AMD-V
L1d cache:           32K
L1i cache:           32K
L2 cache:            512K
L3 cache:            16384K
NUMA node0 CPU(s):   0-15
NUMA node1 CPU(s):   16-31
Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate ssbd mba ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local clzero irperf xsaveerptr wbnoinvd amd_ppin arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif v_spec_ctrl umip rdpid overflow_recov succor smca sme sev sev_es

==============================
Versions of relevant libraries
==============================
[pip3] flashinfer-python==0.5.2
[pip3] numpy==2.2.6
[pip3] nvidia-cublas-cu12==12.8.4.1
[pip3] nvidia-cuda-cupti-cu12==12.8.90
[pip3] nvidia-cuda-nvrtc-cu12==12.8.93
[pip3] nvidia-cuda-runtime-cu12==12.8.90
[pip3] nvidia-cudnn-cu12==9.10.2.21
[pip3] nvidia-cudnn-frontend==1.18.0
[pip3] nvidia-cufft-cu12==11.3.3.83
[pip3] nvidia-cufile-cu12==1.13.1.3
[pip3] nvidia-curand-cu12==10.3.9.90
[pip3] nvidia-cusolver-cu12==11.7.3.90
[pip3] nvidia-cusparse-cu12==12.5.8.93
[pip3] nvidia-cusparselt-cu12==0.7.1
[pip3] nvidia-cutlass-dsl==4.4.1
[pip3] nvidia-cutlass-dsl-libs-base==4.4.1
[pip3] nvidia-ml-py==13.590.48
[pip3] nvidia-nccl-cu12==2.27.5
[pip3] nvidia-nvjitlink-cu12==12.8.93
[pip3] nvidia-nvshmem-cu12==3.3.20
[pip3] nvidia-nvtx-cu12==12.8.90
[pip3] pynvml==13.0.1
[pip3] pyzmq==27.1.0
[pip3] torch==2.9.0
[pip3] torchaudio==2.9.0
[pip3] torchvision==0.24.0
[pip3] transformers==4.57.6
[pip3] triton==3.5.0
[conda] flashinfer-python                    0.5.2            pypi_0           pypi
[conda] numpy                                2.2.6            pypi_0           pypi
[conda] nvidia-cublas-cu12                   12.8.4.1         pypi_0           pypi
[conda] nvidia-cuda-cupti-cu12               12.8.90          pypi_0           pypi
[conda] nvidia-cuda-nvrtc-cu12               12.8.93          pypi_0           pypi
[conda] nvidia-cuda-runtime-cu12             12.8.90          pypi_0           pypi
[conda] nvidia-cudnn-cu12                    9.10.2.21        pypi_0           pypi
[conda] nvidia-cudnn-frontend                1.18.0           pypi_0           pypi
[conda] nvidia-cufft-cu12                    11.3.3.83        pypi_0           pypi
[conda] nvidia-cufile-cu12                   1.13.1.3         pypi_0           pypi
[conda] nvidia-curand-cu12                   10.3.9.90        pypi_0           pypi
[conda] nvidia-cusolver-cu12                 11.7.3.90        pypi_0           pypi
[conda] nvidia-cusparse-cu12                 12.5.8.93        pypi_0           pypi
[conda] nvidia-cusparselt-cu12               0.7.1            pypi_0           pypi
[conda] nvidia-cutlass-dsl                   4.4.1            pypi_0           pypi
[conda] nvidia-cutlass-dsl-libs-base         4.4.1            pypi_0           pypi
[conda] nvidia-ml-py                         13.590.48        pypi_0           pypi
[conda] nvidia-nccl-cu12                     2.27.5           pypi_0           pypi
[conda] nvidia-nvjitlink-cu12                12.8.93          pypi_0           pypi
[conda] nvidia-nvshmem-cu12                  3.3.20           pypi_0           pypi
[conda] nvidia-nvtx-cu12                     12.8.90          pypi_0           pypi
[conda] pynvml                               13.0.1           pypi_0           pypi
[conda] pyzmq                                27.1.0           pypi_0           pypi
[conda] torch                                2.9.0            pypi_0           pypi
[conda] torchaudio                           2.9.0            pypi_0           pypi
[conda] torchvision                          0.24.0           pypi_0           pypi
[conda] transformers                         4.57.6           pypi_0           pypi
[conda] triton                               3.5.0            pypi_0           pypi

==============================
         vLLM Info
==============================
ROCM Version                 : Could not collect
vLLM Version                 : 0.11.2
vLLM Build Flags:
  CUDA Archs: Not Set; ROCm: Disabled
GPU Topology:
  	GPU0	NIC0	NIC1	NIC2	CPU Affinity	NUMA Affinity	GPU NUMA ID
GPU0	 X 	SYS	SYS	NODE	4,9	0		N/A
NIC0	SYS	 X 	PIX	SYS				
NIC1	SYS	PIX	 X 	SYS				
NIC2	NODE	SYS	SYS	 X 				

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

NIC Legend:

  NIC0: mlx5_2
  NIC1: mlx5_3
  NIC2: mlx5_bond_0

==============================
     Environment Variables
==============================
LD_LIBRARY_PATH=/opt/common/cuda/cuda-13.1.1/lib64:
CUDA_HOME=/opt/common/cuda/cuda-13.1.1
CUDA_HOME=/opt/common/cuda/cuda-13.1.1
CUDA_VISIBLE_DEVICES=0
CUDA_VISIBLE_DEVICES=0
PYTORCH_NVML_BASED_CUDA_CHECK=1
TORCHINDUCTOR_COMPILE_THREADS=1

---

Model: Qwen/Qwen2.5-0.5B-Instruct
vLLM version: (tested on current main)
GPU: NVIDIA RTX A6000
CUDA version: 12.x
Python: 3.10

---

python -m vllm.entrypoints.openai.api_server \
    --model Qwen/Qwen2.5-0.5B-Instruct \
    --port 8000 \
    --gpu-memory-utilization 0.95 \
    --max-model-len 32768 \
    --enable-prefix-caching \
    --disable-log-requests

---

Run  1  (4089ms wall)
    r9      :    189.9 ms *** SLOW
    r06     :   3044.7 ms *** SLOW
    r01     :   3052.1 ms *** SLOW
    r03     :   3111.2 ms *** SLOW
    r08     :   3131.4 ms *** SLOW

  Run  2  (771ms wall)
    r9      :     17.0 ms
    r06     :     85.6 ms
    r01     :     73.0 ms
    r03     :     72.7 ms
    r08     :     72.1 ms

  Run  3  (716ms wall)
    r9      :     16.1 ms
    r06     :     20.6 ms
    r01     :     27.4 ms
    r03     :     27.2 ms
    r08     :     26.7 ms

  Run  4  (715ms wall)
    r9      :     16.0 ms
    r06     :     20.9 ms
    r01     :     27.2 ms
    r03     :     26.9 ms
    r08     :     25.6 ms

  Run  5  (718ms wall)
    r9      :     16.2 ms
    r06     :     22.5 ms
    r01     :     31.1 ms
    r03     :     30.8 ms
    r08     :     30.1 ms

  Run  6  (715ms wall)
    r9      :     17.4 ms
    r06     :     20.4 ms
    r01     :     26.1 ms
    r03     :     26.0 ms
    r08     :     25.7 ms

  Run  7  (719ms wall)
    r9      :     15.2 ms
    r06     :     20.9 ms
    r01     :     31.6 ms
    r03     :     31.0 ms
    r08     :     30.7 ms

  Run  8  (716ms wall)
    r9      :     15.7 ms
    r06     :     21.4 ms
    r01     :     27.0 ms
    r03     :     26.9 ms
    r08     :     26.2 ms

  Run  9  (720ms wall)
    r9      :     15.7 ms
    r06     :     20.6 ms
    r01     :     35.2 ms
    r03     :     35.6 ms
    r08     :     34.4 ms

  Run 10  (714ms wall)
    r9      :     15.4 ms
    r06     :     20.2 ms
    r01     :     26.4 ms
    r03     :     25.8 ms
    r08     :     26.0 ms

============================================================
  RESULTS SUMMARY
============================================================
  All requests — 50 samples:
    mean :    276.7 ms
    p99  :   3131.4 ms
    max  :   3131.4 ms
    threshold : 46 ms

  r9 (plain victim, arrived 265ms early)10 samples:
    mean :     33.5 ms
    p99  :    189.9 ms  (4.1× threshold)

---

8 sequences × 150k logits × (read + sort + top-K)

---

# 1. Start vLLM (Try a fresh server please — cold cache makes the spike most visible)
python -m vllm.entrypoints.openai.api_server \
    --model Qwen/Qwen2.5-0.5B-Instruct \
    --port 8000 \
    --gpu-memory-utilization 0.95 \
    --max-model-len 32768 \
    --enable-prefix-caching \
    --disable-log-requests

# 2. Run the reproduction script immediately after server is ready
python3 repro.py 
    --base-url http://localhost:8000 \
    --runs 10
RAW_BUFFERClick to expand / collapse

Your current environment

<details> <summary>The output of <code>python collect_env.py</code></summary>
/nfshomes/yunze/miniconda3/envs/vllm-fuzz/lib/python3.11/site-packages/torch/cuda/__init__.py:63: FutureWarning: The pynvml package is deprecated. Please install nvidia-ml-py instead. If you did not install pynvml directly, please report this to the maintainers of the package that installed pynvml for you.
  import pynvml  # type: ignore[import]
Collecting environment information...
==============================
        System Info
==============================
OS                           : Red Hat Enterprise Linux release 8.10 (Ootpa) (x86_64)
GCC version                  : (GCC) 8.5.0 20210514 (Red Hat 8.5.0-28)
Clang version                : Could not collect
CMake version                : version 3.26.5
Libc version                 : glibc-2.28

==============================
       PyTorch Info
==============================
PyTorch version              : 2.9.0+cu128
Is debug build               : False
CUDA used to build PyTorch   : 12.8
ROCM used to build PyTorch   : N/A

==============================
      Python Environment
==============================
Python version               : 3.11.14 (main, Oct 21 2025, 18:31:21) [GCC 11.2.0] (64-bit runtime)
Python platform              : Linux-4.18.0-553.109.1.el8_10.x86_64-x86_64-with-glibc2.28

==============================
       CUDA / GPU Info
==============================
Is CUDA available            : True
CUDA runtime version         : 13.1.115
CUDA_MODULE_LOADING set to   : 
GPU models and configuration : GPU 0: NVIDIA RTX A6000
Nvidia driver version        : 590.48.01
cuDNN version                : Could not collect
HIP runtime version          : N/A
MIOpen runtime version       : N/A
Is XNNPACK available         : True

==============================
          CPU Info
==============================
Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              32
On-line CPU(s) list: 0-31
Thread(s) per core:  1
Core(s) per socket:  16
Socket(s):           2
NUMA node(s):        2
Vendor ID:           AuthenticAMD
CPU family:          23
Model:               49
Model name:          AMD EPYC 7302 16-Core Processor
Stepping:            0
CPU MHz:             3000.000
CPU max MHz:         3000.0000
CPU min MHz:         1500.0000
BogoMIPS:            6000.12
Virtualization:      AMD-V
L1d cache:           32K
L1i cache:           32K
L2 cache:            512K
L3 cache:            16384K
NUMA node0 CPU(s):   0-15
NUMA node1 CPU(s):   16-31
Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate ssbd mba ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local clzero irperf xsaveerptr wbnoinvd amd_ppin arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif v_spec_ctrl umip rdpid overflow_recov succor smca sme sev sev_es

==============================
Versions of relevant libraries
==============================
[pip3] flashinfer-python==0.5.2
[pip3] numpy==2.2.6
[pip3] nvidia-cublas-cu12==12.8.4.1
[pip3] nvidia-cuda-cupti-cu12==12.8.90
[pip3] nvidia-cuda-nvrtc-cu12==12.8.93
[pip3] nvidia-cuda-runtime-cu12==12.8.90
[pip3] nvidia-cudnn-cu12==9.10.2.21
[pip3] nvidia-cudnn-frontend==1.18.0
[pip3] nvidia-cufft-cu12==11.3.3.83
[pip3] nvidia-cufile-cu12==1.13.1.3
[pip3] nvidia-curand-cu12==10.3.9.90
[pip3] nvidia-cusolver-cu12==11.7.3.90
[pip3] nvidia-cusparse-cu12==12.5.8.93
[pip3] nvidia-cusparselt-cu12==0.7.1
[pip3] nvidia-cutlass-dsl==4.4.1
[pip3] nvidia-cutlass-dsl-libs-base==4.4.1
[pip3] nvidia-ml-py==13.590.48
[pip3] nvidia-nccl-cu12==2.27.5
[pip3] nvidia-nvjitlink-cu12==12.8.93
[pip3] nvidia-nvshmem-cu12==3.3.20
[pip3] nvidia-nvtx-cu12==12.8.90
[pip3] pynvml==13.0.1
[pip3] pyzmq==27.1.0
[pip3] torch==2.9.0
[pip3] torchaudio==2.9.0
[pip3] torchvision==0.24.0
[pip3] transformers==4.57.6
[pip3] triton==3.5.0
[conda] flashinfer-python                    0.5.2            pypi_0           pypi
[conda] numpy                                2.2.6            pypi_0           pypi
[conda] nvidia-cublas-cu12                   12.8.4.1         pypi_0           pypi
[conda] nvidia-cuda-cupti-cu12               12.8.90          pypi_0           pypi
[conda] nvidia-cuda-nvrtc-cu12               12.8.93          pypi_0           pypi
[conda] nvidia-cuda-runtime-cu12             12.8.90          pypi_0           pypi
[conda] nvidia-cudnn-cu12                    9.10.2.21        pypi_0           pypi
[conda] nvidia-cudnn-frontend                1.18.0           pypi_0           pypi
[conda] nvidia-cufft-cu12                    11.3.3.83        pypi_0           pypi
[conda] nvidia-cufile-cu12                   1.13.1.3         pypi_0           pypi
[conda] nvidia-curand-cu12                   10.3.9.90        pypi_0           pypi
[conda] nvidia-cusolver-cu12                 11.7.3.90        pypi_0           pypi
[conda] nvidia-cusparse-cu12                 12.5.8.93        pypi_0           pypi
[conda] nvidia-cusparselt-cu12               0.7.1            pypi_0           pypi
[conda] nvidia-cutlass-dsl                   4.4.1            pypi_0           pypi
[conda] nvidia-cutlass-dsl-libs-base         4.4.1            pypi_0           pypi
[conda] nvidia-ml-py                         13.590.48        pypi_0           pypi
[conda] nvidia-nccl-cu12                     2.27.5           pypi_0           pypi
[conda] nvidia-nvjitlink-cu12                12.8.93          pypi_0           pypi
[conda] nvidia-nvshmem-cu12                  3.3.20           pypi_0           pypi
[conda] nvidia-nvtx-cu12                     12.8.90          pypi_0           pypi
[conda] pynvml                               13.0.1           pypi_0           pypi
[conda] pyzmq                                27.1.0           pypi_0           pypi
[conda] torch                                2.9.0            pypi_0           pypi
[conda] torchaudio                           2.9.0            pypi_0           pypi
[conda] torchvision                          0.24.0           pypi_0           pypi
[conda] transformers                         4.57.6           pypi_0           pypi
[conda] triton                               3.5.0            pypi_0           pypi

==============================
         vLLM Info
==============================
ROCM Version                 : Could not collect
vLLM Version                 : 0.11.2
vLLM Build Flags:
  CUDA Archs: Not Set; ROCm: Disabled
GPU Topology:
  	GPU0	NIC0	NIC1	NIC2	CPU Affinity	NUMA Affinity	GPU NUMA ID
GPU0	 X 	SYS	SYS	NODE	4,9	0		N/A
NIC0	SYS	 X 	PIX	SYS				
NIC1	SYS	PIX	 X 	SYS				
NIC2	NODE	SYS	SYS	 X 				

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

NIC Legend:

  NIC0: mlx5_2
  NIC1: mlx5_3
  NIC2: mlx5_bond_0

==============================
     Environment Variables
==============================
LD_LIBRARY_PATH=/opt/common/cuda/cuda-13.1.1/lib64:
CUDA_HOME=/opt/common/cuda/cuda-13.1.1
CUDA_HOME=/opt/common/cuda/cuda-13.1.1
CUDA_VISIBLE_DEVICES=0
CUDA_VISIBLE_DEVICES=0
PYTORCH_NVML_BASED_CUDA_CHECK=1
TORCHINDUCTOR_COMPILE_THREADS=1
</details>

🐛 Describe the bug

My current environment in short

Model: Qwen/Qwen2.5-0.5B-Instruct
vLLM version: (tested on current main)
GPU: NVIDIA RTX A6000
CUDA version: 12.x
Python: 3.10

Launch command:

python -m vllm.entrypoints.openai.api_server \
    --model Qwen/Qwen2.5-0.5B-Instruct \
    --port 8000 \
    --gpu-memory-utilization 0.95 \
    --max-model-len 32768 \
    --enable-prefix-caching \
    --disable-log-requests

This is found in the same round of 12 hours fuzzing done to vLLM as in #37308 and #37076

We found that a single request carrying n_completions=8 and logprobs=20 causes a 76–423× TTFT regression for every other request co-scheduled in the same batch, confirmed across 10 independent runs.

The expensive request itself is not the victim, it only generates 16 tokens and completes within its own budget. The victims are the plain co-scheduled requests that are forced to share every decode iteration with it. Their TTFT goes from an expected ~46ms to 3524ms p99 (76×), with worst-case single samples reaching 19678ms (423×) in one reproduction.

This was found via automated fuzzing (finding_01758, seed 199991774) and confirmed across 10 independent runs on a dedicated node.

This is a different root cause from the chunked-prefill head-of-line blocking bug #37308 . Enabling --enable-chunked-prefill does not help here. The regression occurs entirely in the decode phase and is caused by per-step compute amplification, not prefill scheduling order.


Expected behavior

Once all requests have been prefilled and enter the decode phase, each decode step should complete in roughly constant time regardless of the n_completions or logprobs parameters of other co-scheduled requests. A plain request with max_tokens=16 and no logprobs should receive its first token in ~46ms regardless of what other requests are in-flight.


Observed

With the following 5 concurrent requests sharing a 32-token cached prefix:

requestpromptmax_tokensn_completionslogprobstemperature
r9512641default
r06512168201.26
r0151216120default
r0351216120default
r08512161200.5

Observed TTFT across 10 runs (50 samples):

  Run  1  (4089ms wall)
    r9      :    189.9 ms *** SLOW
    r06     :   3044.7 ms *** SLOW
    r01     :   3052.1 ms *** SLOW
    r03     :   3111.2 ms *** SLOW
    r08     :   3131.4 ms *** SLOW

  Run  2  (771ms wall)
    r9      :     17.0 ms
    r06     :     85.6 ms
    r01     :     73.0 ms
    r03     :     72.7 ms
    r08     :     72.1 ms

  Run  3  (716ms wall)
    r9      :     16.1 ms
    r06     :     20.6 ms
    r01     :     27.4 ms
    r03     :     27.2 ms
    r08     :     26.7 ms

  Run  4  (715ms wall)
    r9      :     16.0 ms
    r06     :     20.9 ms
    r01     :     27.2 ms
    r03     :     26.9 ms
    r08     :     25.6 ms

  Run  5  (718ms wall)
    r9      :     16.2 ms
    r06     :     22.5 ms
    r01     :     31.1 ms
    r03     :     30.8 ms
    r08     :     30.1 ms

  Run  6  (715ms wall)
    r9      :     17.4 ms
    r06     :     20.4 ms
    r01     :     26.1 ms
    r03     :     26.0 ms
    r08     :     25.7 ms

  Run  7  (719ms wall)
    r9      :     15.2 ms
    r06     :     20.9 ms
    r01     :     31.6 ms
    r03     :     31.0 ms
    r08     :     30.7 ms

  Run  8  (716ms wall)
    r9      :     15.7 ms
    r06     :     21.4 ms
    r01     :     27.0 ms
    r03     :     26.9 ms
    r08     :     26.2 ms

  Run  9  (720ms wall)
    r9      :     15.7 ms
    r06     :     20.6 ms
    r01     :     35.2 ms
    r03     :     35.6 ms
    r08     :     34.4 ms

  Run 10  (714ms wall)
    r9      :     15.4 ms
    r06     :     20.2 ms
    r01     :     26.4 ms
    r03     :     25.8 ms
    r08     :     26.0 ms

============================================================
  RESULTS SUMMARY
============================================================
  All requests — 50 samples:
    mean :    276.7 ms
    p99  :   3131.4 ms
    max  :   3131.4 ms
    threshold : 46 ms

  r9 (plain victim, arrived 265ms early) — 10 samples:
    mean :     33.5 ms
    p99  :    189.9 ms  (4.1× threshold)

What we think happened

n_completions=N expands one logical request into N parallel decode sequences, each with its own KV block chain. With N=8, vLLM schedules 8 independent sequences for r06 simultaneously, all of which participate in every decode iteration.

logprobs=K requires a full-vocabulary softmax read at every decode step for every sequence. With logprobs=20 and a vocabulary of ~150k tokens (Qwen2.5), this means reading and sorting 150k logit values per sequence per step.

The combined cost per decode iteration for r06 alone is huge:

8 sequences × 150k logits × (read + sort + top-K)

The three other logprob requests (r01, r03, r08) each add another ~150k per step, also huge, adding up, even worse.

ALso, r9 cannot exit the batch early, since every decode step waits for the full batch, including all 8 of r06's sequences and their logprob computations, before any request receives its next token. With max_tokens=16 for the small requests, they should have exited after 16 steps; instead each of those 16 steps is paid at full amplified cost.

Here is the seed, and repro script

We baked the seed in the repro.py so you can just run the repro.py :)

To reproduce (on a fresh server)

# 1. Start vLLM (Try a fresh server please — cold cache makes the spike most visible)
python -m vllm.entrypoints.openai.api_server \
    --model Qwen/Qwen2.5-0.5B-Instruct \
    --port 8000 \
    --gpu-memory-utilization 0.95 \
    --max-model-len 32768 \
    --enable-prefix-caching \
    --disable-log-requests

# 2. Run the reproduction script immediately after server is ready
python3 repro.py 
    --base-url http://localhost:8000 \
    --runs 10

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

Fix Plan

To address the TTFT regression issue, we need to optimize the decode phase to reduce the computational amplification caused by n_completions and logprobs. Here are the steps:

  • Optimize logprob computations:
    • Implement a more efficient logprob calculation method, such as using a sparse softmax or approximating the softmax using a smaller vocabulary.
    • Consider using a hierarchical softmax or a mixture of softmax experts to reduce the computational cost.
  • Batch logprob computations:
    • Instead of computing logprobs for each sequence individually, batch them together to reduce the number of softmax computations.
    • Use a batched softmax implementation to take advantage of GPU parallelization.
  • Parallelize decode sequences:
    • Use parallel processing to decode multiple sequences simultaneously, reducing the overall computational time.
    • Consider using a parallel decoding algorithm, such as parallel beam search or parallel greedy decoding.
  • Implement early exit for plain requests:
    • Allow plain requests to exit the batch early, without waiting for the full batch to complete.
    • Use a dynamic batching mechanism to group requests with similar computational requirements together.

Example code snippet for batched logprob computations:

import torch
import torch.nn.functional as F

def batched_logprobs(logits, top_k):
    # Batch logits and compute logprobs
    batched_logits = logits.view(-1, logits.shape[-1])
    logprobs = F.log_softmax(batched_logits, dim=-1)
    
    # Get top-k logprobs
    top_k_logprobs = logprobs.topk(top_k, dim=-1)
    
    return top_k_logprobs

# Example usage
logits = torch.randn(8, 150000)  # 8 sequences, 150k vocabulary
top_k = 20
batched_logprobs(logits, top_k)

Verification

To verify the fix, run the reproduction script with the optimized decode phase and measure the TTFT regression. Compare the results with the original issue to ensure the fix reduces the TTFT regression.

Extra Tips

  • Consider using profiling tools to identify performance bottlenecks in the decode phase.
  • Experiment with different optimization techniques, such as quantization or knowledge distillation, to further improve the performance of the model.
  • Regularly test and validate the model to ensure the fix does not introduce any new issues or regressions.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

FAQ

Expected behavior

Once all requests have been prefilled and enter the decode phase, each decode step should complete in roughly constant time regardless of the n_completions or logprobs parameters of other co-scheduled requests. A plain request with max_tokens=16 and no logprobs should receive its first token in ~46ms regardless of what other requests are in-flight.


Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

vllm - ✅(Solved) Fix [Bug]: n_completions + logprobs Causes Significant TTFT Spike for Co-Scheduled Requests on Cold Cache [1 pull requests, 2 comments, 2 participants]