vllm - 💡(How to fix) Fix [RFC] Production-boundary measurement on H100 + vLLM 0.19.1: throughput plateau at c=4→16, methodology critique invited [1 comments, 1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#42484Fetched 2026-05-14 03:29:50
View on GitHub
Comments
1
Participants
1
Timeline
3
Reactions
0
Participants
Timeline (top)
commented ×1labeled ×1unsubscribed ×1

Root Cause

I am posting this as an RFC for methodology critique because:

RAW_BUFFERClick to expand / collapse

Updated 2026-05-13 after Lambda H100 SXM5 cross-provider replication. Original framing preserved in GitHub edit history (click the "edited" link next to the timestamp). Major changes: ctx 8K c=4/c=16 reproduces across providers under fresh-server conditions; warm-vs-fresh server state introduces a separate observation; fairness 4.3× number is environment-specific (Lambda shows 1.08× at 90/10 mix under n=100).

Motivation

Single-stream benchmarks dominate published LLM serving numbers, but production traffic is concurrent. This RFC documents a measurement protocol applied to vLLM 0.19.1 on H100 SXM5, with a cross-provider follow-up that reproduces the structural boundary across two providers under matched fresh-server conditions.

I am posting this as an RFC for methodology critique because:

  1. The boundary shape was sharper than I expected and I want to confirm I am not misinterpreting a vLLM scheduler behavior or a benchmark artifact.
  2. If the shape is real, it has implications for how operators should pick concurrency under similar production load — the decision is currently driven by intuition, not measurement.
  3. After the original Spheron run I extended with a Lambda H100 SXM5 replay; that surfaced a server-state observation that I would value maintainer perspective on.

Repo with raw bench JSONs, run logs, GPU telemetry CSV (1 Hz, 90 min), per-cell scenario logs, vLLM server logs, pip freeze, nvidia-smi dump, environment metadata, and analysis scripts:

https://github.com/jacob-sunho-kim/llm-boundary-research (cross-provider sprint at findings/cross_provider_lambda_spheron_20260513/).

Proposed Change

No vLLM code change is proposed. The "change" requested is informational: I would value pointers on whether (a) the boundary shape generalizes, (b) the methodology has gaps, (c) the warm-server observation is a known artifact.

Setup (both providers, instance-local bench client)

ItemSpheron 2026-05-12Lambda 2026-05-13
ProviderSpheron Dedicated EU North 1Lambda Cloud on_demand
GPU1× NVIDIA H100 80 GB HBM3 SXM51× NVIDIA H100 80 GB HBM3 SXM5
Driver / CUDA550.163.01 / 12.4580.105.08 / 13.0
EnginevLLM 0.19.1 + FlashAttention v3vLLM 0.19.1 + FlashAttention v3
ModelRedHatAI/Meta-Llama-3.1-70B-Instruct-FP8(same)
KV cache dtypeFP8FP8
vLLM args--max-num-seqs 64 --max-model-len 32768 --gpu-memory-utilization 0.95 --kv-cache-dtype fp8(same)
Workloadvllm bench serve --dataset-name random(same)
Prefix cacheON (vLLM default)(same)
Bench clientinstance-local localhost:8000(same)

Finding 1 — Throughput ceiling at 8K context reproduces across providers (fresh server)

At ctx 8K, scaling concurrency 4 → 16 produced a throughput plateau and a P99 TTFT explosion on both providers under fresh vLLM server conditions:

CellSpheronLambda fresh (Round 3.5, n=3 trials mean)
ctx 8K c=483.0 tok/s / 4,607 ms P9985.4 tok/s / 4,637 ms P99 (1.007× of Spheron)
ctx 8K c=1683.2 tok/s / 40,113 ms P9985.8 tok/s / 40,280 ms P99 (1.004× of Spheron)
Throughput delta c=4→c=16+0.2 %0.0 %
P99 TTFT delta c=4→c=16+772 % (8.7×)+769 % (8.7×)

Adding 4× concurrency moved server throughput by ≤0.2 % on either provider while P99 TTFT moved by ~8.7× on both. The boundary structure is not a single-provider artifact.

Finding 2 — Long-context (ctx 28K) cells matched almost exactly

CellSpheronLambda (Round 2, warm server)Ratio
ctx 28K c=15,361 ms5,290 ms0.99×
ctx 28K c=442,752 ms42,761 ms0.9998×
ctx 28K c=16 (n=10 OOM probe)116,573 ms116,739 ms1.001×
ctx 28K c=32 (n=10 OOM probe)116,485 ms111,643 ms0.958×

Throughput at all ctx 28K cells matched within 1.005× across providers. The ctx 28K series did not show the warm-server drift that Lambda's ctx 8K cells showed in Round 2 — long-context appears robust to the server-state variable described in Finding 5.

Finding 3 — Repeatability of the critical ctx 8K boundary on Lambda

Round 3.5 fresh-server, n=3 trials:

ConfigP99 TTFT (3 trials, ms)meanCOV
ctx 8K c=44,816 / 4,542 / 4,5514,6373.4 %
ctx 8K c=1640,276 / 40,265 / 40,29740,2800.0 %

The Spheron Round 3.5 (2026-05-12) also produced sub-1.1 % COV on the same critical pair. The verdict (degraded at c=4, unfit at c=16) is stable under both providers, under fresh-server conditions.

Finding 4 — Config sweep did not move the boundary on either provider

Spheron Round 3.1 (2026-05-12) and Lambda Round 3.1 (2026-05-13) tested the same vLLM serving knobs:

  • max_num_seqs ∈ {16, 64, 256}
  • max_num_batched_tokens ∈ {2048, 8192, 16384}
  • chunked_prefill on/off
ProviderConfigs testedCells that moved the boundaryStartup OOM
Spheron60/92 (max_num_batched_tokens=16384, no chunked_prefill)
Lambda60/180

Lambda specifically: across 6 configs × 3 concurrency points (c=4, c=16, c=32), the ctx 8K P99 TTFT cells stayed within 1.011× / 1.001× / 1.001× of each other respectively. Lambda safely launched the two configs that startup-OOM'd on Spheron (mechanism for that startup-OOM delta is not attributed).

The ctx 8K boundary is not config-tunable within the tested knob set. Prescription path likely sits at the routing / admission / queue separation / hardware-escalation layer, not at the vLLM serving args layer.

Finding 5 — Server State Drift (new observation, mechanism not attributed)

On the same Lambda instance, the ctx 8K cells produced different P99 TTFT depending on whether the vLLM server was freshly restarted or had ~60 min of accumulated session state.

CellSpheron (fresh)Lambda Round 2 (warm, ~60 min uptime)Lambda Round 3.5 (fresh restart, n=3 mean)
ctx 8K c=4 P994,607 ms11,565 ms (2.51× of Spheron)4,637 ms (1.007× of Spheron)
ctx 8K c=16 P9940,113 ms54,382 ms (1.36× of Spheron)40,280 ms (1.004× of Spheron)
Throughput83.0 / 83.271.6 / 71.685.4 / 85.8

Same model, same hardware, same engine, same args, same workload — boundary severity moved between fresh-restart and warm states on the same physical instance.

Mechanism is not attributed. Candidate causes: KV / block allocator state, scheduler queue accumulation, prefix cache state, CUDA memory pool fragmentation, runtime envelope drift, or cumulative effect of preceding bench cells on the same long-lived server.

Lambda's ctx 28K cells (Finding 2) did NOT show the same warm-server drift; the drift specifically affected ctx 8K. I do not have a hypothesis for the context-length specificity.

Finding 6 — Mixed-workload fairness is provider/runtime-sensitive (correction)

The original RFC body reported a 4.3× short-user P99 degradation at 90/10 mix. That number is real for the Spheron environment but does not generalize to a second provider under the same vLLM args.

ScenarioSpheron n=20 short P99Lambda n=20 short P99Lambda n=100 short P99Short degradation
short alone360 ms335 ms334 ms1.00×
90/10 mix1,532 ms371 ms361 msSpheron 4.26× vs Lambda 1.08×
50/50 mix594 ms905 ms915 msSpheron 1.65× vs Lambda 2.74×

Lambda's n=100 rerun confirmed that the 90/10 stability is not a sample-size effect (n=20 and n=100 gave essentially identical degradation ratios). The Spheron baseline is currently only at n=20 (population value remains pending an n=100 Spheron rerun).

The fairness trigger point and severity appear to be scheduler / runtime / arrival-pattern-sensitive. The original 4.3× framing is therefore Spheron-specific. Generalizing it to a universal fairness rule would be wrong.

Where I am asking for critique

  1. Server State Drift mechanism. Is there a known vLLM 0.19.1 behavior where ctx 8K c=4/c=16 P99 TTFT degrades over ~60 min of mixed-traffic uptime, then recovers on server restart with the same args? If so, what is the dominant mechanism — block allocator, scheduler state, prefix cache, CUDA memory pool, or something else?
  2. Context-length specificity of the drift. Why might warm-server drift affect ctx 8K but not ctx 28K on the same instance under the same workload sequence?
  3. Config knob coverage. Were there obvious knobs I missed (--num-scheduler-steps, --enable-prefix-caching variants, --swap-space, --enforce-eager, --disable-log-stats) that would have moved the c=16 boundary inside the safe range?
  4. Fairness divergence interpretation. What scheduler-level mechanism could cause 90/10 mix to degrade short P99 4.3× on one stack and 1.08× on another (same args, same model, same workload)? Is it driver / CUDA, prefix cache state, or arrival-pattern timing?
  5. Synthetic vs realistic workload. I used --dataset-name random. How much of the ctx 8K c=16 boundary disappears under a realistic distribution (RAG / chat / code / agent prompt length distribution)? Is anyone aware of a known shape where random over-reports the boundary?

What I am explicitly not claiming

  • Not universal across all H100 deployments. Two providers only, same H100 SXM5 chip class.
  • Not provider-agnostic. ctx 8K severity diverged in Round 2; reproduction required Round 3.5 fresh-restart on Lambda.
  • Not vLLM-version-agnostic. Both providers ran 0.19.1.
  • Not workload-agnostic. dataset-name random only.
  • Not root-cause-attributed. No DCGM / Nsight profiling. All mechanism statements above are candidate-level.

Feedback Period

2 weeks from the original post date (until ~2026-05-27 KST). Happy to extend if discussion is active.

CC List

(blank — open to anyone interested.)

Any Other Things

  • Cross-provider replication note + per-finding notes: https://github.com/jacob-sunho-kim/llm-boundary-research/tree/master/findings/cross_provider_lambda_spheron_20260513.
  • Raw bundle includes all 47 bench JSONs, scenario logs, vLLM server logs, GPU 1 Hz telemetry, environment metadata, pip freeze, nvidia-smi dump. MIT for code, CC-BY-4.0 for data.
  • If a maintainer prefers a different format (Discussions, a smaller focused issue, or a research-paper-style writeup), please tag and I will follow up.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

vllm - 💡(How to fix) Fix [RFC] Production-boundary measurement on H100 + vLLM 0.19.1: throughput plateau at c=4→16, methodology critique invited [1 comments, 1 participants]