vllm - 💡(How to fix) Fix [RFC] Production-boundary measurement on H100 + vLLM 0.19.1: throughput plateau at c=4→16, methodology critique invited [1 comments, 1 participants]

jacob-sunho-kim · 2026-05-13T05:18:24Z

[vllm] Updated 2026-05-13 after Lambda H100 SXM5 cross-provider replication. Original framing preserved in GitHub edit history click the "edited" link next to… > **Updated 2026-05-13 after Lambda H100 SXM5 cross-provider replication.** Original framing preserved in GitHub edit history (click the "edited" link next to the timestamp). Major changes: ctx 8K c=4/c=16 reproduces across providers under fresh-server conditions; warm-vs-fresh server state introduces a separate observation; fairness 4.3× number is environment-specific (Lambda shows 1.08× at 90/10 mix under n=100). ## Motivation Single-stream benchmarks dominate published LLM serving numbers, but production traffic is concurrent. This RFC documents a measurement protocol applied to vLLM 0.19.1 on H100 SXM5, with a cross-provider follow-up that reproduces the structural boundary across two providers under matched fresh-server conditions. I am posting this as an RFC for methodology critique because: 1. The boundary shape was sharper than I expected and I want to confirm I am not misinterpreting a vLLM scheduler behavior or a benchmark artifact. 2. If the shape is real, it has implications for how operators should pick concurrency under similar production load — the decision is currently driven by intuition, not measurement. 3. After the original Spheron run I extended with a Lambda H100 SXM5 replay; that surfaced a server-state observation that I would value maintainer perspective on. Repo with raw bench JSONs, run logs, GPU telemetry CSV (1 Hz, 90 min), per-cell scenario logs, vLLM server logs, pip freeze, nvidia-smi dump, environment metadata, and analysis scripts: `https://github.com/jacob-sunho-kim/llm-boundary-research` (cross-provider sprint at `findings/cross_provider_lambda_spheron_20260513/`). ## Proposed Change No vLLM code change is proposed. The "change" requested is informational: I would value pointers on whether (a) the boundary shape generalizes, (b) the methodology has gaps, (c) the warm-server observation is a known artifact. ## Setup (both providers, instance-local bench client) | Item | Spheron 2026-05-12 | Lambda 2026-05-13 | |---|---|---| | Provider | Spheron Dedicated EU North 1 | Lambda Cloud on_demand | | GPU | 1× NVIDIA H100 80 GB HBM3 SXM5 | 1× NVIDIA H100 80 GB HBM3 SXM5 | | Driver / CUDA | 550.163.01 / 12.4 | 580.105.08 / 13.0 | | Engine | vLLM 0.19.1 + FlashAttention v3 | vLLM 0.19.1 + FlashAttention v3 | | Model | `RedHatAI/Meta-Llama-3.1-70B-Instruct-FP8` | (same) | | KV cache dtype | FP8 | FP8 | | vLLM args | `--max-num-seqs 64 --max-model-len 32768 --gpu-memory-utilization 0.95 --kv-cache-dtype fp8` | (same) | | Workload | `vllm bench serve --dataset-name random` | (same) | | Prefix cache | ON (vLLM default) | (same) | | Bench client | instance-local localhost:8000 | (same) | ## Finding 1 — Throughput ceiling at 8K context reproduces across providers (fresh server) At ctx 8K, scaling concurrency 4 → 16 produced a throughput plateau and a P99 TTFT explosion on **both** providers under fresh vLLM server conditions: | Cell | Spheron | Lambda fresh (Round 3.5, n=3 trials mean) | |---|---|---| | ctx 8K c=4 | 83.0 tok/s / **4,607 ms P99** | 85.4 tok/s / **4,637 ms P99** (1.007× of Spheron) | | ctx 8K c=16 | 83.2 tok/s / **40,113 ms P99** | 85.8 tok/s / **40,280 ms P99** (1.004× of Spheron) | | Throughput delta c=4→c=16 | +0.2 % | 0.0 % | | P99 TTFT delta c=4→c=16 | +772 % (8.7×) | +769 % (8.7×) | Adding 4× concurrency moved server throughput by ≤0.2 % on either provider while P99 TTFT moved by ~8.7× on both. The boundary structure is not a single-provider artifact. ## Finding 2 — Long-context (ctx 28K) cells matched almost exactly | Cell | Spheron | Lambda (Round 2, warm server) | Ratio | |---|---|---|---| | ctx 28K c=1 | 5,361 ms | 5,290 ms | **0.99×** | | ctx 28K c=4 | 42,752 ms | 42,761 ms | **0.9998×** | | ctx 28K c=16 (n=10 OOM probe) | 116,573 ms | 116,739 ms | 1.001× | | ctx 28K c=32 (n=10 OOM probe) | 116,485 ms | 111,643 ms | 0.958× | Throughput at all ctx 28K cells matched within 1.005× across providers. The ctx 28K series did **not** show the warm-server drift that Lambda's ctx 8K cells showed in Round 2 — long-context appears robust to the server-state variable described in Finding 5. ## Finding 3 — Repeatability of the critical ctx 8K boundary on Lambda Round 3.5 fresh-server, n=3 trials: | Config | P99 TTFT (3 trials, ms) | mean | COV | |---|---|---|---| | ctx 8K c=4 | 4,816 / 4,542 / 4,551 | 4,637 | 3.4 % | | ctx 8K c=16 | 40,276 / 40,265 / 40,297 | 40,280 | 0.0 % | The Spheron Round 3.5 (2026-05-12) also produced sub-1.1 % COV on the same critical pair. The verdict (`degraded` at c=4, `unfit` at c=16) is stable under both providers, under fresh-server conditions. ## Finding 4 — Config sweep did not move the boundary on either provider Spheron Round 3.1 (2026-05-12) and Lambda Round 3.1 (2026-05-13) tested the same vLLM serving knobs: - `max_num_seqs ∈ {16, 64, 256}` - `max_num_batched_tokens ∈ {2048, 8192, 16

vllm2026-05-13 05:18:24

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

vllm-project/vllm#42484•Fetched 2026-05-14 03:29:50

View on GitHub

Comments

Participants

Timeline

Reactions

Author

jacob-sunho-kim

Participants

jacob-sunho-kim

Timeline (top)

commented ×1labeled ×1unsubscribed ×1

Root Cause

I am posting this as an RFC for methodology critique because:

RAW_BUFFERClick to expand / collapse

Updated 2026-05-13 after Lambda H100 SXM5 cross-provider replication. Original framing preserved in GitHub edit history (click the "edited" link next to the timestamp). Major changes: ctx 8K c=4/c=16 reproduces across providers under fresh-server conditions; warm-vs-fresh server state introduces a separate observation; fairness 4.3× number is environment-specific (Lambda shows 1.08× at 90/10 mix under n=100).

Motivation

Single-stream benchmarks dominate published LLM serving numbers, but production traffic is concurrent. This RFC documents a measurement protocol applied to vLLM 0.19.1 on H100 SXM5, with a cross-provider follow-up that reproduces the structural boundary across two providers under matched fresh-server conditions.

I am posting this as an RFC for methodology critique because:

The boundary shape was sharper than I expected and I want to confirm I am not misinterpreting a vLLM scheduler behavior or a benchmark artifact.
If the shape is real, it has implications for how operators should pick concurrency under similar production load — the decision is currently driven by intuition, not measurement.
After the original Spheron run I extended with a Lambda H100 SXM5 replay; that surfaced a server-state observation that I would value maintainer perspective on.

Repo with raw bench JSONs, run logs, GPU telemetry CSV (1 Hz, 90 min), per-cell scenario logs, vLLM server logs, pip freeze, nvidia-smi dump, environment metadata, and analysis scripts:

https://github.com/jacob-sunho-kim/llm-boundary-research (cross-provider sprint at findings/cross_provider_lambda_spheron_20260513/).

Proposed Change

No vLLM code change is proposed. The "change" requested is informational: I would value pointers on whether (a) the boundary shape generalizes, (b) the methodology has gaps, (c) the warm-server observation is a known artifact.

Setup (both providers, instance-local bench client)

Item	Spheron 2026-05-12	Lambda 2026-05-13
Provider	Spheron Dedicated EU North 1	Lambda Cloud on_demand
GPU	1× NVIDIA H100 80 GB HBM3 SXM5	1× NVIDIA H100 80 GB HBM3 SXM5
Driver / CUDA	550.163.01 / 12.4	580.105.08 / 13.0
Engine	vLLM 0.19.1 + FlashAttention v3	vLLM 0.19.1 + FlashAttention v3
Model	`RedHatAI/Meta-Llama-3.1-70B-Instruct-FP8`	(same)
KV cache dtype	FP8	FP8
vLLM args	`--max-num-seqs 64 --max-model-len 32768 --gpu-memory-utilization 0.95 --kv-cache-dtype fp8`	(same)
Workload	`vllm bench serve --dataset-name random`	(same)
Prefix cache	ON (vLLM default)	(same)
Bench client	instance-local localhost:8000	(same)

Finding 1 — Throughput ceiling at 8K context reproduces across providers (fresh server)

At ctx 8K, scaling concurrency 4 → 16 produced a throughput plateau and a P99 TTFT explosion on both providers under fresh vLLM server conditions:

Cell	Spheron	Lambda fresh (Round 3.5, n=3 trials mean)
ctx 8K c=4	83.0 tok/s / 4,607 ms P99	85.4 tok/s / 4,637 ms P99 (1.007× of Spheron)
ctx 8K c=16	83.2 tok/s / 40,113 ms P99	85.8 tok/s / 40,280 ms P99 (1.004× of Spheron)
Throughput delta c=4→c=16	+0.2 %	0.0 %
P99 TTFT delta c=4→c=16	+772 % (8.7×)	+769 % (8.7×)

Adding 4× concurrency moved server throughput by ≤0.2 % on either provider while P99 TTFT moved by ~8.7× on both. The boundary structure is not a single-provider artifact.

Finding 2 — Long-context (ctx 28K) cells matched almost exactly

Cell	Spheron	Lambda (Round 2, warm server)	Ratio
ctx 28K c=1	5,361 ms	5,290 ms	0.99×
ctx 28K c=4	42,752 ms	42,761 ms	0.9998×
ctx 28K c=16 (n=10 OOM probe)	116,573 ms	116,739 ms	1.001×
ctx 28K c=32 (n=10 OOM probe)	116,485 ms	111,643 ms	0.958×

Throughput at all ctx 28K cells matched within 1.005× across providers. The ctx 28K series did not show the warm-server drift that Lambda's ctx 8K cells showed in Round 2 — long-context appears robust to the server-state variable described in Finding 5.

Finding 3 — Repeatability of the critical ctx 8K boundary on Lambda

Round 3.5 fresh-server, n=3 trials:

Config	P99 TTFT (3 trials, ms)	mean	COV
ctx 8K c=4	4,816 / 4,542 / 4,551	4,637	3.4 %
ctx 8K c=16	40,276 / 40,265 / 40,297	40,280	0.0 %

The Spheron Round 3.5 (2026-05-12) also produced sub-1.1 % COV on the same critical pair. The verdict (degraded at c=4, unfit at c=16) is stable under both providers, under fresh-server conditions.

Finding 4 — Config sweep did not move the boundary on either provider

Spheron Round 3.1 (2026-05-12) and Lambda Round 3.1 (2026-05-13) tested the same vLLM serving knobs:

max_num_seqs ∈ {16, 64, 256}
max_num_batched_tokens ∈ {2048, 8192, 16384}
chunked_prefill on/off

Provider	Configs tested	Cells that moved the boundary	Startup OOM
Spheron	6	0/9	2 (max_num_batched_tokens=16384, no chunked_prefill)
Lambda	6	0/18	0

Lambda specifically: across 6 configs × 3 concurrency points (c=4, c=16, c=32), the ctx 8K P99 TTFT cells stayed within 1.011× / 1.001× / 1.001× of each other respectively. Lambda safely launched the two configs that startup-OOM'd on Spheron (mechanism for that startup-OOM delta is not attributed).

The ctx 8K boundary is not config-tunable within the tested knob set. Prescription path likely sits at the routing / admission / queue separation / hardware-escalation layer, not at the vLLM serving args layer.

Finding 5 — Server State Drift (new observation, mechanism not attributed)

On the same Lambda instance, the ctx 8K cells produced different P99 TTFT depending on whether the vLLM server was freshly restarted or had ~60 min of accumulated session state.

Cell	Spheron (fresh)	Lambda Round 2 (warm, ~60 min uptime)	Lambda Round 3.5 (fresh restart, n=3 mean)
ctx 8K c=4 P99	4,607 ms	11,565 ms (2.51× of Spheron)	4,637 ms (1.007× of Spheron)
ctx 8K c=16 P99	40,113 ms	54,382 ms (1.36× of Spheron)	40,280 ms (1.004× of Spheron)
Throughput	83.0 / 83.2	71.6 / 71.6	85.4 / 85.8

Same model, same hardware, same engine, same args, same workload — boundary severity moved between fresh-restart and warm states on the same physical instance.

Mechanism is not attributed. Candidate causes: KV / block allocator state, scheduler queue accumulation, prefix cache state, CUDA memory pool fragmentation, runtime envelope drift, or cumulative effect of preceding bench cells on the same long-lived server.

Lambda's ctx 28K cells (Finding 2) did NOT show the same warm-server drift; the drift specifically affected ctx 8K. I do not have a hypothesis for the context-length specificity.

Finding 6 — Mixed-workload fairness is provider/runtime-sensitive (correction)

The original RFC body reported a 4.3× short-user P99 degradation at 90/10 mix. That number is real for the Spheron environment but does not generalize to a second provider under the same vLLM args.

Scenario	Spheron n=20 short P99	Lambda n=20 short P99	Lambda n=100 short P99	Short degradation
short alone	360 ms	335 ms	334 ms	1.00×
90/10 mix	1,532 ms	371 ms	361 ms	Spheron 4.26× vs Lambda 1.08×
50/50 mix	594 ms	905 ms	915 ms	Spheron 1.65× vs Lambda 2.74×

Lambda's n=100 rerun confirmed that the 90/10 stability is not a sample-size effect (n=20 and n=100 gave essentially identical degradation ratios). The Spheron baseline is currently only at n=20 (population value remains pending an n=100 Spheron rerun).

The fairness trigger point and severity appear to be scheduler / runtime / arrival-pattern-sensitive. The original 4.3× framing is therefore Spheron-specific. Generalizing it to a universal fairness rule would be wrong.

Where I am asking for critique

Server State Drift mechanism. Is there a known vLLM 0.19.1 behavior where ctx 8K c=4/c=16 P99 TTFT degrades over ~60 min of mixed-traffic uptime, then recovers on server restart with the same args? If so, what is the dominant mechanism — block allocator, scheduler state, prefix cache, CUDA memory pool, or something else?
Context-length specificity of the drift. Why might warm-server drift affect ctx 8K but not ctx 28K on the same instance under the same workload sequence?
Config knob coverage. Were there obvious knobs I missed (--num-scheduler-steps, --enable-prefix-caching variants, --swap-space, --enforce-eager, --disable-log-stats) that would have moved the c=16 boundary inside the safe range?
Fairness divergence interpretation. What scheduler-level mechanism could cause 90/10 mix to degrade short P99 4.3× on one stack and 1.08× on another (same args, same model, same workload)? Is it driver / CUDA, prefix cache state, or arrival-pattern timing?
Synthetic vs realistic workload. I used --dataset-name random. How much of the ctx 8K c=16 boundary disappears under a realistic distribution (RAG / chat / code / agent prompt length distribution)? Is anyone aware of a known shape where random over-reports the boundary?

What I am explicitly not claiming

Not universal across all H100 deployments. Two providers only, same H100 SXM5 chip class.
Not provider-agnostic. ctx 8K severity diverged in Round 2; reproduction required Round 3.5 fresh-restart on Lambda.
Not vLLM-version-agnostic. Both providers ran 0.19.1.
Not workload-agnostic. dataset-name random only.
Not root-cause-attributed. No DCGM / Nsight profiling. All mechanism statements above are candidate-level.

Feedback Period

2 weeks from the original post date (until ~2026-05-27 KST). Happy to extend if discussion is active.

CC List

(blank — open to anyone interested.)

Any Other Things

Cross-provider replication note + per-finding notes: https://github.com/jacob-sunho-kim/llm-boundary-research/tree/master/findings/cross_provider_lambda_spheron_20260513.
Raw bundle includes all 47 bench JSONs, scenario logs, vLLM server logs, GPU 1 Hz telemetry, environment metadata, pip freeze, nvidia-smi dump. MIT for code, CC-BY-4.0 for data.
If a maintainer prefers a different format (Discussions, a smaller focused issue, or a research-paper-style writeup), please tag and I will follow up.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #ssr #prompt formatting #chain error #conversation history

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

vllm - 💡(How to fix) Fix [RFC] Production-boundary measurement on H100 + vLLM 0.19.1: throughput plateau at c=4→16, methodology critique invited [1 comments, 1 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Motivation

Proposed Change

Setup (both providers, instance-local bench client)

Finding 1 — Throughput ceiling at 8K context reproduces across providers (fresh server)

Finding 2 — Long-context (ctx 28K) cells matched almost exactly

Finding 3 — Repeatability of the critical ctx 8K boundary on Lambda

Finding 4 — Config sweep did not move the boundary on either provider

Finding 5 — Server State Drift (new observation, mechanism not attributed)

Finding 6 — Mixed-workload fairness is provider/runtime-sensitive (correction)

Where I am asking for critique

What I am explicitly not claiming

Feedback Period

CC List

Any Other Things

Still need to ship something?

TRENDING

vllm - 💡(How to fix) Fix [RFC] Production-boundary measurement on H100 + vLLM 0.19.1: throughput plateau at c=4→16, methodology critique invited [1 comments, 1 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Motivation

Proposed Change

Setup (both providers, instance-local bench client)

Finding 1 — Throughput ceiling at 8K context reproduces across providers (fresh server)

Finding 2 — Long-context (ctx 28K) cells matched almost exactly

Finding 3 — Repeatability of the critical ctx 8K boundary on Lambda

Finding 4 — Config sweep did not move the boundary on either provider

Finding 5 — Server State Drift (new observation, mechanism not attributed)

Finding 6 — Mixed-workload fairness is provider/runtime-sensitive (correction)

Where I am asking for critique

What I am explicitly not claiming

Feedback Period

CC List

Any Other Things

Still need to ship something?

RELATED_DISCOVERY

TRENDING