vllm - 💡(How to fix) Fix [Feature]: Performance Tiers: Apple-style hardware requirements for stable inference

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
RAW_BUFFERClick to expand / collapse

🚀 The feature, motivation and pitch

The Problem (2.5 Years Unsolved)

vLLM tries to support everything — CUDA, ROCm, CPU, Apple Silicon, TPU. The result? Hundreds of configuration flags, cryptic docs, and users spending hours tuning PagedAttention only to get OutOfMemory or suboptimal performance.

"The project that tries to catch two hares catches neither."

vLLM chases both universal compatibility and ease of use — and fails at both for the average user.

The Solution: Performance Tiers

Stop trying to run everywhere. Start running perfectly where it matters.

Just as Apple doesn't support every hardware configuration — but delivers a flawless experience on supported devices — vLLM should define clear, enforceable system requirements.

Three Tiers

Minimum

  • Requirements: 16 GB VRAM, 32 GB RAM
  • Context Length: Up to 8K tokens
  • Use Case: Development, testing

Recommended

  • Requirements: 24 GB VRAM, 64 GB RAM
  • Context Length: Up to 32K tokens
  • Use Case: Production, single-user

Optimal

  • Requirements: 48+ GB VRAM, 128 GB RAM
  • Context Length: Up to 128K tokens
  • Use Case: Enterprise, multi-user

How It Works

  1. Hardware Detection: On startup, vLLM checks available resources
  2. Tier Assignment: Automatically selects the appropriate tier
  3. Fixed Parameters: Each tier has pre-configured PagedAttention settings — no manual tuning
  4. Hard Barrier: If hardware doesn't meet Minimum tier — refuse to start with a clear message:

    "Your hardware doesn't meet the minimum requirements (16 GB VRAM required). Consider using transformers for CPU inference or upgrade your GPU."

Benefits

Before: "Why does it crash on my 8 GB GPU?" After: "Your hardware is below minimum. Upgrade or use alternatives."

Before: "How do I tune PagedAttention?" After: "Settings are automatic. Pick your tier."

Before: "It works but slowly on my laptop" After: "Laptops below minimum aren't supported. Buy desktop/server hardware."

Before: 80% dev time on edge cases After: 80% dev time on stability and speed for supported hardware

Why This Is the Only Way Forward

For 2.5 years, the community has asked for simpler configuration. The response has always been more flags, more documentation, more platform support.

The definition of insanity is doing the same thing and expecting different results.

Apple didn't become Apple by supporting every PC clone. They defined their ecosystem, enforced requirements, and delivered reliability. vLLM can do the same for LLM inference.

Implementation Sketch

New projects — the clean path:

llm = LLM(model="llama-3-70b", tier="recommended")

Or even simpler:

llm = LLM(model="llama-3-70b")  # Auto-detects and assigns tier

Existing projects — backward compatible:

llm = LLM(model="llama-3-70b", tier="legacy")  # All 50+ flags still work

Or simply omit tier entirely — old behavior preserved:

llm = LLM(
    model="llama-3-70b",
    tensor_parallel_size=2,
    gpu_memory_utilization=0.9,
    max_model_len=4096,
    block_size=16,
    swap_space=4,
    # ... all existing parameters still work
)

Addressing Edge Cases

Multi-GPU setups (e.g., 4× 16GB): Total VRAM is what matters. 4 × 16GB = 64GB total → falls under Optimal tier. Tier calculation uses aggregate resources, not per-device.

Container/WSL/cloud environments: Autodetect reliability is a known challenge. Tier assignment includes a --verify-hardware flag that runs a quick benchmark to confirm actual available resources before finalizing the tier. If detection is uncertain, falls back to Minimum tier with a warning.

New GPU releases: Tiers are reviewed quarterly by maintainers. Community can propose tier adjustments via dedicated label tier-review. Process is documented in docs/hardware_tiers.md.

The Choice

vLLM can continue chasing two hares — universal compatibility and ease of use — and catch neither.

Or it can choose one path: flawless performance on supported hardware, with clear requirements and automatic configuration.

The Apple path isn't exclusion — it's focus.

Alternatives

Alternatives Considered

More documentation: Already tried for 2.5 years. Doesn't reduce complexity, just moves it around.

More flags for edge cases: This is what created the current mess.

Hardware-specific forks (e.g., tiny-vllm): Fragments the ecosystem, duplicates maintenance effort.

Status quo: Continue supporting everything poorly vs. supporting something perfectly.

"Warning only" instead of hard barrier: Users ignore warnings, then complain about performance. We've seen this pattern. Soft enforcement = no enforcement.

8GB GPU "research use case": Running LLM inference on 8GB is not research — it's suffering. Researchers have access to cloud, Colab, servers. vLLM should not optimize for suffering.

The tier approach is the only solution that reduces cognitive load for users and focuses developer effort.

Additional context

Additional Context

This proposal is inspired by Apple's hardware ecosystem strategy: define supported configurations, enforce them, and deliver reliability. The "two hares" analogy reflects a universal principle in software engineering — spreading too thin guarantees mediocrity.

The tier system is not about excluding users. It's about being honest: "We can't make this work well on your hardware, so we won't pretend we can." Users with below-minimum hardware get a clear path forward (use transformers, upgrade GPU, or use tier="legacy" if you insist) instead of silent failures and debugging marathons.

Backward compatibility: Existing code without tier parameter works exactly as before. No migration needed. No rewrite. Tiers are opt-in for new projects.

Escape hatch for experts: tier="legacy" preserves all existing flags. No hidden parameters. No _override hacks. Explicit and documented.

License for this proposal: MIT

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING