vllm - 💡(How to fix) Fix [Feature]: MoE Active Expert Management --moe-gpu-prefetch <num> [1 comments, 1 participants]

Root Cause

This proposal introduces a simple but powerful abstraction:

a fixed number of GPU expert slots mapped dynamically to active expert IDs.

This enables vLLM to:

scale MoE inference under constrained GPU memory
better utilize sparse activation patterns
move toward a more flexible, expert-aware execution model

--moe-gpu-prefetch <num> --- GPU Expert Slots (size = <num>) Slot 0 → Expert 12 Slot 1 → Expert 45 Slot 2 → Expert 7 ... Slot N-1 → Expert 63 --- slot_to_expert: slot_id → expert_id expert_to_slot: expert_id → slot_id (or -1 if not resident)

🚀 The feature, motivation and pitch

I would like to propose a new feature for vLLM to improve memory efficiency and scalability for sparse Mixture-of-Experts (MoE) models:

--moe-gpu-prefetch <num>

This feature introduces expert-level GPU memory management, allowing only a limited number of experts to reside on GPU while the rest are offloaded to CPU memory.

Motivation

Modern MoE models (e.g., Gemma, Mixtral-style architectures) contain a large number of experts per layer (e.g., 128), but only a small subset (e.g., top-8) are activated per token.

Current approaches to handle GPU memory constraints include:

Full model residency on GPU (high memory requirement)
Layer-level CPU offload (coarse-grained and inefficient)

These approaches do not align well with the fundamental property of MoE:

Activation is sparse at the expert level, not at the layer level.

As a result:

GPU memory is wasted on inactive experts
Data movement is excessive when offloading entire layers
Large MoE models cannot efficiently run on limited GPU setups

Proposed feature

--moe-gpu-prefetch <num>

Where <num> defines the number of expert models that can reside ("hot") on GPU at any time.

Execution model:

All expert weights are stored in CPU memory
GPU memory contains:
- shared layers
- router
- a fixed number of <num> expert slots
During inference:
- If a routed expert is already on GPU → execute directly
- If not → load it from CPU into GPU (evicting another expert if needed)

This effectively turns GPU into a hot expert cache.

🔧 Core Design: GPU Expert Slot Mapping

The core of this feature is to introduce a fixed set of physical GPU expert slots, and maintain a dynamic mapping between these slots and logical expert IDs.

Key abstraction

GPU Expert Slots (size = <num>)

Slot 0 → Expert 12
Slot 1 → Expert 45
Slot 2 → Expert 7
...
Slot N-1 → Expert 63

Runtime structures:

slot_to_expert:  slot_id → expert_id
expert_to_slot:  expert_id → slot_id (or -1 if not resident)

Where:

A slot is a fixed GPU memory region capable of storing one expert’s weights
An expert ID is the logical expert index in the MoE model

Execution flow

Router selects top-k experts per token
For each expert:
- Check expert_to_slot
  - Hit → directly execute using the corresponding GPU slot
  - Miss → trigger load
On miss:
- Select a victim slot (e.g., LRU or frequency-based)
- Evict the current expert from that slot
- Load the required expert from CPU to GPU
- Update mappings
Continue execution

Design philosophy

Instead of managing memory at the layer level:

This feature manages memory at the expert level.

This shifts the model execution from:

Layer-centric memory management

to:

Expert-centric memory management

The GPU is no longer treated as a place to hold the entire model, but as:

a bounded, dynamically managed cache of active experts.

Benefits

Enables large MoE models to run on limited GPU memory
Reduces memory waste from inactive experts
Avoids coarse layer-level data movement
Aligns with sparse MoE execution semantics
Maintains correctness (no change to outputs)
Provides deterministic GPU memory usage

Relation to existing vLLM design

This design is conceptually similar to vLLM’s KV cache block management:

KV cache → manages token-level memory blocks
Expert slots → manage expert-level memory blocks

This allows:

O(1) expert residency lookup
clean integration into the MoE execution path
minimal impact on the scheduler

Alternatives

Full GPU residency
- Requires large GPU memory
- Not scalable for large MoE models
Layer-level CPU offload
- Moves entire layers between CPU and GPU
- Ignores MoE sparsity
- Causes excessive data movement

Compared to these:

Expert-level offload is more fine-grained and better aligned with MoE routing behavior.

Summary

This proposal introduces a simple but powerful abstraction:

a fixed number of GPU expert slots mapped dynamically to active expert IDs.

This enables vLLM to:

scale MoE inference under constrained GPU memory
better utilize sparse activation patterns
move toward a more flexible, expert-aware execution model

Alternatives

No response

Additional context

Demo repo: leoustc/vllm-moe

Hardware: A10-40GB
Model: /models/gemma-4-26B-A4B-it

GPU limit	Prefetch num	Output tok/s	Status
0.50	16	20.43	OK
0.50	32	NA	KV cache startup failure
0.50	64	NA	KV cache startup failure
0.75	16	20.64	OK
0.75	32	34.78	OK
0.75	64	NA	KV cache startup failure
0.95	16	21.71	OK
0.95	32	33.86	OK
0.95	64	56.51	OK
0.95	72	60.31	OK
0.95	96	NA	active expert cache startup failure

with this feature, we can fine control the GPU resource in MoE like model and still get a good performance on large model on a small GPU.

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

TL;DR

Implementing the proposed --moe-gpu-prefetch feature with a suitable number of expert slots can help mitigate GPU memory constraints for sparse Mixture-of-Experts models.

Guidance

To determine the optimal number of expert slots, experiment with different values of <num> and monitor the output tok/s and status.
Consider the trade-off between GPU memory usage and performance when selecting the prefetch number.
Use the provided demo repo (leoustc/vllm-moe) and test cases as a starting point for evaluation.
Be cautious of potential startup failures (e.g., KV cache startup failure, active expert cache startup failure) when increasing the prefetch number.

Example

No code snippet is provided as the issue focuses on proposing a new feature rather than debugging existing code.

Notes

The proposed feature is designed to work with sparse Mixture-of-Experts models and may not be applicable to other model architectures. The optimal prefetch number may vary depending on the specific model, hardware, and performance requirements.

Recommendation

Apply the proposed --moe-gpu-prefetch feature with careful experimentation to find the suitable number of expert slots, as it can help improve memory efficiency and scalability for sparse MoE models.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

vllm - 💡(How to fix) Fix [Feature]: MoE Active Expert Management --moe-gpu-prefetch <num> [1 comments, 1 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Code Example

🚀 The feature, motivation and pitch

🚀 The feature, motivation and pitch

Motivation

Proposed feature

🔧 Core Design: GPU Expert Slot Mapping

Key abstraction

Execution flow

Design philosophy

Benefits

Relation to existing vLLM design

Alternatives

Summary

Alternatives

Additional context

Before submitting a new issue...

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

TRENDING

vllm - 💡(How to fix) Fix [Feature]: MoE Active Expert Management --moe-gpu-prefetch <num> [1 comments, 1 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Code Example

🚀 The feature, motivation and pitch

🚀 The feature, motivation and pitch

Motivation

Proposed feature

🔧 Core Design: GPU Expert Slot Mapping

Key abstraction

Execution flow

Design philosophy

Benefits

Relation to existing vLLM design

Alternatives

Summary

Alternatives

Additional context

Before submitting a new issue...

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING