vllm - 💡(How to fix) Fix [Feature]: MoE Active Expert Management --moe-gpu-prefetch <num> [1 comments, 1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#41447Fetched 2026-05-02 05:28:05
View on GitHub
Comments
1
Participants
1
Timeline
2
Reactions
1
Author
Participants
Timeline (top)
commented ×1labeled ×1

This proposal introduces a simple but powerful abstraction:

a fixed number of GPU expert slots mapped dynamically to active expert IDs.

This enables vLLM to:

  • scale MoE inference under constrained GPU memory
  • better utilize sparse activation patterns
  • move toward a more flexible, expert-aware execution model

Root Cause

This proposal introduces a simple but powerful abstraction:

a fixed number of GPU expert slots mapped dynamically to active expert IDs.

This enables vLLM to:

  • scale MoE inference under constrained GPU memory
  • better utilize sparse activation patterns
  • move toward a more flexible, expert-aware execution model

Code Example

--moe-gpu-prefetch <num>

---

GPU Expert Slots (size = <num>)

Slot 0Expert 12
Slot 1Expert 45
Slot 2Expert 7
...
Slot N-1Expert 63

---

slot_to_expert:  slot_id → expert_id
expert_to_slot:  expert_id → slot_id (or -1 if not resident)
RAW_BUFFERClick to expand / collapse

🚀 The feature, motivation and pitch

🚀 The feature, motivation and pitch

I would like to propose a new feature for vLLM to improve memory efficiency and scalability for sparse Mixture-of-Experts (MoE) models:

--moe-gpu-prefetch <num>

This feature introduces expert-level GPU memory management, allowing only a limited number of experts to reside on GPU while the rest are offloaded to CPU memory.


Motivation

Modern MoE models (e.g., Gemma, Mixtral-style architectures) contain a large number of experts per layer (e.g., 128), but only a small subset (e.g., top-8) are activated per token.

Current approaches to handle GPU memory constraints include:

  • Full model residency on GPU (high memory requirement)
  • Layer-level CPU offload (coarse-grained and inefficient)

These approaches do not align well with the fundamental property of MoE:

Activation is sparse at the expert level, not at the layer level.

As a result:

  • GPU memory is wasted on inactive experts
  • Data movement is excessive when offloading entire layers
  • Large MoE models cannot efficiently run on limited GPU setups

Proposed feature

--moe-gpu-prefetch <num>

Where <num> defines the number of expert models that can reside ("hot") on GPU at any time.

Execution model:

  • All expert weights are stored in CPU memory

  • GPU memory contains:

    • shared layers
    • router
    • a fixed number of <num> expert slots
  • During inference:

    • If a routed expert is already on GPU → execute directly
    • If not → load it from CPU into GPU (evicting another expert if needed)

This effectively turns GPU into a hot expert cache.


🔧 Core Design: GPU Expert Slot Mapping

The core of this feature is to introduce a fixed set of physical GPU expert slots, and maintain a dynamic mapping between these slots and logical expert IDs.

Key abstraction

GPU Expert Slots (size = <num>)

Slot 0 → Expert 12
Slot 1 → Expert 45
Slot 2 → Expert 7
...
Slot N-1 → Expert 63

Runtime structures:

slot_to_expert:  slot_id → expert_id
expert_to_slot:  expert_id → slot_id (or -1 if not resident)

Where:

  • A slot is a fixed GPU memory region capable of storing one expert’s weights
  • An expert ID is the logical expert index in the MoE model

Execution flow

  1. Router selects top-k experts per token

  2. For each expert:

    • Check expert_to_slot

      • Hit → directly execute using the corresponding GPU slot
      • Miss → trigger load
  3. On miss:

    • Select a victim slot (e.g., LRU or frequency-based)
    • Evict the current expert from that slot
    • Load the required expert from CPU to GPU
    • Update mappings
  4. Continue execution


Design philosophy

Instead of managing memory at the layer level:

This feature manages memory at the expert level.

This shifts the model execution from:

  • Layer-centric memory management

to:

  • Expert-centric memory management

The GPU is no longer treated as a place to hold the entire model, but as:

a bounded, dynamically managed cache of active experts.


Benefits

  • Enables large MoE models to run on limited GPU memory
  • Reduces memory waste from inactive experts
  • Avoids coarse layer-level data movement
  • Aligns with sparse MoE execution semantics
  • Maintains correctness (no change to outputs)
  • Provides deterministic GPU memory usage

Relation to existing vLLM design

This design is conceptually similar to vLLM’s KV cache block management:

  • KV cache → manages token-level memory blocks
  • Expert slots → manage expert-level memory blocks

This allows:

  • O(1) expert residency lookup
  • clean integration into the MoE execution path
  • minimal impact on the scheduler

Alternatives

  1. Full GPU residency

    • Requires large GPU memory
    • Not scalable for large MoE models
  2. Layer-level CPU offload

    • Moves entire layers between CPU and GPU
    • Ignores MoE sparsity
    • Causes excessive data movement

Compared to these:

Expert-level offload is more fine-grained and better aligned with MoE routing behavior.


Summary

This proposal introduces a simple but powerful abstraction:

a fixed number of GPU expert slots mapped dynamically to active expert IDs.

This enables vLLM to:

  • scale MoE inference under constrained GPU memory
  • better utilize sparse activation patterns
  • move toward a more flexible, expert-aware execution model

Alternatives

No response

Additional context

Demo repo: leoustc/vllm-moe

  • Hardware: A10-40GB
  • Model: /models/gemma-4-26B-A4B-it
GPU limitPrefetch numOutput tok/sStatus
0.501620.43OK
0.5032NAKV cache startup failure
0.5064NAKV cache startup failure
0.751620.64OK
0.753234.78OK
0.7564NAKV cache startup failure
0.951621.71OK
0.953233.86OK
0.956456.51OK
0.957260.31OK
0.9596NAactive expert cache startup failure

with this feature, we can fine control the GPU resource in MoE like model and still get a good performance on large model on a small GPU.

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

TL;DR

Implementing the proposed --moe-gpu-prefetch feature with a suitable number of expert slots can help mitigate GPU memory constraints for sparse Mixture-of-Experts models.

Guidance

  • To determine the optimal number of expert slots, experiment with different values of <num> and monitor the output tok/s and status.
  • Consider the trade-off between GPU memory usage and performance when selecting the prefetch number.
  • Use the provided demo repo (leoustc/vllm-moe) and test cases as a starting point for evaluation.
  • Be cautious of potential startup failures (e.g., KV cache startup failure, active expert cache startup failure) when increasing the prefetch number.

Example

No code snippet is provided as the issue focuses on proposing a new feature rather than debugging existing code.

Notes

The proposed feature is designed to work with sparse Mixture-of-Experts models and may not be applicable to other model architectures. The optimal prefetch number may vary depending on the specific model, hardware, and performance requirements.

Recommendation

Apply the proposed --moe-gpu-prefetch feature with careful experimentation to find the suitable number of expert slots, as it can help improve memory efficiency and scalability for sparse MoE models.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

vllm - 💡(How to fix) Fix [Feature]: MoE Active Expert Management --moe-gpu-prefetch <num> [1 comments, 1 participants]