vllm - ✅(Solved) Fix [RFC]: Incremental MoE Expert Offloading — GPU Cache + Async Pipeline [1 pull requests, 3 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#38256Fetched 2026-04-08 01:36:58
View on GitHub
Comments
3
Participants
2
Timeline
24
Reactions
0
Author
Participants
Timeline (top)
subscribed ×8referenced ×6mentioned ×5commented ×3

Dynamic MoE expert weight offloading for vLLM. Expert weights live in CPU pinned memory; a fixed-size GPU cache holds the hottest experts; LRU eviction and cross-layer prediction minimize cache misses. Models that exceed GPU VRAM can run on smaller hardware.

PR 1 is open: #37190 (~600 LOC Python, passing CI).

This RFC covers the full 3-PR architecture and provides production data from tinyserve, an independent implementation of the same techniques (30 tok/s decode on 8 GB GPU, 325 tests).

Error Message

  • No CPU fallback computation. If more experts are needed than cache capacity, prepare() raises a clear error with guidance to increase --moe-expert-cache-size. Hand-rolled CPU MoE with Python loops is a correctness trap.
  • No silent config downgrade. If the user requests offloading and it's incompatible, that's an error, not a silent no-op.

Root Cause

Dynamic MoE expert weight offloading for vLLM. Expert weights live in CPU pinned memory; a fixed-size GPU cache holds the hottest experts; LRU eviction and cross-layer prediction minimize cache misses. Models that exceed GPU VRAM can run on smaller hardware.

PR 1 is open: #37190 (~600 LOC Python, passing CI).

This RFC covers the full 3-PR architecture and provides production data from tinyserve, an independent implementation of the same techniques (30 tok/s decode on 8 GB GPU, 325 tests).

Fix Action

Fix / Workaround

At FusedMoEModularMethod.apply() — replace direct layer.w13_weight access with provider.prepare(topk_ids). No bypass of the runner. All paths go through runner.forward() -> quant_method.apply(), preserving EP dispatch, DP chunking, and shared-expert overlap.

PR fix notes

PR #37190: [MoE][Offload] Run MoE models exceeding VRAM via expert CPU offloading with GPU cache (--moe-expert-cache-size)

Description (problem / solution / changelog)

Purpose

CachedWeightProvider — MoE expert CPU offloading with GPU LFRU cache, addressing RFC #38256.

Expert weights live in CPU pinned memory; a fixed-size GPU cache holds the hottest N experts per layer using LFRU (frequency-weighted LRU) eviction. LFRU prevents early layers from monopolizing the cache — a known problem with pure LRU in sequential MoE execution. Models that exceed GPU VRAM can now run on smaller hardware.

No runner bypass — all paths go through quant_method.apply(). EP dispatch, DP chunking, and shared expert overlap work unchanged.

References: RFC #38256 | tinyserve (production validation, 481 tests)

Test results

Community validation (independent):

ModelVRAMtok/sTester
Nemotron-Cascade-2-30B-A3B (cache=8)7.6 GB15.6@caiovicentino
Gemma-4-26B-A4B-it (cache=8)8.6 GB14.8@caiovicentino

LFRU vs LRU (Nemotron, cache=8): LFRU cache=8 exceeds LRU cache=16 in hit rate. +5.2% speed improvement.

Unit tests: 26 test cases (parametrized across dtypes, capacities, num_experts). Tests LFRU-specific eviction behavior (frequency-weighted, not just recency).

Changes

15 files, ~810 additions

FileWhat
expert_weight_provider.py (new)CachedWeightProvider with LFRU eviction, ExpertWeightResult dataclass
fused_moe_method_base.pysupports_expert_lru_cache property (default False)
fused_moe_modular_method.pyProvider check in apply()
layer.py_maybe_init_expert_lru_cache(), expert_weight_provider attribute
unquantized_fused_moe_method.pyCPU weight allocation, cache init, kernel init for cache path, XPU transpose
quantization/fp8.pysupports_expert_lru_cache, provider check, cache init
offload.pymoe_expert_cache_size config field
vllm.pyCross-validator: enforce_eager required
arg_utils.pyCLI argument --moe-expert-cache-size
llm.pymoe_expert_cache_size parameter in LLM.__init__
basic_correctness.yamlCI test area registration
docs/features/moe_cache_policies.md (new)Feature documentation
test_expert_lru_cache.py (new)26 unit tests with parametrization
test_moe_expert_cache.py (new)Integration test via compare_two_settings
benchmarks/qwen_122b_test_20260331.txt (new)Benchmark raw data

How it works

moe_expert_cache_size == 0 (default):
  No provider created. Zero overhead (one getattr per layer).

moe_expert_cache_size > 0:
  CachedWeightProvider.prepare(topk_ids):
    for each unique expert:
      hit  → update LFRU frequency + recency (O(1))
      miss → evict lowest freq/age score, H2D copy, update mapping
    remap topk_ids → slot indices via persistent GPU mapping tensor
  → kernel receives GPU buffer + remapped IDs

Limitations

  • --enforce-eager required (CUDA graph compat deferred to PR 2)
  • Synchronous H2D copies (async pipeline in PR 2)
  • Single eviction policy (LFRU hardcoded, no pluggable framework)
  • EP > 1 not supported
  • BF16 + FP8 per-tensor only

Test plan

pytest tests/kernels/moe/test_expert_lru_cache.py -v
pytest tests/basic_correctness/test_moe_expert_cache.py -v -s

AI-assisted development (Claude Code). Architecture validated in tinyserve.

Changed files

  • .buildkite/test_areas/basic_correctness.yaml (modified, +2/-0)
  • benchmarks/qwen_122b_test_20260331.txt (added, +28/-0)
  • docs/features/moe_cache_policies.md (added, +102/-0)
  • tests/basic_correctness/test_moe_expert_cache.py (added, +39/-0)
  • tests/kernels/moe/test_expert_lru_cache.py (added, +280/-0)
  • vllm/config/offload.py (modified, +3/-0)
  • vllm/config/vllm.py (modified, +15/-0)
  • vllm/engine/arg_utils.py (modified, +5/-0)
  • vllm/entrypoints/llm.py (modified, +7/-0)
  • vllm/model_executor/layers/fused_moe/expert_weight_provider.py (added, +216/-0)
  • vllm/model_executor/layers/fused_moe/fused_moe_method_base.py (modified, +10/-0)
  • vllm/model_executor/layers/fused_moe/fused_moe_modular_method.py (modified, +17/-0)
  • vllm/model_executor/layers/fused_moe/layer.py (modified, +111/-3)
  • vllm/model_executor/layers/fused_moe/unquantized_fused_moe_method.py (modified, +94/-31)
  • vllm/model_executor/layers/quantization/fp8.py (modified, +40/-0)

Code Example

ExpertWeightProvider (ABC)
├── FullGPUProvider        -- zero-cost passthrough (default, no overhead)
└── CachedWeightProvider   -- GPU LRU cache + CPU backing store
      ├── GPUSlotManager   -- fixed-address GPU buffers
      ├── LRUEviction      -- pure-Python eviction (collections.OrderedDict)
      └── CPUBackingStore   -- pinned DRAM, all local experts

---

class ExpertWeightProvider(ABC):
    @abstractmethod
    def prepare(self, topk_ids: torch.Tensor) -> ExpertWeightResult:
        """Ensure requested experts are GPU-resident. Returns GPU tensors."""
        ...

@dataclass
class ExpertWeightResult:
    w1: torch.Tensor       # [capacity, ...] GPU-resident, fixed address
    w2: torch.Tensor       # [capacity, ...] GPU-resident, fixed address
    topk_ids: torch.Tensor # always remapped to slot indices
    w1_scale: Optional[torch.Tensor] = None
    w2_scale: Optional[torch.Tensor] = None
RAW_BUFFERClick to expand / collapse

Summary

Dynamic MoE expert weight offloading for vLLM. Expert weights live in CPU pinned memory; a fixed-size GPU cache holds the hottest experts; LRU eviction and cross-layer prediction minimize cache misses. Models that exceed GPU VRAM can run on smaller hardware.

PR 1 is open: #37190 (~600 LOC Python, passing CI).

This RFC covers the full 3-PR architecture and provides production data from tinyserve, an independent implementation of the same techniques (30 tok/s decode on 8 GB GPU, 325 tests).

Motivation

Large MoE models (DeepSeek-V3 671B, Qwen3.5-122B) don't fit in a single GPU. Only a small subset of experts activates per token (e.g., 8 of 256), so most expert weights sit idle. Moving cold experts to CPU and caching hot ones on GPU lets these models run on hardware that would otherwise OOM.

Prior art in vLLM

PRWhat it doesLimitation this RFC addresses
#34535 (merged)Static CPU weight offloadNo runtime migration — offloaded weights stay on CPU permanently
#29941 (merged)Async H2D prefetch for non-MoE weightsPattern reused for expert prefetch in PR 2
RFC #33869 / #31938Monolithic MoE offload (cache + CPU kernels + DBO + prefetch)Closed — too large to review/pass CI. This RFC takes the opposite approach

Key design principle

The cache is a weight provider, not a special forward path. The kernel does not know or care where weights came from. No bypass of the runner pipeline.

Production data (tinyserve)

RTX PRO 2000 8 GB, GPT-OSS-20B MXFP4, 238 cache slots, single-stream decode:

MetricResult
Decode throughput30 tok/s (stable across context lengths)
vs HF device_map="auto"160x faster
Cache hit rate (temporal prediction)97-100%
Expert loads per layer (batched prefill)O(num_experts) vs O(seq_len x top_k)

Caveat: These numbers are single-stream on a laptop GPU. Multi-user batched inference on H100 will have different bottlenecks (higher H2D bandwidth, but batch diversity may reduce hit rates). In-tree benchmarks will accompany each PR.

Architecture

ExpertWeightProvider (ABC)
├── FullGPUProvider        -- zero-cost passthrough (default, no overhead)
└── CachedWeightProvider   -- GPU LRU cache + CPU backing store
      ├── GPUSlotManager   -- fixed-address GPU buffers
      ├── LRUEviction      -- pure-Python eviction (collections.OrderedDict)
      └── CPUBackingStore   -- pinned DRAM, all local experts

Integration point

At FusedMoEModularMethod.apply() — replace direct layer.w13_weight access with provider.prepare(topk_ids). No bypass of the runner. All paths go through runner.forward() -> quant_method.apply(), preserving EP dispatch, DP chunking, and shared-expert overlap.

prepare() contract

class ExpertWeightProvider(ABC):
    @abstractmethod
    def prepare(self, topk_ids: torch.Tensor) -> ExpertWeightResult:
        """Ensure requested experts are GPU-resident. Returns GPU tensors."""
        ...

@dataclass
class ExpertWeightResult:
    w1: torch.Tensor       # [capacity, ...] GPU-resident, fixed address
    w2: torch.Tensor       # [capacity, ...] GPU-resident, fixed address
    topk_ids: torch.Tensor # always remapped to slot indices
    w1_scale: Optional[torch.Tensor] = None
    w2_scale: Optional[torch.Tensor] = None

Design choices driven by torch.compile compatibility (per @zou3519's review of #29941):

  • topk_ids is always remapped — no boolean flag that changes tensor interpretation
  • Scales are fixed attributes, not a dynamic dict — avoids graph breaks
  • GPU buffers allocated once at init (fixed addresses for CUDA graph capture)
  • prepare() is inherently dynamic Python (eviction, H2D copies) and must run outside any torch.compile boundary

Quant-agnostic tensor registration

Each quant method declares what tensors to cache. The cache stores them as opaque blobs by name — adding a new quant format requires zero cache code changes.

EP compatibility

The provider composes expert_map (global->local) with cache mapping (local->slot) into a single GPU mapping tensor. The kernel's existing -1 skip logic works unchanged.

Batched prefill

prepare() accepts batch topk_ids, deduplicates to unique expert IDs, loads each once. At 3K context with top_k=4 and 32 experts: 32 loads vs 12K sequential — 375x reduction. Without this, prefill is the #1 bottleneck.

CUDA graph compatibility (PR 2)

GPU buffers are fixed-address. A persistent int32 mapping tensor [num_experts] -> slot_index is updated by prepare() on CPU before graph replay. Inside the graph, slot lookup is pure indexing (slot_ids = mapping[topk_ids]) — no Python, no control flow. Same pattern as KV cache block tables.

What prepare() does NOT do

  • No CPU fallback computation. If more experts are needed than cache capacity, prepare() raises a clear error with guidance to increase --moe-expert-cache-size. Hand-rolled CPU MoE with Python loops is a correctness trap.
  • No silent config downgrade. If the user requests offloading and it's incompatible, that's an error, not a silent no-op.

Phased PR plan

PRScopeLOCStatus
PR 1 [#37190]ExpertWeightProvider ABC + CachedWeightProvider with LRU eviction, integrated into apply(). Sync H2D. BF16 + FP8. --enforce-eager required.~600 PythonOpen, CI passing
PR 2Async H2D via CUDA stream + cross-layer temporal prediction + GPU mapping tensor for torch.compile compat~400 PythonAfter PR 1 merge
PR 3Disk tier (mmap), additional quant formats, EPLB integration, telemetry~400 PythonAfter PR 1 merge

PR 1 scope (what ships)

  • ExpertWeightProvider ABC + FullGPUProvider (zero-cost passthrough)
  • CachedWeightProvider with LRU eviction via collections.OrderedDict (no external deps)
  • Integration into FusedMoEModularMethod.apply() — the bypass path (_forward_with_expert_cache) is deleted
  • Synchronous H2D copies (no streams, no prefetch)
  • --enforce-eager required (documented)
  • BF16 + FP8 per-tensor scale support
  • Tests: unit tests for cache logic + integration test through the full runner path
  • Zero overhead on default path (no cache): one if provider is not None check per layer

PR 1 limitations (stated explicitly)

  • Not for latency-sensitive production serving. Synchronous H2D + no CUDA graphs = batch inference and evaluation only.
  • Single-GPU only. TP stall behavior from cache misses is not characterized. Use on multi-GPU TP setups is unsupported until timing analysis is done.
  • LRU only. Other eviction policies (LFU, SLRU, ARC) ship in follow-ups if workload data justifies them.

Default-path overhead

When moe_expert_cache_size == 0 (99%+ of users): one if check per layer per forward pass. No allocations, no imports, no code paths touched. FullGPUProvider is a passthrough that returns the existing weight tensors unchanged.

Open questions

  1. Integration depth in PR 1: Should ExpertWeightProvider.prepare() be called inside FusedMoEModularMethod.apply(), or should it be lifted to the model runner level (outside compile boundary) from the start? The former is simpler for PR 1; the latter is required for PR 2. Preference?

  2. Memory accounting: GPU slot buffers are capacity * expert_size (e.g., 32 slots x 67MB = ~2.1 GB for DeepSeek-V3 BF16). Should these be allocated during vLLM's memory profiling phase so they're visible to the memory profiler, or is a separate accounting path acceptable?

  3. Observability timeline: Production deployers want per-layer hit/miss metrics exposed to Prometheus from day one. Should basic counters ship in PR 1 (adds ~30 LOC) or is PR 3 acceptable?

  4. Policy ABC surface: For future pluggable eviction (ARC, LIRS), the policy interface needs to receive hit/miss/recency signals, not just eviction requests. Should PR 1's LRU implementation use a policy ABC that's ARC-ready, or is a simple OrderedDict wrapper sufficient?

CC

@mgoin @zou3519 @pavanimajety

References

  • RFC #33869 — prior MoE offload RFC
  • PR #37190 — PR 1 implementation (open)
  • PR #34535 — selective CPU offload (merged, static)
  • PR #29941 — async prefetch handler (merged, pattern for PR 2)
  • tinyserve — independent validation, 325 tests
  • FATE — cross-layer temporal prediction (83% cosine similarity, 97-99% prediction accuracy)

Architecture validated in tinyserve. AI-assisted drafting (Claude Code).

extent analysis

Fix Plan

To implement the dynamic MoE expert weight offloading, follow these steps:

  • Implement the ExpertWeightProvider ABC with FullGPUProvider and CachedWeightProvider classes.
  • Integrate CachedWeightProvider into FusedMoEModularMethod.apply() using the prepare() method.
  • Implement LRU eviction using collections.OrderedDict.
  • Add support for BF16 and FP8 per-tensor scale.
  • Implement synchronous H2D copies.

Example code for ExpertWeightProvider ABC:

from abc import ABC, abstractmethod
import torch

class ExpertWeightProvider(ABC):
    @abstractmethod
    def prepare(self, topk_ids: torch.Tensor) -> ExpertWeightResult:
        """Ensure requested experts are GPU-resident. Returns GPU tensors."""
        ...

@dataclass
class ExpertWeightResult:
    w1: torch.Tensor       # [capacity, ...] GPU-resident, fixed address
    w2: torch.Tensor       # [capacity, ...] GPU-resident, fixed address
    topk_ids: torch.Tensor # always remapped to slot indices
    w1_scale: Optional[torch.Tensor] = None
    w2_scale: Optional[torch.Tensor] = None

Example code for CachedWeightProvider:

class CachedWeightProvider(ExpertWeightProvider):
    def __init__(self, capacity: int):
        self.capacity = capacity
        self.cache = collections.OrderedDict()

    def prepare(self, topk_ids: torch.Tensor) -> ExpertWeightResult:
        # Implement LRU eviction and cache management
        ...
        return ExpertWeightResult(w1, w2, topk_ids, w1_scale, w2_scale)

Verification

To verify the fix, run the following tests:

  • Unit tests for cache logic
  • Integration tests through the full runner path
  • Test the prepare() method with different inputs and verify the output

Extra Tips

  • Use torch.compile compatibility to ensure the code works with torch.compile.
  • Implement async H2D copies and cross-layer temporal prediction in follow-up PRs.
  • Add support for additional quant formats and EPLB integration in follow-up PRs.
  • Use a policy ABC surface for future pluggable eviction policies.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING