vllm - ✅(Solved) Fix [RFC]: Incremental MoE Expert Offloading — GPU Cache + Async Pipeline [1 pull requests, 3 comments, 2 participants]

e1n00r · 2026-03-26T16:17:31Z

[vllm] Dynamic MoE expert weight offloading for vLLM. Expert weights live in CPU pinned memory; a fixed-size GPU cache holds the hottest experts; LRU eviction… Dynamic MoE expert weight offloading for vLLM. Expert weights live in CPU pinned memory; a fixed-size GPU cache holds the hottest experts; LRU eviction and cross-layer prediction minimize cache misses. Models that exceed GPU VRAM can run on smaller hardware. **PR 1 is open:** [#37190](https://github.com/vllm-project/vllm/pull/37190) (~600 LOC Python, passing CI). This RFC covers the full 3-PR architecture and provides production data from [tinyserve](https://github.com/e1n00r/tinyserve), an independent implementation of the same techniques (30 tok/s decode on 8 GB GPU, 325 tests). # PR #37190: [MoE][Offload] Run MoE models exceeding VRAM via expert CPU offloading with GPU cache (--moe-expert-cache-size) - Repository: vllm-project/vllm - Author: e1n00r - State: open | merged: False - Link: https://github.com/vllm-project/vllm/pull/37190 ## Description (problem / solution / changelog) ## Purpose `CachedWeightProvider` — MoE expert CPU offloading with GPU LFRU cache, addressing [RFC #38256](https://github.com/vllm-project/vllm/issues/38256). Expert weights live in CPU pinned memory; a fixed-size GPU cache holds the hottest N experts per layer using LFRU (frequency-weighted LRU) eviction. LFRU prevents early layers from monopolizing the cache — a known problem with pure LRU in sequential MoE execution. Models that exceed GPU VRAM can now run on smaller hardware. **No runner bypass** — all paths go through `quant_method.apply()`. EP dispatch, DP chunking, and shared expert overlap work unchanged. **References:** [RFC #38256](https://github.com/vllm-project/vllm/issues/38256) | [tinyserve](https://github.com/e1n00r/tinyserve) (production validation, 481 tests) ## Test results **Community validation (independent):** | Model | VRAM | tok/s | Tester | |---|---|---|---| | Nemotron-Cascade-2-30B-A3B (cache=8) | 7.6 GB | 15.6 | @caiovicentino | | Gemma-4-26B-A4B-it (cache=8) | 8.6 GB | 14.8 | @caiovicentino | **LFRU vs LRU** (Nemotron, cache=8): LFRU cache=8 exceeds LRU cache=16 in hit rate. +5.2% speed improvement. **Unit tests:** 26 test cases (parametrized across dtypes, capacities, num_experts). Tests LFRU-specific eviction behavior (frequency-weighted, not just recency). ## Changes **15 files, ~810 additions** | File | What | |---|---| | `expert_weight_provider.py` (new) | `CachedWeightProvider` with LFRU eviction, `ExpertWeightResult` dataclass | | `fused_moe_method_base.py` | `supports_expert_lru_cache` property (default False) | | `fused_moe_modular_method.py` | Provider check in `apply()` | | `layer.py` | `_maybe_init_expert_lru_cache()`, `expert_weight_provider` attribute | | `unquantized_fused_moe_method.py` | CPU weight allocation, cache init, kernel init for cache path, XPU transpose | | `quantization/fp8.py` | `supports_expert_lru_cache`, provider check, cache init | | `offload.py` | `moe_expert_cache_size` config field | | `vllm.py` | Cross-validator: enforce_eager required | | `arg_utils.py` | CLI argument `--moe-expert-cache-size` | | `llm.py` | `moe_expert_cache_size` parameter in `LLM.__init__` | | `basic_correctness.yaml` | CI test area registration | | `docs/features/moe_cache_policies.md` (new) | Feature documentation | | `test_expert_lru_cache.py` (new) | 26 unit tests with parametrization | | `test_moe_expert_cache.py` (new) | Integration test via `compare_two_settings` | | `benchmarks/qwen_122b_test_20260331.txt` (new) | Benchmark raw data | ## How it works ```text moe_expert_cache_size == 0 (default): No provider created. Zero overhead (one getattr per layer). moe_expert_cache_size > 0: CachedWeightProvider.prepare(topk_ids): for each unique expert: hit → update LFRU frequency + recency (O(1)) miss → evict lowest freq/age score, H2D copy, update mapping remap topk_ids → slot indices via persistent GPU mapping tensor → kernel receives GPU buffer + remapped IDs ``` ## Limitations - `--enforce-eager` required (CUDA graph compat deferred to PR 2) - Synchronous H2D copies (async pipeline in PR 2) - Single eviction policy (LFRU hardcoded, no pluggable framework) - EP > 1 not supported - BF16 + FP8 per-tensor only ## Test plan ```bash pytest tests/kernels/moe/test_expert_lru_cache.py -v pytest tests/basic_correctness/test_moe_expert_cache.py -v -s ``` --- *AI-assisted development ([Claude Code](https://claude.com/claude-code)). Architecture validated in [tinyserve](https://github.com/e1n00r/tinyserve).* ## Changed files - `.buildkite/test_areas/basic_correctness.yaml` (modified, +2/-0) - `benchmarks/qwen_122b_test_20260331.txt` (added, +28/-0) - `docs/features/moe_cache_policies.md` (added, +102/-0) - `tests/basic_correctness/test_moe_expert_cache.py` (added, +39/-0) - `tests/kernels/moe/test_expert_lru_cache.py` (added, +280/-0) - `vllm/config/offload.py` (modified, +3/-0) - `vllm/config/vllm.py` (modified, +15/-0) - `vllm/engine/arg_utils.py

vllm2026-03-26 16:17:31

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

vllm-project/vllm#38256•Fetched 2026-04-08 01:36:58

View on GitHub

Comments

Participants

Timeline

Reactions

Author

e1n00r

Participants

e1n00r

RemizovDenis

Timeline (top)

subscribed ×8referenced ×6mentioned ×5commented ×3

Dynamic MoE expert weight offloading for vLLM. Expert weights live in CPU pinned memory; a fixed-size GPU cache holds the hottest experts; LRU eviction and cross-layer prediction minimize cache misses. Models that exceed GPU VRAM can run on smaller hardware.

PR 1 is open: #37190 (~600 LOC Python, passing CI).

This RFC covers the full 3-PR architecture and provides production data from tinyserve, an independent implementation of the same techniques (30 tok/s decode on 8 GB GPU, 325 tests).

Error Message

No CPU fallback computation. If more experts are needed than cache capacity, prepare() raises a clear error with guidance to increase --moe-expert-cache-size. Hand-rolled CPU MoE with Python loops is a correctness trap.
No silent config downgrade. If the user requests offloading and it's incompatible, that's an error, not a silent no-op.

Root Cause

PR 1 is open: #37190 (~600 LOC Python, passing CI).

This RFC covers the full 3-PR architecture and provides production data from tinyserve, an independent implementation of the same techniques (30 tok/s decode on 8 GB GPU, 325 tests).

Fix Action

Fix / Workaround

At FusedMoEModularMethod.apply() — replace direct layer.w13_weight access with provider.prepare(topk_ids). No bypass of the runner. All paths go through runner.forward() -> quant_method.apply(), preserving EP dispatch, DP chunking, and shared-expert overlap.

PR fix notes

PR #37190: [MoE][Offload] Run MoE models exceeding VRAM via expert CPU offloading with GPU cache (--moe-expert-cache-size)

Repository: vllm-project/vllm
Author: e1n00r
State: open | merged: False
Link: https://github.com/vllm-project/vllm/pull/37190

Description (problem / solution / changelog)

Purpose

CachedWeightProvider — MoE expert CPU offloading with GPU LFRU cache, addressing RFC #38256.

Expert weights live in CPU pinned memory; a fixed-size GPU cache holds the hottest N experts per layer using LFRU (frequency-weighted LRU) eviction. LFRU prevents early layers from monopolizing the cache — a known problem with pure LRU in sequential MoE execution. Models that exceed GPU VRAM can now run on smaller hardware.

No runner bypass — all paths go through quant_method.apply(). EP dispatch, DP chunking, and shared expert overlap work unchanged.

References: RFC #38256 | tinyserve (production validation, 481 tests)

Test results

Community validation (independent):

Model	VRAM	tok/s	Tester
Nemotron-Cascade-2-30B-A3B (cache=8)	7.6 GB	15.6	@caiovicentino
Gemma-4-26B-A4B-it (cache=8)	8.6 GB	14.8	@caiovicentino

LFRU vs LRU (Nemotron, cache=8): LFRU cache=8 exceeds LRU cache=16 in hit rate. +5.2% speed improvement.

Unit tests: 26 test cases (parametrized across dtypes, capacities, num_experts). Tests LFRU-specific eviction behavior (frequency-weighted, not just recency).

Changes

15 files, ~810 additions

File	What
`expert_weight_provider.py` (new)	`CachedWeightProvider` with LFRU eviction, `ExpertWeightResult` dataclass
`fused_moe_method_base.py`	`supports_expert_lru_cache` property (default False)
`fused_moe_modular_method.py`	Provider check in `apply()`
`layer.py`	`_maybe_init_expert_lru_cache()`, `expert_weight_provider` attribute
`unquantized_fused_moe_method.py`	CPU weight allocation, cache init, kernel init for cache path, XPU transpose
`quantization/fp8.py`	`supports_expert_lru_cache`, provider check, cache init
`offload.py`	`moe_expert_cache_size` config field
`vllm.py`	Cross-validator: enforce_eager required
`arg_utils.py`	CLI argument `--moe-expert-cache-size`
`llm.py`	`moe_expert_cache_size` parameter in `LLM.__init__`
`basic_correctness.yaml`	CI test area registration
`docs/features/moe_cache_policies.md` (new)	Feature documentation
`test_expert_lru_cache.py` (new)	26 unit tests with parametrization
`test_moe_expert_cache.py` (new)	Integration test via `compare_two_settings`
`benchmarks/qwen_122b_test_20260331.txt` (new)	Benchmark raw data

How it works

moe_expert_cache_size == 0 (default):
  No provider created. Zero overhead (one getattr per layer).

moe_expert_cache_size > 0:
  CachedWeightProvider.prepare(topk_ids):
    for each unique expert:
      hit  → update LFRU frequency + recency (O(1))
      miss → evict lowest freq/age score, H2D copy, update mapping
    remap topk_ids → slot indices via persistent GPU mapping tensor
  → kernel receives GPU buffer + remapped IDs

Limitations

--enforce-eager required (CUDA graph compat deferred to PR 2)
Synchronous H2D copies (async pipeline in PR 2)
Single eviction policy (LFRU hardcoded, no pluggable framework)
EP > 1 not supported
BF16 + FP8 per-tensor only

Test plan

pytest tests/kernels/moe/test_expert_lru_cache.py -v
pytest tests/basic_correctness/test_moe_expert_cache.py -v -s

AI-assisted development (Claude Code). Architecture validated in tinyserve.

Changed files

.buildkite/test_areas/basic_correctness.yaml (modified, +2/-0)
benchmarks/qwen_122b_test_20260331.txt (added, +28/-0)
docs/features/moe_cache_policies.md (added, +102/-0)
tests/basic_correctness/test_moe_expert_cache.py (added, +39/-0)
tests/kernels/moe/test_expert_lru_cache.py (added, +280/-0)
vllm/config/offload.py (modified, +3/-0)
vllm/config/vllm.py (modified, +15/-0)
vllm/engine/arg_utils.py (modified, +5/-0)
vllm/entrypoints/llm.py (modified, +7/-0)
vllm/model_executor/layers/fused_moe/expert_weight_provider.py (added, +216/-0)
vllm/model_executor/layers/fused_moe/fused_moe_method_base.py (modified, +10/-0)
vllm/model_executor/layers/fused_moe/fused_moe_modular_method.py (modified, +17/-0)
vllm/model_executor/layers/fused_moe/layer.py (modified, +111/-3)
vllm/model_executor/layers/fused_moe/unquantized_fused_moe_method.py (modified, +94/-31)
vllm/model_executor/layers/quantization/fp8.py (modified, +40/-0)

Code Example

ExpertWeightProvider (ABC)
├── FullGPUProvider        -- zero-cost passthrough (default, no overhead)
└── CachedWeightProvider   -- GPU LRU cache + CPU backing store
      ├── GPUSlotManager   -- fixed-address GPU buffers
      ├── LRUEviction      -- pure-Python eviction (collections.OrderedDict)
      └── CPUBackingStore   -- pinned DRAM, all local experts

---

class ExpertWeightProvider(ABC):
    @abstractmethod
    def prepare(self, topk_ids: torch.Tensor) -> ExpertWeightResult:
        """Ensure requested experts are GPU-resident. Returns GPU tensors."""
        ...

@dataclass
class ExpertWeightResult:
    w1: torch.Tensor       # [capacity, ...] GPU-resident, fixed address
    w2: torch.Tensor       # [capacity, ...] GPU-resident, fixed address
    topk_ids: torch.Tensor # always remapped to slot indices
    w1_scale: Optional[torch.Tensor] = None
    w2_scale: Optional[torch.Tensor] = None

RAW_BUFFERClick to expand / collapse

Summary

PR 1 is open: #37190 (~600 LOC Python, passing CI).

This RFC covers the full 3-PR architecture and provides production data from tinyserve, an independent implementation of the same techniques (30 tok/s decode on 8 GB GPU, 325 tests).

Motivation

Large MoE models (DeepSeek-V3 671B, Qwen3.5-122B) don't fit in a single GPU. Only a small subset of experts activates per token (e.g., 8 of 256), so most expert weights sit idle. Moving cold experts to CPU and caching hot ones on GPU lets these models run on hardware that would otherwise OOM.

Prior art in vLLM

PR	What it does	Limitation this RFC addresses
#34535 (merged)	Static CPU weight offload	No runtime migration — offloaded weights stay on CPU permanently
#29941 (merged)	Async H2D prefetch for non-MoE weights	Pattern reused for expert prefetch in PR 2
RFC #33869 / #31938	Monolithic MoE offload (cache + CPU kernels + DBO + prefetch)	Closed — too large to review/pass CI. This RFC takes the opposite approach

Key design principle

The cache is a weight provider, not a special forward path. The kernel does not know or care where weights came from. No bypass of the runner pipeline.

Production data (tinyserve)

RTX PRO 2000 8 GB, GPT-OSS-20B MXFP4, 238 cache slots, single-stream decode:

Metric	Result
Decode throughput	30 tok/s (stable across context lengths)
vs HF `device_map="auto"`	160x faster
Cache hit rate (temporal prediction)	97-100%
Expert loads per layer (batched prefill)	O(num_experts) vs O(seq_len x top_k)

Caveat: These numbers are single-stream on a laptop GPU. Multi-user batched inference on H100 will have different bottlenecks (higher H2D bandwidth, but batch diversity may reduce hit rates). In-tree benchmarks will accompany each PR.

Architecture

ExpertWeightProvider (ABC)
├── FullGPUProvider        -- zero-cost passthrough (default, no overhead)
└── CachedWeightProvider   -- GPU LRU cache + CPU backing store
      ├── GPUSlotManager   -- fixed-address GPU buffers
      ├── LRUEviction      -- pure-Python eviction (collections.OrderedDict)
      └── CPUBackingStore   -- pinned DRAM, all local experts

Integration point

`prepare()` contract

class ExpertWeightProvider(ABC):
    @abstractmethod
    def prepare(self, topk_ids: torch.Tensor) -> ExpertWeightResult:
        """Ensure requested experts are GPU-resident. Returns GPU tensors."""
        ...

@dataclass
class ExpertWeightResult:
    w1: torch.Tensor       # [capacity, ...] GPU-resident, fixed address
    w2: torch.Tensor       # [capacity, ...] GPU-resident, fixed address
    topk_ids: torch.Tensor # always remapped to slot indices
    w1_scale: Optional[torch.Tensor] = None
    w2_scale: Optional[torch.Tensor] = None

Design choices driven by torch.compile compatibility (per @zou3519's review of #29941):

topk_ids is always remapped — no boolean flag that changes tensor interpretation
Scales are fixed attributes, not a dynamic dict — avoids graph breaks
GPU buffers allocated once at init (fixed addresses for CUDA graph capture)
prepare() is inherently dynamic Python (eviction, H2D copies) and must run outside any torch.compile boundary

Quant-agnostic tensor registration

Each quant method declares what tensors to cache. The cache stores them as opaque blobs by name — adding a new quant format requires zero cache code changes.

EP compatibility

The provider composes expert_map (global->local) with cache mapping (local->slot) into a single GPU mapping tensor. The kernel's existing -1 skip logic works unchanged.

Batched prefill

prepare() accepts batch topk_ids, deduplicates to unique expert IDs, loads each once. At 3K context with top_k=4 and 32 experts: 32 loads vs 12K sequential — 375x reduction. Without this, prefill is the #1 bottleneck.

CUDA graph compatibility (PR 2)

GPU buffers are fixed-address. A persistent int32 mapping tensor [num_experts] -> slot_index is updated by prepare() on CPU before graph replay. Inside the graph, slot lookup is pure indexing (slot_ids = mapping[topk_ids]) — no Python, no control flow. Same pattern as KV cache block tables.

What `prepare()` does NOT do

No CPU fallback computation. If more experts are needed than cache capacity, prepare() raises a clear error with guidance to increase --moe-expert-cache-size. Hand-rolled CPU MoE with Python loops is a correctness trap.
No silent config downgrade. If the user requests offloading and it's incompatible, that's an error, not a silent no-op.

Phased PR plan

PR	Scope	LOC	Status
PR 1 [#37190]	`ExpertWeightProvider` ABC + `CachedWeightProvider` with LRU eviction, integrated into `apply()`. Sync H2D. BF16 + FP8. `--enforce-eager` required.	~600 Python	Open, CI passing
PR 2	Async H2D via CUDA stream + cross-layer temporal prediction + GPU mapping tensor for torch.compile compat	~400 Python	After PR 1 merge
PR 3	Disk tier (mmap), additional quant formats, EPLB integration, telemetry	~400 Python	After PR 1 merge

PR 1 scope (what ships)

ExpertWeightProvider ABC + FullGPUProvider (zero-cost passthrough)
CachedWeightProvider with LRU eviction via collections.OrderedDict (no external deps)
Integration into FusedMoEModularMethod.apply() — the bypass path (_forward_with_expert_cache) is deleted
Synchronous H2D copies (no streams, no prefetch)
--enforce-eager required (documented)
BF16 + FP8 per-tensor scale support
Tests: unit tests for cache logic + integration test through the full runner path
Zero overhead on default path (no cache): one if provider is not None check per layer

PR 1 limitations (stated explicitly)

Not for latency-sensitive production serving. Synchronous H2D + no CUDA graphs = batch inference and evaluation only.
Single-GPU only. TP stall behavior from cache misses is not characterized. Use on multi-GPU TP setups is unsupported until timing analysis is done.
LRU only. Other eviction policies (LFU, SLRU, ARC) ship in follow-ups if workload data justifies them.

Default-path overhead

When moe_expert_cache_size == 0 (99%+ of users): one if check per layer per forward pass. No allocations, no imports, no code paths touched. FullGPUProvider is a passthrough that returns the existing weight tensors unchanged.

Open questions

Integration depth in PR 1: Should ExpertWeightProvider.prepare() be called inside FusedMoEModularMethod.apply(), or should it be lifted to the model runner level (outside compile boundary) from the start? The former is simpler for PR 1; the latter is required for PR 2. Preference?
Memory accounting: GPU slot buffers are capacity * expert_size (e.g., 32 slots x 67MB = ~2.1 GB for DeepSeek-V3 BF16). Should these be allocated during vLLM's memory profiling phase so they're visible to the memory profiler, or is a separate accounting path acceptable?
Observability timeline: Production deployers want per-layer hit/miss metrics exposed to Prometheus from day one. Should basic counters ship in PR 1 (adds ~30 LOC) or is PR 3 acceptable?
Policy ABC surface: For future pluggable eviction (ARC, LIRS), the policy interface needs to receive hit/miss/recency signals, not just eviction requests. Should PR 1's LRU implementation use a policy ABC that's ARC-ready, or is a simple OrderedDict wrapper sufficient?

CC

@mgoin @zou3519 @pavanimajety

References

RFC #33869 — prior MoE offload RFC
PR #37190 — PR 1 implementation (open)
PR #34535 — selective CPU offload (merged, static)
PR #29941 — async prefetch handler (merged, pattern for PR 2)
tinyserve — independent validation, 325 tests
FATE — cross-layer temporal prediction (83% cosine similarity, 97-99% prediction accuracy)

Architecture validated in tinyserve. AI-assisted drafting (Claude Code).

extent analysis

Fix Plan

To implement the dynamic MoE expert weight offloading, follow these steps:

Implement the ExpertWeightProvider ABC with FullGPUProvider and CachedWeightProvider classes.
Integrate CachedWeightProvider into FusedMoEModularMethod.apply() using the prepare() method.
Implement LRU eviction using collections.OrderedDict.
Add support for BF16 and FP8 per-tensor scale.
Implement synchronous H2D copies.

Example code for ExpertWeightProvider ABC:

from abc import ABC, abstractmethod
import torch

class ExpertWeightProvider(ABC):
    @abstractmethod
    def prepare(self, topk_ids: torch.Tensor) -> ExpertWeightResult:
        """Ensure requested experts are GPU-resident. Returns GPU tensors."""
        ...

@dataclass
class ExpertWeightResult:
    w1: torch.Tensor       # [capacity, ...] GPU-resident, fixed address
    w2: torch.Tensor       # [capacity, ...] GPU-resident, fixed address
    topk_ids: torch.Tensor # always remapped to slot indices
    w1_scale: Optional[torch.Tensor] = None
    w2_scale: Optional[torch.Tensor] = None

Example code for CachedWeightProvider:

class CachedWeightProvider(ExpertWeightProvider):
    def __init__(self, capacity: int):
        self.capacity = capacity
        self.cache = collections.OrderedDict()

    def prepare(self, topk_ids: torch.Tensor) -> ExpertWeightResult:
        # Implement LRU eviction and cache management
        ...
        return ExpertWeightResult(w1, w2, topk_ids, w1_scale, w2_scale)

Verification

To verify the fix, run the following tests:

Unit tests for cache logic
Integration tests through the full runner path
Test the prepare() method with different inputs and verify the output

Extra Tips

Use torch.compile compatibility to ensure the code works with torch.compile.
Implement async H2D copies and cross-layer temporal prediction in follow-up PRs.
Add support for additional quant formats and EPLB integration in follow-up PRs.
Use a policy ABC surface for future pluggable eviction policies.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#GPU setup #container setup #orchestration issue #cache issue #memory leak

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

vllm - ✅(Solved) Fix [RFC]: Incremental MoE Expert Offloading — GPU Cache + Async Pipeline [1 pull requests, 3 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Fix / Workaround

PR fix notes

PR #37190: [MoE][Offload] Run MoE models exceeding VRAM via expert CPU offloading with GPU cache (--moe-expert-cache-size)

Description (problem / solution / changelog)

Purpose

Test results

Changes

How it works

Limitations

Test plan

Changed files

Code Example

Summary

Motivation

Prior art in vLLM

Key design principle

Production data (tinyserve)

Architecture

Integration point

prepare() contract

Quant-agnostic tensor registration

EP compatibility

Batched prefill

CUDA graph compatibility (PR 2)

What prepare() does NOT do

Phased PR plan

PR 1 scope (what ships)

PR 1 limitations (stated explicitly)

Default-path overhead

Open questions

CC

References

extent analysis

Fix Plan

Verification

Extra Tips

Still need to ship something?

RELATED_DISCOVERY

TRENDING

`prepare()` contract

What `prepare()` does NOT do