vllm - ✅(Solved) Fix [Bug]: NVFP4 MoE produces garbage output on SM120 (RTX 5080) with CPU Weight Offloading — Nemotron-Cascade-2-30B-A3B [2 pull requests, 2 comments, 2 participants]

lucaspirola · 2026-04-01T12:13:41Z

[vllm] PR 37190: MoE Offload Run MoE models exceeding VRAM via expert CPU offloading with GPU cache --moe-expert-cache-size - Repository: vllm-project/vllm - A… # PR #37190: [MoE][Offload] Run MoE models exceeding VRAM via expert CPU offloading with GPU cache (--moe-expert-cache-size) - Repository: vllm-project/vllm - Author: e1n00r - State: open | merged: False - Link: https://github.com/vllm-project/vllm/pull/37190 ## Description (problem / solution / changelog) ## Purpose `CachedWeightProvider` — MoE expert CPU offloading with GPU LFRU cache, addressing [RFC #38256](https://github.com/vllm-project/vllm/issues/38256). Expert weights live in CPU pinned memory; a fixed-size GPU cache holds the hottest N experts per layer using LFRU (frequency-weighted LRU) eviction. LFRU prevents early layers from monopolizing the cache — a known problem with pure LRU in sequential MoE execution. Models that exceed GPU VRAM can now run on smaller hardware. **No runner bypass** — all paths go through `quant_method.apply()`. EP dispatch, DP chunking, and shared expert overlap work unchanged. **References:** [RFC #38256](https://github.com/vllm-project/vllm/issues/38256) | [tinyserve](https://github.com/e1n00r/tinyserve) (production validation, 481 tests) ## Test results **Community validation (independent):** | Model | VRAM | tok/s | Tester | |---|---|---|---| | Nemotron-Cascade-2-30B-A3B (cache=8) | 7.6 GB | 15.6 | @caiovicentino | | Gemma-4-26B-A4B-it (cache=8) | 8.6 GB | 14.8 | @caiovicentino | **LFRU vs LRU** (Nemotron, cache=8): LFRU cache=8 exceeds LRU cache=16 in hit rate. +5.2% speed improvement. **Unit tests:** 26 test cases (parametrized across dtypes, capacities, num_experts). Tests LFRU-specific eviction behavior (frequency-weighted, not just recency). ## Changes **15 files, ~810 additions** | File | What | |---|---| | `expert_weight_provider.py` (new) | `CachedWeightProvider` with LFRU eviction, `ExpertWeightResult` dataclass | | `fused_moe_method_base.py` | `supports_expert_lru_cache` property (default False) | | `fused_moe_modular_method.py` | Provider check in `apply()` | | `layer.py` | `_maybe_init_expert_lru_cache()`, `expert_weight_provider` attribute | | `unquantized_fused_moe_method.py` | CPU weight allocation, cache init, kernel init for cache path, XPU transpose | | `quantization/fp8.py` | `supports_expert_lru_cache`, provider check, cache init | | `offload.py` | `moe_expert_cache_size` config field | | `vllm.py` | Cross-validator: enforce_eager required | | `arg_utils.py` | CLI argument `--moe-expert-cache-size` | | `llm.py` | `moe_expert_cache_size` parameter in `LLM.__init__` | | `basic_correctness.yaml` | CI test area registration | | `docs/features/moe_cache_policies.md` (new) | Feature documentation | | `test_expert_lru_cache.py` (new) | 26 unit tests with parametrization | | `test_moe_expert_cache.py` (new) | Integration test via `compare_two_settings` | | `benchmarks/qwen_122b_test_20260331.txt` (new) | Benchmark raw data | ## How it works ```text moe_expert_cache_size == 0 (default): No provider created. Zero overhead (one getattr per layer). moe_expert_cache_size > 0: CachedWeightProvider.prepare(topk_ids): for each unique expert: hit → update LFRU frequency + recency (O(1)) miss → evict lowest freq/age score, H2D copy, update mapping remap topk_ids → slot indices via persistent GPU mapping tensor → kernel receives GPU buffer + remapped IDs ``` ## Limitations - `--enforce-eager` required (CUDA graph compat deferred to PR 2) - Synchronous H2D copies (async pipeline in PR 2) - Single eviction policy (LFRU hardcoded, no pluggable framework) - EP > 1 not supported - BF16 + FP8 per-tensor only ## Test plan ```bash pytest tests/kernels/moe/test_expert_lru_cache.py -v pytest tests/basic_correctness/test_moe_expert_cache.py -v -s ``` --- *AI-assisted development ([Claude Code](https://claude.com/claude-code)). Architecture validated in [tinyserve](https://github.com/e1n00r/tinyserve).* ## Changed files - `.buildkite/test_areas/basic_correctness.yaml` (modified, +2/-0) - `benchmarks/qwen_122b_test_20260331.txt` (added, +28/-0) - `docs/features/moe_cache_policies.md` (added, +102/-0) - `tests/basic_correctness/test_moe_expert_cache.py` (added, +39/-0) - `tests/kernels/moe/test_expert_lru_cache.py` (added, +280/-0) - `vllm/config/offload.py` (modified, +3/-0) - `vllm/config/vllm.py` (modified, +15/-0) - `vllm/engine/arg_utils.py` (modified, +5/-0) - `vllm/entrypoints/llm.py` (modified, +7/-0) - `vllm/model_executor/layers/fused_moe/expert_weight_provider.py` (added, +216/-0) - `vllm/model_executor/layers/fused_moe/fused_moe_method_base.py` (modified, +10/-0) - `vllm/model_executor/layers/fused_moe/fused_moe_modular_method.py` (modified, +17/-0) - `vllm/model_executor/layers/fused_moe/layer.py` (modified, +111/-3) - `vllm/model_executor/layers/fused_moe/unquantized_fused_moe_method.py` (modified, +94/-31) - `vllm/model_executor/layers/quantization/fp8.py` (modified, +40/-0) --- # PR #36461: [Bugfix] Fix c

vllm2026-04-01 12:13:41

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

vllm-project/vllm#38718•Fetched 2026-04-08 02:23:14

View on GitHub

Comments

Participants

Timeline

Reactions

Author

lucaspirola

Participants

caiovicentino

lucaspirola

Timeline (top)

commented ×2cross-referenced ×2renamed ×1subscribed ×1

Root Cause

Suspected root cause: CPU offload + MoE interaction

Fix Action

Fix / Workaround

FlashInfer auto-detection bug: flashinfer/compilation_context.py auto-appends "a" suffix for SM major >= 9, producing compute_120a instead of compute_120f. The compute_120a arch does not support TMA WS grouped GEMM tactics needed for NVFP4 MoE, causing "illegal instruction" crashes. Workaround: FLASHINFER_CUDA_ARCH_LIST=12.0f.

PR fix notes

PR #37190: [MoE][Offload] Run MoE models exceeding VRAM via expert CPU offloading with GPU cache (--moe-expert-cache-size)

Repository: vllm-project/vllm
Author: e1n00r
State: open | merged: False
Link: https://github.com/vllm-project/vllm/pull/37190

Description (problem / solution / changelog)

Purpose

CachedWeightProvider — MoE expert CPU offloading with GPU LFRU cache, addressing RFC #38256.

Expert weights live in CPU pinned memory; a fixed-size GPU cache holds the hottest N experts per layer using LFRU (frequency-weighted LRU) eviction. LFRU prevents early layers from monopolizing the cache — a known problem with pure LRU in sequential MoE execution. Models that exceed GPU VRAM can now run on smaller hardware.

No runner bypass — all paths go through quant_method.apply(). EP dispatch, DP chunking, and shared expert overlap work unchanged.

References: RFC #38256 | tinyserve (production validation, 481 tests)

Test results

Community validation (independent):

Model	VRAM	tok/s	Tester
Nemotron-Cascade-2-30B-A3B (cache=8)	7.6 GB	15.6	@caiovicentino
Gemma-4-26B-A4B-it (cache=8)	8.6 GB	14.8	@caiovicentino

LFRU vs LRU (Nemotron, cache=8): LFRU cache=8 exceeds LRU cache=16 in hit rate. +5.2% speed improvement.

Unit tests: 26 test cases (parametrized across dtypes, capacities, num_experts). Tests LFRU-specific eviction behavior (frequency-weighted, not just recency).

Changes

15 files, ~810 additions

File	What
`expert_weight_provider.py` (new)	`CachedWeightProvider` with LFRU eviction, `ExpertWeightResult` dataclass
`fused_moe_method_base.py`	`supports_expert_lru_cache` property (default False)
`fused_moe_modular_method.py`	Provider check in `apply()`
`layer.py`	`_maybe_init_expert_lru_cache()`, `expert_weight_provider` attribute
`unquantized_fused_moe_method.py`	CPU weight allocation, cache init, kernel init for cache path, XPU transpose
`quantization/fp8.py`	`supports_expert_lru_cache`, provider check, cache init
`offload.py`	`moe_expert_cache_size` config field
`vllm.py`	Cross-validator: enforce_eager required
`arg_utils.py`	CLI argument `--moe-expert-cache-size`
`llm.py`	`moe_expert_cache_size` parameter in `LLM.__init__`
`basic_correctness.yaml`	CI test area registration
`docs/features/moe_cache_policies.md` (new)	Feature documentation
`test_expert_lru_cache.py` (new)	26 unit tests with parametrization
`test_moe_expert_cache.py` (new)	Integration test via `compare_two_settings`
`benchmarks/qwen_122b_test_20260331.txt` (new)	Benchmark raw data

How it works

moe_expert_cache_size == 0 (default):
  No provider created. Zero overhead (one getattr per layer).

moe_expert_cache_size > 0:
  CachedWeightProvider.prepare(topk_ids):
    for each unique expert:
      hit  → update LFRU frequency + recency (O(1))
      miss → evict lowest freq/age score, H2D copy, update mapping
    remap topk_ids → slot indices via persistent GPU mapping tensor
  → kernel receives GPU buffer + remapped IDs

Limitations

--enforce-eager required (CUDA graph compat deferred to PR 2)
Synchronous H2D copies (async pipeline in PR 2)
Single eviction policy (LFRU hardcoded, no pluggable framework)
EP > 1 not supported
BF16 + FP8 per-tensor only

Test plan

pytest tests/kernels/moe/test_expert_lru_cache.py -v
pytest tests/basic_correctness/test_moe_expert_cache.py -v -s

AI-assisted development (Claude Code). Architecture validated in tinyserve.

Changed files

.buildkite/test_areas/basic_correctness.yaml (modified, +2/-0)
benchmarks/qwen_122b_test_20260331.txt (added, +28/-0)
docs/features/moe_cache_policies.md (added, +102/-0)
tests/basic_correctness/test_moe_expert_cache.py (added, +39/-0)
tests/kernels/moe/test_expert_lru_cache.py (added, +280/-0)
vllm/config/offload.py (modified, +3/-0)
vllm/config/vllm.py (modified, +15/-0)
vllm/engine/arg_utils.py (modified, +5/-0)
vllm/entrypoints/llm.py (modified, +7/-0)
vllm/model_executor/layers/fused_moe/expert_weight_provider.py (added, +216/-0)
vllm/model_executor/layers/fused_moe/fused_moe_method_base.py (modified, +10/-0)
vllm/model_executor/layers/fused_moe/fused_moe_modular_method.py (modified, +17/-0)
vllm/model_executor/layers/fused_moe/layer.py (modified, +111/-3)
vllm/model_executor/layers/fused_moe/unquantized_fused_moe_method.py (modified, +94/-31)
vllm/model_executor/layers/quantization/fp8.py (modified, +40/-0)

PR #36461: [Bugfix] Fix cpu-offload-gb assertion with non-default block sizes

Repository: vllm-project/vllm
Author: AjAnubolu
State: closed | merged: True
Link: https://github.com/vllm-project/vllm/pull/36461

Description (problem / solution / changelog)

Remove assertion in may_reinitialize_input_batch that blocked CPU offloading when block sizes differ from defaults. The original issue it guarded against no longer applies.

Closes #36279

Changed files

vllm/v1/worker/gpu_model_runner.py (modified, +0/-5)

Code Example

#!/usr/bin/env python3
"""Minimal reproduction: NVFP4 MoE garbage on SM120 + CPU offload."""
import asyncio
import os

os.environ["VLLM_ENABLE_V1_MULTIPROCESSING"] = "0"
os.environ["VLLM_ALLOW_INSECURE_SERIALIZATION"] = "1"
os.environ["VLLM_ALLOW_LONG_MAX_MODEL_LEN"] = "1"
os.environ["FLASHINFER_CUDA_ARCH_LIST"] = "12.0f"  # Required for SM120 MoE JIT

async def main():
    from vllm import LLM, SamplingParams

    llm = LLM(
        model="nvidia/Nemotron-Cascade-2-30B-A3B-NVFP4",
        trust_remote_code=True,
        max_model_len=4096,
        gpu_memory_utilization=0.95,
        cpu_offload_gb=5,  # Required: 18.4 GB model > 16 GB VRAM
        enforce_eager=True,
        max_num_batched_tokens=4096,
    )

    outputs = llm.generate(
        ["The capital of France is"],
        SamplingParams(max_tokens=32, temperature=0),
    )
    print(f"Output: {outputs[0].outputs[0].text!r}")
    # Expected: "Paris..." or similar
    # Actual: "." or garbage

if __name__ == "__main__":
    asyncio.run(main())

RAW_BUFFERClick to expand / collapse

Your current environment

vLLM: 0.18.2rc1.dev25+gef53395e2 (built from main, 2026-04-01, includes PRs #33417 and #29242)
FlashInfer: 0.6.6
PyTorch: 2.12.0a0+git6fcbf6d
CUDA: 13.2
GPU: NVIDIA GeForce RTX 5080 (SM 12.0, 16 GB VRAM)
Driver: 595.45.04
OS: Ubuntu 24.04, Linux 6.17.0-19-generic

Model

nvidia/Nemotron-Cascade-2-30B-A3B-NVFP4 — NemotronH hybrid architecture (Mamba2 + MoE + Attention), NVFP4 quantized via modelopt.

52 layers: 6 attention + 23 Mamba2 SSM + 23 MoE (128 routed experts, 6 active, 1 shared)
~18.4 GB on disk → requires --cpu-offload-gb 5 to fit on 16 GB VRAM

Describe the bug

All NVFP4 MoE backends produce numerically incorrect (garbage) output on SM120 when CPU offloading is enabled. The model loads successfully, the engine starts, and inference runs without errors — but the output is completely wrong (immediate EOS or random tokens).

Every MoE backend tested produces garbage:

MoE Backend	CUDA Arch	Crash?	Output
FLASHINFER_CUTLASS	compute_120a (default)	Yes — illegal instruction	N/A
FLASHINFER_CUTLASS	compute_120f (via `FLASHINFER_CUDA_ARCH_LIST=12.0f`)	No	Empty + EOS (1 token)
VLLM_CUTLASS	precompiled	No	":" + EOS (2 tokens)
MARLIN	N/A	No	Garbage

Example outputs:

Prompt	Output	Expected
"What is the capital of France?" (chat)	"" + EOS	"Paris"
"The capital of France is" (completion)	"."	"Paris..."
"def fibonacci(n):\n" (completion)	" word\n"	Python code

Suspected root cause: CPU offload + MoE interaction

The community has confirmed working NVFP4 MoE on SM120 without CPU offloading:

RTX PRO 6000 (96 GB): 39 tok/s native FP4 MoE after CUTLASS fix (NVIDIA/cutlass#3096)
RTX 5090 (32 GB): ~202 tok/s via llama.cpp GGUF variant

Our RTX 5080 (16 GB) requires 5 GB CPU offloading because the model is 18.4 GB. With --cpu-offload-gb 5, expert weights are split across GPU and CPU. The fused MoE kernel may not correctly handle experts that have been offloaded to CPU memory.

Additionally, both pure-attention NVFP4 models (Qwen3-4B-NVFP4) and hybrid non-MoE models (Qwen3.5-9B-NVFP4 with GDN layers) work correctly on the same GPU with the same vLLM build — even with CPU offloading. Only models with MoE layers produce garbage.

How to reproduce

#!/usr/bin/env python3
"""Minimal reproduction: NVFP4 MoE garbage on SM120 + CPU offload."""
import asyncio
import os

os.environ["VLLM_ENABLE_V1_MULTIPROCESSING"] = "0"
os.environ["VLLM_ALLOW_INSECURE_SERIALIZATION"] = "1"
os.environ["VLLM_ALLOW_LONG_MAX_MODEL_LEN"] = "1"
os.environ["FLASHINFER_CUDA_ARCH_LIST"] = "12.0f"  # Required for SM120 MoE JIT

async def main():
    from vllm import LLM, SamplingParams

    llm = LLM(
        model="nvidia/Nemotron-Cascade-2-30B-A3B-NVFP4",
        trust_remote_code=True,
        max_model_len=4096,
        gpu_memory_utilization=0.95,
        cpu_offload_gb=5,  # Required: 18.4 GB model > 16 GB VRAM
        enforce_eager=True,
        max_num_batched_tokens=4096,
    )

    outputs = llm.generate(
        ["The capital of France is"],
        SamplingParams(max_tokens=32, temperature=0),
    )
    print(f"Output: {outputs[0].outputs[0].text!r}")
    # Expected: "Paris..." or similar
    # Actual: "." or garbage

if __name__ == "__main__":
    asyncio.run(main())

To reproduce, you need:

An SM120 GPU with < 18.4 GB VRAM (RTX 5080 16 GB), forcing CPU offload
CUDA 13.2+ (for compute_120f support)
FlashInfer 0.6.6+

Additional notes

FlashInfer auto-detection bug: flashinfer/compilation_context.py auto-appends "a" suffix for SM major >= 9, producing compute_120a instead of compute_120f. The compute_120a arch does not support TMA WS grouped GEMM tactics needed for NVFP4 MoE, causing "illegal instruction" crashes. Workaround: FLASHINFER_CUDA_ARCH_LIST=12.0f.
hf_quant_config.json FP8 KV: The model ships with "kv_cache_quant_algo": "FP8" but provides no FP8 KV scaling factors (vLLM warns "Using KV cache scaling factor 1.0"). Setting this to "none" to use BF16 KV does not affect the MoE garbage — the issue is purely in the MoE path.
CPU offload assertion: vLLM's gpu_model_runner.py had an assertion blocking CPU offload with hybrid models (InputBatch reinit). Current main has this as a warning, which is correct — the model loads fine.

Before submitting a new issue...

I have searched for similar issues and couldn't find any duplicates.
I have verified that the issue persists on the latest vLLM main branch.
I have provided all necessary details including reproduction steps.

extent analysis

TL;DR

Disable CPU offloading by setting cpu_offload_gb to 0 or removing the flag, as the interaction between CPU offload and MoE layers is suspected to be the root cause of the issue.

Guidance

Verify that the model works correctly without CPU offloading by setting cpu_offload_gb to 0 or removing the flag.
Test the model on a GPU with sufficient VRAM to hold the entire model, to confirm that the issue is related to CPU offloading.
Investigate alternative methods for reducing memory usage, such as model pruning or quantization, to avoid the need for CPU offloading.
Check the FlashInfer documentation for any known issues or limitations related to CPU offloading and MoE layers.

Example

llm = LLM(
    model="nvidia/Nemotron-Cascade-2-30B-A3B-NVFP4",
    trust_remote_code=True,
    max_model_len=4096,
    gpu_memory_utilization=0.95,
    # Remove cpu_offload_gb flag or set to 0
    enforce_eager=True,
    max_num_batched_tokens=4096,
)

Notes

The provided reproduction steps and environment details suggest that the issue is specific to the interaction between CPU offloading and MoE layers on SM120 GPUs. Disabling CPU offloading may resolve the issue, but may also require significant changes to the model or environment.

Recommendation

Apply workaround: Disable CPU offloading by setting cpu_offload_gb to 0 or removing the flag, as this is the most likely cause of the issue and disabling it may resolve the problem.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #model download #tokenizer error #prompt formatting #chain error

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

vllm - ✅(Solved) Fix [Bug]: NVFP4 MoE produces garbage output on SM120 (RTX 5080) with CPU Weight Offloading — Nemotron-Cascade-2-30B-A3B [2 pull requests, 2 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Suspected root cause: CPU offload + MoE interaction

Fix Action

Fix / Workaround

PR fix notes

PR #37190: [MoE][Offload] Run MoE models exceeding VRAM via expert CPU offloading with GPU cache (--moe-expert-cache-size)

Description (problem / solution / changelog)

Purpose

Test results

Changes

How it works

Limitations

Test plan

Changed files

PR #36461: [Bugfix] Fix cpu-offload-gb assertion with non-default block sizes

Description (problem / solution / changelog)

Changed files

Code Example

Your current environment

Model

Describe the bug

Suspected root cause: CPU offload + MoE interaction

How to reproduce

Additional notes

Before submitting a new issue...

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING