vllm - ✅(Solved) Fix [Bug]: NVFP4 MoE produces garbage output on SM120 (RTX 5080) with CPU Weight Offloading — Nemotron-Cascade-2-30B-A3B [2 pull requests, 2 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#38718Fetched 2026-04-08 02:23:14
View on GitHub
Comments
2
Participants
2
Timeline
6
Reactions
0
Timeline (top)
commented ×2cross-referenced ×2renamed ×1subscribed ×1

Root Cause

Suspected root cause: CPU offload + MoE interaction

Fix Action

Fix / Workaround

  1. FlashInfer auto-detection bug: flashinfer/compilation_context.py auto-appends "a" suffix for SM major >= 9, producing compute_120a instead of compute_120f. The compute_120a arch does not support TMA WS grouped GEMM tactics needed for NVFP4 MoE, causing "illegal instruction" crashes. Workaround: FLASHINFER_CUDA_ARCH_LIST=12.0f.

PR fix notes

PR #37190: [MoE][Offload] Run MoE models exceeding VRAM via expert CPU offloading with GPU cache (--moe-expert-cache-size)

Description (problem / solution / changelog)

Purpose

CachedWeightProvider — MoE expert CPU offloading with GPU LFRU cache, addressing RFC #38256.

Expert weights live in CPU pinned memory; a fixed-size GPU cache holds the hottest N experts per layer using LFRU (frequency-weighted LRU) eviction. LFRU prevents early layers from monopolizing the cache — a known problem with pure LRU in sequential MoE execution. Models that exceed GPU VRAM can now run on smaller hardware.

No runner bypass — all paths go through quant_method.apply(). EP dispatch, DP chunking, and shared expert overlap work unchanged.

References: RFC #38256 | tinyserve (production validation, 481 tests)

Test results

Community validation (independent):

ModelVRAMtok/sTester
Nemotron-Cascade-2-30B-A3B (cache=8)7.6 GB15.6@caiovicentino
Gemma-4-26B-A4B-it (cache=8)8.6 GB14.8@caiovicentino

LFRU vs LRU (Nemotron, cache=8): LFRU cache=8 exceeds LRU cache=16 in hit rate. +5.2% speed improvement.

Unit tests: 26 test cases (parametrized across dtypes, capacities, num_experts). Tests LFRU-specific eviction behavior (frequency-weighted, not just recency).

Changes

15 files, ~810 additions

FileWhat
expert_weight_provider.py (new)CachedWeightProvider with LFRU eviction, ExpertWeightResult dataclass
fused_moe_method_base.pysupports_expert_lru_cache property (default False)
fused_moe_modular_method.pyProvider check in apply()
layer.py_maybe_init_expert_lru_cache(), expert_weight_provider attribute
unquantized_fused_moe_method.pyCPU weight allocation, cache init, kernel init for cache path, XPU transpose
quantization/fp8.pysupports_expert_lru_cache, provider check, cache init
offload.pymoe_expert_cache_size config field
vllm.pyCross-validator: enforce_eager required
arg_utils.pyCLI argument --moe-expert-cache-size
llm.pymoe_expert_cache_size parameter in LLM.__init__
basic_correctness.yamlCI test area registration
docs/features/moe_cache_policies.md (new)Feature documentation
test_expert_lru_cache.py (new)26 unit tests with parametrization
test_moe_expert_cache.py (new)Integration test via compare_two_settings
benchmarks/qwen_122b_test_20260331.txt (new)Benchmark raw data

How it works

moe_expert_cache_size == 0 (default):
  No provider created. Zero overhead (one getattr per layer).

moe_expert_cache_size > 0:
  CachedWeightProvider.prepare(topk_ids):
    for each unique expert:
      hit  → update LFRU frequency + recency (O(1))
      miss → evict lowest freq/age score, H2D copy, update mapping
    remap topk_ids → slot indices via persistent GPU mapping tensor
  → kernel receives GPU buffer + remapped IDs

Limitations

  • --enforce-eager required (CUDA graph compat deferred to PR 2)
  • Synchronous H2D copies (async pipeline in PR 2)
  • Single eviction policy (LFRU hardcoded, no pluggable framework)
  • EP > 1 not supported
  • BF16 + FP8 per-tensor only

Test plan

pytest tests/kernels/moe/test_expert_lru_cache.py -v
pytest tests/basic_correctness/test_moe_expert_cache.py -v -s

AI-assisted development (Claude Code). Architecture validated in tinyserve.

Changed files

  • .buildkite/test_areas/basic_correctness.yaml (modified, +2/-0)
  • benchmarks/qwen_122b_test_20260331.txt (added, +28/-0)
  • docs/features/moe_cache_policies.md (added, +102/-0)
  • tests/basic_correctness/test_moe_expert_cache.py (added, +39/-0)
  • tests/kernels/moe/test_expert_lru_cache.py (added, +280/-0)
  • vllm/config/offload.py (modified, +3/-0)
  • vllm/config/vllm.py (modified, +15/-0)
  • vllm/engine/arg_utils.py (modified, +5/-0)
  • vllm/entrypoints/llm.py (modified, +7/-0)
  • vllm/model_executor/layers/fused_moe/expert_weight_provider.py (added, +216/-0)
  • vllm/model_executor/layers/fused_moe/fused_moe_method_base.py (modified, +10/-0)
  • vllm/model_executor/layers/fused_moe/fused_moe_modular_method.py (modified, +17/-0)
  • vllm/model_executor/layers/fused_moe/layer.py (modified, +111/-3)
  • vllm/model_executor/layers/fused_moe/unquantized_fused_moe_method.py (modified, +94/-31)
  • vllm/model_executor/layers/quantization/fp8.py (modified, +40/-0)

PR #36461: [Bugfix] Fix cpu-offload-gb assertion with non-default block sizes

Description (problem / solution / changelog)

Remove assertion in may_reinitialize_input_batch that blocked CPU offloading when block sizes differ from defaults. The original issue it guarded against no longer applies.

Closes #36279

Changed files

  • vllm/v1/worker/gpu_model_runner.py (modified, +0/-5)

Code Example

#!/usr/bin/env python3
"""Minimal reproduction: NVFP4 MoE garbage on SM120 + CPU offload."""
import asyncio
import os

os.environ["VLLM_ENABLE_V1_MULTIPROCESSING"] = "0"
os.environ["VLLM_ALLOW_INSECURE_SERIALIZATION"] = "1"
os.environ["VLLM_ALLOW_LONG_MAX_MODEL_LEN"] = "1"
os.environ["FLASHINFER_CUDA_ARCH_LIST"] = "12.0f"  # Required for SM120 MoE JIT

async def main():
    from vllm import LLM, SamplingParams

    llm = LLM(
        model="nvidia/Nemotron-Cascade-2-30B-A3B-NVFP4",
        trust_remote_code=True,
        max_model_len=4096,
        gpu_memory_utilization=0.95,
        cpu_offload_gb=5,  # Required: 18.4 GB model > 16 GB VRAM
        enforce_eager=True,
        max_num_batched_tokens=4096,
    )

    outputs = llm.generate(
        ["The capital of France is"],
        SamplingParams(max_tokens=32, temperature=0),
    )
    print(f"Output: {outputs[0].outputs[0].text!r}")
    # Expected: "Paris..." or similar
    # Actual: "." or garbage

if __name__ == "__main__":
    asyncio.run(main())
RAW_BUFFERClick to expand / collapse

Your current environment

  • vLLM: 0.18.2rc1.dev25+gef53395e2 (built from main, 2026-04-01, includes PRs #33417 and #29242)
  • FlashInfer: 0.6.6
  • PyTorch: 2.12.0a0+git6fcbf6d
  • CUDA: 13.2
  • GPU: NVIDIA GeForce RTX 5080 (SM 12.0, 16 GB VRAM)
  • Driver: 595.45.04
  • OS: Ubuntu 24.04, Linux 6.17.0-19-generic

Model

nvidia/Nemotron-Cascade-2-30B-A3B-NVFP4 — NemotronH hybrid architecture (Mamba2 + MoE + Attention), NVFP4 quantized via modelopt.

  • 52 layers: 6 attention + 23 Mamba2 SSM + 23 MoE (128 routed experts, 6 active, 1 shared)
  • ~18.4 GB on disk → requires --cpu-offload-gb 5 to fit on 16 GB VRAM

Describe the bug

All NVFP4 MoE backends produce numerically incorrect (garbage) output on SM120 when CPU offloading is enabled. The model loads successfully, the engine starts, and inference runs without errors — but the output is completely wrong (immediate EOS or random tokens).

Every MoE backend tested produces garbage:

MoE BackendCUDA ArchCrash?Output
FLASHINFER_CUTLASScompute_120a (default)Yes — illegal instructionN/A
FLASHINFER_CUTLASScompute_120f (via FLASHINFER_CUDA_ARCH_LIST=12.0f)NoEmpty + EOS (1 token)
VLLM_CUTLASSprecompiledNo":" + EOS (2 tokens)
MARLINN/ANoGarbage

Example outputs:

PromptOutputExpected
"What is the capital of France?" (chat)"" + EOS"Paris"
"The capital of France is" (completion)".""Paris..."
"def fibonacci(n):\n" (completion)" word\n"Python code

Suspected root cause: CPU offload + MoE interaction

The community has confirmed working NVFP4 MoE on SM120 without CPU offloading:

  • RTX PRO 6000 (96 GB): 39 tok/s native FP4 MoE after CUTLASS fix (NVIDIA/cutlass#3096)
  • RTX 5090 (32 GB): ~202 tok/s via llama.cpp GGUF variant

Our RTX 5080 (16 GB) requires 5 GB CPU offloading because the model is 18.4 GB. With --cpu-offload-gb 5, expert weights are split across GPU and CPU. The fused MoE kernel may not correctly handle experts that have been offloaded to CPU memory.

Additionally, both pure-attention NVFP4 models (Qwen3-4B-NVFP4) and hybrid non-MoE models (Qwen3.5-9B-NVFP4 with GDN layers) work correctly on the same GPU with the same vLLM build — even with CPU offloading. Only models with MoE layers produce garbage.

How to reproduce

#!/usr/bin/env python3
"""Minimal reproduction: NVFP4 MoE garbage on SM120 + CPU offload."""
import asyncio
import os

os.environ["VLLM_ENABLE_V1_MULTIPROCESSING"] = "0"
os.environ["VLLM_ALLOW_INSECURE_SERIALIZATION"] = "1"
os.environ["VLLM_ALLOW_LONG_MAX_MODEL_LEN"] = "1"
os.environ["FLASHINFER_CUDA_ARCH_LIST"] = "12.0f"  # Required for SM120 MoE JIT

async def main():
    from vllm import LLM, SamplingParams

    llm = LLM(
        model="nvidia/Nemotron-Cascade-2-30B-A3B-NVFP4",
        trust_remote_code=True,
        max_model_len=4096,
        gpu_memory_utilization=0.95,
        cpu_offload_gb=5,  # Required: 18.4 GB model > 16 GB VRAM
        enforce_eager=True,
        max_num_batched_tokens=4096,
    )

    outputs = llm.generate(
        ["The capital of France is"],
        SamplingParams(max_tokens=32, temperature=0),
    )
    print(f"Output: {outputs[0].outputs[0].text!r}")
    # Expected: "Paris..." or similar
    # Actual: "." or garbage

if __name__ == "__main__":
    asyncio.run(main())

To reproduce, you need:

  1. An SM120 GPU with < 18.4 GB VRAM (RTX 5080 16 GB), forcing CPU offload
  2. CUDA 13.2+ (for compute_120f support)
  3. FlashInfer 0.6.6+

Additional notes

  1. FlashInfer auto-detection bug: flashinfer/compilation_context.py auto-appends "a" suffix for SM major >= 9, producing compute_120a instead of compute_120f. The compute_120a arch does not support TMA WS grouped GEMM tactics needed for NVFP4 MoE, causing "illegal instruction" crashes. Workaround: FLASHINFER_CUDA_ARCH_LIST=12.0f.

  2. hf_quant_config.json FP8 KV: The model ships with "kv_cache_quant_algo": "FP8" but provides no FP8 KV scaling factors (vLLM warns "Using KV cache scaling factor 1.0"). Setting this to "none" to use BF16 KV does not affect the MoE garbage — the issue is purely in the MoE path.

  3. CPU offload assertion: vLLM's gpu_model_runner.py had an assertion blocking CPU offload with hybrid models (InputBatch reinit). Current main has this as a warning, which is correct — the model loads fine.

Before submitting a new issue...

  • I have searched for similar issues and couldn't find any duplicates.
  • I have verified that the issue persists on the latest vLLM main branch.
  • I have provided all necessary details including reproduction steps.

extent analysis

TL;DR

Disable CPU offloading by setting cpu_offload_gb to 0 or removing the flag, as the interaction between CPU offload and MoE layers is suspected to be the root cause of the issue.

Guidance

  • Verify that the model works correctly without CPU offloading by setting cpu_offload_gb to 0 or removing the flag.
  • Test the model on a GPU with sufficient VRAM to hold the entire model, to confirm that the issue is related to CPU offloading.
  • Investigate alternative methods for reducing memory usage, such as model pruning or quantization, to avoid the need for CPU offloading.
  • Check the FlashInfer documentation for any known issues or limitations related to CPU offloading and MoE layers.

Example

llm = LLM(
    model="nvidia/Nemotron-Cascade-2-30B-A3B-NVFP4",
    trust_remote_code=True,
    max_model_len=4096,
    gpu_memory_utilization=0.95,
    # Remove cpu_offload_gb flag or set to 0
    enforce_eager=True,
    max_num_batched_tokens=4096,
)

Notes

The provided reproduction steps and environment details suggest that the issue is specific to the interaction between CPU offloading and MoE layers on SM120 GPUs. Disabling CPU offloading may resolve the issue, but may also require significant changes to the model or environment.

Recommendation

Apply workaround: Disable CPU offloading by setting cpu_offload_gb to 0 or removing the flag, as this is the most likely cause of the issue and disabling it may resolve the problem.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING