Current environment

vLLM version: 0.18.0-cu130 (also reproduced on 0.17.1-cu130)
GPU: NVIDIA GeForce RTX 5090D 32GB (SM120, consumer Blackwell)
Host OS: Windows 11 + Docker Desktop (WSL2 backend)
Container runtime: Docker with NVIDIA GPU passthrough
Model: Sehyo/Qwen3.5-35B-A3B-NVFP4 (compressed-tensors / NVFP4 quantization)
NVFP4 GEMM backend: MARLIN (on 0.17.1), FLASHINFER_CUTLASS (on 0.18.0)
Note from vLLM startup log: "Using 'pin_memory=False' as WSL is detected."

🐛 Describe the bug

Attempting to use --cpu-offload-gb with --cpu-offload-params=experts on a NVFP4-quantized MoE model under WSL results in three distinct fatal crashes depending on the parameter combination. No combination of workarounds has produced a working configuration.

All three crashes occur during profile_run() at startup, before the engine is ready to serve requests.

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ Crash 1 — Dynamo cannot trace setattr (v0.18.0, compile mode) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Config: --cpu-offload-gb=4 --cpu-offload-params=experts (no --enforce-eager)

Error: torch._dynamo.exc.Unsupported: Failed to trace builtin operator Explanation: Dynamo does not know how to trace builtin operator setattr with argument types ['OrderedDict', 'str', 'OrderedDict']

Root cause: UVAOffloader.forward() in uva.py calls module.state_dict() inside the forward pass. Internally, state_dict() executes: destination._metadata = OrderedDict() This setattr on an OrderedDict is untraceable by torch._dynamo, causing a hard crash during AOT compilation.

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ Crash 2 — b_scales not on GPU (v0.18.0, Marlin path, with --enforce-eager) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Config: --cpu-offload-gb=4 --cpu-offload-params=experts --enforce-eager VLLM_USE_FLASHINFER_MOE_FP4=0 (forces Marlin backend)

Error: RuntimeError: b_scales is not on GPU at fused_marlin_moe.py -> moe_wna16_marlin_gemm()

Root cause: The UVA offloader moves all named parameters of the matched module to CPU, including b_scales (the NVFP4 quantization scale tensor). The Marlin GEMM kernel moe_wna16_marlin_gemm() requires all input tensors to reside on GPU and has no logic to handle CPU/UVA tensors. The offloader has no exclusion for quantization metadata tensors.

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ Crash 3 — CUDA illegal memory access (v0.18.0, CUTLASS path, with --enforce-eager) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Config: --cpu-offload-gb=4 --cpu-offload-params=experts --enforce-eager VLLM_USE_FLASHINFER_MOE_FP4=1 (uses FLASHINFER_CUTLASS backend)

Error: torch.AcceleratorError: CUDA error: an illegal memory access was encountered at uva.py:122 -> k: v.to(device, non_blocking=True)

Root cause: vLLM detects WSL and sets pin_memory=False at startup (logged explicitly). UVAOffloader.forward() uses non_blocking=True for all CPU-to-GPU tensor transfers. non_blocking=True without pinned memory is undefined behavior in CUDA — the transfer is not guaranteed to complete before the kernel launches, causing an illegal memory access. The offloader has no fallback to non_blocking=False when pin_memory is unavailable.

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ Summary of all tested combinations ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

enforce-eager	NVFP4 backend	Result
no	MARLIN	Crash 1 (Dynamo setattr)
no	FLASHINFER_CUTLASS	Crash 1 (Dynamo setattr)
yes	MARLIN	Crash 2 (b_scales not on GPU)
yes	FLASHINFER_CUTLASS	Crash 3 (illegal memory access)

All four combinations crash. CPU offload is completely non-functional on WSL with NVFP4 MoE models in vLLM 0.18.0.

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ Expected behavior ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

UVAOffloader should not call state_dict() inside forward() where torch.compile / Dynamo is active. This should be restructured to be Dynamo-traceable.
The offloader should exclude quantization metadata tensors (scales, zeros, etc.) from CPU offloading, or detect and keep them on GPU.
When pin_memory=False is detected (e.g., WSL), UVAOffloader should fall back to non_blocking=False to avoid undefined CUDA behavior.

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ Additional context ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Without --cpu-offload-gb, the model runs correctly on both 0.17.1 (Marlin) and 0.18.0 (FLASHINFER_CUTLASS).
The FLASHINFER_CUTLASS backend works correctly without offloading on 0.18.0 after enabling VLLM_USE_FLASHINFER_MOE_FP4=1.
The motivation for offloading: FLASHINFER_CUTLASS requires a higher gpu_memory_utilization (~0.92) to pass profile_run, leaving very few KV cache blocks. With 4GB of expert weights offloaded to CPU, the freed VRAM could accommodate a full 131K context window.
Happy to test any proposed fix on RTX 5090D + WSL.

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

Fix Plan

To address the crashes, we need to modify the UVAOffloader to make it compatible with torch.compile/Dynamo and handle quantization metadata tensors correctly. Here are the steps:

Restructure UVAOffloader.forward():
- Move the state_dict() call outside the forward pass to avoid Dynamo tracing issues.
- Example:

Before

def forward(self, input): state = self.module.state_dict() # ...

After

def init(self, module): self.state = module.state_dict()

def forward(self, input): # Use self.state instead of calling state_dict() here # ...


2. **Exclude quantization metadata tensors from CPU offloading**:
   - Identify and exclude tensors like `b_scales` from being offloaded to CPU.
   - Example:
     ```python
# Assuming self.module has an attribute 'b_scales'
def offload_params(self):
    params_to_offload = []
    for name, param in self.module.named_parameters():
        if name != 'b_scales':  # Exclude b_scales
            params_to_offload.append(param)
    # Offload params_to_offload to CPU

Fallback to non_blocking=False when pin_memory=False:
- Modify the tensor transfer logic to use non_blocking=False when pin_memory=False.
- Example:

if pin_memory: tensor.to(device, non_blocking=True) else: tensor.to(device, non_blocking=False)


### Verification
After applying these fixes, verify that:
- The model runs without crashes during `profile_run()` with `--cpu-offload-gb` and `--cpu-offload-params=experts`.
- The offloaded parameters are correctly excluded from CPU offloading.
- The model's performance is as expected with the applied fixes.

### Extra Tips
- Ensure that the `UVAOffloader` is correctly handling the quantization metadata tensors and excluding them from offloading.
- Test the model with different configurations to ensure that the fixes are working as expected.
- Consider adding additional logging or debugging statements to help identify any future issues.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

vllm - 💡(How to fix) Fix [Bug] UVA CPU offload completely broken on WSL with NVFP4 MoE (Qwen3.5-35B-A3B): three distinct crashes across all parameter combinations [1 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Fix / Workaround

Current environment

🐛 Describe the bug

Before submitting a new issue...

extent analysis

Fix Plan

Before

After

Still need to ship something?

TRENDING

vllm - 💡(How to fix) Fix [Bug] UVA CPU offload completely broken on WSL with NVFP4 MoE (Qwen3.5-35B-A3B): three distinct crashes across all parameter combinations [1 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Fix / Workaround

Current environment

🐛 Describe the bug

Before submitting a new issue...

extent analysis

Fix Plan

Before

After

Still need to ship something?

RELATED_DISCOVERY

TRENDING