vllm - 💡(How to fix) Fix [Bug] UVA CPU offload completely broken on WSL with NVFP4 MoE (Qwen3.5-35B-A3B): three distinct crashes across all parameter combinations [1 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#37883Fetched 2026-04-08 01:17:23
View on GitHub
Comments
1
Participants
2
Timeline
2
Reactions
0
Author
Participants
Timeline (top)
commented ×1labeled ×1

Error Message

Error: Error: Error: torch.AcceleratorError: CUDA error: an illegal memory access was encountered

Root Cause

Root cause: UVAOffloader.forward() in uva.py calls module.state_dict() inside the forward pass. Internally, state_dict() executes: destination._metadata = OrderedDict() This setattr on an OrderedDict is untraceable by torch._dynamo, causing a hard crash during AOT compilation.

Fix Action

Fix / Workaround

Attempting to use --cpu-offload-gb with --cpu-offload-params=experts on a NVFP4-quantized MoE model under WSL results in three distinct fatal crashes depending on the parameter combination. No combination of workarounds has produced a working configuration.

RAW_BUFFERClick to expand / collapse

Current environment

  • vLLM version: 0.18.0-cu130 (also reproduced on 0.17.1-cu130)
  • GPU: NVIDIA GeForce RTX 5090D 32GB (SM120, consumer Blackwell)
  • Host OS: Windows 11 + Docker Desktop (WSL2 backend)
  • Container runtime: Docker with NVIDIA GPU passthrough
  • Model: Sehyo/Qwen3.5-35B-A3B-NVFP4 (compressed-tensors / NVFP4 quantization)
  • NVFP4 GEMM backend: MARLIN (on 0.17.1), FLASHINFER_CUTLASS (on 0.18.0)
  • Note from vLLM startup log: "Using 'pin_memory=False' as WSL is detected."

🐛 Describe the bug

Attempting to use --cpu-offload-gb with --cpu-offload-params=experts on a NVFP4-quantized MoE model under WSL results in three distinct fatal crashes depending on the parameter combination. No combination of workarounds has produced a working configuration.

All three crashes occur during profile_run() at startup, before the engine is ready to serve requests.

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ Crash 1 — Dynamo cannot trace setattr (v0.18.0, compile mode) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Config: --cpu-offload-gb=4 --cpu-offload-params=experts (no --enforce-eager)

Error: torch._dynamo.exc.Unsupported: Failed to trace builtin operator Explanation: Dynamo does not know how to trace builtin operator setattr with argument types ['OrderedDict', 'str', 'OrderedDict']

Root cause: UVAOffloader.forward() in uva.py calls module.state_dict() inside the forward pass. Internally, state_dict() executes: destination._metadata = OrderedDict() This setattr on an OrderedDict is untraceable by torch._dynamo, causing a hard crash during AOT compilation.

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ Crash 2 — b_scales not on GPU (v0.18.0, Marlin path, with --enforce-eager) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Config: --cpu-offload-gb=4 --cpu-offload-params=experts --enforce-eager VLLM_USE_FLASHINFER_MOE_FP4=0 (forces Marlin backend)

Error: RuntimeError: b_scales is not on GPU at fused_marlin_moe.py -> moe_wna16_marlin_gemm()

Root cause: The UVA offloader moves all named parameters of the matched module to CPU, including b_scales (the NVFP4 quantization scale tensor). The Marlin GEMM kernel moe_wna16_marlin_gemm() requires all input tensors to reside on GPU and has no logic to handle CPU/UVA tensors. The offloader has no exclusion for quantization metadata tensors.

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ Crash 3 — CUDA illegal memory access (v0.18.0, CUTLASS path, with --enforce-eager) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Config: --cpu-offload-gb=4 --cpu-offload-params=experts --enforce-eager VLLM_USE_FLASHINFER_MOE_FP4=1 (uses FLASHINFER_CUTLASS backend)

Error: torch.AcceleratorError: CUDA error: an illegal memory access was encountered at uva.py:122 -> k: v.to(device, non_blocking=True)

Root cause: vLLM detects WSL and sets pin_memory=False at startup (logged explicitly). UVAOffloader.forward() uses non_blocking=True for all CPU-to-GPU tensor transfers. non_blocking=True without pinned memory is undefined behavior in CUDA — the transfer is not guaranteed to complete before the kernel launches, causing an illegal memory access. The offloader has no fallback to non_blocking=False when pin_memory is unavailable.

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ Summary of all tested combinations ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

enforce-eagerNVFP4 backendResult
noMARLINCrash 1 (Dynamo setattr)
noFLASHINFER_CUTLASSCrash 1 (Dynamo setattr)
yesMARLINCrash 2 (b_scales not on GPU)
yesFLASHINFER_CUTLASSCrash 3 (illegal memory access)

All four combinations crash. CPU offload is completely non-functional on WSL with NVFP4 MoE models in vLLM 0.18.0.

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ Expected behavior ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

  1. UVAOffloader should not call state_dict() inside forward() where torch.compile / Dynamo is active. This should be restructured to be Dynamo-traceable.

  2. The offloader should exclude quantization metadata tensors (scales, zeros, etc.) from CPU offloading, or detect and keep them on GPU.

  3. When pin_memory=False is detected (e.g., WSL), UVAOffloader should fall back to non_blocking=False to avoid undefined CUDA behavior.

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ Additional context ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

  • Without --cpu-offload-gb, the model runs correctly on both 0.17.1 (Marlin) and 0.18.0 (FLASHINFER_CUTLASS).
  • The FLASHINFER_CUTLASS backend works correctly without offloading on 0.18.0 after enabling VLLM_USE_FLASHINFER_MOE_FP4=1.
  • The motivation for offloading: FLASHINFER_CUTLASS requires a higher gpu_memory_utilization (~0.92) to pass profile_run, leaving very few KV cache blocks. With 4GB of expert weights offloaded to CPU, the freed VRAM could accommodate a full 131K context window.
  • Happy to test any proposed fix on RTX 5090D + WSL.

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

Fix Plan

To address the crashes, we need to modify the UVAOffloader to make it compatible with torch.compile/Dynamo and handle quantization metadata tensors correctly. Here are the steps:

  1. Restructure UVAOffloader.forward():
    • Move the state_dict() call outside the forward pass to avoid Dynamo tracing issues.
    • Example:

Before

def forward(self, input): state = self.module.state_dict() # ...

After

def init(self, module): self.state = module.state_dict()

def forward(self, input): # Use self.state instead of calling state_dict() here # ...


2. **Exclude quantization metadata tensors from CPU offloading**:
   - Identify and exclude tensors like `b_scales` from being offloaded to CPU.
   - Example:
     ```python
# Assuming self.module has an attribute 'b_scales'
def offload_params(self):
    params_to_offload = []
    for name, param in self.module.named_parameters():
        if name != 'b_scales':  # Exclude b_scales
            params_to_offload.append(param)
    # Offload params_to_offload to CPU
  1. Fallback to non_blocking=False when pin_memory=False:
    • Modify the tensor transfer logic to use non_blocking=False when pin_memory=False.
    • Example:

if pin_memory: tensor.to(device, non_blocking=True) else: tensor.to(device, non_blocking=False)


### Verification
After applying these fixes, verify that:
- The model runs without crashes during `profile_run()` with `--cpu-offload-gb` and `--cpu-offload-params=experts`.
- The offloaded parameters are correctly excluded from CPU offloading.
- The model's performance is as expected with the applied fixes.

### Extra Tips
- Ensure that the `UVAOffloader` is correctly handling the quantization metadata tensors and excluding them from offloading.
- Test the model with different configurations to ensure that the fixes are working as expected.
- Consider adding additional logging or debugging statements to help identify any future issues.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

vllm - 💡(How to fix) Fix [Bug] UVA CPU offload completely broken on WSL with NVFP4 MoE (Qwen3.5-35B-A3B): three distinct crashes across all parameter combinations [1 comments, 2 participants]