openclaw - 💡(How to fix) Fix WSL2 GPU-PV driver lockup: nvidia-smi hangs after llama-server D-state crash

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

On WSL2 with GPU-PV (GPU Paravirtualization), when llama-server encounters a GPU operation stall, the entire NVIDIA driver stack locks up:

  1. llama-server process enters D state (uninterruptible sleep)
  2. nvidia-smi hangs indefinitely (both WSL and Windows side)
  3. SIGKILL cannot terminate the D-state process
  4. Only wsl --shutdown recovers the system

Root Cause

When this happens:

  • Agent responses stall because inference backend is down
  • Gateway cannot serve requests
  • System requires full WSL restart (downtime ~2-3 min)
  • All in-memory session state is lost

Fix Action

Fix / Workaround

Possible mitigations (not implemented upstream)

RAW_BUFFERClick to expand / collapse

Description

On WSL2 with GPU-PV (GPU Paravirtualization), when llama-server encounters a GPU operation stall, the entire NVIDIA driver stack locks up:

  1. llama-server process enters D state (uninterruptible sleep)
  2. nvidia-smi hangs indefinitely (both WSL and Windows side)
  3. SIGKILL cannot terminate the D-state process
  4. Only wsl --shutdown recovers the system

Environment

  • WSL2 (Ubuntu 26.04) on Windows
  • GPU: NVIDIA RTX 5090 (24GB)
  • Driver: Latest NVIDIA WSL2 driver
  • llama.cpp build b9246
  • Model: Qwen3.6-35B-A3B Q4_K_M with --cpu-moe -ngl 85
  • Also using llama-swap (:1235) as proxy

Trigger scenario

The GPU was running llama-server for ~1.5 hours serving Qwen3.6-35B. When the process received a kill signal while a CUDA operation was in-flight, the GPU-PV driver layer failed to complete the CUDA context teardown, resulting in:

  • Process stuck in D state (wakekill flag set: Ds → D)
  • NVIDIA kernel module unresponsive
  • All CUDA/cuda-smi commands hang at driver level

Impact on OpenClaw

When this happens:

  • Agent responses stall because inference backend is down
  • Gateway cannot serve requests
  • System requires full WSL restart (downtime ~2-3 min)
  • All in-memory session state is lost

Possible mitigations (not implemented upstream)

  1. GPU health watchdog: Periodically check nvidia-smi responsiveness; auto-recover
  2. Lower GPU memory pressure: Use fewer GPU layers (-ngl 65) to leave VRAM headroom
  3. Graceful process management: Use systemd TimeoutStopSec for cleaner CUDA teardown
  4. WSL2 GPU driver limitations: This is a known WSL2 GPU-PV limitation — CUDA operations that stall cannot be recovered without host-level intervention

Related

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

openclaw - 💡(How to fix) Fix WSL2 GPU-PV driver lockup: nvidia-smi hangs after llama-server D-state crash