openclaw - 💡(How to fix) Fix WSL2 GPU-PV driver lockup: nvidia-smi hangs after llama-server D-state crash

openclaw2026-05-24 13:03:37

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

On WSL2 with GPU-PV (GPU Paravirtualization), when llama-server encounters a GPU operation stall, the entire NVIDIA driver stack locks up:

llama-server process enters D state (uninterruptible sleep)
nvidia-smi hangs indefinitely (both WSL and Windows side)
SIGKILL cannot terminate the D-state process
Only wsl --shutdown recovers the system

Root Cause

When this happens:

Agent responses stall because inference backend is down
Gateway cannot serve requests
System requires full WSL restart (downtime ~2-3 min)
All in-memory session state is lost

Fix Action

Fix / Workaround

Possible mitigations (not implemented upstream)

RAW_BUFFERClick to expand / collapse

Description

On WSL2 with GPU-PV (GPU Paravirtualization), when llama-server encounters a GPU operation stall, the entire NVIDIA driver stack locks up:

llama-server process enters D state (uninterruptible sleep)
nvidia-smi hangs indefinitely (both WSL and Windows side)
SIGKILL cannot terminate the D-state process
Only wsl --shutdown recovers the system

Environment

WSL2 (Ubuntu 26.04) on Windows
GPU: NVIDIA RTX 5090 (24GB)
Driver: Latest NVIDIA WSL2 driver
llama.cpp build b9246
Model: Qwen3.6-35B-A3B Q4_K_M with --cpu-moe -ngl 85
Also using llama-swap (:1235) as proxy

Trigger scenario

The GPU was running llama-server for ~1.5 hours serving Qwen3.6-35B. When the process received a kill signal while a CUDA operation was in-flight, the GPU-PV driver layer failed to complete the CUDA context teardown, resulting in:

Process stuck in D state (wakekill flag set: Ds → D)
NVIDIA kernel module unresponsive
All CUDA/cuda-smi commands hang at driver level

Impact on OpenClaw

When this happens:

Agent responses stall because inference backend is down
Gateway cannot serve requests
System requires full WSL restart (downtime ~2-3 min)
All in-memory session state is lost

Possible mitigations (not implemented upstream)

GPU health watchdog: Periodically check nvidia-smi responsiveness; auto-recover
Lower GPU memory pressure: Use fewer GPU layers (-ngl 65) to leave VRAM headroom
Graceful process management: Use systemd TimeoutStopSec for cleaner CUDA teardown
WSL2 GPU driver limitations: This is a known WSL2 GPU-PV limitation — CUDA operations that stall cannot be recovered without host-level intervention

WSL2 GPU-PV documentation: https://learn.microsoft.com/en-us/windows/wsl/tutorials/gpu-compute
Known limitation: WSL2 does not support GPU TDR recovery

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

openclaw - 💡(How to fix) Fix WSL2 GPU-PV driver lockup: nvidia-smi hangs after llama-server D-state crash

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Fix Action

Fix / Workaround

Possible mitigations (not implemented upstream)

Description

Environment

Trigger scenario

Impact on OpenClaw

Possible mitigations (not implemented upstream)

Related

Still need to ship something?

TRENDING