vllm - 💡(How to fix) Fix [Bug]: Qwen3.6 hybrid Mamba models fail KV cache allocation on RTX PRO 6000 Blackwell + WSL2 — 16 GiB invisible CUDA overhead [1 comments, 1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#41619Fetched 2026-05-05 05:44:38
View on GitHub
Comments
1
Participants
1
Timeline
6
Reactions
0
Author
Participants
Timeline (top)
cross-referenced ×3commented ×1labeled ×1subscribed ×1

vLLM (and SGLang) cannot load any Qwen3.6 / Qwen3.5 family models (Qwen3_5ForConditionalGeneration / Qwen3_5MoeForConditionalGeneration architecture) on RTX PRO 6000 Blackwell Workstation Edition (96 GB VRAM, sm_120) under WSL2 Ubuntu 22.04.

The model loads to GPU successfully (~30 GB), but allocation of the Mamba state cache fails with torch.OutOfMemoryError: CUDA out of memory. Tried to allocate X GiB. GPU has 95.59 GiB total of which 50+ GiB is free.

The "non-PyTorch memory in use" reported by torch is 16 GiB on this WSL2 setup (vs ~1-2 GiB normal on native Linux). This abnormal CUDA driver overhead consumes memory invisibly and causes the contiguous Mamba state allocation to fail.

Error Message

Collecting environment information...

    System Info

============================== OS : Ubuntu 22.04.5 LTS (x86_64) GCC version : (Ubuntu 11.4.0-1ubuntu1~22.04.3) 11.4.0 Clang version : Could not collect CMake version : Could not collect Libc version : glibc-2.35

============================== PyTorch Info

PyTorch version : 2.10.0+cu128 Is debug build : False CUDA used to build PyTorch : 12.8 ROCM used to build PyTorch : N/A XPU used to build PyTorch : N/A

============================== Python Environment

Python version : 3.10.12 (main, Mar 3 2026, 11:56:32) [GCC 11.4.0] (64-bit runtime) Python platform : Linux-6.6.87.2-microsoft-standard-WSL2-x86_64-with-glibc2.35

============================== CUDA / GPU Info

Is CUDA available : True CUDA runtime version : Could not collect CUDA_MODULE_LOADING set to : GPU models and configuration : GPU 0: NVIDIA RTX PRO 6000 Blackwell Workstation Edition Nvidia driver version : 596.36 cuDNN version : Could not collect HIP runtime version : N/A MIOpen runtime version : N/A Is XNNPACK available : True

============================== CPU Info

Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Address sizes: 39 bits physical, 48 bits virtual Byte Order: Little Endian CPU(s): 14 On-line CPU(s) list: 0-13 Vendor ID: GenuineIntel Model name: 13th Gen Intel(R) Core(TM) i5-13400F CPU family: 6 Model: 191 Thread(s) per core: 2 Core(s) per socket: 7 Socket(s): 1 Stepping: 2 BogoMIPS: 4991.99 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology tsc_reliable nonstop_tsc cpuid tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch ssbd ibrs ibpb stibp ibrs_enhanced fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves avx_vnni umip waitpkg gfni vaes vpclmulqdq rdpid movdiri movdir64b fsrm md_clear serialize flush_l1d arch_capabilities Hypervisor vendor: Microsoft Virtualization type: full L1d cache: 336 KiB (7 instances) L1i cache: 224 KiB (7 instances) L2 cache: 8.8 MiB (7 instances) L3 cache: 20 MiB (1 instance) NUMA node(s): 1 NUMA node0 CPU(s): 0-13 Vulnerability Gather data sampling: Not affected Vulnerability Itlb multihit: Not affected Vulnerability L1tf: Not affected Vulnerability Mds: Not affected Vulnerability Meltdown: Not affected Vulnerability Mmio stale data: Not affected Vulnerability Reg file data sampling: Vulnerable: No microcode Vulnerability Retbleed: Mitigation; Enhanced IBRS Vulnerability Spec rstack overflow: Not affected Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Vulnerability Spectre v2: Mitigation; Enhanced / Automatic IBRS; IBPB conditional; RSB filling; PBRSB-eIBRS SW sequence; BHI BHI_DIS_S Vulnerability Srbds: Not affected Vulnerability Tsx async abort: Not affected

============================== Versions of relevant libraries

[pip3] flashinfer-python==0.6.4 [pip3] numpy==2.2.6 [pip3] nvidia-cublas==13.1.0.3 [pip3] nvidia-cublas-cu12==12.8.4.1 [pip3] nvidia-cuda-cccl==13.2.75 [pip3] nvidia-cuda-crt==13.2.78 [pip3] nvidia-cuda-cupti==13.0.85 [pip3] nvidia-cuda-cupti-cu12==12.8.90 [pip3] nvidia-cuda-nvcc==13.2.78 [pip3] nvidia-cuda-nvrtc==13.0.88 [pip3] nvidia-cuda-nvrtc-cu12==12.8.93 [pip3] nvidia-cuda-runtime==13.0.96 [pip3] nvidia-cuda-runtime-cu12==12.8.90 [pip3] nvidia-cudnn-cu12==9.10.2.21 [pip3] nvidia-cudnn-frontend==1.18.0 [pip3] nvidia-cufft==12.0.0.61 [pip3] nvidia-cufft-cu12==11.3.3.83 [pip3] nvidia-cufile==1.15.1.6 [pip3] nvidia-cufile-cu12==1.13.1.3 [pip3] nvidia-curand==10.4.0.35 [pip3] nvidia-curand-cu12==10.3.9.90 [pip3] nvidia-cusolver==12.0.4.66 [pip3] nvidia-cusolver-cu12==11.7.3.90 [pip3] nvidia-cusparse==12.6.3.3 [pip3] nvidia-cusparse-cu12==12.5.8.93 [pip3] nvidia-cusparselt-cu12==0.7.1 [pip3] nvidia-cusparselt-cu13==0.8.0 [pip3] nvidia-cutlass-dsl==4.4.2 [pip3] nvidia-cutlass-dsl-libs-base==4.4.2 [pip3] nvidia-ml-py==13.595.45 [pip3] nvidia-nccl-cu12==2.27.5 [pip3] nvidia-nvjitlink==13.0.88 [pip3] nvidia-nvjitlink-cu12==12.8.93 [pip3] nvidia-nvshmem-cu12==3.4.5 [pip3] nvidia-nvshmem-cu13==3.4.5 [pip3] nvidia-nvtx==13.0.85 [pip3] nvidia-nvtx-cu12==12.8.90 [pip3] nvidia-nvvm==13.2.78 [pip3] pyzmq==27.1.0 [pip3] torch==2.10.0+cu128 [pip3] torch_c_dlpack_ext==0.1.5 [pip3] torchaudio==2.10.0+cu128 [pip3] torchvision==0.25.0+cu128 [pip3] transformers==4.57.6 [pip3] triton==3.6.0 [conda] Could not collect

============================== vLLM Info

ROCM Version : Could not collect vLLM Version : 0.17.1 vLLM Build Flags: CUDA Archs: Not Set; ROCm: Disabled; XPU: Disabled GPU Topology: GPU0 CPU Affinity NUMA Affinity GPU NUMA ID GPU0 X N/A

Legend:

X = Self SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI) NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU) PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge) PIX = Connection traversing at most a single PCIe bridge NV# = Connection traversing a bonded set of # NVLinks

============================== Environment Variables

PYTORCH_NVML_BASED_CUDA_CHECK=1 TORCHINDUCTOR_COMPILE_THREADS=1 TORCHINDUCTOR_CACHE_DIR=/tmp/torchinductor_nomugop

</details>

🐛 Describe the bug

Summary

vLLM (and SGLang) cannot load any Qwen3.6 / Qwen3.5 family models (Qwen3_5ForConditionalGeneration / Qwen3_5MoeForConditionalGeneration architecture) on RTX PRO 6000 Blackwell Workstation Edition (96 GB VRAM, sm_120) under WSL2 Ubuntu 22.04.

The model loads to GPU successfully (~30 GB), but allocation of the Mamba state cache fails with torch.OutOfMemoryError: CUDA out of memory. Tried to allocate X GiB. GPU has 95.59 GiB total of which 50+ GiB is free.

The "non-PyTorch memory in use" reported by torch is 16 GiB on this WSL2 setup (vs ~1-2 GiB normal on native Linux). This abnormal CUDA driver overhead consumes memory invisibly and causes the contiguous Mamba state allocation to fail.

Reproduction

Hardware

  • GPU: NVIDIA RTX PRO 6000 Blackwell Workstation Edition (compute 12.0, sm_120, 97887 MiB VRAM)
  • CPU: Intel Core i5-13400F
  • RAM: 32 GB

Software

  • Host: Windows 11 IoT Enterprise LTSC, build 26100
  • NVIDIA driver: tested 596.36 AND 581.80 (RTX Enterprise) — same result
  • WSL: WSL 2.6.3.0, kernel 6.6.87.2-microsoft-standard-WSL2
  • Distro: Ubuntu 22.04.5 LTS
  • Python 3.10.12
  • CUDA: tested cu128 AND cu130 — same result

Engine versions tested (all fail with same OOM pattern)

EngineVersiontorchCUDAResult
vLLM0.20.02.11.0cu130OOM Tried to allocate 3.77 GiB
vLLM0.17.12.10.0cu128OOM Tried to allocate 3.48 GiB
SGLang0.5.10.post12.9.1cu128OOM Tried to allocate 18.56 GiB (worse)

Models tested (all fail, all on disk locally)

ModelArchitectureMamba allocation requestedResult
Qwen/Qwen3.6-27B-FP8Qwen3_5ForConditionalGeneration3.77 GBOOM
Qwen/Qwen3.6-35B-A3B-FP8Qwen3_5MoeForConditionalGeneration (MoE)4.99 GBOOM

All Qwen3.6 models have linear_attention layers (hybrid GDN+Mamba architecture).

Reproducible command

Root Cause

vLLM's profiler thinks 60 GB is available for KV cache — but actual allocation fails on the first 3.48 GiB request because PyTorch can't find a contiguous block.

Fix Action

Fix / Workaround

============================== CPU Info

Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Address sizes: 39 bits physical, 48 bits virtual Byte Order: Little Endian CPU(s): 14 On-line CPU(s) list: 0-13 Vendor ID: GenuineIntel Model name: 13th Gen Intel(R) Core(TM) i5-13400F CPU family: 6 Model: 191 Thread(s) per core: 2 Core(s) per socket: 7 Socket(s): 1 Stepping: 2 BogoMIPS: 4991.99 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology tsc_reliable nonstop_tsc cpuid tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch ssbd ibrs ibpb stibp ibrs_enhanced fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves avx_vnni umip waitpkg gfni vaes vpclmulqdq rdpid movdiri movdir64b fsrm md_clear serialize flush_l1d arch_capabilities Hypervisor vendor: Microsoft Virtualization type: full L1d cache: 336 KiB (7 instances) L1i cache: 224 KiB (7 instances) L2 cache: 8.8 MiB (7 instances) L3 cache: 20 MiB (1 instance) NUMA node(s): 1 NUMA node0 CPU(s): 0-13 Vulnerability Gather data sampling: Not affected Vulnerability Itlb multihit: Not affected Vulnerability L1tf: Not affected Vulnerability Mds: Not affected Vulnerability Meltdown: Not affected Vulnerability Mmio stale data: Not affected Vulnerability Reg file data sampling: Vulnerable: No microcode Vulnerability Retbleed: Mitigation; Enhanced IBRS Vulnerability Spec rstack overflow: Not affected Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Vulnerability Spectre v2: Mitigation; Enhanced / Automatic IBRS; IBPB conditional; RSB filling; PBRSB-eIBRS SW sequence; BHI BHI_DIS_S Vulnerability Srbds: Not affected Vulnerability Tsx async abort: Not affected

Workarounds (none ideal)

I am happy to provide additional debug data, run patches, or test proposed fixes. This is currently blocking my AI workflow which previously ran fine on native Ubuntu with the same hardware.

Code Example

Collecting environment information...
==============================
        System Info
==============================
OS                           : Ubuntu 22.04.5 LTS (x86_64)
GCC version                  : (Ubuntu 11.4.0-1ubuntu1~22.04.3) 11.4.0
Clang version                : Could not collect
CMake version                : Could not collect
Libc version                 : glibc-2.35

==============================
       PyTorch Info
==============================
PyTorch version              : 2.10.0+cu128
Is debug build               : False
CUDA used to build PyTorch   : 12.8
ROCM used to build PyTorch   : N/A
XPU used to build PyTorch    : N/A

==============================
      Python Environment
==============================
Python version               : 3.10.12 (main, Mar  3 2026, 11:56:32) [GCC 11.4.0] (64-bit runtime)
Python platform              : Linux-6.6.87.2-microsoft-standard-WSL2-x86_64-with-glibc2.35
    
==============================
       CUDA / GPU Info
==============================
Is CUDA available            : True
CUDA runtime version         : Could not collect
CUDA_MODULE_LOADING set to   : 
GPU models and configuration : GPU 0: NVIDIA RTX PRO 6000 Blackwell Workstation Edition
Nvidia driver version        : 596.36
cuDNN version                : Could not collect
HIP runtime version          : N/A
MIOpen runtime version       : N/A
Is XNNPACK available         : True

==============================
          CPU Info
==============================
Architecture:                         x86_64
CPU op-mode(s):                       32-bit, 64-bit
Address sizes:                        39 bits physical, 48 bits virtual
Byte Order:                           Little Endian
CPU(s):                               14
On-line CPU(s) list:                  0-13
Vendor ID:                            GenuineIntel
Model name:                           13th Gen Intel(R) Core(TM) i5-13400F
CPU family:                           6
Model:                                191
Thread(s) per core:                   2
Core(s) per socket:                   7
Socket(s):                            1
Stepping:                             2
BogoMIPS:                             4991.99
Flags:                                fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology tsc_reliable nonstop_tsc cpuid tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch ssbd ibrs ibpb stibp ibrs_enhanced fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves avx_vnni umip waitpkg gfni vaes vpclmulqdq rdpid movdiri movdir64b fsrm md_clear serialize flush_l1d arch_capabilities
Hypervisor vendor:                    Microsoft
Virtualization type:                  full
L1d cache:                            336 KiB (7 instances)
L1i cache:                            224 KiB (7 instances)
L2 cache:                             8.8 MiB (7 instances)
L3 cache:                             20 MiB (1 instance)
NUMA node(s):                         1
NUMA node0 CPU(s):                    0-13
Vulnerability Gather data sampling:   Not affected
Vulnerability Itlb multihit:          Not affected
Vulnerability L1tf:                   Not affected
Vulnerability Mds:                    Not affected
Vulnerability Meltdown:               Not affected
Vulnerability Mmio stale data:        Not affected
Vulnerability Reg file data sampling: Vulnerable: No microcode
Vulnerability Retbleed:               Mitigation; Enhanced IBRS
Vulnerability Spec rstack overflow:   Not affected
Vulnerability Spec store bypass:      Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:             Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:             Mitigation; Enhanced / Automatic IBRS; IBPB conditional; RSB filling; PBRSB-eIBRS SW sequence; BHI BHI_DIS_S
Vulnerability Srbds:                  Not affected
Vulnerability Tsx async abort:        Not affected

==============================
Versions of relevant libraries
==============================
[pip3] flashinfer-python==0.6.4
[pip3] numpy==2.2.6
[pip3] nvidia-cublas==13.1.0.3
[pip3] nvidia-cublas-cu12==12.8.4.1
[pip3] nvidia-cuda-cccl==13.2.75
[pip3] nvidia-cuda-crt==13.2.78
[pip3] nvidia-cuda-cupti==13.0.85
[pip3] nvidia-cuda-cupti-cu12==12.8.90
[pip3] nvidia-cuda-nvcc==13.2.78
[pip3] nvidia-cuda-nvrtc==13.0.88
[pip3] nvidia-cuda-nvrtc-cu12==12.8.93
[pip3] nvidia-cuda-runtime==13.0.96
[pip3] nvidia-cuda-runtime-cu12==12.8.90
[pip3] nvidia-cudnn-cu12==9.10.2.21
[pip3] nvidia-cudnn-frontend==1.18.0
[pip3] nvidia-cufft==12.0.0.61
[pip3] nvidia-cufft-cu12==11.3.3.83
[pip3] nvidia-cufile==1.15.1.6
[pip3] nvidia-cufile-cu12==1.13.1.3
[pip3] nvidia-curand==10.4.0.35
[pip3] nvidia-curand-cu12==10.3.9.90
[pip3] nvidia-cusolver==12.0.4.66
[pip3] nvidia-cusolver-cu12==11.7.3.90
[pip3] nvidia-cusparse==12.6.3.3
[pip3] nvidia-cusparse-cu12==12.5.8.93
[pip3] nvidia-cusparselt-cu12==0.7.1
[pip3] nvidia-cusparselt-cu13==0.8.0
[pip3] nvidia-cutlass-dsl==4.4.2
[pip3] nvidia-cutlass-dsl-libs-base==4.4.2
[pip3] nvidia-ml-py==13.595.45
[pip3] nvidia-nccl-cu12==2.27.5
[pip3] nvidia-nvjitlink==13.0.88
[pip3] nvidia-nvjitlink-cu12==12.8.93
[pip3] nvidia-nvshmem-cu12==3.4.5
[pip3] nvidia-nvshmem-cu13==3.4.5
[pip3] nvidia-nvtx==13.0.85
[pip3] nvidia-nvtx-cu12==12.8.90
[pip3] nvidia-nvvm==13.2.78
[pip3] pyzmq==27.1.0
[pip3] torch==2.10.0+cu128
[pip3] torch_c_dlpack_ext==0.1.5
[pip3] torchaudio==2.10.0+cu128
[pip3] torchvision==0.25.0+cu128
[pip3] transformers==4.57.6
[pip3] triton==3.6.0
[conda] Could not collect

==============================
         vLLM Info
==============================
ROCM Version                 : Could not collect
vLLM Version                 : 0.17.1
vLLM Build Flags:
  CUDA Archs: Not Set; ROCm: Disabled; XPU: Disabled
GPU Topology:
  	GPU0	CPU Affinity	NUMA Affinity	GPU NUMA ID
GPU0	 X 				N/A

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

==============================
     Environment Variables
==============================
PYTORCH_NVML_BASED_CUDA_CHECK=1
TORCHINDUCTOR_COMPILE_THREADS=1
TORCHINDUCTOR_CACHE_DIR=/tmp/torchinductor_nomugop

</details>


### 🐛 Describe the bug

## Summary

vLLM (and SGLang) cannot load any Qwen3.6 / Qwen3.5 family models (Qwen3_5ForConditionalGeneration / Qwen3_5MoeForConditionalGeneration architecture) on RTX PRO 6000 Blackwell Workstation Edition (96 GB VRAM, sm_120) under WSL2 Ubuntu 22.04.

The model loads to GPU successfully (~30 GB), but allocation of the Mamba state cache fails with `torch.OutOfMemoryError: CUDA out of memory. Tried to allocate X GiB. GPU has 95.59 GiB total of which 50+ GiB is free`.

The "non-PyTorch memory in use" reported by torch is **16 GiB** on this WSL2 setup (vs ~1-2 GiB normal on native Linux). This abnormal CUDA driver overhead consumes memory invisibly and causes the contiguous Mamba state allocation to fail.

## Reproduction

### Hardware

- GPU: NVIDIA RTX PRO 6000 Blackwell Workstation Edition (compute 12.0, sm_120, 97887 MiB VRAM)
- CPU: Intel Core i5-13400F
- RAM: 32 GB

### Software

- Host: Windows 11 IoT Enterprise LTSC, build 26100
- NVIDIA driver: tested **596.36** AND **581.80 (RTX Enterprise)** — same result
- WSL: WSL 2.6.3.0, kernel 6.6.87.2-microsoft-standard-WSL2
- Distro: Ubuntu 22.04.5 LTS
- Python 3.10.12
- CUDA: tested cu128 AND cu130 — same result

### Engine versions tested (all fail with same OOM pattern)

| Engine | Version | torch | CUDA | Result |
|--------|---------|-------|------|--------|
| vLLM | **0.20.0** | 2.11.0 | cu130 | OOM `Tried to allocate 3.77 GiB` |
| vLLM | **0.17.1** | 2.10.0 | cu128 | OOM `Tried to allocate 3.48 GiB` |
| SGLang | **0.5.10.post1** | 2.9.1 | cu128 | OOM `Tried to allocate 18.56 GiB` (worse) |

### Models tested (all fail, all on disk locally)

| Model | Architecture | Mamba allocation requested | Result |
|-------|--------------|---------------------------|--------|
| Qwen/Qwen3.6-27B-FP8 | Qwen3_5ForConditionalGeneration | 3.77 GB | OOM |
| Qwen/Qwen3.6-35B-A3B-FP8 | Qwen3_5MoeForConditionalGeneration (MoE) | 4.99 GB | OOM |

All Qwen3.6 models have `linear_attention` layers (hybrid GDN+Mamba architecture).

### Reproducible command

---

## Full error message

---

**Note:** The `17179869184.00 GiB` is a display bug — actual value is `17179869184 bytes = 16 GiB` (mislabeled units). So this process has **16 GiB of "non-PyTorch memory"** which is abnormally high (typical native Linux: 1-2 GiB).

## Stack trace (where the OOM happens)

---

Profiler reports plenty of memory before OOM:
RAW_BUFFERClick to expand / collapse

Your current environment

<details> <summary>The output of <code>python collect_env.py</code></summary>
Collecting environment information...
==============================
        System Info
==============================
OS                           : Ubuntu 22.04.5 LTS (x86_64)
GCC version                  : (Ubuntu 11.4.0-1ubuntu1~22.04.3) 11.4.0
Clang version                : Could not collect
CMake version                : Could not collect
Libc version                 : glibc-2.35

==============================
       PyTorch Info
==============================
PyTorch version              : 2.10.0+cu128
Is debug build               : False
CUDA used to build PyTorch   : 12.8
ROCM used to build PyTorch   : N/A
XPU used to build PyTorch    : N/A

==============================
      Python Environment
==============================
Python version               : 3.10.12 (main, Mar  3 2026, 11:56:32) [GCC 11.4.0] (64-bit runtime)
Python platform              : Linux-6.6.87.2-microsoft-standard-WSL2-x86_64-with-glibc2.35
    
==============================
       CUDA / GPU Info
==============================
Is CUDA available            : True
CUDA runtime version         : Could not collect
CUDA_MODULE_LOADING set to   : 
GPU models and configuration : GPU 0: NVIDIA RTX PRO 6000 Blackwell Workstation Edition
Nvidia driver version        : 596.36
cuDNN version                : Could not collect
HIP runtime version          : N/A
MIOpen runtime version       : N/A
Is XNNPACK available         : True

==============================
          CPU Info
==============================
Architecture:                         x86_64
CPU op-mode(s):                       32-bit, 64-bit
Address sizes:                        39 bits physical, 48 bits virtual
Byte Order:                           Little Endian
CPU(s):                               14
On-line CPU(s) list:                  0-13
Vendor ID:                            GenuineIntel
Model name:                           13th Gen Intel(R) Core(TM) i5-13400F
CPU family:                           6
Model:                                191
Thread(s) per core:                   2
Core(s) per socket:                   7
Socket(s):                            1
Stepping:                             2
BogoMIPS:                             4991.99
Flags:                                fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology tsc_reliable nonstop_tsc cpuid tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch ssbd ibrs ibpb stibp ibrs_enhanced fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves avx_vnni umip waitpkg gfni vaes vpclmulqdq rdpid movdiri movdir64b fsrm md_clear serialize flush_l1d arch_capabilities
Hypervisor vendor:                    Microsoft
Virtualization type:                  full
L1d cache:                            336 KiB (7 instances)
L1i cache:                            224 KiB (7 instances)
L2 cache:                             8.8 MiB (7 instances)
L3 cache:                             20 MiB (1 instance)
NUMA node(s):                         1
NUMA node0 CPU(s):                    0-13
Vulnerability Gather data sampling:   Not affected
Vulnerability Itlb multihit:          Not affected
Vulnerability L1tf:                   Not affected
Vulnerability Mds:                    Not affected
Vulnerability Meltdown:               Not affected
Vulnerability Mmio stale data:        Not affected
Vulnerability Reg file data sampling: Vulnerable: No microcode
Vulnerability Retbleed:               Mitigation; Enhanced IBRS
Vulnerability Spec rstack overflow:   Not affected
Vulnerability Spec store bypass:      Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:             Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:             Mitigation; Enhanced / Automatic IBRS; IBPB conditional; RSB filling; PBRSB-eIBRS SW sequence; BHI BHI_DIS_S
Vulnerability Srbds:                  Not affected
Vulnerability Tsx async abort:        Not affected

==============================
Versions of relevant libraries
==============================
[pip3] flashinfer-python==0.6.4
[pip3] numpy==2.2.6
[pip3] nvidia-cublas==13.1.0.3
[pip3] nvidia-cublas-cu12==12.8.4.1
[pip3] nvidia-cuda-cccl==13.2.75
[pip3] nvidia-cuda-crt==13.2.78
[pip3] nvidia-cuda-cupti==13.0.85
[pip3] nvidia-cuda-cupti-cu12==12.8.90
[pip3] nvidia-cuda-nvcc==13.2.78
[pip3] nvidia-cuda-nvrtc==13.0.88
[pip3] nvidia-cuda-nvrtc-cu12==12.8.93
[pip3] nvidia-cuda-runtime==13.0.96
[pip3] nvidia-cuda-runtime-cu12==12.8.90
[pip3] nvidia-cudnn-cu12==9.10.2.21
[pip3] nvidia-cudnn-frontend==1.18.0
[pip3] nvidia-cufft==12.0.0.61
[pip3] nvidia-cufft-cu12==11.3.3.83
[pip3] nvidia-cufile==1.15.1.6
[pip3] nvidia-cufile-cu12==1.13.1.3
[pip3] nvidia-curand==10.4.0.35
[pip3] nvidia-curand-cu12==10.3.9.90
[pip3] nvidia-cusolver==12.0.4.66
[pip3] nvidia-cusolver-cu12==11.7.3.90
[pip3] nvidia-cusparse==12.6.3.3
[pip3] nvidia-cusparse-cu12==12.5.8.93
[pip3] nvidia-cusparselt-cu12==0.7.1
[pip3] nvidia-cusparselt-cu13==0.8.0
[pip3] nvidia-cutlass-dsl==4.4.2
[pip3] nvidia-cutlass-dsl-libs-base==4.4.2
[pip3] nvidia-ml-py==13.595.45
[pip3] nvidia-nccl-cu12==2.27.5
[pip3] nvidia-nvjitlink==13.0.88
[pip3] nvidia-nvjitlink-cu12==12.8.93
[pip3] nvidia-nvshmem-cu12==3.4.5
[pip3] nvidia-nvshmem-cu13==3.4.5
[pip3] nvidia-nvtx==13.0.85
[pip3] nvidia-nvtx-cu12==12.8.90
[pip3] nvidia-nvvm==13.2.78
[pip3] pyzmq==27.1.0
[pip3] torch==2.10.0+cu128
[pip3] torch_c_dlpack_ext==0.1.5
[pip3] torchaudio==2.10.0+cu128
[pip3] torchvision==0.25.0+cu128
[pip3] transformers==4.57.6
[pip3] triton==3.6.0
[conda] Could not collect

==============================
         vLLM Info
==============================
ROCM Version                 : Could not collect
vLLM Version                 : 0.17.1
vLLM Build Flags:
  CUDA Archs: Not Set; ROCm: Disabled; XPU: Disabled
GPU Topology:
  	GPU0	CPU Affinity	NUMA Affinity	GPU NUMA ID
GPU0	 X 				N/A

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

==============================
     Environment Variables
==============================
PYTORCH_NVML_BASED_CUDA_CHECK=1
TORCHINDUCTOR_COMPILE_THREADS=1
TORCHINDUCTOR_CACHE_DIR=/tmp/torchinductor_nomugop

</details>


### 🐛 Describe the bug

## Summary

vLLM (and SGLang) cannot load any Qwen3.6 / Qwen3.5 family models (Qwen3_5ForConditionalGeneration / Qwen3_5MoeForConditionalGeneration architecture) on RTX PRO 6000 Blackwell Workstation Edition (96 GB VRAM, sm_120) under WSL2 Ubuntu 22.04.

The model loads to GPU successfully (~30 GB), but allocation of the Mamba state cache fails with `torch.OutOfMemoryError: CUDA out of memory. Tried to allocate X GiB. GPU has 95.59 GiB total of which 50+ GiB is free`.

The "non-PyTorch memory in use" reported by torch is **16 GiB** on this WSL2 setup (vs ~1-2 GiB normal on native Linux). This abnormal CUDA driver overhead consumes memory invisibly and causes the contiguous Mamba state allocation to fail.

## Reproduction

### Hardware

- GPU: NVIDIA RTX PRO 6000 Blackwell Workstation Edition (compute 12.0, sm_120, 97887 MiB VRAM)
- CPU: Intel Core i5-13400F
- RAM: 32 GB

### Software

- Host: Windows 11 IoT Enterprise LTSC, build 26100
- NVIDIA driver: tested **596.36** AND **581.80 (RTX Enterprise)** — same result
- WSL: WSL 2.6.3.0, kernel 6.6.87.2-microsoft-standard-WSL2
- Distro: Ubuntu 22.04.5 LTS
- Python 3.10.12
- CUDA: tested cu128 AND cu130 — same result

### Engine versions tested (all fail with same OOM pattern)

| Engine | Version | torch | CUDA | Result |
|--------|---------|-------|------|--------|
| vLLM | **0.20.0** | 2.11.0 | cu130 | OOM `Tried to allocate 3.77 GiB` |
| vLLM | **0.17.1** | 2.10.0 | cu128 | OOM `Tried to allocate 3.48 GiB` |
| SGLang | **0.5.10.post1** | 2.9.1 | cu128 | OOM `Tried to allocate 18.56 GiB` (worse) |

### Models tested (all fail, all on disk locally)

| Model | Architecture | Mamba allocation requested | Result |
|-------|--------------|---------------------------|--------|
| Qwen/Qwen3.6-27B-FP8 | Qwen3_5ForConditionalGeneration | 3.77 GB | OOM |
| Qwen/Qwen3.6-35B-A3B-FP8 | Qwen3_5MoeForConditionalGeneration (MoE) | 4.99 GB | OOM |

All Qwen3.6 models have `linear_attention` layers (hybrid GDN+Mamba architecture).

### Reproducible command

```bash
# In WSL2 Ubuntu 22.04 with vllm-env activated
HF_HUB_OFFLINE=1 vllm serve Qwen/Qwen3.6-27B-FP8 \
  --gpu-memory-utilization 0.90 \
  --max-model-len 8192 \
  --max-num-seqs 1 \
  --enforce-eager \
  --dtype bfloat16 \
  --trust-remote-code \
  --reasoning-parser qwen3

Full error message

(EngineCore_DP0 pid=1430) ERROR 05-04 11:02:37 [core.py:1100] EngineCore failed to start.
(EngineCore_DP0 pid=1430) ERROR 05-04 11:02:37 [core.py:1100] torch.OutOfMemoryError: CUDA out of memory.
Tried to allocate 3.48 GiB.
GPU 0 has a total capacity of 95.59 GiB of which 50.40 GiB is free.
Including non-PyTorch memory, this process has 17179869184.00 GiB memory in use.
Process 2638 has 17179869184.00 GiB memory in use.
Of the allocated memory 42.49 GiB is allocated by PyTorch,
and 47.72 MiB is reserved by PyTorch but unallocated.
If reserved but unallocated memory is large try setting PYTORCH_ALLOC_CONF=expandable_segments:True
to avoid fragmentation.

Note: The 17179869184.00 GiB is a display bug — actual value is 17179869184 bytes = 16 GiB (mislabeled units). So this process has 16 GiB of "non-PyTorch memory" which is abnormally high (typical native Linux: 1-2 GiB).

Stack trace (where the OOM happens)

File "/home/nomugop/vllm-env/lib/python3.10/site-packages/vllm/v1/worker/gpu_model_runner.py",
  line 6537, in _allocate_kv_cache_tensors
    tensor = torch.zeros(...)
File "/home/nomugop/vllm-env/lib/python3.10/site-packages/vllm/v1/worker/gpu_worker.py",
  line 538, in initialize_from_config
    self.model_runner.initialize_kv_cache(kv_cache_config)

Profiler reports plenty of memory before OOM:

INFO 05-04 01:10:54 [gpu_worker.py:440] Available KV cache memory: 60.27 GiB
INFO 05-04 01:10:54 [kv_cache_utils.py:1716] Maximum concurrency for 16,384 tokens per request: 39.22x

vLLM's profiler thinks 60 GB is available for KV cache — but actual allocation fails on the first 3.48 GiB request because PyTorch can't find a contiguous block.

What I have tried (all do not help)

vLLM flags

  • --gpu-memory-utilization: 0.70, 0.85, 0.90, 0.95 — all OOM
  • --max-num-seqs: 1, 4, 8, 16, 32 — all OOM
  • --max-model-len: 4096, 8192, 16384 — all OOM
  • --max-num-batched-tokens 2096 (per #37714 fix) — does not help
  • --num-gpu-blocks-override 8000 — does not help (different allocation size, same OOM)
  • --kv-cache-dtype fp8 — does not help (concurrency profile changes but same alloc OOM)
  • --enable-prefix-caching on/off — does not help
  • --enforce-eager on/off — does not help
  • --reasoning-parser qwen3 — required for Qwen3.6 but does not affect OOM

Environment variables

  • HF_HUB_OFFLINE=1 — required for local model
  • PYTORCH_ALLOC_CONF=expandable_segments:Truebreaks CUDA driver entirely on Blackwell, causes RuntimeError: CUDA driver error: unknown error
  • VLLM_USE_FASTSAFETENSORS=1 — does not help

Driver versions

  • 596.36 (CUDA 13) — OOM with 16 GB non-PyTorch overhead
  • 581.80 RTX Enterprise (CUDA 12.8) — OOM with same 16 GB non-PyTorch overhead (driver downgrade does not change anything!)

Other engines

  • SGLang 0.5.10 with --mamba-full-memory-ratio 0.05 + --max-mamba-cache-size 256 + --mem-fraction-static 0.95 + --max-running-requests 1 + --context-length 2048 — still OOM (or RuntimeError: Not enough memory from sglang's own check)

Hypothesis: WSL2 GPU passthrough overhead on Blackwell

The 16 GiB "non-PyTorch memory" is consumed by:

  • DXGI shim layer (/usr/lib/wsl/drivers/...)
  • Hyper-V virtualization
  • CUDA driver context on sm_120 architecture

On native Linux this is typically 1-2 GiB. On WSL2 + Blackwell it's ~16 GiB.

This invisible memory consumption is NOT accounted for by vLLM's profiler when it computes available KV cache. So profiler says "60 GB free for KV cache" but actual usable memory is ~24 GB after subtracting hidden CUDA context overhead. PyTorch tries to allocate the contiguous Mamba state block and fails because the actual free memory is fragmented.

Why this is specifically a Mamba/Qwen3.5/3.6 problem

Standard transformer models (Qwen2.5, Llama, Mistral) work fine on this hardware because their KV cache is allocated per-layer in many small blocks, which fits in fragmented free memory.

Qwen3.5/3.6 hybrid GDN+Mamba architecture requires one large contiguous tensor for Mamba state cache sized hidden_size × num_layers × max_running_requests. This single big allocation cannot be satisfied due to fragmentation + 16 GB hidden overhead.

Why I think this is a vLLM issue (vs PyTorch / NVIDIA / Microsoft)

While the underlying cause may be in WSL2 / NVIDIA driver behavior, vLLM could mitigate it by:

  1. Allocating Mamba state in multiple smaller chunks instead of one contiguous block (similar to how KV cache PagedAttention works)
  2. Accounting for non-PyTorch memory in the KV cache profiler (reading torch.cuda.mem_get_info() and subtracting actual used memory)
  3. Adding a --mamba-cache-chunk-size N flag to let users control allocation granularity

Severity / Impact

  • Critical for anyone wanting to use Qwen3.5/3.6 family on Blackwell workstation cards in WSL2
  • Workstation cards (RTX PRO 6000 Blackwell, RTX 5000 Ada) are common for AI dev environments
  • WSL2 is the standard for AI dev on Windows hosts
  • This blocks an entire model family on a growing hardware platform

Workarounds (none ideal)

  1. Use non-Mamba models (Qwen3-32B-AWQ instead of Qwen3.6) — works but loses access to newest Qwen
  2. Dual-boot native Linux — works but loses Windows
  3. Use Ollama / llama.cpp — works (no PyTorch) but no Qwen3.6 support yet
  4. Cloud API — works but defeats purpose of local GPU

Files attached

  • system-info.txt — full system info dump
  • vllm-error-full.log — full vllm journal from failed startup
  • reproduce.sh — minimal reproduction script

Related

  • Possibly related to vllm#37714 (Qwen3.5 27B+ on Blackwell SM120 + CUDA 13)
  • Possibly related to vllm#37242 (RTX 5090 sm_120 + WSL2 setup)
  • Possibly related to PyTorch CUDA allocator behavior on sm_120

I am happy to provide additional debug data, run patches, or test proposed fixes. This is currently blocking my AI workflow which previously ran fine on native Ubuntu with the same hardware.

reproduce.sh system-info.txt vllm-error-full.log

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

TL;DR

The most likely fix is to modify vLLM to allocate Mamba state in smaller chunks or account for non-PyTorch memory in the KV cache profiler.

Guidance

  • Investigate allocating Mamba state in multiple smaller chunks instead of one contiguous block to mitigate fragmentation issues.
  • Consider adding a --mamba-cache-chunk-size N flag to let users control allocation granularity.
  • Look into accounting for non-PyTorch memory in the KV cache profiler by reading torch.cuda.mem_get_info() and subtracting actual used memory.
  • Test using a different CUDA version or driver to see if the issue persists.

Example

No specific code example is provided as the issue requires modifications to the vLLM codebase, which is not publicly available.

Notes

The issue seems to be related to the WSL2 GPU passthrough overhead on Blackwell, which consumes a significant amount of memory. The vLLM profiler does not account for this memory, leading to incorrect available memory calculations.

Recommendation

Apply a workaround by modifying vLLM to allocate Mamba state in smaller chunks or account for non-PyTorch memory in the KV cache profiler, as the root cause of the issue lies in the interaction between vLLM, WSL2, and the NVIDIA driver.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

vllm - 💡(How to fix) Fix [Bug]: Qwen3.6 hybrid Mamba models fail KV cache allocation on RTX PRO 6000 Blackwell + WSL2 — 16 GiB invisible CUDA overhead [1 comments, 1 participants]