vllm - 💡(How to fix) Fix [Bug]: expandable_segments:True rejected with SimpleCPUOffloadConnector — no opt-in mechanism for DMA-only connectors

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

Error Message

ValueError: KV connector SimpleCPUOffloadConnector is incompatible with PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True unless enable_sleep_mode is also enabled. PyTorch's CUDA VMM allocator can remap KV cache virtual addresses, which may invalidate connectors that hold direct references to GPU memory.

Root Cause

Expected behavior: SimpleCPUOffloadConnector should be allowed with expandable_segments:True because it only uses DMA transfers (memcpy), not RDMA or pinned memory registrations that are affected by VMM remapping.

Fix Action

Fix / Workaround

============================== CPU Info

Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Address sizes: 43 bits physical, 48 bits virtual Byte Order: Little Endian CPU(s): 64 On-line CPU(s) list: 0-63 Vendor ID: AuthenticAMD Model name: AMD EPYC 7551P 32-Core Processor CPU family: 23 Model: 1 Thread(s) per core: 2 Core(s) per socket: 32 Socket(s): 1 Stepping: 2 Microcode version: 0x8001227 Frequency boost: enabled CPU max MHz: 2000.0000 CPU min MHz: 1200.0000 BogoMIPS: 3992.47 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid amd_dcm aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb hw_pstate ssbd ibpb vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif overflow_recov succor smca sev sev_es Virtualization: AMD-V L1d cache: 1 MiB (32 instances) L1i cache: 2 MiB (32 instances) L2 cache: 16 MiB (32 instances) L3 cache: 64 MiB (8 instances) NUMA node(s): 4 NUMA node0 CPU(s): 0-7,32-39 NUMA node1 CPU(s): 8-15,40-47 NUMA node2 CPU(s): 16-23,48-55 NUMA node3 CPU(s): 24-31,56-63 Vulnerability Gather data sampling: Not affected Vulnerability Ghostwrite: Not affected Vulnerability Indirect target selection: Not affected Vulnerability Itlb multihit: Not affected Vulnerability L1tf: Not affected Vulnerability Mds: Not affected Vulnerability Meltdown: Not affected Vulnerability Mmio stale data: Not affected Vulnerability Old microcode: Not affected Vulnerability Reg file data sampling: Not affected Vulnerability Retbleed: Mitigation; untrained return thunk; SMT vulnerable Vulnerability Spec rstack overflow: Mitigation; Safe RET Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Vulnerability Spectre v2: Mitigation; Retpolines; IBPB conditional; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected Vulnerability Srbds: Not affected Vulnerability Tsa: Not affected Vulnerability Tsx async abort: Not affected Vulnerability Vmscape: Mitigation; IBPB before exit to userspace

…are blocked with no workaround. Sleep mode (--enable-sleep-mode) adds ~868 MiB/GPU overhead for CuMemAllocator pools, which is too much for already-tight memory budgets.

Code Example

==============================
        System Info
==============================
OS                           : Arch Linux (x86_64)
GCC version                  : (GCC) 16.1.1 20260430
Clang version                : 22.1.5
CMake version                : version 4.3.2
Libc version                 : glibc-2.43

==============================
       PyTorch Info
==============================
PyTorch version              : 2.11.0+cu130
Is debug build               : False
CUDA used to build PyTorch   : 13.0
ROCM used to build PyTorch   : N/A
XPU used to build PyTorch    : N/A

==============================
      Python Environment
==============================
Python version               : 3.11.5 (main, Sep 11 2023, 13:54:46) [GCC 11.2.0] (64-bit runtime)
Python platform              : Linux-7.0.3-arch1-2-x86_64-with-glibc2.43

==============================
       CUDA / GPU Info
==============================
Is CUDA available            : True
CUDA runtime version         : 13.2.78
CUDA_MODULE_LOADING set to   :
GPU models and configuration :
GPU 0: NVIDIA GeForce RTX 3060
GPU 1: NVIDIA GeForce RTX 3060
GPU 2: NVIDIA GeForce RTX 3060

Nvidia driver version        : 595.71.05
cuDNN version                : Could not collect
HIP runtime version          : N/A
MIOpen runtime version       : N/A
Is XNNPACK available         : True

==============================
          CPU Info
==============================
Architecture:                            x86_64
CPU op-mode(s):                          32-bit, 64-bit
Address sizes:                           43 bits physical, 48 bits virtual
Byte Order:                              Little Endian
CPU(s):                                  64
On-line CPU(s) list:                     0-63
Vendor ID:                               AuthenticAMD
Model name:                              AMD EPYC 7551P 32-Core Processor
CPU family:                              23
Model:                                   1
Thread(s) per core:                      2
Core(s) per socket:                      32
Socket(s):                               1
Stepping:                                2
Microcode version:                       0x8001227
Frequency boost:                         enabled
CPU max MHz:                             2000.0000
CPU min MHz:                             1200.0000
BogoMIPS:                                3992.47
Flags:                                   fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid amd_dcm aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb hw_pstate ssbd ibpb vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif overflow_recov succor smca sev sev_es
Virtualization:                          AMD-V
L1d cache:                               1 MiB (32 instances)
L1i cache:                               2 MiB (32 instances)
L2 cache:                                16 MiB (32 instances)
L3 cache:                                64 MiB (8 instances)
NUMA node(s):                            4
NUMA node0 CPU(s):                       0-7,32-39
NUMA node1 CPU(s):                       8-15,40-47
NUMA node2 CPU(s):                       16-23,48-55
NUMA node3 CPU(s):                       24-31,56-63
Vulnerability Gather data sampling:      Not affected
Vulnerability Ghostwrite:                Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Old microcode:             Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Mitigation; untrained return thunk; SMT vulnerable
Vulnerability Spec rstack overflow:      Mitigation; Safe RET
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; Retpolines; IBPB conditional; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Not affected
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Mitigation; IBPB before exit to userspace

==============================
Versions of relevant libraries
==============================
[pip3] flashinfer-python==0.6.8.post1
[pip3] numpy==2.3.5
[pip3] nvidia-cublas==13.1.0.3
[pip3] nvidia-cuda-cupti==13.0.85
[pip3] nvidia-cuda-nvrtc==13.0.88
[pip3] nvidia-cuda-runtime==13.0.96
[pip3] nvidia-cudnn-cu13==9.19.0.56
[pip3] nvidia-cudnn-frontend==1.18.0
[pip3] nvidia-cufft==12.0.0.61
[pip3] nvidia-cufile==1.15.1.6
[pip3] nvidia-curand==10.4.0.35
[pip3] nvidia-cusolver==12.0.4.66
[pip3] nvidia-cusparse==12.6.3.3
[pip3] nvidia-cusparselt-cu13==0.8.0
[pip3] nvidia-cutlass-dsl==4.5.0
[pip3] nvidia-cutlass-dsl-libs-base==4.5.0
[pip3] nvidia-ml-py==13.595.45
[pip3] nvidia-nccl-cu13==2.28.9
[pip3] nvidia-nvjitlink==13.0.88
[pip3] nvidia-nvshmem-cu13==3.4.5
[pip3] nvidia-nvtx==13.0.85
[pip3] pyzmq==27.1.0
[pip3] torch==2.11.0
[pip3] torch_c_dlpack_ext==0.1.5
[pip3] torchaudio==2.11.0
[pip3] torchvision==0.26.0
[pip3] transformers==5.8.0
[pip3] triton==3.6.0
[conda] No relevant packages

==============================
         vLLM Info
==============================
ROCM Version                 : Could not collect
vLLM Version                 : 0.20.2
vLLM Build Flags:
  CUDA Archs: Not Set; ROCm: Disabled; XPU: Disabled
GPU Topology:
        GPU0    GPU1    GPU2    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X      PHB     SYS     8-15,40-47      1               N/A
GPU1    PHB      X      SYS     8-15,40-47      1               N/A
GPU2    SYS     SYS      X      16-23,48-55     2               N/A

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

==============================
     Environment Variables
==============================
CUDA_PATH=/opt/cuda
PYTORCH_NVML_BASED_CUDA_CHECK=1
TORCHINDUCTOR_COMPILE_THREADS=1
TORCHINDUCTOR_CACHE_DIR=/tmp/torchinductor_tib

---

PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True \
python3 -m vllm.entrypoints.openai.api_server \
  Qwen/Qwen3-30B-A3B \
  --pipeline-parallel-size 3 \
  --max-model-len 65536 \
  --gpu-memory-utilization 0.94 \
  --enforce-eager \
  --kv-offloading-backend native \
  --kv-offloading-size 16

---

ValueError: KV connector SimpleCPUOffloadConnector is incompatible with
PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True unless enable_sleep_mode
is also enabled. PyTorch's CUDA VMM allocator can remap KV cache virtual
addresses, which may invalidate connectors that hold direct references to
GPU memory.
RAW_BUFFERClick to expand / collapse

Your current environment

<details> <summary>The output of <code>python collect_env.py</code></summary>
==============================
        System Info
==============================
OS                           : Arch Linux (x86_64)
GCC version                  : (GCC) 16.1.1 20260430
Clang version                : 22.1.5
CMake version                : version 4.3.2
Libc version                 : glibc-2.43

==============================
       PyTorch Info
==============================
PyTorch version              : 2.11.0+cu130
Is debug build               : False
CUDA used to build PyTorch   : 13.0
ROCM used to build PyTorch   : N/A
XPU used to build PyTorch    : N/A

==============================
      Python Environment
==============================
Python version               : 3.11.5 (main, Sep 11 2023, 13:54:46) [GCC 11.2.0] (64-bit runtime)
Python platform              : Linux-7.0.3-arch1-2-x86_64-with-glibc2.43

==============================
       CUDA / GPU Info
==============================
Is CUDA available            : True
CUDA runtime version         : 13.2.78
CUDA_MODULE_LOADING set to   :
GPU models and configuration :
GPU 0: NVIDIA GeForce RTX 3060
GPU 1: NVIDIA GeForce RTX 3060
GPU 2: NVIDIA GeForce RTX 3060

Nvidia driver version        : 595.71.05
cuDNN version                : Could not collect
HIP runtime version          : N/A
MIOpen runtime version       : N/A
Is XNNPACK available         : True

==============================
          CPU Info
==============================
Architecture:                            x86_64
CPU op-mode(s):                          32-bit, 64-bit
Address sizes:                           43 bits physical, 48 bits virtual
Byte Order:                              Little Endian
CPU(s):                                  64
On-line CPU(s) list:                     0-63
Vendor ID:                               AuthenticAMD
Model name:                              AMD EPYC 7551P 32-Core Processor
CPU family:                              23
Model:                                   1
Thread(s) per core:                      2
Core(s) per socket:                      32
Socket(s):                               1
Stepping:                                2
Microcode version:                       0x8001227
Frequency boost:                         enabled
CPU max MHz:                             2000.0000
CPU min MHz:                             1200.0000
BogoMIPS:                                3992.47
Flags:                                   fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid amd_dcm aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb hw_pstate ssbd ibpb vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif overflow_recov succor smca sev sev_es
Virtualization:                          AMD-V
L1d cache:                               1 MiB (32 instances)
L1i cache:                               2 MiB (32 instances)
L2 cache:                                16 MiB (32 instances)
L3 cache:                                64 MiB (8 instances)
NUMA node(s):                            4
NUMA node0 CPU(s):                       0-7,32-39
NUMA node1 CPU(s):                       8-15,40-47
NUMA node2 CPU(s):                       16-23,48-55
NUMA node3 CPU(s):                       24-31,56-63
Vulnerability Gather data sampling:      Not affected
Vulnerability Ghostwrite:                Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Old microcode:             Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Mitigation; untrained return thunk; SMT vulnerable
Vulnerability Spec rstack overflow:      Mitigation; Safe RET
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; Retpolines; IBPB conditional; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Not affected
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Mitigation; IBPB before exit to userspace

==============================
Versions of relevant libraries
==============================
[pip3] flashinfer-python==0.6.8.post1
[pip3] numpy==2.3.5
[pip3] nvidia-cublas==13.1.0.3
[pip3] nvidia-cuda-cupti==13.0.85
[pip3] nvidia-cuda-nvrtc==13.0.88
[pip3] nvidia-cuda-runtime==13.0.96
[pip3] nvidia-cudnn-cu13==9.19.0.56
[pip3] nvidia-cudnn-frontend==1.18.0
[pip3] nvidia-cufft==12.0.0.61
[pip3] nvidia-cufile==1.15.1.6
[pip3] nvidia-curand==10.4.0.35
[pip3] nvidia-cusolver==12.0.4.66
[pip3] nvidia-cusparse==12.6.3.3
[pip3] nvidia-cusparselt-cu13==0.8.0
[pip3] nvidia-cutlass-dsl==4.5.0
[pip3] nvidia-cutlass-dsl-libs-base==4.5.0
[pip3] nvidia-ml-py==13.595.45
[pip3] nvidia-nccl-cu13==2.28.9
[pip3] nvidia-nvjitlink==13.0.88
[pip3] nvidia-nvshmem-cu13==3.4.5
[pip3] nvidia-nvtx==13.0.85
[pip3] pyzmq==27.1.0
[pip3] torch==2.11.0
[pip3] torch_c_dlpack_ext==0.1.5
[pip3] torchaudio==2.11.0
[pip3] torchvision==0.26.0
[pip3] transformers==5.8.0
[pip3] triton==3.6.0
[conda] No relevant packages

==============================
         vLLM Info
==============================
ROCM Version                 : Could not collect
vLLM Version                 : 0.20.2
vLLM Build Flags:
  CUDA Archs: Not Set; ROCm: Disabled; XPU: Disabled
GPU Topology:
        GPU0    GPU1    GPU2    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X      PHB     SYS     8-15,40-47      1               N/A
GPU1    PHB      X      SYS     8-15,40-47      1               N/A
GPU2    SYS     SYS      X      16-23,48-55     2               N/A

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

==============================
     Environment Variables
==============================
CUDA_PATH=/opt/cuda
PYTORCH_NVML_BASED_CUDA_CHECK=1
TORCHINDUCTOR_COMPILE_THREADS=1
TORCHINDUCTOR_CACHE_DIR=/tmp/torchinductor_tib
</details>

🐛 Describe the bug

PR #41237 added a blanket rejection of all KV connectors when PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True is set (unless --enable-sleep-mode is also enabled). The reasoning is correct for RDMA-based connectors like NixlConnector and MooncakeConnectorV1 — VMM can remap KV cache pages, invalidating registered memory regions.

However, SimpleCPUOffloadConnector uses DMA (memcpy via torch.cat/tensor slicing) — not RDMA or pinned memory registrations. These operations go through the CUDA allocator and are transparent to VMM remapping. The blanket rejection is a false positive for this connector.

Impact: Users on memory-constrained GPUs (12–24 GiB) who need both:

  • KV cache offloading to CPU (to serve models whose weights barely fit in VRAM)
  • expandable_segments:True (to reduce memory fragmentation and avoid OOM during prefill)

…are blocked with no workaround. Sleep mode (--enable-sleep-mode) adds ~868 MiB/GPU overhead for CuMemAllocator pools, which is too much for already-tight memory budgets.

Reproduction:

PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True \
python3 -m vllm.entrypoints.openai.api_server \
  Qwen/Qwen3-30B-A3B \
  --pipeline-parallel-size 3 \
  --max-model-len 65536 \
  --gpu-memory-utilization 0.94 \
  --enforce-eager \
  --kv-offloading-backend native \
  --kv-offloading-size 16

Error:

ValueError: KV connector SimpleCPUOffloadConnector is incompatible with
PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True unless enable_sleep_mode
is also enabled. PyTorch's CUDA VMM allocator can remap KV cache virtual
addresses, which may invalidate connectors that hold direct references to
GPU memory.

Expected behavior: SimpleCPUOffloadConnector should be allowed with expandable_segments:True because it only uses DMA transfers (memcpy), not RDMA or pinned memory registrations that are affected by VMM remapping.

Proposed fix: Add a SupportsVmmSafeTransfers marker class that connectors can opt into by inheriting. The config validation checks for this marker instead of rejecting all connectors. This is consistent with how supports_hma() already works in the connector framework (plain issubclass check). DMA-only connectors opt in; RDMA-based connectors remain rejected.

Cross-references: #41612 (follow-up from), #23087 (demand for "weights on GPU, KV on CPU" pattern), #19854 (canonical KV offloading RFC), PR #41237 (source of blanket rejection)

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

vllm - 💡(How to fix) Fix [Bug]: expandable_segments:True rejected with SimpleCPUOffloadConnector — no opt-in mechanism for DMA-only connectors