vllm - 💡(How to fix) Fix [Bug]: MoE + --enable-sleep-mode OOM during weight load — bisected to #41268, root cause in cumem MemPool reclaim (pytorch#159674)

Fix Action

Fix / Workaround

Stack	`_scoped_allocator_max_split(20)`	Outcome	Private pool
v0.20.2 vanilla	❌ no	✅ clean, KV 59.7 GiB	—
v0.21.0 vanilla	✅ yes	❌ OOM	71.20 GiB
v0.20.2 + #41268 patch	✅ yes	❌ OOM	71.20 GiB (byte-equal)

A minimal synthetic reproducer (varied-size large-buffer churn inside use_memory_pool(tag="weights") with max_split_size_mb=20) is available, plus the full Qwen3-235B config + Docker image + cherry-pick patch + DEBUG logs.

torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 3 has a total capacity of 184.31 GiB of which 21.50 MiB is free. ... 181.80 GiB is allocated by PyTorch, with 71.20 GiB allocated in private pools (e.g., CUDA Graphs), and 13.63 MiB is reserved by PyTorch but unallocated.

Your current environment

vLLM: 0.21.0 (also reproduced on 0.21.x). Clean on 0.18.0 / 0.19.1 / 0.20.0 / 0.20.1 / 0.20.2.
GPU: 4x NVIDIA GB200 (184 GiB), TP=4. Also matches the single-H100 report in #34877.
Model: Qwen3-235B-A22B-Instruct-2507 (128-expert MoE), --enable-sleep-mode, --gpu-memory-utilization 0.93 --max-num-seqs 1024.

🐛 Describe the bug

Loading a large MoE model with --enable-sleep-mode OOMs during weight loading (the per-expert FlashInfer/TRTLLM block-layout swizzle), at csrc/cumem_allocator.cpp. Profiling CUDA graph memory is never reached.

torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB.
GPU 3 has a total capacity of 184.31 GiB of which 21.50 MiB is free.
... 181.80 GiB is allocated by PyTorch, with 71.20 GiB allocated in private pools
(e.g., CUDA Graphs), and 13.63 MiB is reserved by PyTorch but unallocated.

The traceback points at convert_moe_weights_to_flashinfer_trtllm_block_layout. This is the same root cause as #34877 (gpt-oss-120b mxfp4 swizzle, single H100, sleep mode) — same cumem_allocator.cpp OOM signature, same ~70 GiB stranded private pool.

Bisection (5 clean / 3 OOM, 4 distinct hosts, identical config)

vLLM	Result	Peak after load	Private pool at OOM
0.18.0 / 0.19.1 / 0.20.0 / 0.20.1 / 0.20.2	✅ Clean boot	109.71 GiB	—
0.21.0 (×3 independent runs)	❌ OOM	109.71 GiB	69.99–71.20 GiB

cumem.py is byte-identical 0.18→0.21. Among the commits between v0.20.2 and v0.21.0, git log v0.20.2..v0.21.0 -S 'max_split_size_mb' returns exactly one PR: #41268 ("Fix OOM by setting PyTorch max_split_size_mb during model loading").

Cherry-pick proof

Applying only #41268's diff (one file, vllm/v1/worker/gpu_worker.py, 3 hunks, applies to v0.20.2 with offset +7) onto a clean v0.20.2 container reproduces the OOM byte-for-byte:

Stack	`_scoped_allocator_max_split(20)`	Outcome	Private pool
v0.20.2 vanilla	❌ no	✅ clean, KV 59.7 GiB	—
v0.21.0 vanilla	✅ yes	❌ OOM	71.20 GiB
v0.20.2 + #41268 patch	✅ yes	❌ OOM	71.20 GiB (byte-equal)

Root cause (two layers)

Layer 1 — #41268's max_split_size_mb=20 disables large-block reuse. During the weights load the CCA marks any free block > 20 MiB as oversize and refuses to split it for a smaller request, and refuses best-fit reuse when the free block is > 20 MiB larger than the request. MoE weight conversion allocates multi-GiB, per-expert, varied-size transient buffers (different layers / quant block layouts). With the cap, expert k's freed 2.0 GiB block cannot serve expert k+1's 1.9 GiB request → a fresh segment every time → high-water mark becomes Σ(all transient buffers) instead of max(one buffer).

Layer 2 — inside a cumem MemPool, PyTorch will not reclaim freed segments under memory pressure. This is the deeper, pre-existing reason the cap matters at all:

Sleep mode requires the cumem MemPool, which forces expandable_segments off (incompatible, pytorch/pytorch#147851) — so the classic split/coalesce allocator is the only memory-bounding mechanism available.
torch.cuda.empty_cache() is a no-op for a live MemPool — pytorch/pytorch#145168 is closed, but only the crash was fixed; the maintainer states emptying a live pool is by-design impossible, and torch 2.12.0's empty_cache() still takes no pool argument (the request to add one, pytorch/pytorch#160069, was closed without implementation).
The allocator's emergency reclaim-on-OOM-retry (release_cached_blocks) is intentionally disabled inside a MemPool: the pool reuses the cuda-graph-capture code path (captures_underway is overloaded), where freeing pooled memory could break graph capture / NCCL pool registration. This is pytorch/pytorch#159674 — still open ("alloc 60 GiB → free → alloc 70 GiB OOMs on an 80 GiB GPU under use_mem_pool, while the default allocator does not").

So: #41268's cap removed the only mid-load memory-bounding lever (split-reuse) for the MoE allocation pattern, and PyTorch's MemPool won't release the accumulated free segments under pressure. The OOM lands mid-load, before vLLM's existing context-exit snapshot-sweep in cumem.py (which only runs at use_memory_pool exit) can reclaim anything.

Why this is dense-vs-MoE-opposite

For dense models #41268 is correct: without the cap, splitting big segments leaves half-used remainders that the context-exit sweep (which only releases fully-free segments) cannot reclaim → reserved bloat. The cap keeps each segment single-purpose → fully-free → reclaimable. The exact same policy is catastrophic for MoE because the varied-size churn needs split-reuse to stay bounded mid-load.

Proposed direction (not an `is_moe` bypass)

The real invariant being violated is "fully-free cumem segments must be released promptly; empty_cache being a no-op for live pools means vLLM must do it itself." vLLM already has the working primitive — the snapshot-sweep currently inlined at use_memory_pool exit in cumem.py. Lifting it to run mid-load (e.g. triggered from the cumem malloc callback when mapped bytes approach a high-water mark) bounds reserved memory for both dense and MoE without branching on model type, and does not depend on any (currently-open) PyTorch fix. Happy to send a PR along these lines.

#34877 — same root cause (gpt-oss-120b mxfp4, single H100)
#41268 — the PR that introduced the regression for MoE
pytorch/pytorch#159674 (open) — MemPool no reclaim-on-pressure
pytorch/pytorch#145168 (closed; empty_cache no-op for live pool by design)
pytorch/pytorch#147851 — expandable_segments incompatible with MemPool

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

vllm - 💡(How to fix) Fix [Bug]: MoE + --enable-sleep-mode OOM during weight load — bisected to #41268, root cause in cumem MemPool reclaim (pytorch#159674)

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Fix / Workaround

Code Example

Your current environment

🐛 Describe the bug

Bisection (5 clean / 3 OOM, 4 distinct hosts, identical config)

Cherry-pick proof

Root cause (two layers)

Why this is dense-vs-MoE-opposite

Proposed direction (not an `is_moe` bypass)

Related

Still need to ship something?

TRENDING

vllm - 💡(How to fix) Fix [Bug]: MoE + --enable-sleep-mode OOM during weight load — bisected to #41268, root cause in cumem MemPool reclaim (pytorch#159674)

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Fix / Workaround

Code Example

Your current environment

🐛 Describe the bug

Bisection (5 clean / 3 OOM, 4 distinct hosts, identical config)

Cherry-pick proof

Root cause (two layers)

Why this is dense-vs-MoE-opposite

Proposed direction (not an is_moe bypass)

Related

Still need to ship something?

TRENDING

Proposed direction (not an `is_moe` bypass)