vllm - 💡(How to fix) Fix [Bug]: MoE + --enable-sleep-mode OOM during weight load — bisected to #41268, root cause in cumem MemPool reclaim (pytorch#159674)

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

Error Message

torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 3 has a total capacity of 184.31 GiB of which 21.50 MiB is free. ... 181.80 GiB is allocated by PyTorch, with 71.20 GiB allocated in private pools (e.g., CUDA Graphs), and 13.63 MiB is reserved by PyTorch but unallocated.

Root Cause

The traceback points at convert_moe_weights_to_flashinfer_trtllm_block_layout. This is the same root cause as #34877 (gpt-oss-120b mxfp4 swizzle, single H100, sleep mode) — same cumem_allocator.cpp OOM signature, same ~70 GiB stranded private pool.

Fix Action

Fix / Workaround

Stack_scoped_allocator_max_split(20)OutcomePrivate pool
v0.20.2 vanilla❌ no✅ clean, KV 59.7 GiB
v0.21.0 vanilla✅ yes❌ OOM71.20 GiB
v0.20.2 + #41268 patch✅ yes❌ OOM71.20 GiB (byte-equal)

A minimal synthetic reproducer (varied-size large-buffer churn inside use_memory_pool(tag="weights") with max_split_size_mb=20) is available, plus the full Qwen3-235B config + Docker image + cherry-pick patch + DEBUG logs.

Code Example

torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB.
GPU 3 has a total capacity of 184.31 GiB of which 21.50 MiB is free.
... 181.80 GiB is allocated by PyTorch, with 71.20 GiB allocated in private pools
(e.g., CUDA Graphs), and 13.63 MiB is reserved by PyTorch but unallocated.
RAW_BUFFERClick to expand / collapse

Your current environment

  • vLLM: 0.21.0 (also reproduced on 0.21.x). Clean on 0.18.0 / 0.19.1 / 0.20.0 / 0.20.1 / 0.20.2.
  • GPU: 4x NVIDIA GB200 (184 GiB), TP=4. Also matches the single-H100 report in #34877.
  • Model: Qwen3-235B-A22B-Instruct-2507 (128-expert MoE), --enable-sleep-mode, --gpu-memory-utilization 0.93 --max-num-seqs 1024.

🐛 Describe the bug

Loading a large MoE model with --enable-sleep-mode OOMs during weight loading (the per-expert FlashInfer/TRTLLM block-layout swizzle), at csrc/cumem_allocator.cpp. Profiling CUDA graph memory is never reached.

torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB.
GPU 3 has a total capacity of 184.31 GiB of which 21.50 MiB is free.
... 181.80 GiB is allocated by PyTorch, with 71.20 GiB allocated in private pools
(e.g., CUDA Graphs), and 13.63 MiB is reserved by PyTorch but unallocated.

The traceback points at convert_moe_weights_to_flashinfer_trtllm_block_layout. This is the same root cause as #34877 (gpt-oss-120b mxfp4 swizzle, single H100, sleep mode) — same cumem_allocator.cpp OOM signature, same ~70 GiB stranded private pool.

Bisection (5 clean / 3 OOM, 4 distinct hosts, identical config)

vLLMResultPeak after loadPrivate pool at OOM
0.18.0 / 0.19.1 / 0.20.0 / 0.20.1 / 0.20.2✅ Clean boot109.71 GiB
0.21.0 (×3 independent runs)❌ OOM109.71 GiB69.99–71.20 GiB

cumem.py is byte-identical 0.18→0.21. Among the commits between v0.20.2 and v0.21.0, git log v0.20.2..v0.21.0 -S 'max_split_size_mb' returns exactly one PR: #41268 ("Fix OOM by setting PyTorch max_split_size_mb during model loading").

Cherry-pick proof

Applying only #41268's diff (one file, vllm/v1/worker/gpu_worker.py, 3 hunks, applies to v0.20.2 with offset +7) onto a clean v0.20.2 container reproduces the OOM byte-for-byte:

Stack_scoped_allocator_max_split(20)OutcomePrivate pool
v0.20.2 vanilla❌ no✅ clean, KV 59.7 GiB
v0.21.0 vanilla✅ yes❌ OOM71.20 GiB
v0.20.2 + #41268 patch✅ yes❌ OOM71.20 GiB (byte-equal)

Root cause (two layers)

Layer 1 — #41268's max_split_size_mb=20 disables large-block reuse. During the weights load the CCA marks any free block > 20 MiB as oversize and refuses to split it for a smaller request, and refuses best-fit reuse when the free block is > 20 MiB larger than the request. MoE weight conversion allocates multi-GiB, per-expert, varied-size transient buffers (different layers / quant block layouts). With the cap, expert k's freed 2.0 GiB block cannot serve expert k+1's 1.9 GiB request → a fresh segment every time → high-water mark becomes Σ(all transient buffers) instead of max(one buffer).

Layer 2 — inside a cumem MemPool, PyTorch will not reclaim freed segments under memory pressure. This is the deeper, pre-existing reason the cap matters at all:

  • Sleep mode requires the cumem MemPool, which forces expandable_segments off (incompatible, pytorch/pytorch#147851) — so the classic split/coalesce allocator is the only memory-bounding mechanism available.
  • torch.cuda.empty_cache() is a no-op for a live MemPool — pytorch/pytorch#145168 is closed, but only the crash was fixed; the maintainer states emptying a live pool is by-design impossible, and torch 2.12.0's empty_cache() still takes no pool argument (the request to add one, pytorch/pytorch#160069, was closed without implementation).
  • The allocator's emergency reclaim-on-OOM-retry (release_cached_blocks) is intentionally disabled inside a MemPool: the pool reuses the cuda-graph-capture code path (captures_underway is overloaded), where freeing pooled memory could break graph capture / NCCL pool registration. This is pytorch/pytorch#159674 — still open ("alloc 60 GiB → free → alloc 70 GiB OOMs on an 80 GiB GPU under use_mem_pool, while the default allocator does not").

So: #41268's cap removed the only mid-load memory-bounding lever (split-reuse) for the MoE allocation pattern, and PyTorch's MemPool won't release the accumulated free segments under pressure. The OOM lands mid-load, before vLLM's existing context-exit snapshot-sweep in cumem.py (which only runs at use_memory_pool exit) can reclaim anything.

Why this is dense-vs-MoE-opposite

For dense models #41268 is correct: without the cap, splitting big segments leaves half-used remainders that the context-exit sweep (which only releases fully-free segments) cannot reclaim → reserved bloat. The cap keeps each segment single-purpose → fully-free → reclaimable. The exact same policy is catastrophic for MoE because the varied-size churn needs split-reuse to stay bounded mid-load.

Proposed direction (not an is_moe bypass)

The real invariant being violated is "fully-free cumem segments must be released promptly; empty_cache being a no-op for live pools means vLLM must do it itself." vLLM already has the working primitive — the snapshot-sweep currently inlined at use_memory_pool exit in cumem.py. Lifting it to run mid-load (e.g. triggered from the cumem malloc callback when mapped bytes approach a high-water mark) bounds reserved memory for both dense and MoE without branching on model type, and does not depend on any (currently-open) PyTorch fix. Happy to send a PR along these lines.

A minimal synthetic reproducer (varied-size large-buffer churn inside use_memory_pool(tag="weights") with max_split_size_mb=20) is available, plus the full Qwen3-235B config + Docker image + cherry-pick patch + DEBUG logs.

Related

  • #34877 — same root cause (gpt-oss-120b mxfp4, single H100)
  • #41268 — the PR that introduced the regression for MoE
  • pytorch/pytorch#159674 (open) — MemPool no reclaim-on-pressure
  • pytorch/pytorch#145168 (closed; empty_cache no-op for live pool by design)
  • pytorch/pytorch#147851 — expandable_segments incompatible with MemPool

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

vllm - 💡(How to fix) Fix [Bug]: MoE + --enable-sleep-mode OOM during weight load — bisected to #41268, root cause in cumem MemPool reclaim (pytorch#159674)