vllm - ✅(Solved) Fix cumem allocator: double-free and stale error codes during sleep/wake cycles [1 pull requests, 1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#36651Fetched 2026-04-08 00:35:42
View on GitHub
Comments
0
Participants
1
Timeline
1
Reactions
0
Participants
Timeline (top)
cross-referenced ×1

Several bugs in the cumem allocator cause CUDA driver errors during sleep/wake cycles:

Error Message

error_code is a file-scope global in cumem_allocator.cpp that is not cleared between calls. A non-fatal error in one operation (e.g. create_and_map) can persist and cause a subsequent unrelated call (e.g. unmap_and_release or my_free) to take the wrong code path.

Root Cause

find_spec("flash_attn") succeeds with FA4 installed, but from flash_attn.ops.triton.rotary import apply_rotary fails because FA4 restructured its module layout. This causes an unhandled ImportError.

Fix Action

PR fix notes

PR #36535: Patch for vLLM + FlashAttention4 + torch for GRPO colocated training

Description (problem / solution / changelog)

Fixes https://github.com/vllm-project/vllm/issues/36651

Summary

  • cumem.py: Track per-allocation mapped state to prevent double cuMemRelease when sleep/wake_up are called on already-unmapped or already-mapped allocations. Add a sleeping flag so _python_free_callback returns a dummy handle during sleep, avoiding CUDA ops on freed memory. Remove gc.collect + empty_cache in sleep() which can trigger frees during the sleeping window.
  • cumem_allocator.cpp: Clear stale error_code at the top of create_and_map, unmap_and_release, and my_free to prevent leftover errors from previous operations propagating. Skip CUDA ops in my_free when recv_d_mem == 0 (sleeping block already freed). Fix size mismatch by using recv_size instead of size in unmap_and_release and cuMemAddressFree calls. Log a warning when unmap_and_release fails and cuMemAddressFree is skipped to make virtual address leaks visible.
  • rotary_embedding/common.py: Replace find_spec check with try/except for flash_attn.ops import to support Flash Attention 4 which restructured its module layout.

Test plan

  • Verify sleep() / wake_up() cycles work without CUDA errors in colocated training
  • Verify no regressions in standard vLLM inference with cumem allocator enabled
  • Verify rotary embedding works with both Flash Attention 2/3 and Flash Attention 4

🤖 Generated with Claude Code

Changed files

  • csrc/cumem_allocator.cpp (modified, +17/-2)
  • vllm/device_allocator/cumem.py (modified, +14/-5)
  • vllm/model_executor/layers/rotary_embedding/common.py (modified, +3/-2)
RAW_BUFFERClick to expand / collapse

Your current environment

vLLM v0.17.0 (also reproducible on main) with cumem allocator enabled, using sleep/wake cycles for colocated GRPO training.

How would you like to use vllm

Colocated training where vLLM's GPU memory is released via sleep() during training steps and reclaimed via wake_up() for inference.

Before submitting a new issue...

Description

Several bugs in the cumem allocator cause CUDA driver errors during sleep/wake cycles:

1. Double cuMemRelease on already-unmapped allocations

sleep() iterates all allocations and calls unmap_and_release unconditionally. If some allocations are already unmapped (e.g. from a previous partial sleep or external release), this causes a double cuMemRelease. Similarly, wake_up() calls create_and_map on already-mapped allocations. There is no per-allocation tracking of mapped state.

2. CUDA ops on freed memory during sleep

When the allocator is sleeping, PyTorch's garbage collector can trigger my_free on tensors whose backing memory was already released by sleep(). The C++ my_free then calls unmap_and_release on invalid handles, causing CUDA driver errors. Additionally, gc.collect() and torch.cuda.empty_cache() at the end of sleep() can trigger these frees while the allocator is in the sleeping state.

3. Stale global error_code propagation

error_code is a file-scope global in cumem_allocator.cpp that is not cleared between calls. A non-fatal error in one operation (e.g. create_and_map) can persist and cause a subsequent unrelated call (e.g. unmap_and_release or my_free) to take the wrong code path.

4. Size mismatch in my_free

my_free passes its size argument to unmap_and_release and cuMemAddressFree, but the correct size is recv_size returned from the Python callback. This can cause incorrect unmapping.

5. Flash Attention 4 rotary embedding import failure

find_spec("flash_attn") succeeds with FA4 installed, but from flash_attn.ops.triton.rotary import apply_rotary fails because FA4 restructured its module layout. This causes an unhandled ImportError.

Fix

https://github.com/vllm-project/vllm/pull/36535

extent analysis

Fix Plan

To address the issues in the cumem allocator, follow these steps:

  • 1. Track allocation state: Introduce a per-allocation tracking mechanism to prevent double cuMemRelease and create_and_map calls.
  • 2. Synchronize with PyTorch's garbage collector: Ensure that sleep() and wake_up() are synchronized with PyTorch's garbage collector to prevent CUDA ops on freed memory.
  • 3. Clear global error_code: Clear the error_code global variable between calls to prevent stale error propagation.
  • 4. Fix size mismatch in my_free: Pass the correct recv_size to unmap_and_release and cuMemAddressFree in my_free.
  • 5. Update Flash Attention import: Update the import statement for Flash Attention 4 to reflect the new module layout.

Example Code Changes

// Track allocation state
std::unordered_map<void*, bool> allocation_state;

void sleep() {
    // ...
    for (auto& allocation : allocations) {
        if (!allocation_state[allocation]) {
            unmap_and_release(allocation);
            allocation_state[allocation] = true;
        }
    }
    // ...
}

void wake_up() {
    // ...
    for (auto& allocation : allocations) {
        if (allocation_state[allocation]) {
            create_and_map(allocation);
            allocation_state[allocation] = false;
        }
    }
    // ...
}

// Synchronize with PyTorch's garbage collector
void sleep() {
    // ...
    torch::autograd::collect_garbage();
    torch::cuda::empty_cache();
    // ...
}

// Clear global error_code
void clear_error_code() {
    error_code = 0;
}

void create_and_map(void* allocation) {
    clear_error_code();
    // ...
}

void unmap_and_release(void* allocation) {
    clear_error_code();
    // ...
}

// Fix size mismatch in my_free
void my_free(void* allocation, size_t size) {
    size_t recv_size = get_recv_size(allocation);
    unmap_and_release(allocation, recv_size);
    cuMemAddressFree(allocation, recv_size);
}

// Update Flash Attention import
try {
    from flash_attn import rotary
    from flash_attn.rotary import apply_rotary
} except ImportError:
    print("Flash Attention 4 import failed")

Verification

To verify the fixes, run the following tests:

  • Test the allocator with sleep/wake cycles to ensure that double cuMemRelease and create_and_map calls are prevented.
  • Test the synchronization with PyTorch's garbage collector to ensure that CUDA ops on freed memory are prevented.
  • Test the clearing of the global error_code to ensure that stale error propagation is prevented.
  • Test the my_free

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

vllm - ✅(Solved) Fix cumem allocator: double-free and stale error codes during sleep/wake cycles [1 pull requests, 1 participants]