vllm - ✅(Solved) Fix cumem allocator: double-free and stale error codes during sleep/wake cycles [1 pull requests, 1 participants]

vllm2026-03-10 12:47:13

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

vllm-project/vllm#36651•Fetched 2026-04-08 00:35:42

View on GitHub

Comments

Participants

Timeline

Reactions

Author

markrogersjr

Participants

markrogersjr

Timeline (top)

cross-referenced ×1

Several bugs in the cumem allocator cause CUDA driver errors during sleep/wake cycles:

Error Message

error_code is a file-scope global in cumem_allocator.cpp that is not cleared between calls. A non-fatal error in one operation (e.g. create_and_map) can persist and cause a subsequent unrelated call (e.g. unmap_and_release or my_free) to take the wrong code path.

Root Cause

find_spec("flash_attn") succeeds with FA4 installed, but from flash_attn.ops.triton.rotary import apply_rotary fails because FA4 restructured its module layout. This causes an unhandled ImportError.

Fix Action

Fix

https://github.com/vllm-project/vllm/pull/36535

PR fix notes

PR #36535: Patch for vLLM + FlashAttention4 + torch for GRPO colocated training

Repository: vllm-project/vllm
Author: markrogersjr
State: open | merged: False
Link: https://github.com/vllm-project/vllm/pull/36535

Description (problem / solution / changelog)

Fixes https://github.com/vllm-project/vllm/issues/36651

Summary

cumem.py: Track per-allocation mapped state to prevent double cuMemRelease when sleep/wake_up are called on already-unmapped or already-mapped allocations. Add a sleeping flag so _python_free_callback returns a dummy handle during sleep, avoiding CUDA ops on freed memory. Remove gc.collect + empty_cache in sleep() which can trigger frees during the sleeping window.
cumem_allocator.cpp: Clear stale error_code at the top of create_and_map, unmap_and_release, and my_free to prevent leftover errors from previous operations propagating. Skip CUDA ops in my_free when recv_d_mem == 0 (sleeping block already freed). Fix size mismatch by using recv_size instead of size in unmap_and_release and cuMemAddressFree calls. Log a warning when unmap_and_release fails and cuMemAddressFree is skipped to make virtual address leaks visible.
rotary_embedding/common.py: Replace find_spec check with try/except for flash_attn.ops import to support Flash Attention 4 which restructured its module layout.

Test plan

Verify sleep() / wake_up() cycles work without CUDA errors in colocated training
Verify no regressions in standard vLLM inference with cumem allocator enabled
Verify rotary embedding works with both Flash Attention 2/3 and Flash Attention 4

🤖 Generated with Claude Code

Changed files

csrc/cumem_allocator.cpp (modified, +17/-2)
vllm/device_allocator/cumem.py (modified, +14/-5)
vllm/model_executor/layers/rotary_embedding/common.py (modified, +3/-2)

RAW_BUFFERClick to expand / collapse

Your current environment

vLLM v0.17.0 (also reproducible on main) with cumem allocator enabled, using sleep/wake cycles for colocated GRPO training.

How would you like to use vllm

Colocated training where vLLM's GPU memory is released via sleep() during training steps and reclaimed via wake_up() for inference.

Before submitting a new issue...

I have searched for similar issues.

Description

Several bugs in the cumem allocator cause CUDA driver errors during sleep/wake cycles:

1. Double cuMemRelease on already-unmapped allocations

sleep() iterates all allocations and calls unmap_and_release unconditionally. If some allocations are already unmapped (e.g. from a previous partial sleep or external release), this causes a double cuMemRelease. Similarly, wake_up() calls create_and_map on already-mapped allocations. There is no per-allocation tracking of mapped state.

2. CUDA ops on freed memory during sleep

When the allocator is sleeping, PyTorch's garbage collector can trigger my_free on tensors whose backing memory was already released by sleep(). The C++ my_free then calls unmap_and_release on invalid handles, causing CUDA driver errors. Additionally, gc.collect() and torch.cuda.empty_cache() at the end of sleep() can trigger these frees while the allocator is in the sleeping state.

3. Stale global error_code propagation

4. Size mismatch in my_free

my_free passes its size argument to unmap_and_release and cuMemAddressFree, but the correct size is recv_size returned from the Python callback. This can cause incorrect unmapping.

5. Flash Attention 4 rotary embedding import failure

Fix

https://github.com/vllm-project/vllm/pull/36535

extent analysis

Fix Plan

To address the issues in the cumem allocator, follow these steps:

1. Track allocation state: Introduce a per-allocation tracking mechanism to prevent double cuMemRelease and create_and_map calls.
2. Synchronize with PyTorch's garbage collector: Ensure that sleep() and wake_up() are synchronized with PyTorch's garbage collector to prevent CUDA ops on freed memory.
3. Clear global error_code: Clear the error_code global variable between calls to prevent stale error propagation.
4. Fix size mismatch in my_free: Pass the correct recv_size to unmap_and_release and cuMemAddressFree in my_free.
5. Update Flash Attention import: Update the import statement for Flash Attention 4 to reflect the new module layout.

Example Code Changes

// Track allocation state
std::unordered_map<void*, bool> allocation_state;

void sleep() {
    // ...
    for (auto& allocation : allocations) {
        if (!allocation_state[allocation]) {
            unmap_and_release(allocation);
            allocation_state[allocation] = true;
        }
    }
    // ...
}

void wake_up() {
    // ...
    for (auto& allocation : allocations) {
        if (allocation_state[allocation]) {
            create_and_map(allocation);
            allocation_state[allocation] = false;
        }
    }
    // ...
}

// Synchronize with PyTorch's garbage collector
void sleep() {
    // ...
    torch::autograd::collect_garbage();
    torch::cuda::empty_cache();
    // ...
}

// Clear global error_code
void clear_error_code() {
    error_code = 0;
}

void create_and_map(void* allocation) {
    clear_error_code();
    // ...
}

void unmap_and_release(void* allocation) {
    clear_error_code();
    // ...
}

// Fix size mismatch in my_free
void my_free(void* allocation, size_t size) {
    size_t recv_size = get_recv_size(allocation);
    unmap_and_release(allocation, recv_size);
    cuMemAddressFree(allocation, recv_size);
}

// Update Flash Attention import
try {
    from flash_attn import rotary
    from flash_attn.rotary import apply_rotary
} except ImportError:
    print("Flash Attention 4 import failed")

Verification

To verify the fixes, run the following tests:

Test the allocator with sleep/wake cycles to ensure that double cuMemRelease and create_and_map calls are prevented.
Test the synchronization with PyTorch's garbage collector to ensure that CUDA ops on freed memory are prevented.
Test the clearing of the global error_code to ensure that stale error propagation is prevented.
Test the my_free

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #ssr #installation #tensor shape #autograd error #model save/load #optimization #mixed precision #training loop

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

vllm - ✅(Solved) Fix cumem allocator: double-free and stale error codes during sleep/wake cycles [1 pull requests, 1 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Fix

PR fix notes

PR #36535: Patch for vLLM + FlashAttention4 + torch for GRPO colocated training

Description (problem / solution / changelog)

Summary

Test plan

Changed files

Your current environment

How would you like to use vllm

Before submitting a new issue...

Description

1. Double cuMemRelease on already-unmapped allocations

2. CUDA ops on freed memory during sleep

3. Stale global error_code propagation

4. Size mismatch in my_free

5. Flash Attention 4 rotary embedding import failure

Fix

extent analysis

Fix Plan

Example Code Changes

Verification

Still need to ship something?

TRENDING

vllm - ✅(Solved) Fix cumem allocator: double-free and stale error codes during sleep/wake cycles [1 pull requests, 1 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Fix

PR fix notes

PR #36535: Patch for vLLM + FlashAttention4 + torch for GRPO colocated training

Description (problem / solution / changelog)

Summary

Test plan

Changed files

Your current environment

How would you like to use vllm

Before submitting a new issue...

Description

1. Double cuMemRelease on already-unmapped allocations

2. CUDA ops on freed memory during sleep

3. Stale global error_code propagation

4. Size mismatch in my_free

5. Flash Attention 4 rotary embedding import failure

Fix

extent analysis

Fix Plan

Example Code Changes

Verification

Still need to ship something?

RELATED_DISCOVERY

TRENDING