pytorch - 💡(How to fix) Fix tensor cuda memory cannot be release by gc when a tensor hold ref of another [8 comments, 4 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
pytorch/pytorch#177028Fetched 2026-04-08 00:22:39
View on GitHub
Comments
8
Participants
4
Timeline
45
Reactions
0
Timeline (top)
mentioned ×16subscribed ×16commented ×8labeled ×5

Fix Action

Fix / Workaround

CPU: Architecture: aarch64 CPU op-mode(s): 64-bit Byte Order: Little Endian CPU(s): 144 On-line CPU(s) list: 0-143 Thread(s) per core: 1 Core(s) per socket: 72 Socket(s): 2 NUMA node(s): 34 Vendor ID: ARM Model: 0 Stepping: r0p0 Frequency boost: disabled CPU max MHz: 3411.0000 CPU min MHz: 81.0000 BogoMIPS: 2000.00 L1d cache: 9 MiB L1i cache: 9 MiB L2 cache: 144 MiB L3 cache: 228 MiB NUMA node0 CPU(s): 0-71 NUMA node1 CPU(s): 72-143 NUMA node2 CPU(s):
NUMA node3 CPU(s):
NUMA node4 CPU(s):
NUMA node5 CPU(s):
NUMA node6 CPU(s):
NUMA node7 CPU(s):
NUMA node8 CPU(s):
NUMA node9 CPU(s):
NUMA node10 CPU(s):
NUMA node11 CPU(s):
NUMA node12 CPU(s):
NUMA node13 CPU(s):
NUMA node14 CPU(s):
NUMA node15 CPU(s):
NUMA node16 CPU(s):
NUMA node17 CPU(s):
NUMA node18 CPU(s):
NUMA node19 CPU(s):
NUMA node20 CPU(s):
NUMA node21 CPU(s):
NUMA node22 CPU(s):
NUMA node23 CPU(s):
NUMA node24 CPU(s):
NUMA node25 CPU(s):
NUMA node26 CPU(s):
NUMA node27 CPU(s):
NUMA node28 CPU(s):
NUMA node29 CPU(s):
NUMA node30 CPU(s):
NUMA node31 CPU(s):
NUMA node32 CPU(s):
NUMA node33 CPU(s):
Vulnerability Gather data sampling: Not affected Vulnerability Ghostwrite: Not affected Vulnerability Indirect target selection: Not affected Vulnerability Itlb multihit: Not affected Vulnerability L1tf: Not affected Vulnerability Mds: Not affected Vulnerability Meltdown: Not affected Vulnerability Mmio stale data: Not affected Vulnerability Reg file data sampling: Not affected Vulnerability Retbleed: Not affected Vulnerability Spec rstack overflow: Not affected Vulnerability Spec store bypass: Not affected Vulnerability Spectre v1: Mitigation; __user pointer sanitization Vulnerability Spectre v2: Mitigation; CSV2, but not BHB Vulnerability Srbds: Not affected Vulnerability Tsa: Not affected Vulnerability Tsx async abort: Not affected Vulnerability Vmscape: Not affected Flags: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh bti

Code Example

import torch
import gc


def run():
    a_torch = torch.empty([8, 1024, 1024], dtype=torch.float32, device="cuda") 
    b_torch = torch.empty([8, 1024, 1024], dtype=torch.float32, device="cuda") 
    b_torch.ref = a_torch

def test():
    print(f"Allocated: {torch.cuda.memory_allocated() / 1024**2:.2f} MB")
    print(f"Reserved:  {torch.cuda.memory_reserved() / 1024**2:.2f} MB")
    print(f"Max Allocated: {torch.cuda.max_memory_allocated() / 1024**2:.2f} MB")
    print(f"Max Reserved:  {torch.cuda.max_memory_reserved() / 1024**2:.2f} MB")
    print("Running test...")

    run()

    torch.cuda.synchronize()
    torch.cuda.empty_cache()
    gc.collect()
    print("Memory usage after test...")
    print(f"Allocated: {torch.cuda.memory_allocated() / 1024**2:.2f} MB")
    print(f"Reserved:  {torch.cuda.memory_reserved() / 1024**2:.2f} MB")
    print(f"Max Allocated: {torch.cuda.max_memory_allocated() / 1024**2:.2f} MB")
    print(f"Max Reserved:  {torch.cuda.max_memory_reserved() / 1024**2:.2f} MB")

if __name__ == "__main__":
    test()

---

Allocated: 0.00 MB
Reserved:  0.00 MB
Max Allocated: 0.00 MB
Max Reserved:  0.00 MB
Running test...
Memory usage after test...
Allocated: 32.00 MB
Reserved:  32.00 MB
Max Allocated: 64.00 MB
Max Reserved:  64.00 MB
RAW_BUFFERClick to expand / collapse

🐛 Describe the bug

tensor cuda memory cannot be release by gc when a tensor hold ref of another, b_tensor.ref = a_tensor

minimal example


import torch
import gc


def run():
    a_torch = torch.empty([8, 1024, 1024], dtype=torch.float32, device="cuda") 
    b_torch = torch.empty([8, 1024, 1024], dtype=torch.float32, device="cuda") 
    b_torch.ref = a_torch

def test():
    print(f"Allocated: {torch.cuda.memory_allocated() / 1024**2:.2f} MB")
    print(f"Reserved:  {torch.cuda.memory_reserved() / 1024**2:.2f} MB")
    print(f"Max Allocated: {torch.cuda.max_memory_allocated() / 1024**2:.2f} MB")
    print(f"Max Reserved:  {torch.cuda.max_memory_reserved() / 1024**2:.2f} MB")
    print("Running test...")

    run()

    torch.cuda.synchronize()
    torch.cuda.empty_cache()
    gc.collect()
    print("Memory usage after test...")
    print(f"Allocated: {torch.cuda.memory_allocated() / 1024**2:.2f} MB")
    print(f"Reserved:  {torch.cuda.memory_reserved() / 1024**2:.2f} MB")
    print(f"Max Allocated: {torch.cuda.max_memory_allocated() / 1024**2:.2f} MB")
    print(f"Max Reserved:  {torch.cuda.max_memory_reserved() / 1024**2:.2f} MB")

if __name__ == "__main__":
    test()

output:

Allocated: 0.00 MB
Reserved:  0.00 MB
Max Allocated: 0.00 MB
Max Reserved:  0.00 MB
Running test...
Memory usage after test...
Allocated: 32.00 MB
Reserved:  32.00 MB
Max Allocated: 64.00 MB
Max Reserved:  64.00 MB

The Allocated memory in the end is expected to be 0.

Versions

Collecting environment information... PyTorch version: 2.9.0+cu128 Is debug build: False CUDA used to build PyTorch: 12.8 ROCM used to build PyTorch: N/A

OS: Ubuntu 20.04.6 LTS (aarch64) GCC version: (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0 Clang version: Could not collect CMake version: version 3.16.3 Libc version: glibc-2.31

Python version: 3.14.0 (main, Feb 26 2026, 03:53:09) [GCC 9.4.0] (64-bit runtime) Python platform: Linux-6.14.0-1013-nvidia-64k-aarch64-with-glibc2.31 Is CUDA available: True CUDA runtime version: 13.1.66 CUDA_MODULE_LOADING set to: GPU models and configuration: GPU 0: NVIDIA GB200 GPU 1: NVIDIA GB200 GPU 2: NVIDIA GB200 GPU 3: NVIDIA GB200

Nvidia driver version: 580.82.07 cuDNN version: Could not collect Is XPU available: False HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True Caching allocator config: N/A

CPU: Architecture: aarch64 CPU op-mode(s): 64-bit Byte Order: Little Endian CPU(s): 144 On-line CPU(s) list: 0-143 Thread(s) per core: 1 Core(s) per socket: 72 Socket(s): 2 NUMA node(s): 34 Vendor ID: ARM Model: 0 Stepping: r0p0 Frequency boost: disabled CPU max MHz: 3411.0000 CPU min MHz: 81.0000 BogoMIPS: 2000.00 L1d cache: 9 MiB L1i cache: 9 MiB L2 cache: 144 MiB L3 cache: 228 MiB NUMA node0 CPU(s): 0-71 NUMA node1 CPU(s): 72-143 NUMA node2 CPU(s):
NUMA node3 CPU(s):
NUMA node4 CPU(s):
NUMA node5 CPU(s):
NUMA node6 CPU(s):
NUMA node7 CPU(s):
NUMA node8 CPU(s):
NUMA node9 CPU(s):
NUMA node10 CPU(s):
NUMA node11 CPU(s):
NUMA node12 CPU(s):
NUMA node13 CPU(s):
NUMA node14 CPU(s):
NUMA node15 CPU(s):
NUMA node16 CPU(s):
NUMA node17 CPU(s):
NUMA node18 CPU(s):
NUMA node19 CPU(s):
NUMA node20 CPU(s):
NUMA node21 CPU(s):
NUMA node22 CPU(s):
NUMA node23 CPU(s):
NUMA node24 CPU(s):
NUMA node25 CPU(s):
NUMA node26 CPU(s):
NUMA node27 CPU(s):
NUMA node28 CPU(s):
NUMA node29 CPU(s):
NUMA node30 CPU(s):
NUMA node31 CPU(s):
NUMA node32 CPU(s):
NUMA node33 CPU(s):
Vulnerability Gather data sampling: Not affected Vulnerability Ghostwrite: Not affected Vulnerability Indirect target selection: Not affected Vulnerability Itlb multihit: Not affected Vulnerability L1tf: Not affected Vulnerability Mds: Not affected Vulnerability Meltdown: Not affected Vulnerability Mmio stale data: Not affected Vulnerability Reg file data sampling: Not affected Vulnerability Retbleed: Not affected Vulnerability Spec rstack overflow: Not affected Vulnerability Spec store bypass: Not affected Vulnerability Spectre v1: Mitigation; __user pointer sanitization Vulnerability Spectre v2: Mitigation; CSV2, but not BHB Vulnerability Srbds: Not affected Vulnerability Tsa: Not affected Vulnerability Tsx async abort: Not affected Vulnerability Vmscape: Not affected Flags: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh bti

Versions of relevant libraries: [pip3] mypy==1.13.0 [pip3] mypy_extensions==1.1.0 [pip3] numpy==2.4.2 [pip3] nvidia-cublas-cu12==12.8.4.1 [pip3] nvidia-cuda-cupti-cu12==12.8.90 [pip3] nvidia-cuda-nvrtc-cu12==12.8.93 [pip3] nvidia-cuda-runtime-cu12==12.8.90 [pip3] nvidia-cudnn-cu12==9.10.2.21 [pip3] nvidia-cufft-cu12==11.3.3.83 [pip3] nvidia-curand-cu12==10.3.9.90 [pip3] nvidia-cusolver-cu12==11.7.3.90 [pip3] nvidia-cusparse-cu12==12.5.8.93 [pip3] nvidia-cusparselt-cu12==0.7.1 [pip3] nvidia-nccl-cu12==2.27.5 [pip3] nvidia-nvjitlink-cu12==12.8.93 [pip3] nvidia-nvtx-cu12==12.8.90 [pip3] torch==2.9.0+cu128 [pip3] triton==3.5.0 [conda] Could not collect

cc @ptrblck @msaroufim @eqy @jerryzh168 @tinglvv @nWEIdia

extent analysis

Fix Plan

Problem Summary

The issue is caused by a tensor holding a reference to another tensor, preventing the garbage collector from releasing the CUDA memory.

Root Cause Analysis

The root cause is the line b_torch.ref = a_torch in the run() function, which creates a reference cycle between the two tensors.

Fix Plan

To fix this issue, we need to break the reference cycle between the two tensors. We can do this by setting the ref attribute of b_torch to None after we're done using it.

def run():
    a_torch = torch.empty([8, 1024, 1024], dtype=torch.float32, device="cuda") 
    b_torch = torch.empty([8, 1024, 1024], dtype=torch.float32, device="cuda") 
    b_torch.ref = a_torch

    # Use b_torch here...

    # Break the reference cycle
    b_torch.ref = None

Alternatively, we can use the del statement to delete the reference to a_torch from b_torch:

def run():
    a_torch = torch.empty([8, 1024, 1024], dtype=torch.float32, device="cuda") 
    b_torch = torch.empty([8, 1024, 1024], dtype=torch.float32, device="cuda") 
    b_torch.ref = a_torch

    # Use b_torch here...

    # Delete the reference to a_torch from b_torch
    del b_torch.ref

Verification

To verify that the fix worked, we can run the test() function again and check that the allocated memory is 0.

test()
print(f"Allocated: {torch.cuda.memory_allocated() / 1024**2:.2

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

pytorch - 💡(How to fix) Fix tensor cuda memory cannot be release by gc when a tensor hold ref of another [8 comments, 4 participants]