pytorch - 💡(How to fix) Fix Memory management, memory not returned after CPU -> GPU [1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
pytorch/pytorch#177520Fetched 2026-04-08 00:47:35
View on GitHub
Comments
0
Participants
1
Timeline
58
Reactions
0
Participants
Timeline (top)
mentioned ×26subscribed ×26labeled ×6

Code Example

void PrintMemory(const char* label)
{
    PROCESS_MEMORY_COUNTERS_EX info{};
    GetProcessMemoryInfo(GetCurrentProcess(), reinterpret_cast<PROCESS_MEMORY_COUNTERS*>(&info), sizeof(info));
    printf("PrivateUsage (committed pages): %.3f GB\n",  (float)info.PrivateUsage / 1024 / 1024 / 1024);
}

---

PrintMemory("Before");
auto tensor = torch::zeros({ 16384, 16384, 4 });  // also  4GB float32
PrintMemory("After Init");        
tensor = tensor.to(torch::kCUDA);
PrintMemory("After Cuda");
c10::cuda::CUDACachingAllocator::emptyCache();
PrintMemory("After emptyCache");

---

[Before]
  PrivateUsage       (committed pages): 0.700 GB
[After Init]
  PrivateUsage       (committed pages): 4.716 GB
[After Cuda]
  PrivateUsage       (committed pages): 4.952 GB
[After emptyCache]
  PrivateUsage       (committed pages): 4.952 GB

---

PrintMemory("Before");
auto tensor = torch::zeros({ 16384, 16384, 4 });  // also  4GB float32
PrintMemory("After Init");        
tensor.reset();
PrintMemory("After reset");

---

[Before]
  PrivateUsage       (committed pages): 0.701 GB
[After Init]
  PrivateUsage       (committed pages): 4.716 GB
[After Reset]
  PrivateUsage       (committed pages): 0.701 GB
RAW_BUFFERClick to expand / collapse

I have a simple C++ libtorch (2.10.0) program:

void PrintMemory(const char* label)
{
    PROCESS_MEMORY_COUNTERS_EX info{};
    GetProcessMemoryInfo(GetCurrentProcess(), reinterpret_cast<PROCESS_MEMORY_COUNTERS*>(&info), sizeof(info));
    printf("PrivateUsage (committed pages): %.3f GB\n",  (float)info.PrivateUsage / 1024 / 1024 / 1024);
}
PrintMemory("Before");
auto tensor = torch::zeros({ 16384, 16384, 4 });  // also  4GB float32
PrintMemory("After Init");        
tensor = tensor.to(torch::kCUDA);
PrintMemory("After Cuda");
c10::cuda::CUDACachingAllocator::emptyCache();
PrintMemory("After emptyCache");

[Before]
  PrivateUsage       (committed pages): 0.700 GB
[After Init]
  PrivateUsage       (committed pages): 4.716 GB
[After Cuda]
  PrivateUsage       (committed pages): 4.952 GB
[After emptyCache]
  PrivateUsage       (committed pages): 4.952 GB

But CPU memory is nor returned.

However, when I do this:

PrintMemory("Before");
auto tensor = torch::zeros({ 16384, 16384, 4 });  // also  4GB float32
PrintMemory("After Init");        
tensor.reset();
PrintMemory("After reset");

[Before]
  PrivateUsage       (committed pages): 0.701 GB
[After Init]
  PrivateUsage       (committed pages): 4.716 GB
[After Reset]
  PrivateUsage       (committed pages): 0.701 GB

memory is returned.

Why in case of CUDA transfer, memory stays allocated (or seems to be allocated)? When I use the to CUDA transfer for large models (not a single Tensor like in this case), the RAM stays filled (or seems) and I cannot allocate more. GPU is also correctly allocated.

cc @peterjc123 @mszhanyi @skyline75489 @nbcsm @iremyux @Blackhex @jbschlosser @ptrblck @msaroufim @eqy @jerryzh168 @tinglvv @nWEIdia

extent analysis

Fix Plan

The issue arises from the fact that c10::cuda::CUDACachingAllocator::emptyCache() does not release the CPU memory allocated for the tensor before it was moved to CUDA. To fix this, you need to manually reset the tensor after moving it to CUDA.

Steps to Fix

  1. Move the tensor to CUDA.
  2. Reset the tensor to release the CPU memory.
PrintMemory("Before");
auto tensor = torch::zeros({ 16384, 16384, 4 });  
PrintMemory("After Init");        
tensor = tensor.to(torch::kCUDA);
PrintMemory("After Cuda");
tensor.reset(); // Add this line to release CPU memory
PrintMemory("After Reset");

Alternatively, you can use torch::Tensor's release_resources() method or reset() method before moving the tensor to CUDA, but this would require you to re-allocate the tensor on the CUDA device.

Verification

Run the modified code and verify that the CPU memory is released after moving the tensor to CUDA and resetting it. The PrintMemory function should show a decrease in private usage after the reset() call.

Extra Tips

  • Always remember to release resources when working with large tensors to avoid memory leaks.
  • Use torch::Tensor's reset() method to release resources when you're done using a tensor.
  • Be aware of the memory allocation and deallocation mechanisms in libtorch to avoid unexpected memory usage patterns.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING