pytorch - 💡(How to fix) Fix [Bug] F.grid_sample backward pass on CUDA returns accumulated gradients due to uninitialized grad_input memory

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

Code Example

import torch
import torch.nn.functional as F
import gc

def reproduce_uninitialized_memory():
    device = "cuda"
    dtype = torch.float32

    print(f"PyTorch Version: {torch.__version__}")
    
    grid = torch.tensor([[[[-1.0, -1.0], [ 1.0, -1.0]],
                          [[-1.0,  1.0], [ 1.0,  1.0]]]], device=device, dtype=dtype)
    
    corner_gradients = []

    print("\nStarting independent forward and backward passes...")
    
    for i in range(1, 6):
        x = torch.zeros(1, 1, 4, 4, device=device, dtype=dtype, requires_grad=True)
        
        y = F.grid_sample(x, grid, align_corners=True)
        
        y.sum().backward()
        
        val = x.grad[0, 0, 0, 0].item()
        corner_gradients.append(val)
        print(f"Run {i}: Top-left gradient = {val}")
        
        del x
        del y
        gc.collect()

    if len(set(corner_gradients)) > 1 and corner_gradients[-1] > corner_gradients[0]:
        print("\nBug Reproduced: Gradients are accumulating across independent runs.")
    else:
        print("\nBug Not Reproduced: Gradients are consistent.")

if __name__ == "__main__":
    reproduce_uninitialized_memory()

---

(torch-nightly) xyt19@Oasis:/tmp$ python bug.py
PyTorch Version: 2.13.0.dev20260521+cu130

Starting independent forward and backward passes...
Run 1: Top-left gradient = 1.0
Run 2: Top-left gradient = 2.0
Run 3: Top-left gradient = 3.0
Run 4: Top-left gradient = 4.0
Run 5: Top-left gradient = 5.0

Bug Reproduced: Gradients are accumulating across independent runs.
RAW_BUFFERClick to expand / collapse

🐛 Describe the bug

When running torch.nn.functional.grid_sample on CUDA, independent backward passes on newly created input tensors appear to produce gradients that accumulate across runs.

In the reproducer below, each iteration creates a fresh leaf tensor x with requires_grad=True, runs F.grid_sample, and calls backward() once. However, x.grad[0, 0, 0, 0] increases as 1.0, 2.0, 3.0, ... across independent iterations.

Since the input tensor is newly created in every iteration, its .grad field should not carry over from previous iterations. The observed monotonic accumulation suggests that stale data may be present in the internal grad_input buffer used by the CUDA backward implementation.

Minimum Reproducible Example

import torch
import torch.nn.functional as F
import gc

def reproduce_uninitialized_memory():
    device = "cuda"
    dtype = torch.float32

    print(f"PyTorch Version: {torch.__version__}")
    
    grid = torch.tensor([[[[-1.0, -1.0], [ 1.0, -1.0]],
                          [[-1.0,  1.0], [ 1.0,  1.0]]]], device=device, dtype=dtype)
    
    corner_gradients = []

    print("\nStarting independent forward and backward passes...")
    
    for i in range(1, 6):
        x = torch.zeros(1, 1, 4, 4, device=device, dtype=dtype, requires_grad=True)
        
        y = F.grid_sample(x, grid, align_corners=True)
        
        y.sum().backward()
        
        val = x.grad[0, 0, 0, 0].item()
        corner_gradients.append(val)
        print(f"Run {i}: Top-left gradient = {val}")
        
        del x
        del y
        gc.collect()

    if len(set(corner_gradients)) > 1 and corner_gradients[-1] > corner_gradients[0]:
        print("\nBug Reproduced: Gradients are accumulating across independent runs.")
    else:
        print("\nBug Not Reproduced: Gradients are consistent.")

if __name__ == "__main__":
    reproduce_uninitialized_memory()

Expected behavior

Each iteration creates a fresh input tensor and performs one backward pass. The top-left input gradient should be consistently 1.0 in every run.

Actual behavior

The top-left input gradient increases across independent runs:

1.0, 2.0, 3.0, 4.0, 5.0

This looks like stale gradient data is being reused or accumulated into, rather than starting from a zero-initialized gradient buffer.

Output from the nightly build:

(torch-nightly) xyt19@Oasis:/tmp$ python bug.py
PyTorch Version: 2.13.0.dev20260521+cu130

Starting independent forward and backward passes...
Run 1: Top-left gradient = 1.0
Run 2: Top-left gradient = 2.0
Run 3: Top-left gradient = 3.0
Run 4: Top-left gradient = 4.0
Run 5: Top-left gradient = 5.0

Bug Reproduced: Gradients are accumulating across independent runs.

Versions

PyTorch version: 2.13.0.dev20260521+cu130 Is debug build: False CUDA used to build PyTorch: 13.0 ROCM used to build PyTorch: N/A

OS: Ubuntu 24.04.4 LTS (x86_64) GCC version: (Ubuntu 13.3.0-6ubuntu2~24.04.1) 13.3.0 Clang version: 18.1.3 (1ubuntu1) CMake version: version 3.28.3 Libc version: glibc-2.39

Python version: 3.10.20 (main, Mar 11 2026, 17:46:40) [GCC 14.3.0] (64-bit runtime) Python platform: Linux-5.15.167.4-microsoft-standard-WSL2-x86_64-with-glibc2.39 Is CUDA available: True CUDA runtime version: 12.0.140 Nvidia driver version: 596.49 cuDNN version: Probably one of the following: /usr/lib/x86_64-linux-gnu/libcudnn.so.9.21.1 /usr/lib/x86_64-linux-gnu/libcudnn_adv.so.9.21.1 /usr/lib/x86_64-linux-gnu/libcudnn_cnn.so.9.21.1 /usr/lib/x86_64-linux-gnu/libcudnn_engines_precompiled.so.9.21.1 /usr/lib/x86_64-linux-gnu/libcudnn_engines_runtime_compiled.so.9.21.1 /usr/lib/x86_64-linux-gnu/libcudnn_engines_tensor_ir.so.9.21.1 /usr/lib/x86_64-linux-gnu/libcudnn_graph.so.9.21.1 /usr/lib/x86_64-linux-gnu/libcudnn_heuristic.so.9.21.1 /usr/lib/x86_64-linux-gnu/libcudnn_ops.so.9.21.1 Is XPU available: False HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True Caching allocator config: N/A ersions of relevant libraries: [pip3] numpy==2.2.6 [pip3] nvidia-cublas==13.1.1.3 [pip3] nvidia-cuda-cupti==13.0.85 [pip3] nvidia-cuda-nvrtc==13.0.88 [pip3] nvidia-cuda-runtime==13.0.96 [pip3] nvidia-cudnn-cu13==9.20.0.48 [pip3] nvidia-cufft==12.0.0.61 [pip3] nvidia-curand==10.4.0.35 [pip3] nvidia-cusolver==12.0.4.66 [pip3] nvidia-cusparse==12.6.3.3 [pip3] nvidia-cusparselt-cu13==0.8.1 [pip3] nvidia-nccl-cu13==2.29.7 [pip3] nvidia-nvjitlink==13.0.88 [pip3] nvidia-nvtx==13.0.85 [pip3] torch==2.13.0.dev20260521+cu130 [pip3] torchaudio==2.11.0.dev20260525+cu130 [pip3] torchvision==0.28.0.dev20260525+cu130 [pip3] triton==3.7.0+git88b227e2 [conda] numpy 2.2.6 pypi_0 pypi [conda] nvidia-cublas 13.1.1.3 pypi_0 pypi [conda] nvidia-cuda-cupti 13.0.85 pypi_0 pypi [conda] nvidia-cuda-nvrtc 13.0.88 pypi_0 pypi [conda] nvidia-cuda-runtime 13.0.96 pypi_0 pypi [conda] nvidia-cudnn-cu13 9.20.0.48 pypi_0 pypi [conda] nvidia-cufft 12.0.0.61 pypi_0 pypi [conda] nvidia-curand 10.4.0.35 pypi_0 pypi [conda] nvidia-cusolver 12.0.4.66 pypi_0 pypi [conda] nvidia-cusparse 12.6.3.3 pypi_0 pypi [conda] nvidia-cusparselt-cu13 0.8.1 pypi_0 pypi [conda] nvidia-nccl-cu13 2.29.7 pypi_0 pypi [conda] nvidia-nvjitlink 13.0.88 pypi_0 pypi [conda] nvidia-nvtx 13.0.85 pypi_0 pypi [conda] torch 2.13.0.dev20260521+cu130 pypi_0 pypi [conda] torchaudio 2.11.0.dev20260525+cu130 pypi_0 pypi [conda] torchvision 0.28.0.dev20260525+cu130 pypi_0 pypi [conda] triton 3.7.0+git88b227e2 pypi_0 pypi

cc @ezyang @albanD @gqchen @nikitaved @soulitzer @Varal7 @bobrenjc93 @ptrblck @msaroufim @eqy @tinglvv @nWEIdia

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

FAQ

Expected behavior

Each iteration creates a fresh input tensor and performs one backward pass. The top-left input gradient should be consistently 1.0 in every run.

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING