pytorch - 💡(How to fix) Fix Inductor max-autotune fails with cudagraph pool tracking error when Conv2d weight is mutated via .data in forward

Official PRs (…)
ON THIS PAGE

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

Error Message

RuntimeError: Detected 6 tensor(s) in the cudagraph pool not tracked as outputs. All live allocations must be tracked for correctness.

Code Example

RuntimeError: Detected 6 tensor(s) in the cudagraph pool not tracked as outputs.
All live allocations must be tracked for correctness.

---

import torch
import torch.nn as nn


class DirtyMaskedConv2d(nn.Conv2d):
    def __init__(self, in_channels, out_channels, kernel_size=7, padding=3):
        super().__init__(
            in_channels,
            out_channels,
            kernel_size=kernel_size,
            padding=padding,
        )

        mask = torch.zeros_like(self.weight)
        mask[:, :, : kernel_size // 2, :] = 1
        mask[:, :, kernel_size // 2, : kernel_size // 2] = 1
        self.register_buffer("mask", mask)

    def forward(self, x):
        # Essential trigger:
        # mutating parameter storage inside forward.
        with torch.no_grad():
            self.weight.data *= self.mask

        return super().forward(x)


class M(nn.Module):
    def __init__(self, n_layers=5):
        super().__init__()

        layers = [
            DirtyMaskedConv2d(1, 64),
            nn.ReLU(),
        ]

        for _ in range(n_layers):
            layers += [
                DirtyMaskedConv2d(64, 64),
                nn.ReLU(),
            ]

        layers += [
            nn.Conv2d(64, 64, kernel_size=1),
            nn.ReLU(),
            nn.Conv2d(64, 2, kernel_size=1),
        ]

        self.model = nn.Sequential(*layers)

    def forward(self, x):
        return self.model(x)


def main():
    assert torch.cuda.is_available()

    torch.manual_seed(0)
    torch.cuda.manual_seed_all(0)

    device = "cuda"
    model = M(n_layers=5).to(device).eval()
    x = torch.rand(2, 1, 32, 32, device=device)

    with torch.no_grad():
        eager = model(x)

    compiled_model = torch.compile(
        model,
        backend="inductor",
        mode="max-autotune",
        dynamic=True,
    )

    with torch.no_grad():
        out = compiled_model(x)

    print(torch.allclose(eager, out, atol=1e-4, rtol=1e-4))


if __name__ == "__main__":
    main()

---

RuntimeError: Detected 6 tensor(s) in the cudagraph pool not tracked as outputs.
All live allocations must be tracked for correctness.

---

torch/_inductor/cudagraph_trees.py", line ..., in check_memory_pool
    raise RuntimeError(msg)

---

File ".../torch/_inductor/compile_fx.py", line 1922, in run
    return compiled_fn(new_inputs)
File ".../torch/_inductor/cudagraph_trees.py", line 450, in deferred_cudagraphify
    fn, out = cudagraphify(model, inputs, new_static_input_idxs, *args, **kwargs)
File ".../torch/_inductor/cudagraph_trees.py", line 510, in cudagraphify
    return manager.add_function(...)
File ".../torch/_inductor/cudagraph_trees.py", line 788, in run
    check_memory_pool(self.device_index, self.cuda_graphs_pool, out_refs)
File ".../torch/_inductor/cudagraph_trees.py", line 1999, in check_memory_pool
    raise RuntimeError(msg)
RuntimeError: Detected 6 tensor(s) in the cudagraph pool not tracked as outputs.
All live allocations must be tracked for correctness.

---

torch.compile(model, backend="inductor", dynamic=True, mode="max-autotune-no-cudagraphs")
torch.compile(model, backend="inductor", dynamic=True)

---

def forward(self, x):
    weight = self.weight * self.mask
    return torch.nn.functional.conv2d(
        x,
        weight,
        self.bias,
        self.stride,
        self.padding,
        self.dilation,
        self.groups,
    )

---

PyTorch version:  2.13.0a0+git059c270
Is debug build: True
CUDA used to build PyTorch: 12.6
ROCM used to build PyTorch: N/A

OS: Ubuntu 22.04.4 LTS (x86_64)
GCC version: (Ubuntu 12.3.0-1ubuntu1~22.04) 12.3.0
Clang version: Could not collect
CMake version: version 3.22.1
Libc version: glibc-2.35

Python version: 3.10.16 (main, Dec 11 2024, 16:24:50) [GCC 11.2.0] (64-bit runtime)
Python platform: Linux-6.8.0-59-generic-x86_64-with-glibc2.35
Is CUDA available: True
RAW_BUFFERClick to expand / collapse

🐛 Describe the bug

torch.compile with the Inductor backend fails at runtime with a CUDA graph trees memory-pool tracking error when a Conv2d subclass mutates its parameter storage inside forward() using self.weight.data *= self.mask.

I understand that mutating a parameter via .data inside forward() is not a recommended pattern. However, the failure currently surfaces as a low-level Inductor/CUDA graph trees runtime error:

RuntimeError: Detected 6 tensor(s) in the cudagraph pool not tracked as outputs.
All live allocations must be tracked for correctness.

The same model runs successfully when CUDA graphs are disabled via mode="max-autotune-no-cudagraphs", and also runs successfully in the default Inductor mode. A clean implementation that avoids .data mutation and instead computes weight = self.weight * self.mask also works with mode="max-autotune".

This suggests the failure is specifically triggered by the combination of:

  • parameter storage mutation inside forward();
  • Inductor mode="max-autotune";
  • CUDA graph trees.

If this pattern is unsupported, it may be better for torch.compile to graph break, fall back, disable cudagraphs for this graph, or emit a clearer diagnostic instead of failing with a cudagraph pool bookkeeping error.

Repro

import torch
import torch.nn as nn


class DirtyMaskedConv2d(nn.Conv2d):
    def __init__(self, in_channels, out_channels, kernel_size=7, padding=3):
        super().__init__(
            in_channels,
            out_channels,
            kernel_size=kernel_size,
            padding=padding,
        )

        mask = torch.zeros_like(self.weight)
        mask[:, :, : kernel_size // 2, :] = 1
        mask[:, :, kernel_size // 2, : kernel_size // 2] = 1
        self.register_buffer("mask", mask)

    def forward(self, x):
        # Essential trigger:
        # mutating parameter storage inside forward.
        with torch.no_grad():
            self.weight.data *= self.mask

        return super().forward(x)


class M(nn.Module):
    def __init__(self, n_layers=5):
        super().__init__()

        layers = [
            DirtyMaskedConv2d(1, 64),
            nn.ReLU(),
        ]

        for _ in range(n_layers):
            layers += [
                DirtyMaskedConv2d(64, 64),
                nn.ReLU(),
            ]

        layers += [
            nn.Conv2d(64, 64, kernel_size=1),
            nn.ReLU(),
            nn.Conv2d(64, 2, kernel_size=1),
        ]

        self.model = nn.Sequential(*layers)

    def forward(self, x):
        return self.model(x)


def main():
    assert torch.cuda.is_available()

    torch.manual_seed(0)
    torch.cuda.manual_seed_all(0)

    device = "cuda"
    model = M(n_layers=5).to(device).eval()
    x = torch.rand(2, 1, 32, 32, device=device)

    with torch.no_grad():
        eager = model(x)

    compiled_model = torch.compile(
        model,
        backend="inductor",
        mode="max-autotune",
        dynamic=True,
    )

    with torch.no_grad():
        out = compiled_model(x)

    print(torch.allclose(eager, out, atol=1e-4, rtol=1e-4))


if __name__ == "__main__":
    main()

Actual behavior

The compiled call fails with:

RuntimeError: Detected 6 tensor(s) in the cudagraph pool not tracked as outputs.
All live allocations must be tracked for correctness.

The stack trace ends in:

torch/_inductor/cudagraph_trees.py", line ..., in check_memory_pool
    raise RuntimeError(msg)

Full traceback excerpt:

File ".../torch/_inductor/compile_fx.py", line 1922, in run
    return compiled_fn(new_inputs)
File ".../torch/_inductor/cudagraph_trees.py", line 450, in deferred_cudagraphify
    fn, out = cudagraphify(model, inputs, new_static_input_idxs, *args, **kwargs)
File ".../torch/_inductor/cudagraph_trees.py", line 510, in cudagraphify
    return manager.add_function(...)
File ".../torch/_inductor/cudagraph_trees.py", line 788, in run
    check_memory_pool(self.device_index, self.cuda_graphs_pool, out_refs)
File ".../torch/_inductor/cudagraph_trees.py", line 1999, in check_memory_pool
    raise RuntimeError(msg)
RuntimeError: Detected 6 tensor(s) in the cudagraph pool not tracked as outputs.
All live allocations must be tracked for correctness.

Expected behavior

Either:

  1. the compiled model should run successfully and match eager, or
  2. if mutating parameter storage through .data inside forward() is unsupported under CUDA graph trees, torch.compile should fail earlier with a clearer diagnostic, graph break, fall back, or avoid enabling CUDA graphs for this graph.

Additional observations

The issue appears specific to mode="max-autotune" / CUDA graph trees.

The following variants run successfully in my tests:

torch.compile(model, backend="inductor", dynamic=True, mode="max-autotune-no-cudagraphs")
torch.compile(model, backend="inductor", dynamic=True)

A clean implementation that avoids .data mutation also runs successfully with mode="max-autotune":

def forward(self, x):
    weight = self.weight * self.mask
    return torch.nn.functional.conv2d(
        x,
        weight,
        self.bias,
        self.stride,
        self.padding,
        self.dilation,
        self.groups,
    )

Versions

PyTorch version:  2.13.0a0+git059c270
Is debug build: True
CUDA used to build PyTorch: 12.6
ROCM used to build PyTorch: N/A

OS: Ubuntu 22.04.4 LTS (x86_64)
GCC version: (Ubuntu 12.3.0-1ubuntu1~22.04) 12.3.0
Clang version: Could not collect
CMake version: version 3.22.1
Libc version: glibc-2.35

Python version: 3.10.16 (main, Dec 11 2024, 16:24:50) [GCC 11.2.0] (64-bit runtime)
Python platform: Linux-6.8.0-59-generic-x86_64-with-glibc2.35
Is CUDA available: True

cc @mcarilli @ezyang @eellison @penguinwu @BoyuanFeng @chauhang @voznesenskym @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @kadeng @muchulee8 @amjames @aakhundov @coconutruben @jataylo

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

FAQ

Expected behavior

Either:

  1. the compiled model should run successfully and match eager, or
  2. if mutating parameter storage through .data inside forward() is unsupported under CUDA graph trees, torch.compile should fail earlier with a clearer diagnostic, graph break, fall back, or avoid enabling CUDA graphs for this graph.

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

pytorch - 💡(How to fix) Fix Inductor max-autotune fails with cudagraph pool tracking error when Conv2d weight is mutated via .data in forward