pytorch - ✅(Solved) Fix SAC not saving SDPA activations when using DDP and torch.compile [1 pull requests, 3 comments, 4 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
pytorch/pytorch#178765Fetched 2026-04-08 01:52:19
View on GitHub
Comments
3
Participants
4
Timeline
69
Reactions
0
Author
Timeline (top)
mentioned ×26subscribed ×26labeled ×10commented ×3

Fix Action

Fix / Workaround

CPU: Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Address sizes: 46 bits physical, 57 bits virtual Byte Order: Little Endian CPU(s): 128 On-line CPU(s) list: 0-127 Vendor ID: GenuineIntel BIOS Vendor ID: Intel Model name: INTEL(R) XEON(R) PLATINUM 8592+ BIOS Model name: INTEL(R) XEON(R) PLATINUM 8592+ CPU @ 1.9GHz BIOS CPU family: 179 CPU family: 6 Model: 207 Thread(s) per core: 1 Core(s) per socket: 64 Socket(s): 2 Stepping: 2 BogoMIPS: 3800.00 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cat_l2 cdp_l3 cdp_l2 ssbd mba ibrs ibpb stibp ibrs_enhanced tpr_shadow flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb intel_pt avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local split_lock_detect user_shstk avx_vnni avx512_bf16 wbnoinvd dtherm ida arat pln pts vnmi avx512vbmi umip pku ospke waitpkg avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg tme avx512_vpopcntdq la57 rdpid bus_lock_detect cldemote movdiri movdir64b enqcmd fsrm md_clear serialize tsxldtrk pconfig arch_lbr ibt amx_bf16 avx512_fp16 amx_tile amx_int8 flush_l1d arch_capabilities Virtualization: VT-x L1d cache: 6 MiB (128 instances) L1i cache: 4 MiB (128 instances) L2 cache: 256 MiB (128 instances) L3 cache: 640 MiB (2 instances) NUMA node(s): 2 NUMA node0 CPU(s): 0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40,42,44,46,48,50,52,54,56,58,60,62,64,66,68,70,72,74,76,78,80,82,84,86,88,90,92,94,96,98,100,102,104,106,108,110,112,114,116,118,120,122,124,126 NUMA node1 CPU(s): 1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31,33,35,37,39,41,43,45,47,49,51,53,55,57,59,61,63,65,67,69,71,73,75,77,79,81,83,85,87,89,91,93,95,97,99,101,103,105,107,109,111,113,115,117,119,121,123,125,127 Vulnerability Gather data sampling: Not affected Vulnerability Itlb multihit: Not affected Vulnerability L1tf: Not affected Vulnerability Mds: Not affected Vulnerability Meltdown: Not affected Vulnerability Mmio stale data: Not affected Vulnerability Reg file data sampling: Not affected Vulnerability Retbleed: Not affected Vulnerability Spec rstack overflow: Not affected Vulnerability Spec store bypass: Vulnerable Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Vulnerability Spectre v2: Vulnerable; IBPB: disabled; STIBP: disabled; PBRSB-eIBRS: Vulnerable; BHI: Vulnerable Vulnerability Srbds: Not affected Vulnerability Tsx async abort: Not affected

PR fix notes

PR #179496: [dynamo] Fix SAC context_fn clobbered by DDPOptimizer's propagate_metadata

Description (problem / solution / changelog)

Stack from ghstack (oldest at bottom):

  • -> #179496

When using selective activation checkpointing (SAC) with torch.compile and DDP, the SAC policy function was never called. DDPOptimizer's propagate_metadata overwrote .meta on all top-level children of the split graph, including HOP body subgraphs (wrap_body_*) that carry _checkpoint_context_fn. This caused tag_activation_checkpoint_impl to fall back to vanilla activation checkpointing (all PREFER_RECOMPUTE).

split_module hoists both partition submodules (submod_*) and HOP body subgraphs as top-level children. propagate_metadata's filter ("." not in name) matched both. The fix has split_module record which children are partitions so propagate_metadata can target only those.

Fixes https://github.com/pytorch/pytorch/issues/178765

Authored with Claude.

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @kadeng @chauhang @amjames @Lucaskabela @jataylo @azahed98

Changed files

  • test/distributed/test_dynamo_distributed.py (modified, +40/-0)
  • torch/_dynamo/backends/distributed.py (modified, +5/-1)
  • torch/fx/passes/split_module.py (modified, +3/-0)

Code Example

torch/_functorch/partitioners.py:1792] [4/0] Ops banned from re-materialization: OrderedSet(['aten.mm', 'aten.unbind', 'aten._scaled_dot_product_cudnn_attention', 'aten._scaled_dot_product_cudnn_attention_backward', 'aten.cat'])
[rank0]:I0329 14:04:35.765000 40085 torch/_functorch/partitioners.py:3080] [4/0] Theoretical Activations Stored: 0.00 GB
[rank0]:I0329 14:04:35.766000 40085 torch/_functorch/partitioners.py:3083] [4/0] Theoretical Per Activation Storage Sizes: [(524288, 'primals_3'), (1572864, 'primals_2'), (2097152, 'primals_1')]
[rank0]:I0329 14:04:35.766000 40085 torch/_functorch/partitioners.py:3096] [4/0] # remat/fw/bw: 18/20/45
[rank0]:I0329 14:04:35.766000 40085 torch/_functorch/partitioners.py:3105] [4/0] Count of Ops Rematerialized: [('aten.permute', 6), ('aten.view', 5), ('aten.mm', 1), ('aten.unbind', 1)]

---

"""
Reproduction script for selective activation checkpoint (SAC) bug with SDPA cuDNN backend.

Run with:
    AOT_PARTITIONER_DEBUG=1 TORCH_LOGS="+aot,aot_graphs,+torch._functorch.partitioners" PYTHONUNBUFFERED=1 torchrun --nproc_per_node=1 repro_sac_bug_sdpa.py 2>&1 | tee output.txt

Check logs for 'Ops Rematerialized' - cudnn SDPA forward should NOT be in the list.
"""

import logging
logging.basicConfig(level=logging.DEBUG, format='%(name)s - %(levelname)s - %(message)s')

import os
import functools
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel as DDP
from torch.utils.checkpoint import CheckpointPolicy, create_selective_checkpoint_contexts, checkpoint


class SimpleAttention(nn.Module):
    def __init__(self, dim: int, num_heads: int):
        super().__init__()
        self.num_heads = num_heads
        self.head_dim = dim // num_heads
        self.qkv_proj = nn.Linear(dim, 3 * dim, bias=False)
        self.out_proj = nn.Linear(dim, dim, bias=False)

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        B, T, C = x.shape
        qkv = self.qkv_proj(x)
        qkv = qkv.view(B, T, 3, self.num_heads, self.head_dim)
        q, k, v = qkv.unbind(dim=2)  # [B, T, num_heads, head_dim]

        # Transpose to [B, num_heads, T, head_dim] for SDPA
        q = q.transpose(1, 2)
        k = k.transpose(1, 2)
        v = v.transpose(1, 2)

        # Force only cuDNN backend
        with torch.nn.attention.sdpa_kernel(torch.nn.attention.SDPBackend.CUDNN_ATTENTION):
            out = F.scaled_dot_product_attention(q, k, v, is_causal=True)

        out = out.transpose(1, 2).contiguous().view(B, T, C)
        return self.out_proj(out)


class Model(nn.Module):
    def __init__(self, dim: int, num_heads: int, num_layers: int, use_sac: bool = True):
        super().__init__()
        self.layers = nn.ModuleList([
            SimpleAttention(dim, num_heads) for _ in range(num_layers)
        ])
        self.use_sac = use_sac

        # Selective activation checkpoint policy - save SDPA forward ops
        ops_to_save = {
            "aten::_scaled_dot_product_cudnn_attention",
            "aten::_scaled_dot_product_flash_attention",
            "aten::_scaled_dot_product_cudnn_attention_backward"
        }

        def policy_fn(ctx, op, *args, **kwargs):
            print(f"OP_NAME IS {op.name()}")
            if str(op) in ops_to_save or op.name() in ops_to_save:
                return CheckpointPolicy.MUST_SAVE
            return CheckpointPolicy.PREFER_RECOMPUTE

        self.context_fn = functools.partial(create_selective_checkpoint_contexts, policy_fn)

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        for layer in self.layers:
            x = checkpoint(
                layer,
                x,
                use_reentrant=False,
                context_fn=self.context_fn,
            )
        return x


def main():
    # Initialize distributed
    dist.init_process_group(backend="nccl")
    local_rank = int(os.environ["LOCAL_RANK"])
    torch.cuda.set_device(local_rank)

    if local_rank == 0:
        print(f"PyTorch version: {torch.__version__}")
        print(f"CUDA available: {torch.cuda.is_available()}")
        print(f"World size: {dist.get_world_size()}")

    # Model config - minimal for reproduction
    batch_size = 2
    seq_len = 1024
    dim = 512
    num_heads = 8
    num_layers = 2

    device = f"cuda:{local_rank}"
    dtype = torch.bfloat16

    # Create model
    model = Model(dim, num_heads, num_layers, use_sac=True).to(device, dtype)

    # Wrap with DDP
    model = DDP(model, device_ids=[local_rank])

    # Compile
    model = torch.compile(model)

    # Input
    x = torch.randn(batch_size, seq_len, dim, device=device, dtype=dtype, requires_grad=True)

    # Forward + backward
    if local_rank == 0:
        print("\nRunning forward pass...")
    out = model(x)

    if local_rank == 0:
        print("Running backward pass...")
    loss = out.sum()
    loss.backward()
    dist.destroy_process_group()


if __name__ == "__main__":
    main()
RAW_BUFFERClick to expand / collapse

🐛 Describe the bug

Hi,

When using selective activation checkpointing with torch.compile and DDP, PyTorch ignores CheckpointPolicy.MUST_SAVE directives and recompute all operations during backward pass.

I'm using Nvidia offical PyTorch images. PyTorch 2.8 worked for me under nvidia pytorch image 25.06 PyTorch 2.9+ doesn't seem to work anymore (tested in nvidia pytorch 25.08 and later version til 26.02 which contains PyTorch 2.10.

I have a reproduce code using SDPA calls.

Running with debug logs I can see that aten._scaled_dot_product_cudnn_attention is banned from re-materialization

torch/_functorch/partitioners.py:1792] [4/0] Ops banned from re-materialization: OrderedSet(['aten.mm', 'aten.unbind', 'aten._scaled_dot_product_cudnn_attention', 'aten._scaled_dot_product_cudnn_attention_backward', 'aten.cat'])
[rank0]:I0329 14:04:35.765000 40085 torch/_functorch/partitioners.py:3080] [4/0] Theoretical Activations Stored: 0.00 GB
[rank0]:I0329 14:04:35.766000 40085 torch/_functorch/partitioners.py:3083] [4/0] Theoretical Per Activation Storage Sizes: [(524288, 'primals_3'), (1572864, 'primals_2'), (2097152, 'primals_1')]
[rank0]:I0329 14:04:35.766000 40085 torch/_functorch/partitioners.py:3096] [4/0] # remat/fw/bw: 18/20/45
[rank0]:I0329 14:04:35.766000 40085 torch/_functorch/partitioners.py:3105] [4/0] Count of Ops Rematerialized: [('aten.permute', 6), ('aten.view', 5), ('aten.mm', 1), ('aten.unbind', 1)]

but when analyzing under nvidia nsight systems i see twice the expected calls for fwd kernel cudnn_generated_fort_native_sdpa_sm80_flash_fprop_ than to cudnn_generated_fort_native_sdpa_sm80_flash_bprop_

where in PyTorch 2.8 the fwd and bwd kernels had same number of invocations

"""
Reproduction script for selective activation checkpoint (SAC) bug with SDPA cuDNN backend.

Run with:
    AOT_PARTITIONER_DEBUG=1 TORCH_LOGS="+aot,aot_graphs,+torch._functorch.partitioners" PYTHONUNBUFFERED=1 torchrun --nproc_per_node=1 repro_sac_bug_sdpa.py 2>&1 | tee output.txt

Check logs for 'Ops Rematerialized' - cudnn SDPA forward should NOT be in the list.
"""

import logging
logging.basicConfig(level=logging.DEBUG, format='%(name)s - %(levelname)s - %(message)s')

import os
import functools
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel as DDP
from torch.utils.checkpoint import CheckpointPolicy, create_selective_checkpoint_contexts, checkpoint


class SimpleAttention(nn.Module):
    def __init__(self, dim: int, num_heads: int):
        super().__init__()
        self.num_heads = num_heads
        self.head_dim = dim // num_heads
        self.qkv_proj = nn.Linear(dim, 3 * dim, bias=False)
        self.out_proj = nn.Linear(dim, dim, bias=False)

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        B, T, C = x.shape
        qkv = self.qkv_proj(x)
        qkv = qkv.view(B, T, 3, self.num_heads, self.head_dim)
        q, k, v = qkv.unbind(dim=2)  # [B, T, num_heads, head_dim]

        # Transpose to [B, num_heads, T, head_dim] for SDPA
        q = q.transpose(1, 2)
        k = k.transpose(1, 2)
        v = v.transpose(1, 2)

        # Force only cuDNN backend
        with torch.nn.attention.sdpa_kernel(torch.nn.attention.SDPBackend.CUDNN_ATTENTION):
            out = F.scaled_dot_product_attention(q, k, v, is_causal=True)

        out = out.transpose(1, 2).contiguous().view(B, T, C)
        return self.out_proj(out)


class Model(nn.Module):
    def __init__(self, dim: int, num_heads: int, num_layers: int, use_sac: bool = True):
        super().__init__()
        self.layers = nn.ModuleList([
            SimpleAttention(dim, num_heads) for _ in range(num_layers)
        ])
        self.use_sac = use_sac

        # Selective activation checkpoint policy - save SDPA forward ops
        ops_to_save = {
            "aten::_scaled_dot_product_cudnn_attention",
            "aten::_scaled_dot_product_flash_attention",
            "aten::_scaled_dot_product_cudnn_attention_backward"
        }

        def policy_fn(ctx, op, *args, **kwargs):
            print(f"OP_NAME IS {op.name()}")
            if str(op) in ops_to_save or op.name() in ops_to_save:
                return CheckpointPolicy.MUST_SAVE
            return CheckpointPolicy.PREFER_RECOMPUTE

        self.context_fn = functools.partial(create_selective_checkpoint_contexts, policy_fn)

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        for layer in self.layers:
            x = checkpoint(
                layer,
                x,
                use_reentrant=False,
                context_fn=self.context_fn,
            )
        return x


def main():
    # Initialize distributed
    dist.init_process_group(backend="nccl")
    local_rank = int(os.environ["LOCAL_RANK"])
    torch.cuda.set_device(local_rank)

    if local_rank == 0:
        print(f"PyTorch version: {torch.__version__}")
        print(f"CUDA available: {torch.cuda.is_available()}")
        print(f"World size: {dist.get_world_size()}")

    # Model config - minimal for reproduction
    batch_size = 2
    seq_len = 1024
    dim = 512
    num_heads = 8
    num_layers = 2

    device = f"cuda:{local_rank}"
    dtype = torch.bfloat16

    # Create model
    model = Model(dim, num_heads, num_layers, use_sac=True).to(device, dtype)

    # Wrap with DDP
    model = DDP(model, device_ids=[local_rank])

    # Compile
    model = torch.compile(model)

    # Input
    x = torch.randn(batch_size, seq_len, dim, device=device, dtype=dtype, requires_grad=True)

    # Forward + backward
    if local_rank == 0:
        print("\nRunning forward pass...")
    out = model(x)

    if local_rank == 0:
        print("Running backward pass...")
    loss = out.sum()
    loss.backward()
    dist.destroy_process_group()


if __name__ == "__main__":
    main()

Versions

Collecting environment information... PyTorch version: 2.11.0a0+eb65b36914.nv26.02 Is debug build: False CUDA used to build PyTorch: 13.1 ROCM used to build PyTorch: N/A

OS: Ubuntu 24.04.3 LTS (x86_64) GCC version: (Ubuntu 13.3.0-6ubuntu2~24.04.1) 13.3.0 Clang version: Could not collect CMake version: version 3.31.6 Libc version: glibc-2.39

Python version: 3.12.3 (main, Jan 22 2026, 20:57:42) [GCC 13.3.0] (64-bit runtime) Python platform: Linux-6.8.0-71-generic-x86_64-with-glibc2.39 Is CUDA available: True CUDA runtime version: 13.1.115 CUDA_MODULE_LOADING set to: LAZY GPU models and configuration: GPU 0: NVIDIA B200 GPU 1: NVIDIA B200 GPU 2: NVIDIA B200 GPU 3: NVIDIA B200 GPU 4: NVIDIA B200 GPU 5: NVIDIA B200 GPU 6: NVIDIA B200 GPU 7: NVIDIA B200

Nvidia driver version: 590.48.01 cuDNN version: Probably one of the following: /usr/lib/x86_64-linux-gnu/libcudnn.so.9.19.0 /usr/lib/x86_64-linux-gnu/libcudnn_adv.so.9.19.0 /usr/lib/x86_64-linux-gnu/libcudnn_cnn.so.9.19.0 /usr/lib/x86_64-linux-gnu/libcudnn_engines_precompiled.so.9.19.0 /usr/lib/x86_64-linux-gnu/libcudnn_engines_runtime_compiled.so.9.19.0 /usr/lib/x86_64-linux-gnu/libcudnn_graph.so.9.19.0 /usr/lib/x86_64-linux-gnu/libcudnn_heuristic.so.9.19.0 /usr/lib/x86_64-linux-gnu/libcudnn_ops.so.9.19.0 Is XPU available: False HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True Caching allocator config: N/A

CPU: Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Address sizes: 46 bits physical, 57 bits virtual Byte Order: Little Endian CPU(s): 128 On-line CPU(s) list: 0-127 Vendor ID: GenuineIntel BIOS Vendor ID: Intel Model name: INTEL(R) XEON(R) PLATINUM 8592+ BIOS Model name: INTEL(R) XEON(R) PLATINUM 8592+ CPU @ 1.9GHz BIOS CPU family: 179 CPU family: 6 Model: 207 Thread(s) per core: 1 Core(s) per socket: 64 Socket(s): 2 Stepping: 2 BogoMIPS: 3800.00 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cat_l2 cdp_l3 cdp_l2 ssbd mba ibrs ibpb stibp ibrs_enhanced tpr_shadow flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb intel_pt avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local split_lock_detect user_shstk avx_vnni avx512_bf16 wbnoinvd dtherm ida arat pln pts vnmi avx512vbmi umip pku ospke waitpkg avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg tme avx512_vpopcntdq la57 rdpid bus_lock_detect cldemote movdiri movdir64b enqcmd fsrm md_clear serialize tsxldtrk pconfig arch_lbr ibt amx_bf16 avx512_fp16 amx_tile amx_int8 flush_l1d arch_capabilities Virtualization: VT-x L1d cache: 6 MiB (128 instances) L1i cache: 4 MiB (128 instances) L2 cache: 256 MiB (128 instances) L3 cache: 640 MiB (2 instances) NUMA node(s): 2 NUMA node0 CPU(s): 0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40,42,44,46,48,50,52,54,56,58,60,62,64,66,68,70,72,74,76,78,80,82,84,86,88,90,92,94,96,98,100,102,104,106,108,110,112,114,116,118,120,122,124,126 NUMA node1 CPU(s): 1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31,33,35,37,39,41,43,45,47,49,51,53,55,57,59,61,63,65,67,69,71,73,75,77,79,81,83,85,87,89,91,93,95,97,99,101,103,105,107,109,111,113,115,117,119,121,123,125,127 Vulnerability Gather data sampling: Not affected Vulnerability Itlb multihit: Not affected Vulnerability L1tf: Not affected Vulnerability Mds: Not affected Vulnerability Meltdown: Not affected Vulnerability Mmio stale data: Not affected Vulnerability Reg file data sampling: Not affected Vulnerability Retbleed: Not affected Vulnerability Spec rstack overflow: Not affected Vulnerability Spec store bypass: Vulnerable Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Vulnerability Spectre v2: Vulnerable; IBPB: disabled; STIBP: disabled; PBRSB-eIBRS: Vulnerable; BHI: Vulnerable Vulnerability Srbds: Not affected Vulnerability Tsx async abort: Not affected

Versions of relevant libraries: [pip3] intel-openmp==2021.4.0 [pip3] mkl==2021.1.1 [pip3] mkl-devel==2021.1.1 [pip3] mkl-include==2021.1.1 [pip3] mypy_extensions==1.1.0 [pip3] numpy==2.1.0 [pip3] nvidia-cuda-runtime-cu13==0.0.0a0 [pip3] nvidia-cudnn-frontend==1.18.0 [pip3] nvtx==0.2.14 [pip3] onnx==1.18.0 [pip3] onnx-ir==0.1.16 [pip3] onnxscript==0.6.2 [pip3] optree==0.18.0 [pip3] pytorch-triton==3.6.0+git9844da95.nv26.2 [pip3] tbb==2021.13.1 [pip3] torch==2.11.0a0+eb65b36914.nv26.2 [pip3] torch-lr-finder==0.2.2 [pip3] torch_tensorrt==2.11.0a0 [pip3] torchao==0.16.0+gita89eaab2 [pip3] torchdata==0.11.0 [pip3] torchtitan==0.2.1+git9f211ec1 [pip3] torchvision==0.25.0a0+1e53952f.nv26.2.44259020 [pip3] triton_kernels==1.0.0+git9844da95.nv26.2 [conda] Could not collect

cc @soulitzer @chauhang @penguinwu @bdhirsh @bobrenjc93 @aorenste @drisspg @liangel-02 @howardzhang-cv

extent analysis

Fix Plan

To address the issue with PyTorch ignoring CheckpointPolicy.MUST_SAVE directives and recompute all operations during the backward pass when using selective activation checkpointing with torch.compile and DDP, follow these steps:

  1. Update PyTorch and cuDNN versions: Ensure you are using the latest versions of PyTorch and cuDNN. The issue might be resolved in newer versions.
  2. Modify the checkpoint policy: Update the policy_fn in the Model class to prioritize saving specific operations. For example:

def policy_fn(ctx, op, *args, **kwargs): ops_to_save = { "aten::_scaled_dot_product_cudnn_attention", "aten::_scaled_dot_product_flash_attention", "aten::_scaled_dot_product_cudnn_attention_backward" } if str(op) in ops_to_save or op.name() in ops_to_save: return CheckpointPolicy.MUST_SAVE return CheckpointPolicy.PREFER_RECOMPUTE

3. **Disable `torch.compile`**: Temporarily disable `torch.compile` to verify if the issue is related to the compilation process.
4. **Verify DDP configuration**: Ensure that the DDP configuration is correct, and the model is properly wrapped with `DistributedDataParallel`.

### Verification
To verify that the fix worked:

1. Run the reproduction script with the updated `policy_fn` and verify that the `CheckpointPolicy.MUST_SAVE` directives are respected.
2. Check the logs for 'Ops Rematerialized' and ensure that the cuDNN SDPA forward operation is not in the list.
3. Use Nvidia NSight Systems to analyze the kernel invocations and verify that the forward and backward kernels have the same number of invocations.

### Extra Tips
* Ensure that the `TORCH_LOGS` environment variable is set to include `+aot,aot_graphs,+torch._functorch.partitioners` to enable detailed logging.
* If the issue persists, try updating the Nvidia driver and cuDNN versions to the latest available.
* Consider filing a bug report with PyTorch if the issue is not resolved with the above steps.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING