pytorch - 💡(How to fix) Fix [silent correctness] torch.compile catastrophically changes grad output of in-place autograd matmul

pytorch2026-05-17 19:09:11

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

Error Message

TODO: if this is currently in half-precision, should we do the addition in float32, to reduce accumulation error?

Root Cause

@staticmethod def setup_context(ctx, inputs, output): t, freqs_cis, coeff = inputs # mark t as dirty because we will modify it in-place. this ensures backward() will be called. ctx.mark_dirty(t) ctx.save_for_backward(freqs_cis) # required for jvp, probably ctx.save_for_forward(freqs_cis) ctx.coeff = coeff

Code Example

# (control): two instances of same model produce same fwd and bwd output. no repro.
python -m repro
fwd outputs matched
model grads matched

# (repro): compile one of the model instances. fwd still matches, grads are completely different.
python -m repro --compile
fwd outputs matched

---

model0.lin.weight.grad
tensor([[ 1.7422,  0.0000],
        [-0.1167,  0.0000]])

model1.lin.weight.grad
tensor([[-0.6289,  0.0000],
        [-0.8594,  0.0000]])

---

from __future__ import annotations
from argparse import ArgumentParser, Namespace
from dataclasses import dataclass
import torch
from torch import FloatTensor, IntTensor, Tensor, nn, no_grad
from torch.testing import assert_close
from torch.nn.functional import mse_loss
from os import environ, getenv
import math
from typing import Literal

# based on k-diffusion's _apply_rotary_emb_inplace
# MIT-licensed, by Katherine Crowson
# https://github.com/crowsonkb/k-diffusion/blob/21d12c91ad4550e8fcf3308ff9fe7116b3f19a08/k_diffusion/models/image_transformer_v2.py#L188C5-L188C30
def _halfrope_inplace(t: Tensor, freqs_cis: Tensor, coeff: Literal[1, -1]) -> None:
    cos, sin = freqs_cis.unbind(-1)
    ty, tx = t.unbind(-1)
    # TODO: if this is currently in half-precision, should we do the addition in float32, to reduce accumulation error?
    ty_roped = ty.mul(cos).addcmul_(sin, tx, value=coeff)
    tx_roped = tx.mul(cos).addcmul_(sin, ty, value=-coeff)
    ty.copy_(ty_roped)
    tx.copy_(tx_roped)


# based on k-diffusion's ApplyRotaryEmbeddingInplace,
# MIT-licensed, by Katherine Crowson
# https://github.com/crowsonkb/k-diffusion/blob/21d12c91ad4550e8fcf3308ff9fe7116b3f19a08/k_diffusion/models/image_transformer_v2.py#L202
class HalfRopeInPlace(torch.autograd.Function):
    # we don't currently use vmap, but maybe it will enable compile to fuse QK rope apply into flex attn operations?
    # benchmark suggests that enabling this didn't incur any additional cold-start time, so I guess it's lazy.
    generate_vmap_rule = True

    @staticmethod
    def forward(t: Tensor, freqs_cis: Tensor, coeff: Literal[1, -1] = -1):
        "NOTE: passing the coeff arg -1 explicitly gives better compiled performance than relying on arg defaulting"
        _halfrope_inplace(t, freqs_cis, coeff=coeff)
        return t

    @staticmethod
    def setup_context(ctx, inputs, output):
        t, freqs_cis, coeff = inputs
        # mark t as dirty because we will modify it in-place. this ensures backward() will be called.
        ctx.mark_dirty(t)
        ctx.save_for_backward(freqs_cis)
        # required for jvp, probably
        ctx.save_for_forward(freqs_cis)
        ctx.coeff = coeff

    @staticmethod
    def backward(ctx, grad_output):
        (freqs_cis,) = ctx.saved_tensors
        # clone made because we must "NEVER" modify grad-w.r.t-input in-place.
        # https://pytorch.org/docs/main/notes/extending.html#how-to-use
        # https://discuss.pytorch.org/t/is-it-safe-to-modify-outputs-grad-and-return-as-inputs-grad/201630
        # it seemed to work for simple cases (including ours as far as we can tell), just being cautious really.
        # this copy seems to slow down the train step by 0.4% (non-compiled) / 0.7% (compiled)
        grad_output = HalfRopeInPlace.apply(grad_output.clone(), freqs_cis, -ctx.coeff)
        return grad_output, None, None

    # we are forced to comment this out in order for compilation to succeed, due to
    # https://github.com/pytorch/pytorch/issues/180284
    # @staticmethod
    # def jvp(ctx, grad_input, *_):
    #     """
    #     none of our code currently uses this, but it gives us forward-mode autodiff support,
    #     which could be useful for implementing consistency training.
    #     """
    #     (freqs_cis,) = ctx.saved_tensors
    #     grad_input = HalfRopeInPlace.apply(grad_input, freqs_cis, ctx.coeff)
    #     return grad_input
    

def make_rot_mat(
    pos: IntTensor,
    dim: int,
    log_theta: float | FloatTensor,
    halfrot: bool,
    angular_velocity: float | FloatTensor = 1.0,
    out_dtype=torch.float32,
) -> FloatTensor:
    """
    log_theta can be a [b, 1, 1] tensor to compute multiple theta simultaneously
    scale can be a [b, 1, 1] tensor to compute multiple theta simultaneously

    scale can be provided for changing the speed of the fastest rotation.
    a smaller value of scale, slows down the fastest rotation.
    """
    assert dim % 2 == 0
    device = pos.device
    precise_dtype = torch.float32 if pos.device.type == 'mps' else torch.float64
    neg_scale = torch.linspace(0, 2 / dim - 1, dim // 2, dtype=precise_dtype, device=device)
    if torch.is_tensor(log_theta):
        assert log_theta.ndim >= 3
        assert log_theta.size(-1) == 1
        assert log_theta.size(-2) == 1
    if torch.is_tensor(angular_velocity):
        assert angular_velocity.ndim >= 3
        assert angular_velocity.size(-1) == 1
        assert angular_velocity.size(-2) == 1
    log_omega: FloatTensor = neg_scale * log_theta
    omega = log_omega.exp()
    angle = pos.unsqueeze(-1) * omega * angular_velocity

    cos = angle.cos().type(out_dtype)
    sin = angle.sin().type(out_dtype)
    if halfrot:
        rotmat = torch.stack(
            [
                cos,
                sin,
            ],
            dim=-1,
        )
    else:
        rotmat = torch.stack(
            [
                cos,
                -sin,
                sin,
                cos,
            ],
            dim=-1,
        ).unflatten(-1, (2, 2))
    return rotmat

class Model(nn.Module):
    def __init__(self):
        super().__init__()
        self.lin = nn.Linear(2, 2, bias=False)

    def forward(self, x: FloatTensor, rotmat: FloatTensor) -> FloatTensor:
        x = self.lin(x)
        HalfRopeInPlace.apply(
            x.unflatten(-1, ((-1, 2))),
            rotmat,
            -1,
        )
        return x


@dataclass
class Args:
    global_rank: int
    world_size: int

    compile: bool
    "compile the peer model to reproduce the very different grad output."

    distributed: bool
    """
    use FSDP2 to shard or replicate model.
    NOTE: reproduces fine on single-GPU without distributing the model.
          I just included this mode because the program in which the bug appeared originally
          used FSDP2, so I wanted a way to reproduce FSDP2's mixed-precision hook in case that
          turned out to be relevant to the compiler's decisions.
    """

    shard: bool
    "[only relevant when using FSDP2] False = shard, True = replicate"

    @staticmethod
    def get_parser() -> ArgumentParser:
        parser = ArgumentParser()
        parser.add_argument("--compile", action="store_true", help="compile the peer model to reproduce the very different grad output.")
        parser.add_argument("--distributed", action="store_true", help="use FSDP2 to shard or replicate model.")
        parser.add_argument("--shard", action="store_true", help="[only relevant when using FSDP2] False = shard, True = replicate")
        return parser
    
    @staticmethod
    def from_namespace(namespace: Namespace) -> Args:
        global_rank=int(getenv("RANK"))
        world_size=int(getenv("WORLD_SIZE"))
        args = Args(
            global_rank=global_rank,
            world_size=world_size,
            **vars(namespace),
        )
        return args


def main(args: Args):
    device=torch.device('cpu')
    hp_dtype=torch.bfloat16
    seq_len = 8
    y = torch.ones((seq_len,), device=device, dtype=torch.float32)
    x = torch.zeros_like(y)
    vec = torch.stack([y, x], dim=-1)
    target_vec = torch.ones_like(vec, dtype=hp_dtype)

    with torch.device('meta'):
        model0 = Model()
        model1 = Model()
    for mod in (model0, model1):
        mod.to_empty(device=device)
        with no_grad():
            torch.eye(2, out=mod.lin.weight.data)
            mod.lin.weight.mul_(2)

    if args.distributed:
        import torch.distributed as dist
        from contextlib import nullcontext
        from datetime import timedelta
        from torch.distributed.fsdp import MixedPrecisionPolicy, fully_shard
        from torch.distributed._composable.replicate_with_fsdp import replicate
        from torch.distributed.device_mesh import DeviceMesh
        
        dist.init_process_group(
            # backend=backends,
            # backend="nccl",
            # backend="gloo",
            init_method=None,
            world_size=args.world_size,
            rank=args.global_rank,
            # device_id=device,  # NOTE: we were once affected by https://github.com/pytorch/pytorch/issues/147568
            timeout=timedelta(seconds=2),  # 70B takes 66 sec to save.
        )
        assert dist.is_initialized(), "Failed to initialize default process group"
        mesh: list[int] = [0]
        device_mesh = DeviceMesh(
            device_type=device.type,
            mesh=mesh,
            mesh_dim_names=["fsdp"],
        )
        mp_policy = MixedPrecisionPolicy(
            param_dtype=hp_dtype,
            reduce_dtype=torch.float32,
            cast_forward_inputs=True,
        )
        for mod in (model0, model1):
            if args.shard:
                fully_shard(mod, mesh=device_mesh, mp_policy=mp_policy)
            else:
                replicate(mod, mesh=device_mesh, mp_policy=mp_policy)
        mp_ctx = nullcontext()
    else:
        from torch.amp.autocast_mode import autocast
        mp_ctx = autocast(device_type=device.type, dtype=hp_dtype)

    if args.compile:
        model1 = torch.compile(model1, dynamic=False, fullgraph=True)

    with no_grad():
        pos = torch.arange(seq_len, device=device, dtype=torch.int32)
        rotmat: FloatTensor = make_rot_mat(
            pos=pos,
            dim=2,
            log_theta = math.log(4000),
            halfrot=True,
        )

    with mp_ctx:
        out0 = model0(vec, rotmat=rotmat)
        out1 = model1(vec, rotmat=rotmat)

    # assert_close(d, d.new_tensor([[2.0, 0.0], [1.078125, 1.6796875], [-0.83203125, 1.8203125], [-1.9765625, 0.283203125], [-1.3046875, -1.515625], [0.56640625, -1.9140625], [1.921875, -0.55859375], [1.5078125, 1.3125]]))
    assert_close(out0, out1)
    print("fwd outputs matched")

    loss0 = mse_loss(out0, target_vec)
    loss1 = mse_loss(out1, target_vec)
    loss0.backward()
    loss1.backward()
    assert_close(model0.lin.weight.grad, model1.lin.weight.grad)
    # assert_close(model.lin.weight.grad, model.lin.weight.grad.new_tensor([[1.7421875, 0.0], [-0.11669921875, 0.0]]))

    print("model grads matched")

if __name__ == "__main__":
    environ.setdefault("RANK", "0")
    environ.setdefault("LOCAL_RANK", "0")
    environ.setdefault("WORLD_SIZE", "1")
    environ.setdefault("MASTER_ADDR", "127.0.0.1")
    environ.setdefault("MASTER_PORT", "2348")

    parser = Args.get_parser()
    args_untyped: Namespace = parser.parse_args()
    args: Args = Args.from_namespace(args_untyped)
    main(args)

---

PyTorch version: 2.11.0
Is debug build: False
CUDA used to build PyTorch: None
ROCM used to build PyTorch: N/A

OS: macOS 26.3.1 (arm64)
GCC version: Could not collect
Clang version: 16.0.0 (clang-1600.0.26.6)
CMake version: version 3.31.2
Libc version: N/A

Python version: 3.13.2 (main, Feb  4 2025, 14:51:09) [Clang 16.0.0 (clang-1600.0.26.6)] (64-bit runtime)
Python platform: macOS-26.3.1-arm64-arm-64bit-Mach-O
Is CUDA available: False
CUDA runtime version: No CUDA
CUDA_MODULE_LOADING set to: N/A
GPU models and configuration: No CUDA
Nvidia driver version: No CUDA
cuDNN version: No CUDA
Is XPU available: False
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True
Caching allocator config: N/A

CPU:
Apple M3 Pro

Versions of relevant libraries:
[pip3] numpy==2.4.4
[pip3] torch==2.11.0
[pip3] torchvision==0.26.0
[conda] Could not collect

RAW_BUFFERClick to expand / collapse

🐛 Describe the bug

This silent correctness bug caused huge divergence when we attempted to enable torch.compile in our training. Compiled model has completely different grads. Might be due to utilising an autograd function, might be due to in-place operations. But doesn't match eager.

We first noticed this on torch 2.11.0 B200 CUDA, but I reproduced it on macOS with CPU and MPS backends (there was no WiFi on the plane okay).

# (control): two instances of same model produce same fwd and bwd output. no repro.
python -m repro
fwd outputs matched
model grads matched

# (repro): compile one of the model instances. fwd still matches, grads are completely different.
python -m repro --compile
fwd outputs matched

grads are very different. doesn't look like typical numerics fare.

model0.lin.weight.grad
tensor([[ 1.7422,  0.0000],
        [-0.1167,  0.0000]])

model1.lin.weight.grad
tensor([[-0.6289,  0.0000],
        [-0.8594,  0.0000]])

repro.py

from __future__ import annotations
from argparse import ArgumentParser, Namespace
from dataclasses import dataclass
import torch
from torch import FloatTensor, IntTensor, Tensor, nn, no_grad
from torch.testing import assert_close
from torch.nn.functional import mse_loss
from os import environ, getenv
import math
from typing import Literal

# based on k-diffusion's _apply_rotary_emb_inplace
# MIT-licensed, by Katherine Crowson
# https://github.com/crowsonkb/k-diffusion/blob/21d12c91ad4550e8fcf3308ff9fe7116b3f19a08/k_diffusion/models/image_transformer_v2.py#L188C5-L188C30
def _halfrope_inplace(t: Tensor, freqs_cis: Tensor, coeff: Literal[1, -1]) -> None:
    cos, sin = freqs_cis.unbind(-1)
    ty, tx = t.unbind(-1)
    # TODO: if this is currently in half-precision, should we do the addition in float32, to reduce accumulation error?
    ty_roped = ty.mul(cos).addcmul_(sin, tx, value=coeff)
    tx_roped = tx.mul(cos).addcmul_(sin, ty, value=-coeff)
    ty.copy_(ty_roped)
    tx.copy_(tx_roped)


# based on k-diffusion's ApplyRotaryEmbeddingInplace,
# MIT-licensed, by Katherine Crowson
# https://github.com/crowsonkb/k-diffusion/blob/21d12c91ad4550e8fcf3308ff9fe7116b3f19a08/k_diffusion/models/image_transformer_v2.py#L202
class HalfRopeInPlace(torch.autograd.Function):
    # we don't currently use vmap, but maybe it will enable compile to fuse QK rope apply into flex attn operations?
    # benchmark suggests that enabling this didn't incur any additional cold-start time, so I guess it's lazy.
    generate_vmap_rule = True

    @staticmethod
    def forward(t: Tensor, freqs_cis: Tensor, coeff: Literal[1, -1] = -1):
        "NOTE: passing the coeff arg -1 explicitly gives better compiled performance than relying on arg defaulting"
        _halfrope_inplace(t, freqs_cis, coeff=coeff)
        return t

    @staticmethod
    def setup_context(ctx, inputs, output):
        t, freqs_cis, coeff = inputs
        # mark t as dirty because we will modify it in-place. this ensures backward() will be called.
        ctx.mark_dirty(t)
        ctx.save_for_backward(freqs_cis)
        # required for jvp, probably
        ctx.save_for_forward(freqs_cis)
        ctx.coeff = coeff

    @staticmethod
    def backward(ctx, grad_output):
        (freqs_cis,) = ctx.saved_tensors
        # clone made because we must "NEVER" modify grad-w.r.t-input in-place.
        # https://pytorch.org/docs/main/notes/extending.html#how-to-use
        # https://discuss.pytorch.org/t/is-it-safe-to-modify-outputs-grad-and-return-as-inputs-grad/201630
        # it seemed to work for simple cases (including ours as far as we can tell), just being cautious really.
        # this copy seems to slow down the train step by 0.4% (non-compiled) / 0.7% (compiled)
        grad_output = HalfRopeInPlace.apply(grad_output.clone(), freqs_cis, -ctx.coeff)
        return grad_output, None, None

    # we are forced to comment this out in order for compilation to succeed, due to
    # https://github.com/pytorch/pytorch/issues/180284
    # @staticmethod
    # def jvp(ctx, grad_input, *_):
    #     """
    #     none of our code currently uses this, but it gives us forward-mode autodiff support,
    #     which could be useful for implementing consistency training.
    #     """
    #     (freqs_cis,) = ctx.saved_tensors
    #     grad_input = HalfRopeInPlace.apply(grad_input, freqs_cis, ctx.coeff)
    #     return grad_input
    

def make_rot_mat(
    pos: IntTensor,
    dim: int,
    log_theta: float | FloatTensor,
    halfrot: bool,
    angular_velocity: float | FloatTensor = 1.0,
    out_dtype=torch.float32,
) -> FloatTensor:
    """
    log_theta can be a [b, 1, 1] tensor to compute multiple theta simultaneously
    scale can be a [b, 1, 1] tensor to compute multiple theta simultaneously

    scale can be provided for changing the speed of the fastest rotation.
    a smaller value of scale, slows down the fastest rotation.
    """
    assert dim % 2 == 0
    device = pos.device
    precise_dtype = torch.float32 if pos.device.type == 'mps' else torch.float64
    neg_scale = torch.linspace(0, 2 / dim - 1, dim // 2, dtype=precise_dtype, device=device)
    if torch.is_tensor(log_theta):
        assert log_theta.ndim >= 3
        assert log_theta.size(-1) == 1
        assert log_theta.size(-2) == 1
    if torch.is_tensor(angular_velocity):
        assert angular_velocity.ndim >= 3
        assert angular_velocity.size(-1) == 1
        assert angular_velocity.size(-2) == 1
    log_omega: FloatTensor = neg_scale * log_theta
    omega = log_omega.exp()
    angle = pos.unsqueeze(-1) * omega * angular_velocity

    cos = angle.cos().type(out_dtype)
    sin = angle.sin().type(out_dtype)
    if halfrot:
        rotmat = torch.stack(
            [
                cos,
                sin,
            ],
            dim=-1,
        )
    else:
        rotmat = torch.stack(
            [
                cos,
                -sin,
                sin,
                cos,
            ],
            dim=-1,
        ).unflatten(-1, (2, 2))
    return rotmat

class Model(nn.Module):
    def __init__(self):
        super().__init__()
        self.lin = nn.Linear(2, 2, bias=False)

    def forward(self, x: FloatTensor, rotmat: FloatTensor) -> FloatTensor:
        x = self.lin(x)
        HalfRopeInPlace.apply(
            x.unflatten(-1, ((-1, 2))),
            rotmat,
            -1,
        )
        return x


@dataclass
class Args:
    global_rank: int
    world_size: int

    compile: bool
    "compile the peer model to reproduce the very different grad output."

    distributed: bool
    """
    use FSDP2 to shard or replicate model.
    NOTE: reproduces fine on single-GPU without distributing the model.
          I just included this mode because the program in which the bug appeared originally
          used FSDP2, so I wanted a way to reproduce FSDP2's mixed-precision hook in case that
          turned out to be relevant to the compiler's decisions.
    """

    shard: bool
    "[only relevant when using FSDP2] False = shard, True = replicate"

    @staticmethod
    def get_parser() -> ArgumentParser:
        parser = ArgumentParser()
        parser.add_argument("--compile", action="store_true", help="compile the peer model to reproduce the very different grad output.")
        parser.add_argument("--distributed", action="store_true", help="use FSDP2 to shard or replicate model.")
        parser.add_argument("--shard", action="store_true", help="[only relevant when using FSDP2] False = shard, True = replicate")
        return parser
    
    @staticmethod
    def from_namespace(namespace: Namespace) -> Args:
        global_rank=int(getenv("RANK"))
        world_size=int(getenv("WORLD_SIZE"))
        args = Args(
            global_rank=global_rank,
            world_size=world_size,
            **vars(namespace),
        )
        return args


def main(args: Args):
    device=torch.device('cpu')
    hp_dtype=torch.bfloat16
    seq_len = 8
    y = torch.ones((seq_len,), device=device, dtype=torch.float32)
    x = torch.zeros_like(y)
    vec = torch.stack([y, x], dim=-1)
    target_vec = torch.ones_like(vec, dtype=hp_dtype)

    with torch.device('meta'):
        model0 = Model()
        model1 = Model()
    for mod in (model0, model1):
        mod.to_empty(device=device)
        with no_grad():
            torch.eye(2, out=mod.lin.weight.data)
            mod.lin.weight.mul_(2)

    if args.distributed:
        import torch.distributed as dist
        from contextlib import nullcontext
        from datetime import timedelta
        from torch.distributed.fsdp import MixedPrecisionPolicy, fully_shard
        from torch.distributed._composable.replicate_with_fsdp import replicate
        from torch.distributed.device_mesh import DeviceMesh
        
        dist.init_process_group(
            # backend=backends,
            # backend="nccl",
            # backend="gloo",
            init_method=None,
            world_size=args.world_size,
            rank=args.global_rank,
            # device_id=device,  # NOTE: we were once affected by https://github.com/pytorch/pytorch/issues/147568
            timeout=timedelta(seconds=2),  # 70B takes 66 sec to save.
        )
        assert dist.is_initialized(), "Failed to initialize default process group"
        mesh: list[int] = [0]
        device_mesh = DeviceMesh(
            device_type=device.type,
            mesh=mesh,
            mesh_dim_names=["fsdp"],
        )
        mp_policy = MixedPrecisionPolicy(
            param_dtype=hp_dtype,
            reduce_dtype=torch.float32,
            cast_forward_inputs=True,
        )
        for mod in (model0, model1):
            if args.shard:
                fully_shard(mod, mesh=device_mesh, mp_policy=mp_policy)
            else:
                replicate(mod, mesh=device_mesh, mp_policy=mp_policy)
        mp_ctx = nullcontext()
    else:
        from torch.amp.autocast_mode import autocast
        mp_ctx = autocast(device_type=device.type, dtype=hp_dtype)

    if args.compile:
        model1 = torch.compile(model1, dynamic=False, fullgraph=True)

    with no_grad():
        pos = torch.arange(seq_len, device=device, dtype=torch.int32)
        rotmat: FloatTensor = make_rot_mat(
            pos=pos,
            dim=2,
            log_theta = math.log(4000),
            halfrot=True,
        )

    with mp_ctx:
        out0 = model0(vec, rotmat=rotmat)
        out1 = model1(vec, rotmat=rotmat)

    # assert_close(d, d.new_tensor([[2.0, 0.0], [1.078125, 1.6796875], [-0.83203125, 1.8203125], [-1.9765625, 0.283203125], [-1.3046875, -1.515625], [0.56640625, -1.9140625], [1.921875, -0.55859375], [1.5078125, 1.3125]]))
    assert_close(out0, out1)
    print("fwd outputs matched")

    loss0 = mse_loss(out0, target_vec)
    loss1 = mse_loss(out1, target_vec)
    loss0.backward()
    loss1.backward()
    assert_close(model0.lin.weight.grad, model1.lin.weight.grad)
    # assert_close(model.lin.weight.grad, model.lin.weight.grad.new_tensor([[1.7421875, 0.0], [-0.11669921875, 0.0]]))

    print("model grads matched")

if __name__ == "__main__":
    environ.setdefault("RANK", "0")
    environ.setdefault("LOCAL_RANK", "0")
    environ.setdefault("WORLD_SIZE", "1")
    environ.setdefault("MASTER_ADDR", "127.0.0.1")
    environ.setdefault("MASTER_PORT", "2348")

    parser = Args.get_parser()
    args_untyped: Namespace = parser.parse_args()
    args: Args = Args.from_namespace(args_untyped)
    main(args)

Versions

PyTorch version: 2.11.0
Is debug build: False
CUDA used to build PyTorch: None
ROCM used to build PyTorch: N/A

OS: macOS 26.3.1 (arm64)
GCC version: Could not collect
Clang version: 16.0.0 (clang-1600.0.26.6)
CMake version: version 3.31.2
Libc version: N/A

Python version: 3.13.2 (main, Feb  4 2025, 14:51:09) [Clang 16.0.0 (clang-1600.0.26.6)] (64-bit runtime)
Python platform: macOS-26.3.1-arm64-arm-64bit-Mach-O
Is CUDA available: False
CUDA runtime version: No CUDA
CUDA_MODULE_LOADING set to: N/A
GPU models and configuration: No CUDA
Nvidia driver version: No CUDA
cuDNN version: No CUDA
Is XPU available: False
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True
Caching allocator config: N/A

CPU:
Apple M3 Pro

Versions of relevant libraries:
[pip3] numpy==2.4.4
[pip3] torch==2.11.0
[pip3] torchvision==0.26.0
[conda] Could not collect

cc @chauhang @penguinwu @voznesenskym @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @kadeng @amjames @jataylo @azahed98

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#mixed precision #prompt template #agent execution #callback error #memory management

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

pytorch - 💡(How to fix) Fix [silent correctness] torch.compile catastrophically changes grad output of in-place autograd matmul

Recommended Tools

GitHub issue graph ai analysis

Error Message

TODO: if this is currently in half-precision, should we do the addition in float32, to reduce accumulation error?

Root Cause

Code Example

🐛 Describe the bug

Versions

Still need to ship something?

TRENDING

pytorch - 💡(How to fix) Fix [silent correctness] torch.compile catastrophically changes grad output of in-place autograd matmul

Recommended Tools

GitHub issue graph ai analysis

Error Message

TODO: if this is currently in half-precision, should we do the addition in float32, to reduce accumulation error?

Root Cause

Code Example

🐛 Describe the bug

Versions

Still need to ship something?

RELATED_DISCOVERY

TRENDING