pytorch - 💡(How to fix) Fix `_make_wrapper_subclass` + `__torch_dispatch__` fails under `torch.compile` + `torch.vmap` with cross-device storage error [1 participants]

Q: Expected behavior

`_make_wrapper_subclass` tensor subclasses with `__torch_dispatch__` should work correctly when passed as non-batched arguments to `torch.vmap` inside `torch.compile(fullgraph=True)` on CUDA devices. The fake tensor infrastructure should handle the empty-storage nature of wrapper subclasses without attempting cross-device storage assignment.

pytorch2026-03-10 00:38:05

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

pytorch/pytorch#176957•Fetched 2026-04-08 00:23:47

View on GitHub

Comments

Participants

Timeline

105

Reactions

Author

vmoens

Participants

vmoens

Timeline (top)

mentioned ×48subscribed ×48labeled ×8cross-referenced ×1

Error Message

RuntimeError: Attempted to set the storage of a tensor on device "cuda:0" to a storage on different device "meta".

Root Cause

During Dynamo tracing, the fake tensor infrastructure creates meta-device versions of input tensors. For _make_wrapper_subclass tensors:

__tensor_flatten__ extracts the inner _data tensor (on cuda:0).
A fake (meta-device) version of _data is created.
__tensor_unflatten__ reconstructs the wrapper with device=meta (from the fake inner tensor).
vmap then clones or manipulates non-batched inputs. At some point during this process, the infrastructure attempts to set storage from the real cuda:0 tensor onto the meta-device wrapper (or vice versa), causing the cross-device error.

The wrapper subclass has no backing storage of its own (_make_wrapper_subclass creates a tensor with empty storage). The device mismatch arises because vmap's internal batching logic manipulates tensor storage directly, which is incompatible with the empty-storage nature of wrapper subclasses during fake tensor tracing.

Fix Action

Fix / Workaround

A transparent tensor subclass implemented with _make_wrapper_subclass + __torch_dispatch__ + __tensor_flatten__/__tensor_unflatten__ triggers a device mismatch error when used as a non-batched input to torch.vmap inside a torch.compile(fullgraph=True) region on CUDA:

@classmethod def torch_dispatch(cls, func, types, args=(), kwargs=None): if kwargs is None: kwargs = {}

_make_wrapper_subclass tensor subclasses with __torch_dispatch__ should work correctly when passed as non-batched arguments to torch.vmap inside torch.compile(fullgraph=True) on CUDA devices. The fake tensor infrastructure should handle the empty-storage nature of wrapper subclasses without attempting cross-device storage assignment.

Code Example

RuntimeError: Attempted to set the storage of a tensor on device "cuda:0"
to a storage on different device "meta".

---

import torch

class TaggedTensor(torch.Tensor):
    """Minimal transparent wrapper — adds a type tag, nothing else."""

    __torch_function__ = torch._C._disabled_torch_function_impl

    @staticmethod
    def __new__(cls, data):
        if isinstance(data, cls):
            return data
        if not isinstance(data, torch.Tensor):
            data = torch.as_tensor(data)
        kwargs = {}
        if data.layout == torch.strided:
            kwargs["strides"] = data.stride()
            kwargs["storage_offset"] = data.storage_offset()
        r = torch.Tensor._make_wrapper_subclass(
            cls,
            data.shape,
            dtype=data.dtype,
            layout=data.layout,
            device=data.device,
            requires_grad=data.requires_grad,
            **kwargs,
        )
        r._data = data
        return r

    def __tensor_flatten__(self):
        return ["_data"], {}

    @classmethod
    def __tensor_unflatten__(cls, inner_tensors, metadata, outer_size, outer_stride):
        return cls(inner_tensors["_data"])

    @classmethod
    def __torch_dispatch__(cls, func, types, args=(), kwargs=None):
        if kwargs is None:
            kwargs = {}

        def unwrap(x):
            if isinstance(x, TaggedTensor):
                return x._data
            if isinstance(x, (list, tuple)):
                return type(x)(unwrap(a) for a in x)
            return x

        result = func(*unwrap(args), **unwrap(kwargs))

        if isinstance(result, torch.Tensor):
            return cls(result)
        if isinstance(result, (tuple, list)):
            return type(result)(
                cls(r) if isinstance(r, torch.Tensor) else r for r in result
            )
        return result


def fn(x, tag):
    # tag is non-batched (in_dims=None), x is batched
    return x + tag

vmapped = torch.vmap(fn, in_dims=(0, None))

x = torch.randn(8, 4, device="cuda")
tag = TaggedTensor(torch.tensor([1.0, 2.0, 3.0, 4.0], device="cuda"))

# Works in eager:
result_eager = vmapped(x, tag)

# Fails under compile:
compiled = torch.compile(vmapped, fullgraph=True)
result_compiled = compiled(x, tag)  # RuntimeError: cross-device storage

RAW_BUFFERClick to expand / collapse

🐛 Describe the bug

RuntimeError: Attempted to set the storage of a tensor on device "cuda:0"
to a storage on different device "meta".

This makes it impossible to build a zero-overhead transparent tensor subclass that works correctly under Dynamo + vmap on CUDA, which is the recommended pattern per the PyTorch tensor subclass documentation.

Reproducer

import torch

class TaggedTensor(torch.Tensor):
    """Minimal transparent wrapper — adds a type tag, nothing else."""

    __torch_function__ = torch._C._disabled_torch_function_impl

    @staticmethod
    def __new__(cls, data):
        if isinstance(data, cls):
            return data
        if not isinstance(data, torch.Tensor):
            data = torch.as_tensor(data)
        kwargs = {}
        if data.layout == torch.strided:
            kwargs["strides"] = data.stride()
            kwargs["storage_offset"] = data.storage_offset()
        r = torch.Tensor._make_wrapper_subclass(
            cls,
            data.shape,
            dtype=data.dtype,
            layout=data.layout,
            device=data.device,
            requires_grad=data.requires_grad,
            **kwargs,
        )
        r._data = data
        return r

    def __tensor_flatten__(self):
        return ["_data"], {}

    @classmethod
    def __tensor_unflatten__(cls, inner_tensors, metadata, outer_size, outer_stride):
        return cls(inner_tensors["_data"])

    @classmethod
    def __torch_dispatch__(cls, func, types, args=(), kwargs=None):
        if kwargs is None:
            kwargs = {}

        def unwrap(x):
            if isinstance(x, TaggedTensor):
                return x._data
            if isinstance(x, (list, tuple)):
                return type(x)(unwrap(a) for a in x)
            return x

        result = func(*unwrap(args), **unwrap(kwargs))

        if isinstance(result, torch.Tensor):
            return cls(result)
        if isinstance(result, (tuple, list)):
            return type(result)(
                cls(r) if isinstance(r, torch.Tensor) else r for r in result
            )
        return result


def fn(x, tag):
    # tag is non-batched (in_dims=None), x is batched
    return x + tag

vmapped = torch.vmap(fn, in_dims=(0, None))

x = torch.randn(8, 4, device="cuda")
tag = TaggedTensor(torch.tensor([1.0, 2.0, 3.0, 4.0], device="cuda"))

# Works in eager:
result_eager = vmapped(x, tag)

# Fails under compile:
compiled = torch.compile(vmapped, fullgraph=True)
result_compiled = compiled(x, tag)  # RuntimeError: cross-device storage

Root cause

During Dynamo tracing, the fake tensor infrastructure creates meta-device versions of input tensors. For _make_wrapper_subclass tensors:

__tensor_flatten__ extracts the inner _data tensor (on cuda:0).
A fake (meta-device) version of _data is created.
__tensor_unflatten__ reconstructs the wrapper with device=meta (from the fake inner tensor).
vmap then clones or manipulates non-batched inputs. At some point during this process, the infrastructure attempts to set storage from the real cuda:0 tensor onto the meta-device wrapper (or vice versa), causing the cross-device error.

Expected behavior

PyTorch tensor subclass documentation
torch._subclasses.meta_utils.empty_create_subclass (where the unflatten happens during fake tensor creation)
torch._subclasses.fake_tensor.FakeTensorConverter.from_real_tensor

Versions

PyTorch: 2.12.0.dev20260223 (nightly)
Python: 3.12
CUDA: 13.0
OS: Linux

cc @ptrblck @msaroufim @eqy @jerryzh168 @tinglvv @nWEIdia @Chillee @ezyang @albanD @samdow @chauhang @penguinwu @kshitij12345 @bdhirsh @bobrenjc93 @aorenste

extent analysis

Fix Plan

1. Modify `__tensor_unflatten__` to preserve the original device

@classmethod
def __tensor_unflatten__(cls, inner_tensors, metadata, outer_size, outer_stride):
    device = inner_tensors["_data"].device
    return cls(inner_tensors["_data"], device=device)

2. Update `__torch_dispatch__` to handle the device attribute

@classmethod
def __torch_dispatch__(cls, func, types, args=(), kwargs=None):
    if kwargs is None:
        kwargs = {}

    def unwrap(x):
        if isinstance(x, TaggedTensor):
            return x._data
        if isinstance(x, (list, tuple)):
            return type(x)(unwrap(a) for a in x)
        return x

    result = func(*unwrap(args), **unwrap(kwargs))

    if isinstance(result, torch.Tensor):
        device = result.device
        return cls(result, device=device)
    if isinstance(result, (tuple, list)):
        return type(result)(
            cls(r, device=r.device) if isinstance(r, torch.Tensor) else r for r in result
        )
    return result

3. Update the `fn` function to handle the device attribute

def fn(x, tag):
    # tag is non-batched (in_dims=None), x is batched
    return x + tag

No changes needed here, as the device attribute is now handled in the TaggedTensor class.

Verification

Run the reproducer code with the updated TaggedTensor class.
Verify that the torch.compile(fullgraph=True) region no longer raises a RuntimeError.
Check that the output of the vmapped function is correct.

Extra Tips

Make sure to update the TaggedTensor class in all relevant places where it is used.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

FAQ

Expected behavior

#api #ssr #installation #tensor shape #autograd error #training loop #device allocation #model download #tokenizer error #prompt formatting

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

pytorch - 💡(How to fix) Fix `_make_wrapper_subclass` + `__torch_dispatch__` fails under `torch.compile` + `torch.vmap` with cross-device storage error [1 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Fix / Workaround

Code Example

🐛 Describe the bug

Reproducer

Root cause

Expected behavior

Related

Versions

extent analysis

Fix Plan

1. Modify `__tensor_unflatten__` to preserve the original device

2. Update `__torch_dispatch__` to handle the device attribute

3. Update the `fn` function to handle the device attribute

Verification

Extra Tips

FAQ

Expected behavior

Still need to ship something?

TRENDING

pytorch - 💡(How to fix) Fix `_make_wrapper_subclass` + `__torch_dispatch__` fails under `torch.compile` + `torch.vmap` with cross-device storage error [1 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Fix / Workaround

Code Example

🐛 Describe the bug

Reproducer

Root cause

Expected behavior

Related

Versions

extent analysis

Fix Plan

1. Modify __tensor_unflatten__ to preserve the original device

2. Update __torch_dispatch__ to handle the device attribute

3. Update the fn function to handle the device attribute

Verification

Extra Tips

FAQ

Expected behavior

Still need to ship something?

RELATED_DISCOVERY

TRENDING

1. Modify `__tensor_unflatten__` to preserve the original device

2. Update `__torch_dispatch__` to handle the device attribute

3. Update the `fn` function to handle the device attribute