pytorch - 💡(How to fix) Fix [TensorIterator] reorder_dimensions is sensitive to operand order for commutative ops [1 participants]

Code Example

import torch

a = torch.randn(4, 3, 2).permute(0, 2, 1)    # shape [4,2,3], strides (6,1,2)
b = torch.randn(4, 2, 3)                     # contiguous,    strides (6,3,1)

c1 = a + b
c2 = b + a

print(c1.stride(), c1.is_contiguous())   # (6, 1, 2) False
print(c2.stride(), c2.is_contiguous())   # (6, 3, 1) True
assert torch.equal(c1, c2)               # same values, different layout

---

import torch
import time

D = 256
a = torch.randn(D, D, D)
b = torch.randn(D, D, D).permute(2, 0, 1)

def bench(fn, warmup=20, iters=200):
    for _ in range(warmup):
        fn()
    t0 = time.perf_counter()
    for _ in range(iters):
        fn()
    return (time.perf_counter() - t0) / iters * 1000

print("A strides:", a.stride())
print("B strides:", b.stride())
print(f"a + b: {bench(lambda: a + b):.3f} ms")
print(f"b + a: {bench(lambda: b + a):.3f} ms")

---

A strides: (65536, 256, 1)
B strides: (1, 65536, 256)
a + b: 43.308 ms
b + a: 71.202 ms

🚀 The feature, motivation and pitch

While studying the TensorIterator internals, I noticed that should_swap in reorder_dimensions short-circuits on the first operand that has a decisive stride comparison. This means that for commutative operations like addition, a + b and b + a can produce output tensors with different memory layouts (different strides), even though the values are identical.

Reproducer (layout asymmetry):

import torch

a = torch.randn(4, 3, 2).permute(0, 2, 1)    # shape [4,2,3], strides (6,1,2)
b = torch.randn(4, 2, 3)                     # contiguous,    strides (6,3,1)

c1 = a + b
c2 = b + a

print(c1.stride(), c1.is_contiguous())   # (6, 1, 2) False
print(c2.stride(), c2.is_contiguous())   # (6, 3, 1) True
assert torch.equal(c1, c2)               # same values, different layout

The cause is that should_swap iterates over operands and returns as soon as one has non-zero, non-equal strides for both dimensions being compared. When inputs disagree about which dimension should be innermost, whichever input was added to the iterator first (the left operand of +) dictates the iteration order and therefore the allocated output's memory layout.

This has two consequences:

Symmetry: commutative operations produce results with different strides() depending on operand order, which can be surprising.
Downstream performance: a non-contiguous intermediate tensor propagates layout costs to subsequent operations that consume it.

The asymmetry becomes more pronounced with higher-rank tensors. For example, with 256x256x256 tensors where one is permuted via .permute(2, 0, 1), the operand order can determine whether the "losing" operand's inner-loop stride is 256 elements (1KB jumps for float32) or 65536 elements (256KB jumps).

Reproducer (performance asymmetry):

import torch
import time

D = 256
a = torch.randn(D, D, D)
b = torch.randn(D, D, D).permute(2, 0, 1)

def bench(fn, warmup=20, iters=200):
    for _ in range(warmup):
        fn()
    t0 = time.perf_counter()
    for _ in range(iters):
        fn()
    return (time.perf_counter() - t0) / iters * 1000

print("A strides:", a.stride())
print("B strides:", b.stride())
print(f"a + b: {bench(lambda: a + b):.3f} ms")
print(f"b + a: {bench(lambda: b + a):.3f} ms")

Output in my computer (Linux x86_64, CPU: Intel i5-12450H (8C/12T, 4.4GHz boost), L2: 7MB, L3: 12MB)

A strides: (65536, 256, 1)
B strides: (1, 65536, 256)
a + b: 43.308 ms
b + a: 71.202 ms

Suggestion:

Instead of short-circuiting on the first operand with a definitive answer, accumulate stride preferences across all input operands (e.g. compare total weighted stride cost per candidate ordering). When the output is undefined and costs are tied or close, prefer the ordering that produces a C-contiguous output.

This would not affect the reduction special case, where reduced dims must be innermost as a correctness constraint.

I understand this is an extremely load-bearing code path and any change would need thorough benchmarking to ensure no regressions. Happy to provide benchmarks or a draft implementation if there's interest.

Alternatives

The current behavior could be documented as intentional, with a note that operand order may affect output memory layout for non-contiguous inputs. Users who care about output contiguity can call .contiguous() on the result.

Another alternative would be to only apply the symmetric voting logic when the output is undefined (will be allocated by the iterator), keeping the current first-operand-wins behavior when the output is pre-allocated.

Additional context

Relevant code: should_swap lambda inside TensorIterator::reorder_dimensions (aten/src/ATen/TensorIterator.cpp)

Tested on PyTorch 2.11.0, CPU.

cc @jerryzh168

extent analysis

TL;DR

Modify the should_swap function in reorder_dimensions to accumulate stride preferences across all input operands instead of short-circuiting on the first decisive operand.

Guidance

Review the should_swap lambda function inside TensorIterator::reorder_dimensions to understand the current implementation and identify the short-circuiting behavior.
Consider implementing a voting system to accumulate stride preferences from all input operands, and prefer the ordering that produces a C-contiguous output when costs are tied or close.
Thoroughly benchmark any proposed changes to ensure no regressions in performance.
Document the current behavior as intentional, with a note that operand order may affect output memory layout for non-contiguous inputs, as an alternative solution.

Example

# Example of how the voting system could be implemented
def accumulate_stride_preferences(operands):
    # Initialize a dictionary to store the stride preferences
    preferences = {}
    for operand in operands:
        # Calculate the stride cost for each operand
        stride_cost = calculate_stride_cost(operand)
        # Update the preferences dictionary
        for dim, cost in stride_cost.items():
            if dim not in preferences:
                preferences[dim] = {}
            if cost not in preferences[dim]:
                preferences[dim][cost] = 0
            preferences[dim][cost] += 1
    # Determine the preferred ordering based on the accumulated preferences
    preferred_ordering = determine_preferred_ordering(preferences)
    return preferred_ordering

Notes

The proposed change may have performance implications, and thorough benchmarking is necessary to ensure no regressions. The current behavior could be documented as intentional, with a note that operand order may affect output memory layout for non-contiguous inputs, as an alternative solution.

Recommendation

Apply the workaround by modifying the should_swap function to accumulate stride preferences across all input operands, as this approach addresses the symmetry and performance issues while minimizing the risk of introducing

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

pytorch - 💡(How to fix) Fix [TensorIterator] reorder_dimensions is sensitive to operand order for commutative ops [1 participants]

Recommended Tools

GitHub issue graph ai analysis

Code Example

🚀 The feature, motivation and pitch

Alternatives

Additional context

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

TRENDING

pytorch - 💡(How to fix) Fix [TensorIterator] reorder_dimensions is sensitive to operand order for commutative ops [1 participants]

Recommended Tools

GitHub issue graph ai analysis

Code Example

🚀 The feature, motivation and pitch

Alternatives

Additional context

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING