pytorch - 💡(How to fix) Fix [TensorIterator] reorder_dimensions is sensitive to operand order for commutative ops [1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
pytorch/pytorch#181488Fetched 2026-04-26 05:05:40
View on GitHub
Comments
0
Participants
1
Timeline
11
Reactions
0
Author
Participants
Timeline (top)
labeled ×5mentioned ×3subscribed ×3

Code Example

import torch

a = torch.randn(4, 3, 2).permute(0, 2, 1)    # shape [4,2,3], strides (6,1,2)
b = torch.randn(4, 2, 3)                     # contiguous,    strides (6,3,1)

c1 = a + b
c2 = b + a

print(c1.stride(), c1.is_contiguous())   # (6, 1, 2) False
print(c2.stride(), c2.is_contiguous())   # (6, 3, 1) True
assert torch.equal(c1, c2)               # same values, different layout

---

import torch
import time

D = 256
a = torch.randn(D, D, D)
b = torch.randn(D, D, D).permute(2, 0, 1)

def bench(fn, warmup=20, iters=200):
    for _ in range(warmup):
        fn()
    t0 = time.perf_counter()
    for _ in range(iters):
        fn()
    return (time.perf_counter() - t0) / iters * 1000

print("A strides:", a.stride())
print("B strides:", b.stride())
print(f"a + b: {bench(lambda: a + b):.3f} ms")
print(f"b + a: {bench(lambda: b + a):.3f} ms")

---

A strides: (65536, 256, 1)
B strides: (1, 65536, 256)
a + b: 43.308 ms
b + a: 71.202 ms
RAW_BUFFERClick to expand / collapse

🚀 The feature, motivation and pitch

While studying the TensorIterator internals, I noticed that should_swap in reorder_dimensions short-circuits on the first operand that has a decisive stride comparison. This means that for commutative operations like addition, a + b and b + a can produce output tensors with different memory layouts (different strides), even though the values are identical.

Reproducer (layout asymmetry):

import torch

a = torch.randn(4, 3, 2).permute(0, 2, 1)    # shape [4,2,3], strides (6,1,2)
b = torch.randn(4, 2, 3)                     # contiguous,    strides (6,3,1)

c1 = a + b
c2 = b + a

print(c1.stride(), c1.is_contiguous())   # (6, 1, 2) False
print(c2.stride(), c2.is_contiguous())   # (6, 3, 1) True
assert torch.equal(c1, c2)               # same values, different layout

The cause is that should_swap iterates over operands and returns as soon as one has non-zero, non-equal strides for both dimensions being compared. When inputs disagree about which dimension should be innermost, whichever input was added to the iterator first (the left operand of +) dictates the iteration order and therefore the allocated output's memory layout.

This has two consequences:

  1. Symmetry: commutative operations produce results with different strides() depending on operand order, which can be surprising.

  2. Downstream performance: a non-contiguous intermediate tensor propagates layout costs to subsequent operations that consume it.

The asymmetry becomes more pronounced with higher-rank tensors. For example, with 256x256x256 tensors where one is permuted via .permute(2, 0, 1), the operand order can determine whether the "losing" operand's inner-loop stride is 256 elements (1KB jumps for float32) or 65536 elements (256KB jumps).

Reproducer (performance asymmetry):

import torch
import time

D = 256
a = torch.randn(D, D, D)
b = torch.randn(D, D, D).permute(2, 0, 1)

def bench(fn, warmup=20, iters=200):
    for _ in range(warmup):
        fn()
    t0 = time.perf_counter()
    for _ in range(iters):
        fn()
    return (time.perf_counter() - t0) / iters * 1000

print("A strides:", a.stride())
print("B strides:", b.stride())
print(f"a + b: {bench(lambda: a + b):.3f} ms")
print(f"b + a: {bench(lambda: b + a):.3f} ms")

Output in my computer (Linux x86_64, CPU: Intel i5-12450H (8C/12T, 4.4GHz boost), L2: 7MB, L3: 12MB)

A strides: (65536, 256, 1)
B strides: (1, 65536, 256)
a + b: 43.308 ms
b + a: 71.202 ms

Suggestion:

Instead of short-circuiting on the first operand with a definitive answer, accumulate stride preferences across all input operands (e.g. compare total weighted stride cost per candidate ordering). When the output is undefined and costs are tied or close, prefer the ordering that produces a C-contiguous output.

This would not affect the reduction special case, where reduced dims must be innermost as a correctness constraint.

I understand this is an extremely load-bearing code path and any change would need thorough benchmarking to ensure no regressions. Happy to provide benchmarks or a draft implementation if there's interest.

Alternatives

The current behavior could be documented as intentional, with a note that operand order may affect output memory layout for non-contiguous inputs. Users who care about output contiguity can call .contiguous() on the result.

Another alternative would be to only apply the symmetric voting logic when the output is undefined (will be allocated by the iterator), keeping the current first-operand-wins behavior when the output is pre-allocated.

Additional context

Relevant code: should_swap lambda inside TensorIterator::reorder_dimensions (aten/src/ATen/TensorIterator.cpp)

Tested on PyTorch 2.11.0, CPU.

cc @jerryzh168

extent analysis

TL;DR

Modify the should_swap function in reorder_dimensions to accumulate stride preferences across all input operands instead of short-circuiting on the first decisive operand.

Guidance

  • Review the should_swap lambda function inside TensorIterator::reorder_dimensions to understand the current implementation and identify the short-circuiting behavior.
  • Consider implementing a voting system to accumulate stride preferences from all input operands, and prefer the ordering that produces a C-contiguous output when costs are tied or close.
  • Thoroughly benchmark any proposed changes to ensure no regressions in performance.
  • Document the current behavior as intentional, with a note that operand order may affect output memory layout for non-contiguous inputs, as an alternative solution.

Example

# Example of how the voting system could be implemented
def accumulate_stride_preferences(operands):
    # Initialize a dictionary to store the stride preferences
    preferences = {}
    for operand in operands:
        # Calculate the stride cost for each operand
        stride_cost = calculate_stride_cost(operand)
        # Update the preferences dictionary
        for dim, cost in stride_cost.items():
            if dim not in preferences:
                preferences[dim] = {}
            if cost not in preferences[dim]:
                preferences[dim][cost] = 0
            preferences[dim][cost] += 1
    # Determine the preferred ordering based on the accumulated preferences
    preferred_ordering = determine_preferred_ordering(preferences)
    return preferred_ordering

Notes

The proposed change may have performance implications, and thorough benchmarking is necessary to ensure no regressions. The current behavior could be documented as intentional, with a note that operand order may affect output memory layout for non-contiguous inputs, as an alternative solution.

Recommendation

Apply the workaround by modifying the should_swap function to accumulate stride preferences across all input operands, as this approach addresses the symmetry and performance issues while minimizing the risk of introducing

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

pytorch - 💡(How to fix) Fix [TensorIterator] reorder_dimensions is sensitive to operand order for commutative ops [1 participants]