pytorch - ✅(Solved) Fix `torch.compile` does not preserve `F.pad` output layout on channels-last input [1 pull requests, 1 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
pytorch/pytorch#179442Fetched 2026-04-08 02:51:41
View on GitHub
Comments
1
Participants
2
Timeline
31
Reactions
0
Timeline (top)
mentioned ×12subscribed ×12labeled ×6commented ×1

PR fix notes

PR #179837: Fix reflection/replication pad stride mismatch under torch.compile

Description (problem / solution / changelog)

Fix https://github.com/pytorch/pytorch/issues/179442 The _reflection_or_replication_pad decomposition uses _unsafe_index which can produce non-standard strides from channels_last inputs. The existing memory format correction called suggest_memory_format(result) — but since _unsafe_index output strides don't reliably reflect the desired format, this gave wrong results.

Fix: use the original input's memory format to decide the output format. On CUDA, the eager C++ kernel always returns contiguous regardless of input format, so force contiguous_format there. On CPU, preserve the input's memory format (e.g. channels_last) to match eager.

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @kadeng @muchulee8 @amjames @chauhang @aakhundov @coconutruben @jataylo

Changed files

  • test/inductor/test_torchinductor.py (modified, +11/-0)
  • torch/_decomp/decompositions.py (modified, +14/-2)

Code Example

import torch
import torch.nn.functional as F

def fn(x):
    return F.pad(x, (1, 2, 2, 1), mode="reflect")

x = torch.randn(2, 3, 4, 5).to(memory_format=torch.channels_last)

eager = fn(x.clone())
compiled = torch.compile(fn, backend="aot_eager_decomp_partition")(x.clone())

print("eager stride   =", eager.stride())
print("compiled stride=", compiled.stride())
print("eager channels_last   =", eager.is_contiguous(memory_format=torch.channels_last))
print("compiled channels_last=", compiled.is_contiguous(memory_format=torch.channels_last))

---

eager stride   = (168, 1, 24, 3)
compiled stride= (168, 56, 8, 1)
eager channels_last   = True
compiled channels_last= False

---

PyTorch version: 2.10.0+cpu
RAW_BUFFERClick to expand / collapse

🐛 Describe the bug

torch.compile changes the output layout of F.pad on a dense channels_last input. Eager and compiled produce the same values, but the compiled result has different stride() and different is_contiguous(memory_format=torch.channels_last) behavior. This also reproduces with backend="aot_eager_decomp_partition". Repro:

import torch
import torch.nn.functional as F

def fn(x):
    return F.pad(x, (1, 2, 2, 1), mode="reflect")

x = torch.randn(2, 3, 4, 5).to(memory_format=torch.channels_last)

eager = fn(x.clone())
compiled = torch.compile(fn, backend="aot_eager_decomp_partition")(x.clone())

print("eager stride   =", eager.stride())
print("compiled stride=", compiled.stride())
print("eager channels_last   =", eager.is_contiguous(memory_format=torch.channels_last))
print("compiled channels_last=", compiled.is_contiguous(memory_format=torch.channels_last))

output:

eager stride   = (168, 1, 24, 3)
compiled stride= (168, 56, 8, 1)
eager channels_last   = True
compiled channels_last= False

Versions

PyTorch version: 2.10.0+cpu

cc @jamesr66a @chauhang @penguinwu @bdhirsh @bobrenjc93 @aorenste

extent analysis

TL;DR

The issue can be mitigated by ensuring consistent memory format handling when using torch.compile with F.pad on channels_last inputs.

Guidance

  • Verify the memory format of the input tensor x before and after compilation to ensure it matches the expected torch.channels_last format.
  • Check the documentation of torch.compile and F.pad to see if there are any known issues or limitations related to memory formats and compilation.
  • Consider using the torch.memory_format attribute to explicitly set the memory format of the compiled tensor to match the eager tensor.
  • Test the code with different backends, such as "eager" or other available options, to see if the issue is specific to the "aot_eager_decomp_partition" backend.

Example

import torch
import torch.nn.functional as F

def fn(x):
    return F.pad(x, (1, 2, 2, 1), mode="reflect")

x = torch.randn(2, 3, 4, 5).to(memory_format=torch.channels_last)

eager = fn(x.clone())
compiled = torch.compile(fn, backend="aot_eager_decomp_partition")(x.clone())

# Explicitly set the memory format of the compiled tensor
compiled = compiled.to(memory_format=torch.channels_last)

print("eager stride   =", eager.stride())
print("compiled stride=", compiled.stride())
print("eager channels_last   =", eager.is_contiguous(memory_format=torch.channels_last))
print("compiled channels_last=", compiled.is_contiguous(memory_format=torch.channels_last))

Notes

The issue seems to be related to the interaction between torch.compile and F.pad when using the torch.channels_last memory format. The provided code snippet and example may not fully resolve the issue, but they can help identify the root cause and potential workarounds.

Recommendation

Apply workaround: Explicitly set the memory format of the compiled tensor to match the eager tensor, as shown in the example code snippet. This may help ensure consistent memory format handling and mitigate the issue.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING