pytorch - ✅(Solved) Fix Flaky error ordering in reentrant backward test: test_reentrant_parent_error_on_cpu [1 pull requests, 1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
pytorch/pytorch#179703Fetched 2026-04-09 07:50:26
View on GitHub
Comments
0
Participants
1
Timeline
2
Reactions
0
Author
Participants
Timeline (top)
cross-referenced ×2

test_reentrant_parent_error_on_cpu is intended to validate that an intentional backward error ("Simulate error on backward pass") is surfaced when parent and reentrant child backward paths run concurrently.

However, there is another independent failure source in the same test:

  • reentrant backward calls .backward() on a non-scalar tensor without an explicit gradient.

This can produce:

  • RuntimeError: grad can be implicitly created only for scalar outputs

Because backward task scheduling/order is nondeterministic, either error may appear first. This makes the test flaky.

Error Message

RuntimeError: grad can be implicitly created only for scalar outputs

Root Cause

Because backward task scheduling/order is nondeterministic, either error may appear first. This makes the test flaky.

Fix Action

Fix / Workaround

CPU: Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Address sizes: 46 bits physical, 57 bits virtual Byte Order: Little Endian CPU(s): 64 On-line CPU(s) list: 0-63 Vendor ID: GenuineIntel Model name: Intel(R) Xeon(R) Platinum 8352Y CPU @ 2.20GHz CPU family: 6 Model: 106 Thread(s) per core: 2 Core(s) per socket: 32 Socket(s): 1 Stepping: 6 CPU(s) scaling MHz: 24% CPU max MHz: 3400.0000 CPU min MHz: 800.0000 BogoMIPS: 4400.00 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 intel_ppin ssbd mba ibrs ibpb stibp ibrs_enhanced tpr_shadow flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb intel_pt avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local split_lock_detect wbnoinvd dtherm ida arat pln pts hwp hwp_act_window hwp_epp hwp_pkg_req vnmi avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq la57 rdpid fsrm md_clear pconfig flush_l1d arch_capabilities Virtualization: VT-x L1d cache: 1.5 MiB (32 instances) L1i cache: 1 MiB (32 instances) L2 cache: 40 MiB (32 instances) L3 cache: 48 MiB (1 instance) NUMA node(s): 1 NUMA node0 CPU(s): 0-63 Vulnerability Gather data sampling: Vulnerable: No microcode Vulnerability Ghostwrite: Not affected Vulnerability Indirect target selection: Mitigation; Aligned branch/return thunks Vulnerability Itlb multihit: Not affected Vulnerability L1tf: Not affected Vulnerability Mds: Not affected Vulnerability Meltdown: Not affected Vulnerability Mmio stale data: Vulnerable: Clear CPU buffers attempted, no microcode; SMT vulnerable Vulnerability Reg file data sampling: Not affected Vulnerability Retbleed: Not affected Vulnerability Spec rstack overflow: Not affected Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Vulnerability Spectre v2: Mitigation; Enhanced / Automatic IBRS; IBPB conditional; PBRSB-eIBRS SW sequence; BHI SW loop, KVM SW loop Vulnerability Srbds: Not affected Vulnerability Tsx async abort: Not affected

PR fix notes

PR #179704: Fix flaky reentrant backward test by passing explicit grad for non-scalar output

Description (problem / solution / changelog)

Fixes #179703

More detailed description of the problem is in the issue #179703

What this PR changes

In test/test_autograd.py, _test_reentrant_parent_error_on_cpu (THIS LINE) defines a reentrant autograd function whose backward currently does:

reentrant_root.backward()

This PR changes it to:

reentrant_root.backward(grad)

Why this solution

reentrant_root is non-scalar ([3, 3]). For non-scalar outputs, .backward() requires an explicit gradient seed. Calling .backward() without that seed can raise:

RuntimeError: grad can be implicitly created only for scalar outputs

Effect on test behavior

Before this change, the test had two competing error sources:

  1. Intended error: "Simulate error" from SimulateBackwardError
  2. Accidental error: non-scalar .backward() without explicit grad

Because execution order is nondeterministic, CI could intermittently fail when (2) happened first.

After this change, the accidental error source is removed. The test now consistently validates only the intended behavior.

Changed files

  • test/test_autograd.py (modified, +1/-1)

Code Example

class ReentrantFunc(Function):
    @staticmethod
    def backward(ctx, grad):
        reentrant_root.backward()   # non-scalar output, no grad argument
        return grad

---

# original
# t5 = TestAutograd.SimulateBackwardError.apply(t4)

# temporary debug change
t5 = t4

---

RuntimeError: grad can be implicitly created only for scalar outputs

---

python -m pytest -vs test/test_autograd.py -k test_reentrant_parent_error_on_cpu

---

AssertionError: "Simulate error" does not match "grad can be implicitly created only for scalar outputs"

---

# Before (buggy — crashes on non-scalar input when ReentrantFunc runs first):
reentrant_root.backward()

# After (correct — explicit gradient seed for non-scalar output):
reentrant_root.backward(grad)
RAW_BUFFERClick to expand / collapse

🐛 Describe the bug

Summary

test_reentrant_parent_error_on_cpu is intended to validate that an intentional backward error ("Simulate error on backward pass") is surfaced when parent and reentrant child backward paths run concurrently.

However, there is another independent failure source in the same test:

  • reentrant backward calls .backward() on a non-scalar tensor without an explicit gradient.

This can produce:

  • RuntimeError: grad can be implicitly created only for scalar outputs

Because backward task scheduling/order is nondeterministic, either error may appear first. This makes the test flaky.

Why this is device-agnostic

The root issue is not backend-specific. Calling .backward() on a non-scalar tensor without providing a gradient is invalid on all devices.

Device/runtime differences affect only which failure appears first:

  • intended simulated error,
  • or non-scalar backward error.

Affected test logic

In test/test_autograd.py, the REENTRANT FUNCTION does:

class ReentrantFunc(Function):
    @staticmethod
    def backward(ctx, grad):
        reentrant_root.backward()   # non-scalar output, no grad argument
        return grad

reentrant_root is shape [3, 3], so implicit grad creation is invalid.

Deterministic reproduction trick

To force the hidden bug to surface every time, temporarily bypass the simulated error path by changing THIS LINE:

# original
# t5 = TestAutograd.SimulateBackwardError.apply(t4)

# temporary debug change
t5 = t4

With this temporary change, the test consistently fails with:

RuntimeError: grad can be implicitly created only for scalar outputs

Reproduction steps

  1. Edit test/test_autograd.py in _test_reentrant_parent_error_on_cpu on THIS LINE:
    • change t5 = TestAutograd.SimulateBackwardError.apply(t4) to t5 = t4
  2. Run:
python -m pytest -vs test/test_autograd.py -k test_reentrant_parent_error_on_cpu

Expected result after temporary t5 = t4 edit:

  • Failure with grad can be implicitly created only for scalar outputs.

Relation to existing issues

This is the root cause behind at least two other issues that were both closed without identifying the real problem.

pytorch/pytorch#86735

DISABLED test_reentrant_parent_error_on_cpu_cuda (TestAutogradDeviceTypeCUDA)

This issue has been open since 2022 and has been repeatedly auto-closed and reopened by pytorch-bot on an endless cycle:

  • Bot closes it after hundreds of reruns pass without failure.
  • Test flakes again weeks or months later on ROCm or Windows.
  • Bot reopens it automatically.
  • No human has ever commented with a root cause analysis.

The cycle repeats because the race condition is hardware/scheduler-dependent — some CI configurations almost always schedule the simulated error first, making the test appear stable for hundreds of runs, until platform or load conditions change.

ROCm/TheRock#2273

Linux Pytorch nightly failing at test "TestAutogradDeviceTypeCUDA.test_reentrant_parent_error_on_cpu_cuda"

Reported November 2025 on ROCm hardware (gfx94X-dcgpu, Python 3.13, PyTorch 2.9). The failure log shows the exact same assertion mismatch:

AssertionError: "Simulate error" does not match "grad can be implicitly created only for scalar outputs"

The assignee was unable to reproduce it on demand, referenced pytorch/pytorch#86735 as a known flaky test, and the issue was closed as "tentatively fixed" — again, without identifying the root cause.

Why both issues were closed without a fix

Both repositories fell into the same trap:

  • The test passes most of the time (simulated error wins the race).
  • The failure cannot be reproduced deterministically without knowing the root cause.
  • Without a reliable reproduction, both teams concluded the issue had gone away on its own.

The fix

THIS LINE change in _test_reentrant_parent_error_on_cpu removes the accidental error source entirely, making the test deterministic on all platforms:

# Before (buggy — crashes on non-scalar input when ReentrantFunc runs first):
reentrant_root.backward()

# After (correct — explicit gradient seed for non-scalar output):
reentrant_root.backward(grad)

This resolves pytorch/pytorch#86735 and ROCm/TheRock#2273 permanently across all platforms (CUDA, ROCm, Windows) without suppressing or disabling the test.

Versions

<details> <summary>Versions</summary>

Collecting environment information... PyTorch version: 2.12.0a0+git02521a0 Is debug build: False CUDA used to build PyTorch: None ROCM used to build PyTorch: N/A

OS: Ubuntu 24.04.3 LTS (x86_64) GCC version: (Ubuntu 14.2.0-4ubuntu2~24.04.1) 14.2.0 Clang version: Could not collect CMake version: version 4.2.3 Libc version: glibc-2.39

Python version: 3.12.3 (main, Mar 3 2026, 12:15:18) [GCC 13.3.0] (64-bit runtime) Python platform: Linux-6.14.0-1012-intel-x86_64-with-glibc2.39 Is CUDA available: False CUDA runtime version: No CUDA CUDA_MODULE_LOADING set to: N/A GPU models and configuration: No CUDA Nvidia driver version: No CUDA cuDNN version: No CUDA Is XPU available: True XPU used to build PyTorch: 20250302 Intel GPU driver version:

  • intel-opencl-icd: 25.18.33578.51-1146~24.04
  • libze1: 1.24.0.0-1146~24.04 Intel GPU models onboard:
  • Intel(R) Data Center GPU Max 1550 Intel GPU models detected:
  • [0] _XpuDeviceProperties(name='Intel(R) Data Center GPU Max 1550', platform_name='Intel(R) oneAPI Unified Runtime over Level-Zero', type='gpu', device_id=0xBD5, uuid=8680d50b-2f00-0000-8c00-000000000001, driver_version='1.6.33578+51', total_memory=65520MB, local_mem_size=128KB, max_compute_units=512, memory_clock_rate=3200MHz, memory_bus_width=64-bit, gpu_eu_count=512, gpu_subslice_count=64, max_work_group_size=1024, max_num_sub_groups=64, sub_group_sizes=[16 32], has_fp16=1, has_fp64=1, has_atomic64=1)
  • [1] _XpuDeviceProperties(name='Intel(R) Data Center GPU Max 1550', platform_name='Intel(R) oneAPI Unified Runtime over Level-Zero', type='gpu', device_id=0xBD5, uuid=8680d50b-2f00-0000-8c00-000000000002, driver_version='1.6.33578+51', total_memory=65520MB, local_mem_size=128KB, max_compute_units=512, memory_clock_rate=3200MHz, memory_bus_width=64-bit, gpu_eu_count=512, gpu_subslice_count=64, max_work_group_size=1024, max_num_sub_groups=64, sub_group_sizes=[16 32], has_fp16=1, has_fp64=1, has_atomic64=1) HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True Caching allocator config: N/A

CPU: Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Address sizes: 46 bits physical, 57 bits virtual Byte Order: Little Endian CPU(s): 64 On-line CPU(s) list: 0-63 Vendor ID: GenuineIntel Model name: Intel(R) Xeon(R) Platinum 8352Y CPU @ 2.20GHz CPU family: 6 Model: 106 Thread(s) per core: 2 Core(s) per socket: 32 Socket(s): 1 Stepping: 6 CPU(s) scaling MHz: 24% CPU max MHz: 3400.0000 CPU min MHz: 800.0000 BogoMIPS: 4400.00 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 intel_ppin ssbd mba ibrs ibpb stibp ibrs_enhanced tpr_shadow flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb intel_pt avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local split_lock_detect wbnoinvd dtherm ida arat pln pts hwp hwp_act_window hwp_epp hwp_pkg_req vnmi avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq la57 rdpid fsrm md_clear pconfig flush_l1d arch_capabilities Virtualization: VT-x L1d cache: 1.5 MiB (32 instances) L1i cache: 1 MiB (32 instances) L2 cache: 40 MiB (32 instances) L3 cache: 48 MiB (1 instance) NUMA node(s): 1 NUMA node0 CPU(s): 0-63 Vulnerability Gather data sampling: Vulnerable: No microcode Vulnerability Ghostwrite: Not affected Vulnerability Indirect target selection: Mitigation; Aligned branch/return thunks Vulnerability Itlb multihit: Not affected Vulnerability L1tf: Not affected Vulnerability Mds: Not affected Vulnerability Meltdown: Not affected Vulnerability Mmio stale data: Vulnerable: Clear CPU buffers attempted, no microcode; SMT vulnerable Vulnerability Reg file data sampling: Not affected Vulnerability Retbleed: Not affected Vulnerability Spec rstack overflow: Not affected Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Vulnerability Spectre v2: Mitigation; Enhanced / Automatic IBRS; IBPB conditional; PBRSB-eIBRS SW sequence; BHI SW loop, KVM SW loop Vulnerability Srbds: Not affected Vulnerability Tsx async abort: Not affected

Versions of relevant libraries: [pip3] numpy==2.4.3 [pip3] optree==0.19.0 [pip3] torch==2.12.0a0+git02521a0 [conda] Could not collect

</details>

extent analysis

TL;DR

The most likely fix is to provide an explicit gradient argument when calling backward() on a non-scalar tensor in the ReentrantFunc class.

Guidance

  • The root cause of the issue is the implicit creation of gradients for non-scalar outputs in the ReentrantFunc class, which is not allowed.
  • To fix this, an explicit gradient argument should be provided when calling backward() on reentrant_root.
  • The test can be made deterministic by temporarily bypassing the simulated error path, allowing the hidden bug to surface consistently.
  • The fix involves changing the line reentrant_root.backward() to reentrant_root.backward(grad) in the ReentrantFunc class.

Example

class ReentrantFunc(Function):
    @staticmethod
    def backward(ctx, grad):
        # Before (buggy)
        # reentrant_root.backward()
        
        # After (correct)
        reentrant_root.backward(grad)
        return grad

Notes

  • The issue is device-agnostic and affects all platforms, including CUDA, ROCm, and Windows.
  • The fix resolves the root cause of the issue, making the test deterministic and reliable.

Recommendation

Apply the workaround by changing the line reentrant_root.backward() to reentrant_root.backward(grad) in the ReentrantFunc class, as this fixes the root cause of the issue and makes the test deterministic.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING