pytorch - 💡(How to fix) Fix ROCm MI300 (gfx942): native segfault in mixed-precision BatchNorm2d backward

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

ROCm-nightly gfx942 distributed tests hit native crashes (SIGSEGV, heap corruption) in mixed-precision norm backward. The crashes are nondeterministic and not caused by the triggering commit.

CI job: rocm-nightly / linux-noble-rocm-nightly-py3.12-gfx942 / test (distributed, 2, 3, linux.rocm.gpu.gfx942.4, module:rocm, oncall:distributed, unstable) HUD job ID: 74545181272 Triggering commit: d783393fbc6d — [Inductor] Do not use layout constraints for NonOwningLayouts (https://github.com/pytorch/pytorch/pull/182333)

Error Message

  • Run 2: Fatal Python error: corrupted double-linked list, then SIGSEGV

Root Cause

ROCm-nightly gfx942 distributed tests hit native crashes (SIGSEGV, heap corruption) in mixed-precision norm backward. The crashes are nondeterministic and not caused by the triggering commit.

Fix Action

Fix / Workaround

Suggested Mitigation

Code Example

Thread 0x00007fb1b75fe6c0 (most recent call first):
  File "torch/autograd/graph.py", line 913 in _engine_run_backward
  File "torch/autograd/__init__.py", line 395 in backward
  File "torch/_tensor.py", line 633 in backward
  File "test/distributed/_composable/fsdp/test_fully_shard_mixed_precision.py", line 707 in inner
  File "test/distributed/_composable/fsdp/test_fully_shard_mixed_precision.py", line 733 in _test_norm_modules
  File "test/distributed/_composable/fsdp/test_fully_shard_mixed_precision.py", line 700 in test_norm_modules_fp16
RAW_BUFFERClick to expand / collapse

Summary

ROCm-nightly gfx942 distributed tests hit native crashes (SIGSEGV, heap corruption) in mixed-precision norm backward. The crashes are nondeterministic and not caused by the triggering commit.

CI job: rocm-nightly / linux-noble-rocm-nightly-py3.12-gfx942 / test (distributed, 2, 3, linux.rocm.gpu.gfx942.4, module:rocm, oncall:distributed, unstable) HUD job ID: 74545181272 Triggering commit: d783393fbc6d — [Inductor] Do not use layout constraints for NonOwningLayouts (https://github.com/pytorch/pytorch/pull/182333)

Failing Tests

The crashes occur during autograd backward in the BatchNorm2d section of the mixed-precision norm tests. That section intentionally expects a Python RuntimeError about a running_mean dtype mismatch. Instead, ROCm native execution crashes.

  1. TestReplicateMixedPrecisionCasts::test_norm_modules_fp16 — Segfaulted once in _engine_run_backward, passed on retry.
  2. TestFullyShardMixedPrecisionCasts::test_norm_modules_bf16 — Segfaulted once in _engine_run_backward, passed on retry.
  3. TestFullyShardMixedPrecisionCasts::test_norm_modules_fp16 — Failed consistently across 3 attempts:
    • Run 1: SIGSEGV in _engine_run_backward
    • Run 2: Fatal Python error: corrupted double-linked list, then SIGSEGV
    • Run 3: Multiple simultaneous segfaults from worker threads

Crash Site

Both FSDP and replicate tests call _test_norm_modules, which runs model(x).sum().backward(). The crash is in the BatchNorm2d backward section:

  • FSDP: test/distributed/_composable/fsdp/test_fully_shard_mixed_precision.py, line 733
  • Replicate: test/distributed/_composable/test_replicate_mixed_precision.py, line 480

Representative stack:

Thread 0x00007fb1b75fe6c0 (most recent call first):
  File "torch/autograd/graph.py", line 913 in _engine_run_backward
  File "torch/autograd/__init__.py", line 395 in backward
  File "torch/_tensor.py", line 633 in backward
  File "test/distributed/_composable/fsdp/test_fully_shard_mixed_precision.py", line 707 in inner
  File "test/distributed/_composable/fsdp/test_fully_shard_mixed_precision.py", line 733 in _test_norm_modules
  File "test/distributed/_composable/fsdp/test_fully_shard_mixed_precision.py", line 700 in test_norm_modules_fp16

Root Cause Assessment

This is very unlikely to be a regression from the triggering commit:

  1. The commit is unrelated. It only changes Inductor layout constraint handling (torch/_inductor/scheduler.py, torch/_inductor/select_algorithm.py, test/inductor/test_max_autotune.py).
  2. The crash is native. SIGSEGV and corrupted double-linked list indicate native memory corruption or allocator misuse. The Python dtype-mismatch path should raise RuntimeError, not corrupt the heap.
  3. The behavior is nondeterministic. Replicate fp16 and FSDP bf16 pass on fresh-process retry, while FSDP fp16 fails with different crash signatures across attempts.
  4. The job shows broader ROCm instability. The same log includes repeated FSDP failures where DEVICE_COUNT is zero (ZeroDivisionError in common_fsdp.py) and 300-second timeouts in test_fsdp_core.py.

Suggested Mitigation

Skip the affected norm-module mixed-precision tests on ROCm MI300/gfx942 until the native crash is resolved. Treat the underlying issue as a ROCm nightly native/backend or CI environment problem. If it persists across unrelated commits, escalate with raw job logs and coredumps.

cc @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @pragupta @msaroufim @dcci @aditvenk @weifengpy @jeffdaily @sunway513 @jithunnair-amd @pruthvistony @ROCmSupport @jataylo @hongxiayang @naromero77amd @jerrymannil @xinyazhang @zhaojuanmao @mrshenli @rohan-varma @chauhang @mori360 @ppwwyyxx @pytorch/rocm

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

pytorch - 💡(How to fix) Fix ROCm MI300 (gfx942): native segfault in mixed-precision BatchNorm2d backward