pytorch - 💡(How to fix) Fix ROCm MI300 (gfx942): native segfault in mixed-precision BatchNorm2d backward

pytorch2026-05-08 20:19:19

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

ROCm-nightly gfx942 distributed tests hit native crashes (SIGSEGV, heap corruption) in mixed-precision norm backward. The crashes are nondeterministic and not caused by the triggering commit.

CI job: rocm-nightly / linux-noble-rocm-nightly-py3.12-gfx942 / test (distributed, 2, 3, linux.rocm.gpu.gfx942.4, module:rocm, oncall:distributed, unstable) HUD job ID: 74545181272 Triggering commit: d783393fbc6d — [Inductor] Do not use layout constraints for NonOwningLayouts (https://github.com/pytorch/pytorch/pull/182333)

Error Message

Run 2: Fatal Python error: corrupted double-linked list, then SIGSEGV

Root Cause

ROCm-nightly gfx942 distributed tests hit native crashes (SIGSEGV, heap corruption) in mixed-precision norm backward. The crashes are nondeterministic and not caused by the triggering commit.

Fix Action

Fix / Workaround

Suggested Mitigation

Code Example

Thread 0x00007fb1b75fe6c0 (most recent call first):
  File "torch/autograd/graph.py", line 913 in _engine_run_backward
  File "torch/autograd/__init__.py", line 395 in backward
  File "torch/_tensor.py", line 633 in backward
  File "test/distributed/_composable/fsdp/test_fully_shard_mixed_precision.py", line 707 in inner
  File "test/distributed/_composable/fsdp/test_fully_shard_mixed_precision.py", line 733 in _test_norm_modules
  File "test/distributed/_composable/fsdp/test_fully_shard_mixed_precision.py", line 700 in test_norm_modules_fp16

RAW_BUFFERClick to expand / collapse

Summary

ROCm-nightly gfx942 distributed tests hit native crashes (SIGSEGV, heap corruption) in mixed-precision norm backward. The crashes are nondeterministic and not caused by the triggering commit.

Failing Tests

The crashes occur during autograd backward in the BatchNorm2d section of the mixed-precision norm tests. That section intentionally expects a Python RuntimeError about a running_mean dtype mismatch. Instead, ROCm native execution crashes.

TestReplicateMixedPrecisionCasts::test_norm_modules_fp16 — Segfaulted once in _engine_run_backward, passed on retry.
TestFullyShardMixedPrecisionCasts::test_norm_modules_bf16 — Segfaulted once in _engine_run_backward, passed on retry.
TestFullyShardMixedPrecisionCasts::test_norm_modules_fp16 — Failed consistently across 3 attempts:
- Run 1: SIGSEGV in _engine_run_backward
- Run 2: Fatal Python error: corrupted double-linked list, then SIGSEGV
- Run 3: Multiple simultaneous segfaults from worker threads

Crash Site

Both FSDP and replicate tests call _test_norm_modules, which runs model(x).sum().backward(). The crash is in the BatchNorm2d backward section:

FSDP: test/distributed/_composable/fsdp/test_fully_shard_mixed_precision.py, line 733
Replicate: test/distributed/_composable/test_replicate_mixed_precision.py, line 480

Representative stack:

Thread 0x00007fb1b75fe6c0 (most recent call first):
  File "torch/autograd/graph.py", line 913 in _engine_run_backward
  File "torch/autograd/__init__.py", line 395 in backward
  File "torch/_tensor.py", line 633 in backward
  File "test/distributed/_composable/fsdp/test_fully_shard_mixed_precision.py", line 707 in inner
  File "test/distributed/_composable/fsdp/test_fully_shard_mixed_precision.py", line 733 in _test_norm_modules
  File "test/distributed/_composable/fsdp/test_fully_shard_mixed_precision.py", line 700 in test_norm_modules_fp16

Root Cause Assessment

This is very unlikely to be a regression from the triggering commit:

The commit is unrelated. It only changes Inductor layout constraint handling (torch/_inductor/scheduler.py, torch/_inductor/select_algorithm.py, test/inductor/test_max_autotune.py).
The crash is native. SIGSEGV and corrupted double-linked list indicate native memory corruption or allocator misuse. The Python dtype-mismatch path should raise RuntimeError, not corrupt the heap.
The behavior is nondeterministic. Replicate fp16 and FSDP bf16 pass on fresh-process retry, while FSDP fp16 fails with different crash signatures across attempts.
The job shows broader ROCm instability. The same log includes repeated FSDP failures where DEVICE_COUNT is zero (ZeroDivisionError in common_fsdp.py) and 300-second timeouts in test_fsdp_core.py.

Suggested Mitigation

Skip the affected norm-module mixed-precision tests on ROCm MI300/gfx942 until the native crash is resolved. Treat the underlying issue as a ROCm nightly native/backend or CI environment problem. If it persists across unrelated commits, escalate with raw job logs and coredumps.

cc @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @pragupta @msaroufim @dcci @aditvenk @weifengpy @jeffdaily @sunway513 @jithunnair-amd @pruthvistony @ROCmSupport @jataylo @hongxiayang @naromero77amd @jerrymannil @xinyazhang @zhaojuanmao @mrshenli @rohan-varma @chauhang @mori360 @ppwwyyxx @pytorch/rocm

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#mixed precision #response parsing #generation error #database connection #vector store

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

pytorch - 💡(How to fix) Fix ROCm MI300 (gfx942): native segfault in mixed-precision BatchNorm2d backward

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Fix / Workaround

Suggested Mitigation

Code Example

Summary

Failing Tests

Crash Site

Root Cause Assessment

Suggested Mitigation

Still need to ship something?

TRENDING

pytorch - 💡(How to fix) Fix ROCm MI300 (gfx942): native segfault in mixed-precision BatchNorm2d backward

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Fix / Workaround

Suggested Mitigation

Code Example

Summary

Failing Tests

Crash Site

Root Cause Assessment

Suggested Mitigation

Still need to ship something?

RELATED_DISCOVERY

TRENDING