pytorch - 💡(How to fix) Fix [ROCm] DistTensorRandomOpCompileTest family is flaky on ROCm [1 pull requests]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

The DistTensorRandomOpCompileTest family in test/distributed/tensor/test_random_ops.py is flaky on ROCm. Eight tests in this class are already @skipIfRocm with individual auto-disabler tracking issues. test_compile_multiple_random_ops is the only member that was never disabled and is now flaking on linux-jammy-rocm-py3.10-mi355 (gfx950 / MI355).

Filing this as a single tracking issue for the whole family so the underlying ROCm + inductor + DTensor RNG interaction can be root-caused once, rather than chasing per-test disables.

Error Message

AssertionError: True is not false : RNG state did not change between call 0 and 1

Root Cause

Eager DTensor random ops go through OffsetBasedRNGTracker._distribute_region (torch/distributed/tensor/_random.py:242), which ends with self._set_device_state(state.state) after bumping the philox offset by end_offset_incr. This is a deterministic host-side write to the device generator, so eager reliably advances the offset.

The compiled path instead routes through the run_dtensor_rng_op higher-order operator (torch/_prims/rng_prims.py:462), registered to inductor as a plain make_fallback (torch/_inductor/lowering.py:3542). Its offset advancement is a host-side side effect (torch.cuda.set_rng_state inside impl_cuda) that inductor's scheduler does not model. The HOP is declared cacheable=True with no mutation annotation. This is RFC-stage work that landed in #174446 (reverted three times before landing) and has explicit rank-dependent-graph caveats noted by the author.

The whole DistTensorRandomOpCompileTest family being unreliable on ROCm is consistent with this side-effect-invisible-to-inductor pattern. The exact gfx950 byte-level trigger that makes the offset stay at 0 on the first compiled call has not yet been pinned; a gfx950 / MI350 host is needed to reproduce.

Fix Action

Fixed

Code Example

AssertionError: True is not false : RNG state did not change between call 0 and 1
RAW_BUFFERClick to expand / collapse

Summary

The DistTensorRandomOpCompileTest family in test/distributed/tensor/test_random_ops.py is flaky on ROCm. Eight tests in this class are already @skipIfRocm with individual auto-disabler tracking issues. test_compile_multiple_random_ops is the only member that was never disabled and is now flaking on linux-jammy-rocm-py3.10-mi355 (gfx950 / MI355).

Filing this as a single tracking issue for the whole family so the underlying ROCm + inductor + DTensor RNG interaction can be root-caused once, rather than chasing per-test disables.

Currently disabled siblings

TestPre-existing tracking issue
test_compile_native_dropout#179985
test_compile_normal_#179973
test_compile_rand_like#179977
test_compile_randn_like#179963
test_compile_randint_like#179984
test_compile_uniform_#179964
test_compile_bernoulli#179981
test_compile_bernoulli_float#179987
test_compile_multiple_random_opsthis issue (not yet skipped)

The pre-existing tracking issues were opened by the test auto-disabler and contain no root-cause analysis.

Observed failure for test_compile_multiple_random_ops

Failures reproduced across four CI runs on different mainline commits (e1e28ae, c3534e4, 455c813, 5e3cb3e) and different shards, all on linux-jammy-rocm-py3.10-mi355 (gfx950 / MI355). Example log: https://ossci-raw-job-status.s3.amazonaws.com/log/78098557692

Assertion at test/distributed/tensor/test_random_ops.py:747:

AssertionError: True is not false : RNG state did not change between call 0 and 1

i.e. after the first call to the inductor-compiled fn, torch.cuda.get_rng_state() is byte-identical to the post-manual_seed(0) state. The global CUDA RNG offset did not advance.

The traceback reaches the assertion from _test_compile_random_op at line 831, which is specifically the backend="inductor" path. The preceding eager and aot_eager runs of the same fn pass.

Analysis

Eager DTensor random ops go through OffsetBasedRNGTracker._distribute_region (torch/distributed/tensor/_random.py:242), which ends with self._set_device_state(state.state) after bumping the philox offset by end_offset_incr. This is a deterministic host-side write to the device generator, so eager reliably advances the offset.

The compiled path instead routes through the run_dtensor_rng_op higher-order operator (torch/_prims/rng_prims.py:462), registered to inductor as a plain make_fallback (torch/_inductor/lowering.py:3542). Its offset advancement is a host-side side effect (torch.cuda.set_rng_state inside impl_cuda) that inductor's scheduler does not model. The HOP is declared cacheable=True with no mutation annotation. This is RFC-stage work that landed in #174446 (reverted three times before landing) and has explicit rank-dependent-graph caveats noted by the author.

The whole DistTensorRandomOpCompileTest family being unreliable on ROCm is consistent with this side-effect-invisible-to-inductor pattern. The exact gfx950 byte-level trigger that makes the offset stay at 0 on the first compiled call has not yet been pinned; a gfx950 / MI350 host is needed to reproduce.

Likely product fix

Make the global generator mutation in run_dtensor_rng_op visible to inductor (model it as a mutating / side-effecting op rather than a cacheable pure fallback), so the compiled path advances the offset as reliably as eager. Needs gfx950 validation.

Why not relax the assertion

The assertion guards a real correctness invariant: each call to a random op must consume fresh randomness. Silently passing would hide a bug where compiled DTensor random ops repeat values across calls.

Action

Adding @skipIfRocm to test_compile_multiple_random_ops to match the siblings, referencing this issue, to unblock ROCm CI while the underlying fix is developed.

cc @sunway513 @jithunnair-amd @pruthvistony @ROCmSupport @jataylo @hongxiayang @naromero77amd @pragupta @jerrymannil @xinyazhang

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING