pytorch - 💡(How to fix) Fix `torch.nn.LSTM.forward` produces incorrect outputs (or crashes) on ROCm [2 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
pytorch/pytorch#177834Fetched 2026-04-08 01:01:29
View on GitHub
Comments
2
Participants
2
Timeline
40
Reactions
0
Timeline (top)
mentioned ×14subscribed ×14labeled ×7commented ×2

Error Message

username:/longer_pathname_so_that_rpms_can_support_packaging_the_debug_info_for_all_os_profiles/src/rocm-libraries/projects/miopen/src/tensor.cpp:120: Lengths must be > 0

Traceback (most recent call last): File "issue.py", line 73, in <module> main() File ".venv/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 124, in decorate_context return func(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "issue.py", line 64, in main raise e File "issue.py", line 61, in main ywrong = run(x, dtype, "cuda:0") ^^^^^^^^^^^^^^^^^^^^^^^ File ".venv/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 124, in decorate_context return func(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "issue.py", line 40, in run y, _ = lstm(x) ^^^^^^^ File ".venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1776, in _wrapped_call_impl return self._call_impl(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File ".venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1787, in _call_impl return forward_call(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File ".venv/lib/python3.11/site-packages/torch/nn/modules/rnn.py", line 1141, in forward result = _VF.lstm( ^^^^^^^^^ RuntimeError: miopenStatusBadParm

Fix Action

Fix / Workaround

CPU: Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Address sizes: 48 bits physical, 48 bits virtual Byte Order: Little Endian CPU(s): 32 On-line CPU(s) list: 0-31 Vendor ID: AuthenticAMD Model name: AMD Ryzen 9 9950X 16-Core Processor CPU family: 26 Model: 68 Thread(s) per core: 2 Core(s) per socket: 16 Socket(s): 1 Stepping: 0 Frequency boost: enabled CPU(s) scaling MHz: 44% CPU max MHz: 8839.3555 CPU min MHz: 3000.0000 BogoMIPS: 8600.02 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good amd_lbr_v2 nopl nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate ssbd mba perfmon_v2 ibrs ibpb stibp ibrs_enhanced vmmcall fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local avx_vnni avx512_bf16 clzero irperf xsaveerptr rdpru wbnoinvd cppc arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif x2avic v_spec_ctrl avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq rdpid bus_lock_detect movdiri movdir64b overflow_recov succor smca fsrm avx512_vp2intersect flush_l1d amd_lbr_pmc_freeze Virtualization: AMD-V L1d cache: 768 KiB (16 instances) L1i cache: 512 KiB (16 instances) L2 cache: 16 MiB (16 instances) L3 cache: 64 MiB (2 instances) NUMA node(s): 1 NUMA node0 CPU(s): 0-31 Vulnerability Gather data sampling: Not affected Vulnerability Indirect target selection: Not affected Vulnerability Itlb multihit: Not affected Vulnerability L1tf: Not affected Vulnerability Mds: Not affected Vulnerability Meltdown: Not affected Vulnerability Mmio stale data: Not affected Vulnerability Reg file data sampling: Not affected Vulnerability Retbleed: Not affected Vulnerability Spec rstack overflow: Not affected Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Vulnerability Spectre v2: Mitigation; Enhanced / Automatic IBRS; IBPB conditional; STIBP always-on; PBRSB-eIBRS Not affected; BHI Not affected Vulnerability Srbds: Not affected Vulnerability Tsa: Not affected Vulnerability Tsx async abort: Not affected Vulnerability Vmscape: Not affected

Code Example

username:/longer_pathname_so_that_rpms_can_support_packaging_the_debug_info_for_all_os_profiles/src/rocm-libraries/projects/miopen/src/tensor.cpp:120: Lengths must be > 0

Traceback (most recent call last):
  File "issue.py", line 73, in <module>
    main()
  File ".venv/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 124, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "issue.py", line 64, in main
    raise e
  File "issue.py", line 61, in main
    ywrong = run(x, dtype, "cuda:0")
             ^^^^^^^^^^^^^^^^^^^^^^^
  File ".venv/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 124, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "issue.py", line 40, in run
    y, _ = lstm(x)
           ^^^^^^^
  File ".venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1776, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File ".venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1787, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File ".venv/lib/python3.11/site-packages/torch/nn/modules/rnn.py", line 1141, in forward
    result = _VF.lstm(
             ^^^^^^^^^
RuntimeError: miopenStatusBadParm

---

Bidirectional LSTM definition: input_size=60, hidden_size=128, num_layers=4
dtype: torch.float16
    batch_size= 10, relative difference vs CPU: mean=0.0007, std=0.0006
    batch_size= 50, relative difference vs CPU: mean=0.0007, std=0.0006
    batch_size=100, relative difference vs CPU: mean=0.0007, std=0.0006
    batch_size=110, relative difference vs CPU: mean=0.0140, std=0.1141
    batch_size=120, relative difference vs CPU: mean=0.1747, std=0.4716
    batch_size=130, relative difference vs CPU: mean=0.3114, std=0.6005
    batch_size=200, relative difference vs CPU: mean=0.8847, std=0.7305
    batch_size=250, relative difference vs CPU: mean=0.9642, std=0.7083
    batch_size=300, relative difference vs CPU: mean=0.9611, std=0.7061
    batch_size=400, relative difference vs CPU: mean=0.9616, std=0.7078
    batch_size=500, relative difference vs CPU: mean=0.9576, std=0.7063
MIOpen Error: username:/longer_pathname_so_that_rpms_can_support_packaging_the_debug_info_for_all_os_profiles/src/rocm-libraries/projects/miopen/src/tensor.cpp:120: Lengths must be > 0
    batch_size=1000 FORWARD ON GPU FAILED: miopenStatusBadParm
dtype: torch.float32
    batch_size= 10, relative difference vs CPU: mean=0.0000, std=0.0000
    batch_size= 50, relative difference vs CPU: mean=0.0000, std=0.0000
    batch_size=100, relative difference vs CPU: mean=0.0000, std=0.0000
    batch_size=110, relative difference vs CPU: mean=0.0134, std=0.1142
    batch_size=120, relative difference vs CPU: mean=0.1742, std=0.4718
    batch_size=130, relative difference vs CPU: mean=0.3110, std=0.6008
    batch_size=200, relative difference vs CPU: mean=0.8847, std=0.7305
    batch_size=250, relative difference vs CPU: mean=1.0828, std=0.8637
    batch_size=300, relative difference vs CPU: mean=1.2134, std=0.9995
    batch_size=400, relative difference vs CPU: mean=1.3588, std=1.2611
    batch_size=500, relative difference vs CPU: mean=1.5400, std=1.4649
MIOpen Error: username:/longer_pathname_so_that_rpms_can_support_packaging_the_debug_info_for_all_os_profiles/src/rocm-libraries/projects/miopen/src/tensor.cpp:120: Lengths must be > 0
    batch_size=1000 FORWARD ON GPU FAILED: miopenStatusBadParm


# Larger `hidden_size` reduces numerical errors.
Bidirectional LSTM definition: input_size=60, hidden_size=256, num_layers=4
dtype: torch.float16
    batch_size= 10, relative difference vs CPU: mean=0.0007, std=0.0007
    batch_size= 50, relative difference vs CPU: mean=0.0007, std=0.0007
    batch_size=100, relative difference vs CPU: mean=0.0007, std=0.0007
    batch_size=110, relative difference vs CPU: mean=0.0007, std=0.0007
    batch_size=120, relative difference vs CPU: mean=0.0007, std=0.0007
    batch_size=130, relative difference vs CPU: mean=0.0007, std=0.0007
    batch_size=200, relative difference vs CPU: mean=0.0007, std=0.0007
    batch_size=250, relative difference vs CPU: mean=0.0007, std=0.0007
MIOpen Error: username:/longer_pathname_so_that_rpms_can_support_packaging_the_debug_info_for_all_os_profiles/src/rocm-libraries/projects/miopen/src/tensor.cpp:120: Lengths must be > 0
    batch_size=300 FORWARD ON GPU FAILED: miopenStatusBadParm
MIOpen Error: username:/longer_pathname_so_that_rpms_can_support_packaging_the_debug_info_for_all_os_profiles/src/rocm-libraries/projects/miopen/src/tensor.cpp:120: Lengths must be > 0
    batch_size=400 FORWARD ON GPU FAILED: miopenStatusBadParm
MIOpen Error: username:/longer_pathname_so_that_rpms_can_support_packaging_the_debug_info_for_all_os_profiles/src/rocm-libraries/projects/miopen/src/tensor.cpp:120: Lengths must be > 0
    batch_size=500 FORWARD ON GPU FAILED: miopenStatusBadParm
MIOpen Error: username:/longer_pathname_so_that_rpms_can_support_packaging_the_debug_info_for_all_os_profiles/src/rocm-libraries/projects/miopen/src/tensor.cpp:120: Lengths must be > 0
    batch_size=1000 FORWARD ON GPU FAILED: miopenStatusBadParm
dtype: torch.float32
    batch_size= 10, relative difference vs CPU: mean=0.0000, std=0.0000
    batch_size= 50, relative difference vs CPU: mean=0.0000, std=0.0000
    batch_size=100, relative difference vs CPU: mean=0.0000, std=0.0000
    batch_size=110, relative difference vs CPU: mean=0.0136, std=0.1175
    batch_size=120, relative difference vs CPU: mean=0.1762, std=0.4779
    batch_size=130, relative difference vs CPU: mean=0.3145, std=0.6083
    batch_size=200, relative difference vs CPU: mean=0.8945, std=0.7398
    batch_size=250, relative difference vs CPU: mean=0.9774, std=0.7180
MIOpen Error: username:/longer_pathname_so_that_rpms_can_support_packaging_the_debug_info_for_all_os_profiles/src/rocm-libraries/projects/miopen/src/tensor.cpp:120: Lengths must be > 0
    batch_size=300 FORWARD ON GPU FAILED: miopenStatusBadParm
MIOpen Error: username:/longer_pathname_so_that_rpms_can_support_packaging_the_debug_info_for_all_os_profiles/src/rocm-libraries/projects/miopen/src/tensor.cpp:120: Lengths must be > 0
    batch_size=400 FORWARD ON GPU FAILED: miopenStatusBadParm
MIOpen Error: username:/longer_pathname_so_that_rpms_can_support_packaging_the_debug_info_for_all_os_profiles/src/rocm-libraries/projects/miopen/src/tensor.cpp:120: Lengths must be > 0
    batch_size=500 FORWARD ON GPU FAILED: miopenStatusBadParm
MIOpen Error: username:/longer_pathname_so_that_rpms_can_support_packaging_the_debug_info_for_all_os_profiles/src/rocm-libraries/projects/miopen/src/tensor.cpp:120: Lengths must be > 0
    batch_size=1000 FORWARD ON GPU FAILED: miopenStatusBadParm

---

Bidirectional LSTM definition: input_size=60, hidden_size=128, num_layers=4
dtype: torch.float16
    batch_size= 10, relative difference vs CPU: mean=0.0003, std=0.0003
    batch_size= 50, relative difference vs CPU: mean=0.0003, std=0.0003
    batch_size=100, relative difference vs CPU: mean=0.0003, std=0.0003
    batch_size=110, relative difference vs CPU: mean=0.0003, std=0.0003
    batch_size=120, relative difference vs CPU: mean=0.0003, std=0.0003
    batch_size=130, relative difference vs CPU: mean=0.0003, std=0.0003
    batch_size=200, relative difference vs CPU: mean=0.0003, std=0.0003
    batch_size=250, relative difference vs CPU: mean=0.0003, std=0.0003
    batch_size=300, relative difference vs CPU: mean=0.0003, std=0.0003
    batch_size=400, relative difference vs CPU: mean=0.0003, std=0.0003
    batch_size=500, relative difference vs CPU: mean=0.0003, std=0.0003
    batch_size=1000, relative difference vs CPU: mean=0.0003, std=0.0003
dtype: torch.float32
    batch_size= 10, relative difference vs CPU: mean=0.0001, std=0.0001
    batch_size= 50, relative difference vs CPU: mean=0.0001, std=0.0001
    batch_size=100, relative difference vs CPU: mean=0.0001, std=0.0001
    batch_size=110, relative difference vs CPU: mean=0.0001, std=0.0001
    batch_size=120, relative difference vs CPU: mean=0.0001, std=0.0001
    batch_size=130, relative difference vs CPU: mean=0.0001, std=0.0001
    batch_size=200, relative difference vs CPU: mean=0.0001, std=0.0001
    batch_size=250, relative difference vs CPU: mean=0.0001, std=0.0001
    batch_size=300, relative difference vs CPU: mean=0.0001, std=0.0001
    batch_size=400, relative difference vs CPU: mean=0.0001, std=0.0001
    batch_size=500, relative difference vs CPU: mean=0.0001, std=0.0001
    batch_size=1000, relative difference vs CPU: mean=0.0001, std=0.0001


Bidirectional LSTM definition: input_size=60, hidden_size=256, num_layers=4
dtype: torch.float16
    batch_size= 10, relative difference vs CPU: mean=0.0003, std=0.0003
    batch_size= 50, relative difference vs CPU: mean=0.0003, std=0.0003
    batch_size=100, relative difference vs CPU: mean=0.0003, std=0.0003
    batch_size=110, relative difference vs CPU: mean=0.0003, std=0.0003
    batch_size=120, relative difference vs CPU: mean=0.0003, std=0.0003
    batch_size=130, relative difference vs CPU: mean=0.0003, std=0.0003
    batch_size=200, relative difference vs CPU: mean=0.0003, std=0.0003
    batch_size=250, relative difference vs CPU: mean=0.0003, std=0.0003
    batch_size=300, relative difference vs CPU: mean=0.0003, std=0.0003
    batch_size=400, relative difference vs CPU: mean=0.0003, std=0.0003
    batch_size=500, relative difference vs CPU: mean=0.0003, std=0.0003
    batch_size=1000, relative difference vs CPU: mean=0.0003, std=0.0003
dtype: torch.float32
    batch_size= 10, relative difference vs CPU: mean=0.0001, std=0.0001
    batch_size= 50, relative difference vs CPU: mean=0.0001, std=0.0001
    batch_size=100, relative difference vs CPU: mean=0.0001, std=0.0001
    batch_size=110, relative difference vs CPU: mean=0.0001, std=0.0001
    batch_size=120, relative difference vs CPU: mean=0.0001, std=0.0001
    batch_size=130, relative difference vs CPU: mean=0.0001, std=0.0001
    batch_size=200, relative difference vs CPU: mean=0.0001, std=0.0001
    batch_size=250, relative difference vs CPU: mean=0.0001, std=0.0001
    batch_size=300, relative difference vs CPU: mean=0.0001, std=0.0001
    batch_size=400, relative difference vs CPU: mean=0.0001, std=0.0001
    batch_size=500, relative difference vs CPU: mean=0.0001, std=0.0001
    batch_size=1000, relative difference vs CPU: mean=0.0001, std=0.0001

---

import gc

import torch

torch.manual_seed(42)
torch.use_deterministic_algorithms(True)


INPUT_SIZE = 60
HIDDEN_SIZE = 128
NUM_LAYERS = 4
MAX_BATCH_SIZE = 2000
SEQLEN = 600


def _make_lstm():
    return torch.nn.LSTM(
        input_size=INPUT_SIZE,
        hidden_size=HIDDEN_SIZE,
        num_layers=NUM_LAYERS,
        bidirectional=True,
        batch_first=True,
    )


STATE_DICT = _make_lstm().state_dict()
RANDOM_INPUT = torch.randn((MAX_BATCH_SIZE, SEQLEN, INPUT_SIZE))


@torch.inference_mode()
def run(x, dtype, device):
    lstm = _make_lstm()
    lstm.load_state_dict(STATE_DICT)
    lstm.to(dtype)
    lstm.to(device)
    lstm.eval()

    x = x.to(device).to(dtype)
    y, _ = lstm(x)
    return y.to(torch.float32).cpu()


@torch.inference_mode()
def main():
    print(f"Bidirectional LSTM definition: input_size={INPUT_SIZE}, "
          f"hidden_size={HIDDEN_SIZE}, num_layers={NUM_LAYERS}")

    batch_sizes = [10, 50, 100, 110, 120, 130, 200, 250, 300, 400, 500, 1000]
    for dtype in [torch.float16, torch.float32]:
        print("dtype:", dtype)
        for batch_size in batch_sizes:
            gc.collect()
            torch.cuda.empty_cache()

            assert batch_size <= MAX_BATCH_SIZE
            x = RANDOM_INPUT[:batch_size].contiguous()

            yref = run(x, torch.float32, "cpu")
            try:
                ywrong = run(x, dtype, "cuda:0")
            except RuntimeError as e:
                print(f"    {batch_size=} FORWARD ON GPU FAILED: {e}")
                continue

            ref_center = yref.abs().mean()
            diff = torch.abs(yref - ywrong) / ref_center
            print(f"    {batch_size=:3d}, relative difference vs CPU: mean="
                  f"{diff.mean():.4f}, std={diff.std():.4f}")


main()
RAW_BUFFERClick to expand / collapse

🐛 Describe the bug

We observed that the the outputs of torch.nn.LSTM.forward, when running on a ROCm GPU, differ significantly from their CPU counterpart. This only happens for larger batch sizes (>100), and the differences become larger as the batch size is further increased. These differences are extremely large (sometimes up to 1-3 orders of magnitude) and can't be attributed to the usual numerical inaccuracies. Furthermore, for a large enough batch size (500-1000), the call raises an error:

username:/longer_pathname_so_that_rpms_can_support_packaging_the_debug_info_for_all_os_profiles/src/rocm-libraries/projects/miopen/src/tensor.cpp:120: Lengths must be > 0

Traceback (most recent call last):
  File "issue.py", line 73, in <module>
    main()
  File ".venv/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 124, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "issue.py", line 64, in main
    raise e
  File "issue.py", line 61, in main
    ywrong = run(x, dtype, "cuda:0")
             ^^^^^^^^^^^^^^^^^^^^^^^
  File ".venv/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 124, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "issue.py", line 40, in run
    y, _ = lstm(x)
           ^^^^^^^
  File ".venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1776, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File ".venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1787, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File ".venv/lib/python3.11/site-packages/torch/nn/modules/rnn.py", line 1141, in forward
    result = _VF.lstm(
             ^^^^^^^^^
RuntimeError: miopenStatusBadParm

We also noticed that the numerical differences decrease when the LSTM's hidden size is lower. None of these effects are present on CUDA.

Here are some results when running on ROCm:

Bidirectional LSTM definition: input_size=60, hidden_size=128, num_layers=4
dtype: torch.float16
    batch_size= 10, relative difference vs CPU: mean=0.0007, std=0.0006
    batch_size= 50, relative difference vs CPU: mean=0.0007, std=0.0006
    batch_size=100, relative difference vs CPU: mean=0.0007, std=0.0006
    batch_size=110, relative difference vs CPU: mean=0.0140, std=0.1141
    batch_size=120, relative difference vs CPU: mean=0.1747, std=0.4716
    batch_size=130, relative difference vs CPU: mean=0.3114, std=0.6005
    batch_size=200, relative difference vs CPU: mean=0.8847, std=0.7305
    batch_size=250, relative difference vs CPU: mean=0.9642, std=0.7083
    batch_size=300, relative difference vs CPU: mean=0.9611, std=0.7061
    batch_size=400, relative difference vs CPU: mean=0.9616, std=0.7078
    batch_size=500, relative difference vs CPU: mean=0.9576, std=0.7063
MIOpen Error: username:/longer_pathname_so_that_rpms_can_support_packaging_the_debug_info_for_all_os_profiles/src/rocm-libraries/projects/miopen/src/tensor.cpp:120: Lengths must be > 0
    batch_size=1000 FORWARD ON GPU FAILED: miopenStatusBadParm
dtype: torch.float32
    batch_size= 10, relative difference vs CPU: mean=0.0000, std=0.0000
    batch_size= 50, relative difference vs CPU: mean=0.0000, std=0.0000
    batch_size=100, relative difference vs CPU: mean=0.0000, std=0.0000
    batch_size=110, relative difference vs CPU: mean=0.0134, std=0.1142
    batch_size=120, relative difference vs CPU: mean=0.1742, std=0.4718
    batch_size=130, relative difference vs CPU: mean=0.3110, std=0.6008
    batch_size=200, relative difference vs CPU: mean=0.8847, std=0.7305
    batch_size=250, relative difference vs CPU: mean=1.0828, std=0.8637
    batch_size=300, relative difference vs CPU: mean=1.2134, std=0.9995
    batch_size=400, relative difference vs CPU: mean=1.3588, std=1.2611
    batch_size=500, relative difference vs CPU: mean=1.5400, std=1.4649
MIOpen Error: username:/longer_pathname_so_that_rpms_can_support_packaging_the_debug_info_for_all_os_profiles/src/rocm-libraries/projects/miopen/src/tensor.cpp:120: Lengths must be > 0
    batch_size=1000 FORWARD ON GPU FAILED: miopenStatusBadParm


# Larger `hidden_size` reduces numerical errors.
Bidirectional LSTM definition: input_size=60, hidden_size=256, num_layers=4
dtype: torch.float16
    batch_size= 10, relative difference vs CPU: mean=0.0007, std=0.0007
    batch_size= 50, relative difference vs CPU: mean=0.0007, std=0.0007
    batch_size=100, relative difference vs CPU: mean=0.0007, std=0.0007
    batch_size=110, relative difference vs CPU: mean=0.0007, std=0.0007
    batch_size=120, relative difference vs CPU: mean=0.0007, std=0.0007
    batch_size=130, relative difference vs CPU: mean=0.0007, std=0.0007
    batch_size=200, relative difference vs CPU: mean=0.0007, std=0.0007
    batch_size=250, relative difference vs CPU: mean=0.0007, std=0.0007
MIOpen Error: username:/longer_pathname_so_that_rpms_can_support_packaging_the_debug_info_for_all_os_profiles/src/rocm-libraries/projects/miopen/src/tensor.cpp:120: Lengths must be > 0
    batch_size=300 FORWARD ON GPU FAILED: miopenStatusBadParm
MIOpen Error: username:/longer_pathname_so_that_rpms_can_support_packaging_the_debug_info_for_all_os_profiles/src/rocm-libraries/projects/miopen/src/tensor.cpp:120: Lengths must be > 0
    batch_size=400 FORWARD ON GPU FAILED: miopenStatusBadParm
MIOpen Error: username:/longer_pathname_so_that_rpms_can_support_packaging_the_debug_info_for_all_os_profiles/src/rocm-libraries/projects/miopen/src/tensor.cpp:120: Lengths must be > 0
    batch_size=500 FORWARD ON GPU FAILED: miopenStatusBadParm
MIOpen Error: username:/longer_pathname_so_that_rpms_can_support_packaging_the_debug_info_for_all_os_profiles/src/rocm-libraries/projects/miopen/src/tensor.cpp:120: Lengths must be > 0
    batch_size=1000 FORWARD ON GPU FAILED: miopenStatusBadParm
dtype: torch.float32
    batch_size= 10, relative difference vs CPU: mean=0.0000, std=0.0000
    batch_size= 50, relative difference vs CPU: mean=0.0000, std=0.0000
    batch_size=100, relative difference vs CPU: mean=0.0000, std=0.0000
    batch_size=110, relative difference vs CPU: mean=0.0136, std=0.1175
    batch_size=120, relative difference vs CPU: mean=0.1762, std=0.4779
    batch_size=130, relative difference vs CPU: mean=0.3145, std=0.6083
    batch_size=200, relative difference vs CPU: mean=0.8945, std=0.7398
    batch_size=250, relative difference vs CPU: mean=0.9774, std=0.7180
MIOpen Error: username:/longer_pathname_so_that_rpms_can_support_packaging_the_debug_info_for_all_os_profiles/src/rocm-libraries/projects/miopen/src/tensor.cpp:120: Lengths must be > 0
    batch_size=300 FORWARD ON GPU FAILED: miopenStatusBadParm
MIOpen Error: username:/longer_pathname_so_that_rpms_can_support_packaging_the_debug_info_for_all_os_profiles/src/rocm-libraries/projects/miopen/src/tensor.cpp:120: Lengths must be > 0
    batch_size=400 FORWARD ON GPU FAILED: miopenStatusBadParm
MIOpen Error: username:/longer_pathname_so_that_rpms_can_support_packaging_the_debug_info_for_all_os_profiles/src/rocm-libraries/projects/miopen/src/tensor.cpp:120: Lengths must be > 0
    batch_size=500 FORWARD ON GPU FAILED: miopenStatusBadParm
MIOpen Error: username:/longer_pathname_so_that_rpms_can_support_packaging_the_debug_info_for_all_os_profiles/src/rocm-libraries/projects/miopen/src/tensor.cpp:120: Lengths must be > 0
    batch_size=1000 FORWARD ON GPU FAILED: miopenStatusBadParm

And here are the results of the same script on a CUDA GPU:

Bidirectional LSTM definition: input_size=60, hidden_size=128, num_layers=4
dtype: torch.float16
    batch_size= 10, relative difference vs CPU: mean=0.0003, std=0.0003
    batch_size= 50, relative difference vs CPU: mean=0.0003, std=0.0003
    batch_size=100, relative difference vs CPU: mean=0.0003, std=0.0003
    batch_size=110, relative difference vs CPU: mean=0.0003, std=0.0003
    batch_size=120, relative difference vs CPU: mean=0.0003, std=0.0003
    batch_size=130, relative difference vs CPU: mean=0.0003, std=0.0003
    batch_size=200, relative difference vs CPU: mean=0.0003, std=0.0003
    batch_size=250, relative difference vs CPU: mean=0.0003, std=0.0003
    batch_size=300, relative difference vs CPU: mean=0.0003, std=0.0003
    batch_size=400, relative difference vs CPU: mean=0.0003, std=0.0003
    batch_size=500, relative difference vs CPU: mean=0.0003, std=0.0003
    batch_size=1000, relative difference vs CPU: mean=0.0003, std=0.0003
dtype: torch.float32
    batch_size= 10, relative difference vs CPU: mean=0.0001, std=0.0001
    batch_size= 50, relative difference vs CPU: mean=0.0001, std=0.0001
    batch_size=100, relative difference vs CPU: mean=0.0001, std=0.0001
    batch_size=110, relative difference vs CPU: mean=0.0001, std=0.0001
    batch_size=120, relative difference vs CPU: mean=0.0001, std=0.0001
    batch_size=130, relative difference vs CPU: mean=0.0001, std=0.0001
    batch_size=200, relative difference vs CPU: mean=0.0001, std=0.0001
    batch_size=250, relative difference vs CPU: mean=0.0001, std=0.0001
    batch_size=300, relative difference vs CPU: mean=0.0001, std=0.0001
    batch_size=400, relative difference vs CPU: mean=0.0001, std=0.0001
    batch_size=500, relative difference vs CPU: mean=0.0001, std=0.0001
    batch_size=1000, relative difference vs CPU: mean=0.0001, std=0.0001


Bidirectional LSTM definition: input_size=60, hidden_size=256, num_layers=4
dtype: torch.float16
    batch_size= 10, relative difference vs CPU: mean=0.0003, std=0.0003
    batch_size= 50, relative difference vs CPU: mean=0.0003, std=0.0003
    batch_size=100, relative difference vs CPU: mean=0.0003, std=0.0003
    batch_size=110, relative difference vs CPU: mean=0.0003, std=0.0003
    batch_size=120, relative difference vs CPU: mean=0.0003, std=0.0003
    batch_size=130, relative difference vs CPU: mean=0.0003, std=0.0003
    batch_size=200, relative difference vs CPU: mean=0.0003, std=0.0003
    batch_size=250, relative difference vs CPU: mean=0.0003, std=0.0003
    batch_size=300, relative difference vs CPU: mean=0.0003, std=0.0003
    batch_size=400, relative difference vs CPU: mean=0.0003, std=0.0003
    batch_size=500, relative difference vs CPU: mean=0.0003, std=0.0003
    batch_size=1000, relative difference vs CPU: mean=0.0003, std=0.0003
dtype: torch.float32
    batch_size= 10, relative difference vs CPU: mean=0.0001, std=0.0001
    batch_size= 50, relative difference vs CPU: mean=0.0001, std=0.0001
    batch_size=100, relative difference vs CPU: mean=0.0001, std=0.0001
    batch_size=110, relative difference vs CPU: mean=0.0001, std=0.0001
    batch_size=120, relative difference vs CPU: mean=0.0001, std=0.0001
    batch_size=130, relative difference vs CPU: mean=0.0001, std=0.0001
    batch_size=200, relative difference vs CPU: mean=0.0001, std=0.0001
    batch_size=250, relative difference vs CPU: mean=0.0001, std=0.0001
    batch_size=300, relative difference vs CPU: mean=0.0001, std=0.0001
    batch_size=400, relative difference vs CPU: mean=0.0001, std=0.0001
    batch_size=500, relative difference vs CPU: mean=0.0001, std=0.0001
    batch_size=1000, relative difference vs CPU: mean=0.0001, std=0.0001

Here is the script to reproduce the results:

import gc

import torch

torch.manual_seed(42)
torch.use_deterministic_algorithms(True)


INPUT_SIZE = 60
HIDDEN_SIZE = 128
NUM_LAYERS = 4
MAX_BATCH_SIZE = 2000
SEQLEN = 600


def _make_lstm():
    return torch.nn.LSTM(
        input_size=INPUT_SIZE,
        hidden_size=HIDDEN_SIZE,
        num_layers=NUM_LAYERS,
        bidirectional=True,
        batch_first=True,
    )


STATE_DICT = _make_lstm().state_dict()
RANDOM_INPUT = torch.randn((MAX_BATCH_SIZE, SEQLEN, INPUT_SIZE))


@torch.inference_mode()
def run(x, dtype, device):
    lstm = _make_lstm()
    lstm.load_state_dict(STATE_DICT)
    lstm.to(dtype)
    lstm.to(device)
    lstm.eval()

    x = x.to(device).to(dtype)
    y, _ = lstm(x)
    return y.to(torch.float32).cpu()


@torch.inference_mode()
def main():
    print(f"Bidirectional LSTM definition: input_size={INPUT_SIZE}, "
          f"hidden_size={HIDDEN_SIZE}, num_layers={NUM_LAYERS}")

    batch_sizes = [10, 50, 100, 110, 120, 130, 200, 250, 300, 400, 500, 1000]
    for dtype in [torch.float16, torch.float32]:
        print("dtype:", dtype)
        for batch_size in batch_sizes:
            gc.collect()
            torch.cuda.empty_cache()

            assert batch_size <= MAX_BATCH_SIZE
            x = RANDOM_INPUT[:batch_size].contiguous()

            yref = run(x, torch.float32, "cpu")
            try:
                ywrong = run(x, dtype, "cuda:0")
            except RuntimeError as e:
                print(f"    {batch_size=} FORWARD ON GPU FAILED: {e}")
                continue

            ref_center = yref.abs().mean()
            diff = torch.abs(yref - ywrong) / ref_center
            print(f"    {batch_size=:3d}, relative difference vs CPU: mean="
                  f"{diff.mean():.4f}, std={diff.std():.4f}")


main()

Versions

Collecting environment information... PyTorch version: 2.10.0 Is debug build: False CUDA used to build PyTorch: N/A ROCM used to build PyTorch: 7.2.26015

OS: Debian GNU/Linux 12 (bookworm) (x86_64) GCC version: (Debian 12.2.0-14+deb12u1) 12.2.0 Clang version: 14.0.6 CMake version: version 3.31.6 Libc version: glibc-2.36

Python version: 3.11.2 (main, Apr 28 2025, 14:11:48) [GCC 12.2.0] (64-bit runtime) Python platform: Linux-6.1.0-41-amd64-x86_64-with-glibc2.36 Is CUDA available: True CUDA runtime version: Could not collect CUDA_MODULE_LOADING set to: GPU models and configuration: AMD Radeon AI PRO R9700 (gfx1201) Nvidia driver version: Could not collect cuDNN version: Could not collect Is XPU available: False HIP runtime version: 7.2.26015 MIOpen runtime version: 3.5.1 Is XNNPACK available: True Caching allocator config: N/A

CPU: Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Address sizes: 48 bits physical, 48 bits virtual Byte Order: Little Endian CPU(s): 32 On-line CPU(s) list: 0-31 Vendor ID: AuthenticAMD Model name: AMD Ryzen 9 9950X 16-Core Processor CPU family: 26 Model: 68 Thread(s) per core: 2 Core(s) per socket: 16 Socket(s): 1 Stepping: 0 Frequency boost: enabled CPU(s) scaling MHz: 44% CPU max MHz: 8839.3555 CPU min MHz: 3000.0000 BogoMIPS: 8600.02 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good amd_lbr_v2 nopl nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate ssbd mba perfmon_v2 ibrs ibpb stibp ibrs_enhanced vmmcall fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local avx_vnni avx512_bf16 clzero irperf xsaveerptr rdpru wbnoinvd cppc arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif x2avic v_spec_ctrl avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq rdpid bus_lock_detect movdiri movdir64b overflow_recov succor smca fsrm avx512_vp2intersect flush_l1d amd_lbr_pmc_freeze Virtualization: AMD-V L1d cache: 768 KiB (16 instances) L1i cache: 512 KiB (16 instances) L2 cache: 16 MiB (16 instances) L3 cache: 64 MiB (2 instances) NUMA node(s): 1 NUMA node0 CPU(s): 0-31 Vulnerability Gather data sampling: Not affected Vulnerability Indirect target selection: Not affected Vulnerability Itlb multihit: Not affected Vulnerability L1tf: Not affected Vulnerability Mds: Not affected Vulnerability Meltdown: Not affected Vulnerability Mmio stale data: Not affected Vulnerability Reg file data sampling: Not affected Vulnerability Retbleed: Not affected Vulnerability Spec rstack overflow: Not affected Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Vulnerability Spectre v2: Mitigation; Enhanced / Automatic IBRS; IBPB conditional; STIBP always-on; PBRSB-eIBRS Not affected; BHI Not affected Vulnerability Srbds: Not affected Vulnerability Tsa: Not affected Vulnerability Tsx async abort: Not affected Vulnerability Vmscape: Not affected

Versions of relevant libraries: [pip3] numpy==2.4.2 [pip3] torch==2.10.0 [conda] Could not collect

cc @mikaylagawarecki @jeffdaily @sunway513 @jithunnair-amd @pruthvistony @ROCmSupport @jataylo @hongxiayang @naromero77amd @pragupta @jerrymannil @xinyazhang

extent analysis

Fix Plan

The issue seems to be related to the ROCm GPU and the torch.nn.LSTM module. To fix this, we can try the following steps:

  • Update ROCm and MIOpen: Ensure that ROCm and MIOpen are up-to-date, as newer versions may have fixed the issue.
  • Use CUDA instead of ROCm: If possible, try using a CUDA GPU instead of ROCm to see if the issue persists.
  • Reduce batch size: Reduce the batch size to a value below 100, as the issue seems to occur only with larger batch sizes.
  • Increase hidden size: Increase the hidden size of the LSTM module, as larger hidden sizes seem to reduce the numerical errors.

Here is an example code snippet that demonstrates how to reduce the batch size and increase the hidden size:

import torch

# Reduce batch size
batch_size = 50

# Increase hidden size
HIDDEN_SIZE = 256

# Create LSTM module with increased hidden size
lstm = torch.nn.LSTM(
    input_size=INPUT_SIZE,
    hidden_size=HIDDEN_SIZE,
    num_layers=NUM_LAYERS,
    bidirectional=True,
    batch_first=True,
)

# Run the LSTM module with reduced batch size
x = torch.randn((batch_size, SEQLEN, INPUT_SIZE))
y, _ = lstm(x)

Verification

To verify that the fix worked, you can run the same script with the updated batch size and hidden size, and check if the numerical differences between the CPU and GPU outputs are reduced.

Extra Tips

  • Ensure that the ROCm and MIOpen versions are compatible with the PyTorch version being used.
  • If the issue persists, try using a different GPU or a different version of PyTorch.
  • Consider filing a bug report with the PyTorch or ROCm teams if the issue is not resolved with the above steps.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

pytorch - 💡(How to fix) Fix `torch.nn.LSTM.forward` produces incorrect outputs (or crashes) on ROCm [2 comments, 2 participants]