pytorch - ✅(Solved) Fix [rocm only] Likely reuse after free when stacking many kernels calls, mixed with Triton kernel calls [2 pull requests, 1 comments, 1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
pytorch/pytorch#180341Fetched 2026-04-16 06:35:17
View on GitHub
Comments
1
Participants
1
Timeline
32
Reactions
0
Participants
Timeline (top)
mentioned ×11subscribed ×11labeled ×6added_to_project_v2 ×1

Error Message

=== DECODE (50 steps, M=1, no sync between steps) ===

Thread 66 "python" received signal SIGSEGV, Segmentation fault. [Switching to thread 66 (Thread 0x7ffdad9ff640 (LWP 3700660))] 0x00007ffec3a76921 in rocr::core::Runtime::AsyncEventsLoop(void*) () from /usr/local/lib/python3.12/dist-packages/torch/lib/libhsa-runtime64.so (gdb) thread apply all bt

Thread 76 (Thread 0x7fbd68ffc640 (LWP 3700672) "python"): #0 0x00007ffff7ce7117 in ?? () from /lib/x86_64-linux-gnu/libc.so.6 #1 0x00007ffff7cf1f7e in ?? () from /lib/x86_64-linux-gnu/libc.so.6 #2 0x0000000000529af0 in PyThread_acquire_lock_timed () #3 0x000000000063b117 in ?? () #4 0x000000000066a252 in ?? () #5 0x0000000000570bab in ?? () #6 0x0000000000562386 in PyObject_Vectorcall () #7 0x0000000000549fe0 in _PyEval_EvalFrameDefault () #8 0x000000000059961d in ?? () #9 0x00000000005991e6 in ?? () #10 0x00000000006a7499 in ?? () #11 0x00000000006a7448 in ?? () #12 0x00007ffff7ceaac3 in ?? () from /lib/x86_64-linux-gnu/libc.so.6 #13 0x00007ffff7d7c8d0 in ?? () from /lib/x86_64-linux-gnu/libc.so.6

Thread 67 (Thread 0x7ffda7fff640 (LWP 3700661) "python"): #0 0x00007ffff7d709cf in ioctl () from /lib/x86_64-linux-gnu/libc.so.6 #1 0x00007ffec3b222b0 in hsakmt_ioctl () from /usr/local/lib/python3.12/dist-packages/torch/lib/libhsa-runtime64.so #2 0x00007ffec3b17f53 in hsaKmtWaitOnMultipleEvents_ExtCtx () from /usr/local/lib/python3.12/dist-packages/torch/lib/libhsa-runtime64.so #3 0x00007ffec3a9b999 in rocr::core::Signal::WaitAnyExceptions(unsigned int, hsa_signal_s const*, hsa_signal_condition_t const*, long const*, long*) () from /usr/local/lib/python3.12/dist-packages/torch/lib/libhsa-runtime64.so #4 0x00007ffec3a77248 in rocr::core::Runtime::AsyncEventsLoop(void*) () from /usr/local/lib/python3.12/dist-packages/torch/lib/libhsa-runtime64.so #5 0x00007ffec3ad34dd in rocr::os::ThreadTrampoline(void*) () from /usr/local/lib/python3.12/dist-packages/torch/lib/libhsa-runtime64.so #6 0x00007ffff7ceaac3 in ?? () from /lib/x86_64-linux-gnu/libc.so.6 #7 0x00007ffff7d7c8d0 in ?? () from /lib/x86_64-linux-gnu/libc.so.6

Thread 66 (Thread 0x7ffdad9ff640 (LWP 3700660) "python"): #0 0x00007ffec3a76921 in rocr::core::Runtime::AsyncEventsLoop(void*) () from /usr/local/lib/python3.12/dist-packages/torch/lib/libhsa-runtime64.so

Root Cause

  • vllm.v1.engine.exceptions.EngineDeadError: EngineCore encountered an issue. See stack trace (above) for the root cause. & no stack trace displayed
  • Process hanging while GPU memory is freed
  • RuntimeError: invalid StreamId4294967297
  • torch.AcceleratorError: HIP error: invalid argument
  • Double free or corruption (out)
  • (EngineCore pid=1711448) RuntimeError: Unknown device: -5. If you have recently updated the caffe2.proto file to add a new device type, did you forget to update the DeviceTypeName() function to reflect such recent changes?
  • terminate called after throwing an instance of 'c10::Error' what(): found != kernels_.end() INTERNAL ASSERT FAILED at "/app/pytorch/aten/src/ATen/core/dispatch/OperatorEntry.cpp":245, please report a bug to PyTorch. Tried to deregister a kernel for dispatch key Meta but there are no kernels registered for this dispatch key. The operator is
  • Segmentation fault (core dumped) ==> error from the minimal repro below

Fix Action

Fix / Workaround

  • vllm.v1.engine.exceptions.EngineDeadError: EngineCore encountered an issue. See stack trace (above) for the root cause. & no stack trace displayed
  • Process hanging while GPU memory is freed
  • RuntimeError: invalid StreamId4294967297
  • torch.AcceleratorError: HIP error: invalid argument
  • Double free or corruption (out)
  • (EngineCore pid=1711448) RuntimeError: Unknown device: -5. If you have recently updated the caffe2.proto file to add a new device type, did you forget to update the DeviceTypeName() function to reflect such recent changes?
  • terminate called after throwing an instance of 'c10::Error' what(): found != kernels_.end() INTERNAL ASSERT FAILED at "/app/pytorch/aten/src/ATen/core/dispatch/OperatorEntry.cpp":245, please report a bug to PyTorch. Tried to deregister a kernel for dispatch key Meta but there are no kernels registered for this dispatch key. The operator is
  • Segmentation fault (core dumped) ==> error from the minimal repro below

---- Workaround env var ----

_SYNCHRONIZE_LAYER = bool(int(os.environ.get("SYNCHRONIZE_LAYER", "0")))

CPU: Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Address sizes: 52 bits physical, 57 bits virtual Byte Order: Little Endian CPU(s): 256 On-line CPU(s) list: 0-255 Vendor ID: AuthenticAMD Model name: AMD EPYC 9555 64-Core Processor CPU family: 26 Model: 2 Thread(s) per core: 2 Core(s) per socket: 64 Socket(s): 2 Stepping: 1 Frequency boost: enabled CPU max MHz: 4409.3750 CPU min MHz: 1500.0000 BogoMIPS: 6399.98 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good amd_lbr_v2 nopl nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate ssbd mba perfmon_v2 ibrs ibpb stibp ibrs_enhanced vmmcall fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local user_shstk avx_vnni avx512_bf16 clzero irperf xsaveerptr rdpru wbnoinvd amd_ppin cppc amd_ibpb_ret arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif x2avic v_spec_ctrl vnmi avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq la57 rdpid bus_lock_detect movdiri movdir64b overflow_recov succor smca fsrm avx512_vp2intersect flush_l1d debug_swap Virtualization: AMD-V L1d cache: 6 MiB (128 instances) L1i cache: 4 MiB (128 instances) L2 cache: 128 MiB (128 instances) L3 cache: 512 MiB (16 instances) NUMA node(s): 2 NUMA node0 CPU(s): 0-63,128-191 NUMA node1 CPU(s): 64-127,192-255 Vulnerability Gather data sampling: Not affected Vulnerability Indirect target selection: Not affected Vulnerability Itlb multihit: Not affected Vulnerability L1tf: Not affected Vulnerability Mds: Not affected Vulnerability Meltdown: Not affected Vulnerability Mmio stale data: Not affected Vulnerability Reg file data sampling: Not affected Vulnerability Retbleed: Not affected Vulnerability Spec rstack overflow: Not affected Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Vulnerability Spectre v2: Mitigation; Enhanced / Automatic IBRS; IBPB conditional; STIBP always-on; PBRSB-eIBRS Not affected; BHI Not affected Vulnerability Srbds: Not affected Vulnerability Tsa: Not affected Vulnerability Tsx async abort: Not affected Vulnerability Vmscape: Not affected

PR fix notes

PR #40033: [NVFP4][Hopper/AMD Instinct] Add Triton kernels for NVFP4 dequantization and QDQ emulation

Description (problem / solution / changelog)

Purpose

Add Triton kernels for NVFP4 dequantization and quantize-dequantize emulation to accelerate NVFP4 quantizatio emulation on devices that do not support this dtype natively.

The Triton kernels automatically replace the torch eager reference implementation when running on CUDA devices, with significant perf improvements, and avoids launching many independent kernel when using --enforce-eager, with back and forth HBM read/write.

This PR is submitted independently from https://github.com/vllm-project/vllm/pull/35737, but closely related to it.

Test Plan

pytest tests/models/quantization/test_nvfp4.py::test_triton_dequantize_nvfp4 -vvvvv -s 
pytest tests/models/quantization/test_nvfp4.py::test_triton_nvfp4_quant_dequant -vvvvv -s

Test Result

The tests verify:

  1. Correctness: Triton kernels produce bitwise-identical results to CPU reference (atol=0, rtol=0)
  2. Coverage: Tests multiple tensor shapes (2D/3D), sizes, and global scale values
  3. Real weights: Dequantization test uses actual NVFP4 weights from nvidia/Qwen3-30B-A3B-NVFP4 checkpoint
  4. Performance: Benchmarks show speedup vs. CPU reference implementation

CUDA_VISIBLE_DEVICES=7 pytest tests/models/quantization/test_nvfp4.py -s -vvvvv -k "test_triton_dequantize_nvfp4"

  dequantize 2D base [512, 1024]:
    triton:    median=0.007ms, min=0.007ms, max=0.008ms
    reference: median=0.088ms, min=0.073ms, max=0.256ms
    speedup:   12.63x
  dequantize 2D 2x rows [1024, 1024]:
    triton:    median=0.008ms, min=0.008ms, max=0.014ms
    reference: median=0.087ms, min=0.082ms, max=0.233ms
    speedup:   10.74x
  dequantize 2D 4x rows [2048, 1024]:
    triton:    median=0.011ms, min=0.011ms, max=0.018ms
    reference: median=0.114ms, min=0.112ms, max=0.124ms
    speedup:   10.10x
  dequantize 2D 2x cols [512, 2048]:
    triton:    median=0.008ms, min=0.008ms, max=0.014ms
    reference: median=0.090ms, min=0.081ms, max=0.269ms
    speedup:   10.93x
  dequantize 3D base [128, 768, 1024]:
    triton:    median=0.338ms, min=0.335ms, max=0.344ms
    reference: median=3.812ms, min=3.792ms, max=3.830ms
    speedup:   11.27x
  dequantize 3D 2x experts [256, 768, 1024]:
    triton:    median=0.673ms, min=0.670ms, max=0.678ms
    reference: median=7.641ms, min=7.596ms, max=7.660ms
    speedup:   11.35x
  dequantize 3D 2x rows [128, 1536, 1024]:
    triton:    median=0.678ms, min=0.674ms, max=0.683ms
    reference: median=7.669ms, min=7.652ms, max=7.715ms
    speedup:   11.31x
  dequantize 3D 2x cols [128, 768, 2048]:
    triton:    median=0.660ms, min=0.655ms, max=0.669ms
    reference: median=7.666ms, min=7.639ms, max=7.680ms
    speedup:   11.61x

CUDA_VISIBLE_DEVICES=5 pytest tests/models/quantization/test_nvfp4.py -s -vvvvv -k "test_triton_nvfp4_quant_dequant"

tests/models/quantization/test_nvfp4.py::test_triton_nvfp4_quant_dequant[0.001-1-16]   quant_dequant [1x16] gs=0.001:
    triton:    median=0.007ms, min=0.006ms, max=0.013ms
    reference: median=0.380ms, min=0.352ms, max=0.719ms
    speedup:   57.23x
PASSED
tests/models/quantization/test_nvfp4.py::test_triton_nvfp4_quant_dequant[0.001-1-32]   quant_dequant [1x32] gs=0.001:
    triton:    median=0.007ms, min=0.006ms, max=0.013ms
    reference: median=0.386ms, min=0.366ms, max=0.748ms
    speedup:   58.44x
PASSED
tests/models/quantization/test_nvfp4.py::test_triton_nvfp4_quant_dequant[0.001-2-48]   quant_dequant [2x48] gs=0.001:
    triton:    median=0.007ms, min=0.006ms, max=0.013ms
    reference: median=0.380ms, min=0.355ms, max=0.710ms
    speedup:   56.89x
PASSED
tests/models/quantization/test_nvfp4.py::test_triton_nvfp4_quant_dequant[0.001-7-64]   quant_dequant [7x64] gs=0.001:
    triton:    median=0.007ms, min=0.006ms, max=0.012ms
    reference: median=0.381ms, min=0.357ms, max=0.722ms
    speedup:   57.43x
PASSED
tests/models/quantization/test_nvfp4.py::test_triton_nvfp4_quant_dequant[0.001-16-128]   quant_dequant [16x128] gs=0.001:
    triton:    median=0.007ms, min=0.006ms, max=0.013ms
    reference: median=0.383ms, min=0.357ms, max=0.747ms
    speedup:   57.64x
PASSED
tests/models/quantization/test_nvfp4.py::test_triton_nvfp4_quant_dequant[0.001-33-160]   quant_dequant [33x160] gs=0.001:
    triton:    median=0.007ms, min=0.006ms, max=0.012ms
    reference: median=0.381ms, min=0.360ms, max=0.754ms
    speedup:   57.38x
PASSED
tests/models/quantization/test_nvfp4.py::test_triton_nvfp4_quant_dequant[0.001-128-256]   quant_dequant [128x256] gs=0.001:
    triton:    median=0.007ms, min=0.006ms, max=0.013ms
    reference: median=0.383ms, min=0.354ms, max=0.743ms
    speedup:   58.02x
PASSED
tests/models/quantization/test_nvfp4.py::test_triton_nvfp4_quant_dequant[0.001-256-512]   quant_dequant [256x512] gs=0.001:
    triton:    median=0.007ms, min=0.006ms, max=0.012ms
    reference: median=0.378ms, min=0.355ms, max=0.725ms
    speedup:   56.66x
PASSED
tests/models/quantization/test_nvfp4.py::test_triton_nvfp4_quant_dequant[0.001-1024-1024]   quant_dequant [1024x1024] gs=0.001:
    triton:    median=0.007ms, min=0.007ms, max=0.025ms
    reference: median=0.375ms, min=0.350ms, max=0.708ms
    speedup:   55.11x
PASSED
tests/models/quantization/test_nvfp4.py::test_triton_nvfp4_quant_dequant[0.001-5120-2048]   quant_dequant [5120x2048] gs=0.001:
    triton:    median=0.019ms, min=0.019ms, max=0.032ms
    reference: median=0.608ms, min=0.604ms, max=0.615ms
    speedup:   31.68x
PASSED
tests/models/quantization/test_nvfp4.py::test_triton_nvfp4_quant_dequant[0.001-2048-4096]   quant_dequant [2048x4096] gs=0.001:
    triton:    median=0.017ms, min=0.016ms, max=0.023ms
    reference: median=0.504ms, min=0.500ms, max=0.510ms
    speedup:   30.28x
PASSED
tests/models/quantization/test_nvfp4.py::test_triton_nvfp4_quant_dequant[0.001-4096-7168]   quant_dequant [4096x7168] gs=0.001:
    triton:    median=0.052ms, min=0.050ms, max=0.058ms
    reference: median=1.460ms, min=1.453ms, max=1.470ms
    speedup:   28.30x
PASSED
tests/models/quantization/test_nvfp4.py::test_triton_nvfp4_quant_dequant[0.001-8192-8192]   quant_dequant [8192x8192] gs=0.001:
    triton:    median=0.092ms, min=0.089ms, max=0.098ms
    reference: median=3.960ms, min=3.935ms, max=3.989ms
    speedup:   43.23x
PASSED

Changed files

  • tests/models/quantization/test_nvfp4.py (modified, +286/-0)
  • vllm/model_executor/layers/quantization/utils/nvfp4_emulation_utils.py (modified, +319/-15)

Code Example

docker run -td --name bug_repro --rm --device /dev/kfd \
    -e PYTORCH_ROCM_ARCH=gfx942 \
    -e PYTORCH_TUNABLEOP_ENABLED=0 \
    --device /dev/dri --security-opt seccomp=unconfined \
    --shm-size=128g --net host --pid host \
    -v $(pwd)/hf_cache:/hf_cache \
    rocm/vllm-dev:nightly_main_20260413 /bin/bash

docker exec -it bug_repro /bin/bash

---

import glob
import os

import huggingface_hub
import torch
import triton
from safetensors import safe_open

from vllm.model_executor.layers.quantization.utils.nvfp4_emulation_utils import (
    break_fp4_bytes,
    kE2M1ToFloat_handle,
    ref_nvfp4_quant,
)

# ---- Workaround env var ----
_SYNCHRONIZE_LAYER = bool(int(os.environ.get("SYNCHRONIZE_LAYER", "0")))

# ---- Qwen3-30B-A3B-FP4 dimensions ----
E = 128         # num experts
K = 2048        # hidden_dim
N = 768         # intermediate_size_per_partition (1536/2 for gated)
TOP_K = 8
NUM_LAYERS = 10  # higher number of layers = higher likelihood of issue
ATTN_DIM = 4096  # attention intermediate dim
BLOCK_SIZE = 16


# ---- 3D-capable dequantize (not on main branch) ----
def dequantize_to_dtype(
    tensor_fp4: torch.Tensor,
    tensor_sf: torch.Tensor,
    global_scale: torch.Tensor,
    dtype: torch.dtype,
):
    """Dequantize the fp4 tensor back to high precision.

    Supports both 2D and 3D inputs:
    - 2D: [m, packed_k] -> [m, k]
    - 3D: [dim0, m, packed_k] -> [dim0, m, k]
    """
    assert tensor_fp4.dtype == torch.uint8

    is_3d = tensor_fp4.ndim == 3

    if is_3d:
        dim0, m, packed_k = tensor_fp4.shape
        tensor_fp4 = tensor_fp4.reshape(-1, packed_k)
        tensor_sf = tensor_sf.reshape(-1, tensor_sf.shape[-1])
        global_scale = global_scale[:, None, None]
    else:
        m, packed_k = tensor_fp4.shape

    k = packed_k * 2
    tensor_f32 = break_fp4_bytes(tensor_fp4, torch.float32)
    tensor_f32 = tensor_f32.reshape(-1, k // BLOCK_SIZE, BLOCK_SIZE)
    tensor_sf = tensor_sf.view(torch.float8_e4m3fn)

    if is_3d:
        tensor_sf = tensor_sf.reshape(dim0, m, k // BLOCK_SIZE)
    tensor_sf_dtype = tensor_sf.to(torch.float32) * global_scale

    if is_3d:
        tensor_f32 = tensor_f32.reshape(dim0, m, -1, BLOCK_SIZE)

    out = tensor_f32 * tensor_sf_dtype.unsqueeze(-1)
    out = out.reshape(*out.shape[:-2], -1)

    return out.to(dtype)


# ---- Activation quantize-dequantize (inline, no dependency on branch code) ----
def ref_nvfp4_quant_dequant(x, global_scale):
    """Quantize to FP4 and immediately dequantize back (activation QDQ)."""
    fp4, scale = ref_nvfp4_quant(x, global_scale, BLOCK_SIZE)
    m, n = x.shape
    fp4 = fp4.reshape(m, n // BLOCK_SIZE, BLOCK_SIZE)
    scale = scale.unsqueeze(-1) / global_scale
    x_dq = (fp4 * scale).reshape(m, n).to(x.dtype)
    return x_dq, None


MODEL_ID = "nvidia/Qwen3-30B-A3B-NVFP4"


def load_checkpoint(device):
    """Download model and load all safetensors into a flat dict on device."""
    model_path = huggingface_hub.snapshot_download(MODEL_ID)
    shard_files = sorted(glob.glob(os.path.join(model_path, "*.safetensors")))
    print(f"Found {len(shard_files)} safetensors shard(s) in {model_path}")

    tensors = {}
    for sf_path in shard_files:
        with safe_open(sf_path, framework="pt", device=str(device)) as f:
            for k in f.keys():
                tensors[k] = f.get_tensor(k)
    return tensors


def load_layer_weights(tensors, layer_idx, device):
    """Extract MoE and attention weights for a single layer from the flat dict.

    Returns:
        (w1_packed, w1_blockscale, w1_global_scale,
         w2_packed, w2_blockscale, w2_global_scale,
         act_global_scale,
         lin_qkv, lin_o)
    """
    prefix = f"model.layers.{layer_idx}"

    # ---- MoE expert weights ----
    all_w1_packed, all_w1_blockscale = [], []
    all_w2_packed, all_w2_blockscale = [], []
    g1_alphas, g2_alphas = [], []
    a1_gscales = []

    for e in range(E):
        ep = f"{prefix}.mlp.experts.{e}"

        # gate_proj + up_proj -> w1 (stacked along out_dim)
        gate_w = tensors[f"{ep}.gate_proj.weight"]       # [N, K//2] uint8
        up_w = tensors[f"{ep}.up_proj.weight"]            # [N, K//2] uint8
        w1_packed = torch.cat([gate_w, up_w], dim=0)      # [2*N, K//2]
        all_w1_packed.append(w1_packed)

        gate_bs = tensors[f"{ep}.gate_proj.weight_scale"]  # [N, K//BLOCK_SIZE] fp8
        up_bs = tensors[f"{ep}.up_proj.weight_scale"]      # [N, K//BLOCK_SIZE] fp8
        w1_bs = torch.cat([gate_bs, up_bs], dim=0)
        all_w1_blockscale.append(w1_bs.view(torch.uint8))

        # Use gate_proj's global scale for w1 (gate and up have same value)
        g1_alphas.append(tensors[f"{ep}.gate_proj.weight_scale_2"])

        # down_proj -> w2
        all_w2_packed.append(tensors[f"{ep}.down_proj.weight"])
        down_bs = tensors[f"{ep}.down_proj.weight_scale"]
        all_w2_blockscale.append(down_bs.view(torch.uint8))
        g2_alphas.append(tensors[f"{ep}.down_proj.weight_scale_2"])

        # Activation input scale (same for gate/up)
        a1_gscales.append(tensors[f"{ep}.gate_proj.input_scale"])

    w1_packed = torch.stack(all_w1_packed)          # [E, 2*N, K//2]
    w1_blockscale = torch.stack(all_w1_blockscale)  # [E, 2*N, K//BLOCK_SIZE]
    w1_global_scale = torch.stack(g1_alphas)        # [E]
    w2_packed = torch.stack(all_w2_packed)           # [E, K, N//2]
    w2_blockscale = torch.stack(all_w2_blockscale)   # [E, K, N//BLOCK_SIZE]
    w2_global_scale = torch.stack(g2_alphas)        # [E]
    # Use the first expert's input_scale as representative
    act_global_scale = a1_gscales[0].unsqueeze(0)   # [1]

    # ---- Attention linear weights (use q_proj as lin_qkv, o_proj as lin_o) ----
    attn = f"{prefix}.self_attn"
    lin_global_scale = tensors[f"{attn}.q_proj.weight_scale_2"].unsqueeze(0)

    return (w1_packed, w1_blockscale, w1_global_scale,
            w2_packed, w2_blockscale, w2_global_scale,
            act_global_scale, lin_global_scale)


def main():
    device = torch.device("cuda:0")
    dtype = torch.bfloat16
    print(f"torch={torch.__version__}")
    print(f"triton={triton.__version__}")

    print(f"SYNCHRONIZE_LAYER={_SYNCHRONIZE_LAYER}")
    print(f"Layers={NUM_LAYERS}, E={E}, K={K}, N={N}, top_k={TOP_K}")

    kE2M1ToFloat_handle.val = kE2M1ToFloat_handle.val.to(device)

    # ---- Load checkpoint ----
    print(f"Downloading / loading {MODEL_ID}...", flush=True)
    tensors = load_checkpoint(device)
    print(f"Loaded {len(tensors)} tensors from checkpoint.", flush=True)

    # ---- Build all layers ----
    print("Building layers from checkpoint weights...", flush=True)
    layers = []
    for i in range(NUM_LAYERS):
        if (i + 1) % 10 == 0 or i == 0:
            print(f"  layer {i+1}/{NUM_LAYERS}", flush=True)

        (w1_packed, w1_blockscale, w1_global_scale,
         w2_packed, w2_blockscale, w2_global_scale,
         act_global_scale, lin_global_scale) = load_layer_weights(tensors, i, device)

        layers.append((w1_packed, w1_blockscale, w1_global_scale,
                        w2_packed, w2_blockscale, w2_global_scale,
                        act_global_scale, lin_global_scale))

    # Free the full checkpoint dict to save memory
    del tensors
    torch.cuda.synchronize()
    print("All layers built.\n", flush=True)

    # ---- Forward pass function (NO sync inside) ----
    def forward(hidden_states):
        M = hidden_states.shape[0]
        for (w1_packed, w1_scale_val, _g1_alphas,
             w2_packed, w2_scale_val, _g2_alphas,
             _a1_gscale, lin_global_scale) in layers:
            for _ in range(1):
                hidden_states, _ = ref_nvfp4_quant_dequant(hidden_states, lin_global_scale)

            # NOTE: no segfault/issue if commenting out the `dequantize_to_dtype` calls.

            # Dequantize w1 — result is a LOCAL variable
            w1_dequant = dequantize_to_dtype(
                tensor_fp4=w1_packed, tensor_sf=w1_scale_val,
                global_scale=_g1_alphas, dtype=hidden_states.dtype,
            )

            # Dequantize w2 — result is a LOCAL variable
            w2_dequant = dequantize_to_dtype(
                tensor_fp4=w2_packed, tensor_sf=w2_scale_val,
                global_scale=_g2_alphas, dtype=hidden_states.dtype,
            )

            # Activation QDQ
            hidden_states, _ = ref_nvfp4_quant_dequant(
                hidden_states, _a1_gscale
            )

        return hidden_states

    # ---- Prefill ----
    print("=== PREFILL (M=16) ===", flush=True)
    h = torch.randn(16, K, dtype=dtype, device=device)
    h = forward(h)
    torch.cuda.synchronize()
    print(f"  OK (norm={h.float().norm().item():.2f})\n", flush=True)

    # ---- Decode steps (M=1, no sync between steps) ----
    num_decode = 50
    print(f"=== DECODE ({num_decode} steps, M=1, no sync between steps) ===",
          flush=True)
    for step in range(num_decode):
        h = torch.randn(1, K, dtype=dtype, device=device)
        h = forward(h)
        # NO sync here — let GPU work queue build up across steps

    # Final sync to catch deferred errors
    torch.cuda.synchronize()
    print(f"  All {num_decode} decode steps completed.", flush=True)
    print("SUCCESS")


if __name__ == "__main__":
    main()

---

hf download nvidia/Qwen3-30B-A3B-NVFP4
CUDA_VISIBLE_DEVICES=0 python repro_nvfp4_moe_main.py`

---

torch=2.12.0.dev20260413+rocm7.2
triton=3.7.0
SYNCHRONIZE_LAYER=False
Layers=10, E=128, K=2048, N=768, top_k=8
Downloading / loading nvidia/Qwen3-30B-A3B-NVFP4...
Fetching 18 files: 100%|████████████████████████████████████████████████████████████████| 18/18 [00:00<00:00, 156633.76it/s]
Found 4 safetensors shard(s) in /scratch/huggingface_hub/models--nvidia--Qwen3-30B-A3B-NVFP4/snapshots/2538ded2a4edb247b4d2b4a8ba24e44bd4c017c3
Loaded 74835 tensors from checkpoint.
Building layers from checkpoint weights...
  layer 1/10
  layer 10/10
All layers built.

=== PREFILL (M=16) ===
WARNING 04-14 13:00:35 [fused_moe.py:1090] Using default MoE config. Performance might be sub-optimal! Config file not found at /felmarty/repos/vllm/vllm/model_executor/layers/fused_moe/configs/E=128,N=768,device_name=AMD_Instinct_MI325X.json
  OK (norm=0.00)

=== DECODE (50 steps, M=1, no sync between steps) ===
Segmentation fault (core dumped)

---

torch=2.12.0.dev20260413+rocm7.2
triton=3.7.0
SYNCHRONIZE_LAYER=False
Layers=10, E=128, K=2048, N=768, top_k=8
Downloading / loading nvidia/Qwen3-30B-A3B-NVFP4...
Fetching 18 files: 100%|████████████████████████████████████████████████████████████████| 18/18 [00:00<00:00, 246723.76it/s]
Found 4 safetensors shard(s) in /scratch/huggingface_hub/models--nvidia--Qwen3-30B-A3B-NVFP4/snapshots/2538ded2a4edb247b4d2b4a8ba24e44bd4c017c3
Loaded 74835 tensors from checkpoint.
Building layers from checkpoint weights...
  layer 1/10
  layer 10/10
All layers built.

=== PREFILL (M=16) ===
WARNING 04-14 13:01:19 [fused_moe.py:1090] Using default MoE config. Performance might be sub-optimal! Config file not found at /felmarty/repos/vllm/vllm/model_executor/layers/fused_moe/configs/E=128,N=768,device_name=AMD_Instinct_MI325X.json
  OK (norm=0.00)

=== DECODE (50 steps, M=1, no sync between steps) ===
  All 50 decode steps completed.
SUCCESS

---

pip uninstall torch
pip3 install --pre torch torchvision --index-url https://download.pytorch.org/whl/nightly/rocm7.2
pip uninstall vllm mori
git clone https://github.com/vllm-project/vllm.git && cd vllm
export PYTORCH_ROCM_ARCH=gfx942
python setup.py develop

---

FROM nvidia/cuda:13.0.2-cudnn-devel-ubuntu24.04

RUN apt update && apt install -y git wget nano curl

ENV PATH="/root/miniforge3/bin:${PATH}"
ARG PATH="/root/miniforge3/bin:${PATH}"

RUN wget https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-Linux-x86_64.sh && \
    bash Miniforge3-Linux-x86_64.sh -b && \
    conda init

RUN pip install torch==2.11.0

ENV TORCH_CUDA_ARCH_LIST=9.0

RUN pip install accelerate datasets pytest nltk huggingface_hub

WORKDIR /workspace

---

VLLM_USE_PRECOMPILED=1 pip install -vvv --editable . --no-build-isolation

---

=== DECODE (50 steps, M=1, no sync between steps) ===

Thread 66 "python" received signal SIGSEGV, Segmentation fault.
[Switching to thread 66 (Thread 0x7ffdad9ff640 (LWP 3700660))]
0x00007ffec3a76921 in rocr::core::Runtime::AsyncEventsLoop(void*) ()
   from /usr/local/lib/python3.12/dist-packages/torch/lib/libhsa-runtime64.so
(gdb) thread apply all bt

Thread 76 (Thread 0x7fbd68ffc640 (LWP 3700672) "python"):
#0  0x00007ffff7ce7117 in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#1  0x00007ffff7cf1f7e in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#2  0x0000000000529af0 in PyThread_acquire_lock_timed ()
#3  0x000000000063b117 in ?? ()
#4  0x000000000066a252 in ?? ()
#5  0x0000000000570bab in ?? ()
#6  0x0000000000562386 in PyObject_Vectorcall ()
#7  0x0000000000549fe0 in _PyEval_EvalFrameDefault ()
#8  0x000000000059961d in ?? ()
#9  0x00000000005991e6 in ?? ()
#10 0x00000000006a7499 in ?? ()
#11 0x00000000006a7448 in ?? ()
#12 0x00007ffff7ceaac3 in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#13 0x00007ffff7d7c8d0 in ?? () from /lib/x86_64-linux-gnu/libc.so.6

Thread 67 (Thread 0x7ffda7fff640 (LWP 3700661) "python"):
#0  0x00007ffff7d709cf in ioctl () from /lib/x86_64-linux-gnu/libc.so.6
#1  0x00007ffec3b222b0 in hsakmt_ioctl () from /usr/local/lib/python3.12/dist-packages/torch/lib/libhsa-runtime64.so
#2  0x00007ffec3b17f53 in hsaKmtWaitOnMultipleEvents_ExtCtx () from /usr/local/lib/python3.12/dist-packages/torch/lib/libhsa-runtime64.so
#3  0x00007ffec3a9b999 in rocr::core::Signal::WaitAnyExceptions(unsigned int, hsa_signal_s const*, hsa_signal_condition_t const*, long const*, long*) () from /usr/local/lib/python3.12/dist-packages/torch/lib/libhsa-runtime64.so
#4  0x00007ffec3a77248 in rocr::core::Runtime::AsyncEventsLoop(void*) () from /usr/local/lib/python3.12/dist-packages/torch/lib/libhsa-runtime64.so
#5  0x00007ffec3ad34dd in rocr::os::ThreadTrampoline(void*) () from /usr/local/lib/python3.12/dist-packages/torch/lib/libhsa-runtime64.so
#6  0x00007ffff7ceaac3 in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#7  0x00007ffff7d7c8d0 in ?? () from /lib/x86_64-linux-gnu/libc.so.6

Thread 66 (Thread 0x7ffdad9ff640 (LWP 3700660) "python"):
#0  0x00007ffec3a76921 in rocr::core::Runtime::AsyncEventsLoop(void*) () from /usr/local/lib/python3.12/dist-packages/torch/lib/libhsa-runtime64.so

---

CMake version: version 3.31.10
Libc version: glibc-2.35

Python version: 3.12.13 (main, Mar  4 2026, 09:23:07) [GCC 11.4.0] (64-bit runtime)
Python platform: Linux-6.8.0-101-generic-x86_64-with-glibc2.35
Is CUDA available: True
CUDA runtime version: Could not collect
CUDA_MODULE_LOADING set to:
GPU models and configuration: AMD Instinct MI325X (gfx942:sramecc+:xnack-)
Nvidia driver version: Could not collect
cuDNN version: Could not collect
Is XPU available: False
HIP runtime version: 7.2.53211
MIOpen runtime version: 3.5.1
Is XNNPACK available: True
Caching allocator config: N/A

CPU:
Architecture:                            x86_64
CPU op-mode(s):                          32-bit, 64-bit
Address sizes:                           52 bits physical, 57 bits virtual
Byte Order:                              Little Endian
CPU(s):                                  256
On-line CPU(s) list:                     0-255
Vendor ID:                               AuthenticAMD
Model name:                              AMD EPYC 9555 64-Core Processor
CPU family:                              26
Model:                                   2
Thread(s) per core:                      2
Core(s) per socket:                      64
Socket(s):                               2
Stepping:                                1
Frequency boost:                         enabled
CPU max MHz:                             4409.3750
CPU min MHz:                             1500.0000
BogoMIPS:                                6399.98
Flags:                                   fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good amd_lbr_v2 nopl nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate ssbd mba perfmon_v2 ibrs ibpb stibp ibrs_enhanced vmmcall fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local user_shstk avx_vnni avx512_bf16 clzero irperf xsaveerptr rdpru wbnoinvd amd_ppin cppc amd_ibpb_ret arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif x2avic v_spec_ctrl vnmi avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq la57 rdpid bus_lock_detect movdiri movdir64b overflow_recov succor smca fsrm avx512_vp2intersect flush_l1d debug_swap
Virtualization:                          AMD-V
L1d cache:                               6 MiB (128 instances)
L1i cache:                               4 MiB (128 instances)
L2 cache:                                128 MiB (128 instances)
L3 cache:                                512 MiB (16 instances)
NUMA node(s):                            2
NUMA node0 CPU(s):                       0-63,128-191
NUMA node1 CPU(s):                       64-127,192-255
Vulnerability Gather data sampling:      Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Not affected
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; Enhanced / Automatic IBRS; IBPB conditional; STIBP always-on; PBRSB-eIBRS Not affected; BHI Not affected
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Not affected
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Not affected

Versions of relevant libraries:
[pip3] conch-triton-kernels==1.2.1
[pip3] numpy==2.1.3
[pip3] onnx==1.19.0
[pip3] onnx-ir==0.2.0
[pip3] onnxscript==0.6.2
[pip3] onnxslim==0.1.91
[pip3] torch==2.12.0.dev20260413+rocm7.2
[pip3] torchaudio==2.9.0+eaa9e4e
[pip3] torchvision==0.24.1+d801a34
[pip3] triton==3.6.0
[pip3] triton_kernels==1.0.0
[pip3] triton-rocm==3.7.0+git282c8251
[conda] Could not collect

---

PyTorch version: 2.11.0+cu130
Is debug build: False
CUDA used to build PyTorch: 13.0
ROCM used to build PyTorch: N/A

OS: Ubuntu 24.04.3 LTS (x86_64)
GCC version: (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0
Clang version: Could not collect
CMake version: Could not collect
Libc version: glibc-2.39

Python version: 3.13.12 | packaged by conda-forge | (main, Feb  5 2026, 05:53:46) [GCC 14.3.0] (64-bit runtime)
Python platform: Linux-6.8.0-107-generic-x86_64-with-glibc2.39
Is CUDA available: True
CUDA runtime version: 13.0.88
CUDA_MODULE_LOADING set to:
GPU models and configuration:
GPU 0: NVIDIA H100 80GB HBM3
GPU 1: NVIDIA H100 80GB HBM3
GPU 2: NVIDIA H100 80GB HBM3
GPU 3: NVIDIA H100 80GB HBM3
GPU 4: NVIDIA H100 80GB HBM3
GPU 5: NVIDIA H100 80GB HBM3
GPU 6: NVIDIA H100 80GB HBM3
GPU 7: NVIDIA H100 80GB HBM3

Nvidia driver version: 595.58.03
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.9.14.0
/usr/lib/x86_64-linux-gnu/libcudnn_adv.so.9.14.0
/usr/lib/x86_64-linux-gnu/libcudnn_cnn.so.9.14.0
/usr/lib/x86_64-linux-gnu/libcudnn_engines_precompiled.so.9.14.0
/usr/lib/x86_64-linux-gnu/libcudnn_engines_runtime_compiled.so.9.14.0
/usr/lib/x86_64-linux-gnu/libcudnn_graph.so.9.14.0
/usr/lib/x86_64-linux-gnu/libcudnn_heuristic.so.9.14.0
/usr/lib/x86_64-linux-gnu/libcudnn_ops.so.9.14.0
Is XPU available: False
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True
Caching allocator config: N/A

CPU:
Architecture:                            x86_64
CPU op-mode(s):                          32-bit, 64-bit
Address sizes:                           52 bits physical, 57 bits virtual
Byte Order:                              Little Endian
CPU(s):                                  384
On-line CPU(s) list:                     0-383
Vendor ID:                               AuthenticAMD
Model name:                              AMD EPYC 9654 96-Core Processor
CPU family:                              25
Model:                                   17
Thread(s) per core:                      2
Core(s) per socket:                      96
Socket(s):                               2
Stepping:                                1
Frequency boost:                         enabled
CPU(s) scaling MHz:                      50%
CPU max MHz:                             3709.3569
CPU min MHz:                             1500.0000
BogoMIPS:                                4799.99
Flags:                                   fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good amd_lbr_v2 nopl nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate ssbd mba perfmon_v2 ibrs ibpb stibp ibrs_enhanced vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local user_shstk avx512_bf16 clzero irperf xsaveerptr rdpru wbnoinvd amd_ppin cppc amd_ibpb_ret arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif x2avic v_spec_ctrl vnmi avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq la57 rdpid overflow_recov succor smca fsrm flush_l1d debug_swap ibpb_exit_to_user
Virtualization:                          AMD-V
L1d cache:                               6 MiB (192 instances)
L1i cache:                               6 MiB (192 instances)
L2 cache:                                192 MiB (192 instances)
L3 cache:                                768 MiB (24 instances)
NUMA node(s):                            2
NUMA node0 CPU(s):                       0-95,192-287
NUMA node1 CPU(s):                       96-191,288-383
Vulnerability Gather data sampling:      Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Mitigation; Safe RET
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; Enhanced / Automatic IBRS; IBPB conditional; STIBP always-on; PBRSB-eIBRS Not affected; BHI Not affected
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Vulnerable: Clear CPU buffers attempted, no microcode
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Mitigation; IBPB before exit to userspace

Versions of relevant libraries:
[pip3] numpy==2.2.6
[pip3] nvidia-cublas==13.1.0.3
[pip3] nvidia-cuda-cupti==13.0.85
[pip3] nvidia-cuda-nvrtc==13.0.88
[pip3] nvidia-cuda-runtime==13.0.96
[pip3] nvidia-cudnn-cu13==9.19.0.56
[pip3] nvidia-cudnn-frontend==1.18.0
[pip3] nvidia-cufft==12.0.0.61
[pip3] nvidia-curand==10.4.0.35
[pip3] nvidia-cusolver==12.0.4.66
[pip3] nvidia-cusparse==12.6.3.3
[pip3] nvidia-cusparselt-cu13==0.8.0
[pip3] nvidia-nccl-cu13==2.28.9
[pip3] nvidia-nvjitlink==13.0.88
[pip3] nvidia-nvtx==13.0.85
[pip3] torch==2.11.0
[pip3] torch_c_dlpack_ext==0.1.5
[pip3] torchaudio==2.11.0
[pip3] torchvision==0.26.0
[pip3] triton==3.6.0
[conda] numpy                                       2.2.6                                    pypi_0                pypi
[conda] nvidia-cublas                               13.1.0.3                                 pypi_0                pypi
[conda] nvidia-cuda-cupti                           13.0.85                                  pypi_0                pypi
[conda] nvidia-cuda-nvrtc                           13.0.88                                  pypi_0                pypi
[conda] nvidia-cuda-runtime                         13.0.96                                  pypi_0                pypi
[conda] nvidia-cudnn-cu13                           9.19.0.56                                pypi_0                pypi
[conda] nvidia-cudnn-frontend                       1.18.0                                   pypi_0                pypi
[conda] nvidia-cufft                                12.0.0.61                                pypi_0                pypi
[conda] nvidia-curand                               10.4.0.35                                pypi_0                pypi
[conda] nvidia-cusolver                             12.0.4.66                                pypi_0                pypi
[conda] nvidia-cusparse                             12.6.3.3                                 pypi_0                pypi
[conda] nvidia-cusparselt-cu13                      0.8.0                                    pypi_0                pypi
[conda] nvidia-nccl-cu13                            2.28.9                                   pypi_0                pypi
[conda] nvidia-nvjitlink                            13.0.88                                  pypi_0                pypi
[conda] nvidia-nvtx                                 13.0.85                                  pypi_0                pypi
[conda] torch                                       2.11.0                                   pypi_0                pypi
[conda] torch-c-dlpack-ext                          0.1.5                                    pypi_0                pypi
[conda] torchaudio                                  2.11.0                                   pypi_0                pypi
[conda] torchvision                                 0.26.0                                   pypi_0                pypi
[conda] triton                                      3.6.0                                    pypi_0                pypi

---

PyTorch version: 2.10.0+git8514f05
Is debug build: False
CUDA used to build PyTorch: N/A
ROCM used to build PyTorch: 7.2.53211

OS: Ubuntu 22.04.5 LTS (x86_64)
GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04.3) 11.4.0
Clang version: 22.0.0git (https://github.com/RadeonOpenCompute/llvm-project roc-7.2.1 26084 f58b06dce1f9c15707c5f808fd002e18c2accf7e)
CMake version: version 3.31.10
Libc version: glibc-2.35

Python version: 3.12.13 (main, Mar  4 2026, 09:23:07) [GCC 11.4.0] (64-bit runtime)
Python platform: Linux-6.8.0-101-generic-x86_64-with-glibc2.35
Is CUDA available: True
CUDA runtime version: Could not collect
CUDA_MODULE_LOADING set to:
GPU models and configuration:  (gfx942:sramecc+:xnack-)
Nvidia driver version: Could not collect
cuDNN version: Could not collect
Is XPU available: False
HIP runtime version: 7.2.53211
MIOpen runtime version: 3.5.1
Is XNNPACK available: True
Caching allocator config: N/A

CPU:
Architecture:                            x86_64
CPU op-mode(s):                          32-bit, 64-bit
Address sizes:                           52 bits physical, 57 bits virtual
Byte Order:                              Little Endian
CPU(s):                                  256
On-line CPU(s) list:                     0-255
Vendor ID:                               AuthenticAMD
Model name:                              AMD EPYC 9555 64-Core Processor
CPU family:                              26
Model:                                   2
Thread(s) per core:                      2
Core(s) per socket:                      64
Socket(s):                               2
Stepping:                                1
Frequency boost:                         enabled
CPU max MHz:                             4409.3750
CPU min MHz:                             1500.0000
BogoMIPS:                                6399.98
Flags:                                   fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good amd_lbr_v2 nopl nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate ssbd mba perfmon_v2 ibrs ibpb stibp ibrs_enhanced vmmcall fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local user_shstk avx_vnni avx512_bf16 clzero irperf xsaveerptr rdpru wbnoinvd amd_ppin cppc amd_ibpb_ret arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif x2avic v_spec_ctrl vnmi avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq la57 rdpid bus_lock_detect movdiri movdir64b overflow_recov succor smca fsrm avx512_vp2intersect flush_l1d debug_swap
Virtualization:                          AMD-V
L1d cache:                               6 MiB (128 instances)
L1i cache:                               4 MiB (128 instances)
L2 cache:                                128 MiB (128 instances)
L3 cache:                                512 MiB (16 instances)
NUMA node(s):                            2
NUMA node0 CPU(s):                       0-63,128-191
NUMA node1 CPU(s):                       64-127,192-255
Vulnerability Gather data sampling:      Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Not affected
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; Enhanced / Automatic IBRS; IBPB conditional; STIBP always-on; PBRSB-eIBRS Not affected; BHI Not affected
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Not affected
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Not affected

Versions of relevant libraries:
[pip3] conch-triton-kernels==1.2.1
[pip3] numpy==2.1.3
[pip3] onnx==1.19.0
[pip3] onnx-ir==0.2.0
[pip3] onnxscript==0.6.2
[pip3] onnxslim==0.1.91
[pip3] torch==2.10.0+git8514f05
[pip3] torchaudio==2.9.0+eaa9e4e
[pip3] torchvision==0.24.1+d801a34
[pip3] triton==3.6.0
[pip3] triton_kernels==1.0.0
[conda] Could not collect
RAW_BUFFERClick to expand / collapse

🐛 Describe the bug

Hello,

Developing https://github.com/vllm-project/vllm/pull/35737, I started hitting a likely reuse after free and/or stack smashing issue after updating from a ROCm image with torch==2.9/triton==3.4 to an image based on torch==2.10 and triton==3.6. I am not completely sure the cause of this issue is a regression in torch/triton/ROCm though - as the issue started happening when migrating to a more recent docker image with more recent torch/triton.

The issue is not reproducible on Nvidia H100 + Nvidia distribution of pytorch.

I can reproduce the issue on two distinct MI325 and MI355 machines. The issue also disappears as soon as using AMD_SERIALIZE_KERNEL=3 or HIP_LAUNCH_BLOCKING=1 on ROCm side.

If it matters: the issues exists while setting unset HSA_NO_SCRATCH_RECLAIM, unset HIP_FORCE_DEV_KERNARG as well.

The behavior is not consistent between runs, but when running full workload through vLLM, crashes involve various random errors as:

  • vllm.v1.engine.exceptions.EngineDeadError: EngineCore encountered an issue. See stack trace (above) for the root cause. & no stack trace displayed
  • Process hanging while GPU memory is freed
  • RuntimeError: invalid StreamId4294967297
  • torch.AcceleratorError: HIP error: invalid argument
  • Double free or corruption (out)
  • (EngineCore pid=1711448) RuntimeError: Unknown device: -5. If you have recently updated the caffe2.proto file to add a new device type, did you forget to update the DeviceTypeName() function to reflect such recent changes?
  • terminate called after throwing an instance of 'c10::Error' what(): found != kernels_.end() INTERNAL ASSERT FAILED at "/app/pytorch/aten/src/ATen/core/dispatch/OperatorEntry.cpp":245, please report a bug to PyTorch. Tried to deregister a kernel for dispatch key Meta but there are no kernels registered for this dispatch key. The operator is
  • Segmentation fault (core dumped) ==> error from the minimal repro below

For your convenience, here is a minimal repro that can run straight out of the box in the docker container rocm/vllm-dev:nightly_main_20260413, that does not depend on loading a vLLM model:

docker run -td --name bug_repro --rm --device /dev/kfd \
    -e PYTORCH_ROCM_ARCH=gfx942 \
    -e PYTORCH_TUNABLEOP_ENABLED=0 \
    --device /dev/dri --security-opt seccomp=unconfined \
    --shm-size=128g --net host --pid host \
    -v $(pwd)/hf_cache:/hf_cache \
    rocm/vllm-dev:nightly_main_20260413 /bin/bash

docker exec -it bug_repro /bin/bash

and:

import glob
import os

import huggingface_hub
import torch
import triton
from safetensors import safe_open

from vllm.model_executor.layers.quantization.utils.nvfp4_emulation_utils import (
    break_fp4_bytes,
    kE2M1ToFloat_handle,
    ref_nvfp4_quant,
)

# ---- Workaround env var ----
_SYNCHRONIZE_LAYER = bool(int(os.environ.get("SYNCHRONIZE_LAYER", "0")))

# ---- Qwen3-30B-A3B-FP4 dimensions ----
E = 128         # num experts
K = 2048        # hidden_dim
N = 768         # intermediate_size_per_partition (1536/2 for gated)
TOP_K = 8
NUM_LAYERS = 10  # higher number of layers = higher likelihood of issue
ATTN_DIM = 4096  # attention intermediate dim
BLOCK_SIZE = 16


# ---- 3D-capable dequantize (not on main branch) ----
def dequantize_to_dtype(
    tensor_fp4: torch.Tensor,
    tensor_sf: torch.Tensor,
    global_scale: torch.Tensor,
    dtype: torch.dtype,
):
    """Dequantize the fp4 tensor back to high precision.

    Supports both 2D and 3D inputs:
    - 2D: [m, packed_k] -> [m, k]
    - 3D: [dim0, m, packed_k] -> [dim0, m, k]
    """
    assert tensor_fp4.dtype == torch.uint8

    is_3d = tensor_fp4.ndim == 3

    if is_3d:
        dim0, m, packed_k = tensor_fp4.shape
        tensor_fp4 = tensor_fp4.reshape(-1, packed_k)
        tensor_sf = tensor_sf.reshape(-1, tensor_sf.shape[-1])
        global_scale = global_scale[:, None, None]
    else:
        m, packed_k = tensor_fp4.shape

    k = packed_k * 2
    tensor_f32 = break_fp4_bytes(tensor_fp4, torch.float32)
    tensor_f32 = tensor_f32.reshape(-1, k // BLOCK_SIZE, BLOCK_SIZE)
    tensor_sf = tensor_sf.view(torch.float8_e4m3fn)

    if is_3d:
        tensor_sf = tensor_sf.reshape(dim0, m, k // BLOCK_SIZE)
    tensor_sf_dtype = tensor_sf.to(torch.float32) * global_scale

    if is_3d:
        tensor_f32 = tensor_f32.reshape(dim0, m, -1, BLOCK_SIZE)

    out = tensor_f32 * tensor_sf_dtype.unsqueeze(-1)
    out = out.reshape(*out.shape[:-2], -1)

    return out.to(dtype)


# ---- Activation quantize-dequantize (inline, no dependency on branch code) ----
def ref_nvfp4_quant_dequant(x, global_scale):
    """Quantize to FP4 and immediately dequantize back (activation QDQ)."""
    fp4, scale = ref_nvfp4_quant(x, global_scale, BLOCK_SIZE)
    m, n = x.shape
    fp4 = fp4.reshape(m, n // BLOCK_SIZE, BLOCK_SIZE)
    scale = scale.unsqueeze(-1) / global_scale
    x_dq = (fp4 * scale).reshape(m, n).to(x.dtype)
    return x_dq, None


MODEL_ID = "nvidia/Qwen3-30B-A3B-NVFP4"


def load_checkpoint(device):
    """Download model and load all safetensors into a flat dict on device."""
    model_path = huggingface_hub.snapshot_download(MODEL_ID)
    shard_files = sorted(glob.glob(os.path.join(model_path, "*.safetensors")))
    print(f"Found {len(shard_files)} safetensors shard(s) in {model_path}")

    tensors = {}
    for sf_path in shard_files:
        with safe_open(sf_path, framework="pt", device=str(device)) as f:
            for k in f.keys():
                tensors[k] = f.get_tensor(k)
    return tensors


def load_layer_weights(tensors, layer_idx, device):
    """Extract MoE and attention weights for a single layer from the flat dict.

    Returns:
        (w1_packed, w1_blockscale, w1_global_scale,
         w2_packed, w2_blockscale, w2_global_scale,
         act_global_scale,
         lin_qkv, lin_o)
    """
    prefix = f"model.layers.{layer_idx}"

    # ---- MoE expert weights ----
    all_w1_packed, all_w1_blockscale = [], []
    all_w2_packed, all_w2_blockscale = [], []
    g1_alphas, g2_alphas = [], []
    a1_gscales = []

    for e in range(E):
        ep = f"{prefix}.mlp.experts.{e}"

        # gate_proj + up_proj -> w1 (stacked along out_dim)
        gate_w = tensors[f"{ep}.gate_proj.weight"]       # [N, K//2] uint8
        up_w = tensors[f"{ep}.up_proj.weight"]            # [N, K//2] uint8
        w1_packed = torch.cat([gate_w, up_w], dim=0)      # [2*N, K//2]
        all_w1_packed.append(w1_packed)

        gate_bs = tensors[f"{ep}.gate_proj.weight_scale"]  # [N, K//BLOCK_SIZE] fp8
        up_bs = tensors[f"{ep}.up_proj.weight_scale"]      # [N, K//BLOCK_SIZE] fp8
        w1_bs = torch.cat([gate_bs, up_bs], dim=0)
        all_w1_blockscale.append(w1_bs.view(torch.uint8))

        # Use gate_proj's global scale for w1 (gate and up have same value)
        g1_alphas.append(tensors[f"{ep}.gate_proj.weight_scale_2"])

        # down_proj -> w2
        all_w2_packed.append(tensors[f"{ep}.down_proj.weight"])
        down_bs = tensors[f"{ep}.down_proj.weight_scale"]
        all_w2_blockscale.append(down_bs.view(torch.uint8))
        g2_alphas.append(tensors[f"{ep}.down_proj.weight_scale_2"])

        # Activation input scale (same for gate/up)
        a1_gscales.append(tensors[f"{ep}.gate_proj.input_scale"])

    w1_packed = torch.stack(all_w1_packed)          # [E, 2*N, K//2]
    w1_blockscale = torch.stack(all_w1_blockscale)  # [E, 2*N, K//BLOCK_SIZE]
    w1_global_scale = torch.stack(g1_alphas)        # [E]
    w2_packed = torch.stack(all_w2_packed)           # [E, K, N//2]
    w2_blockscale = torch.stack(all_w2_blockscale)   # [E, K, N//BLOCK_SIZE]
    w2_global_scale = torch.stack(g2_alphas)        # [E]
    # Use the first expert's input_scale as representative
    act_global_scale = a1_gscales[0].unsqueeze(0)   # [1]

    # ---- Attention linear weights (use q_proj as lin_qkv, o_proj as lin_o) ----
    attn = f"{prefix}.self_attn"
    lin_global_scale = tensors[f"{attn}.q_proj.weight_scale_2"].unsqueeze(0)

    return (w1_packed, w1_blockscale, w1_global_scale,
            w2_packed, w2_blockscale, w2_global_scale,
            act_global_scale, lin_global_scale)


def main():
    device = torch.device("cuda:0")
    dtype = torch.bfloat16
    print(f"torch={torch.__version__}")
    print(f"triton={triton.__version__}")

    print(f"SYNCHRONIZE_LAYER={_SYNCHRONIZE_LAYER}")
    print(f"Layers={NUM_LAYERS}, E={E}, K={K}, N={N}, top_k={TOP_K}")

    kE2M1ToFloat_handle.val = kE2M1ToFloat_handle.val.to(device)

    # ---- Load checkpoint ----
    print(f"Downloading / loading {MODEL_ID}...", flush=True)
    tensors = load_checkpoint(device)
    print(f"Loaded {len(tensors)} tensors from checkpoint.", flush=True)

    # ---- Build all layers ----
    print("Building layers from checkpoint weights...", flush=True)
    layers = []
    for i in range(NUM_LAYERS):
        if (i + 1) % 10 == 0 or i == 0:
            print(f"  layer {i+1}/{NUM_LAYERS}", flush=True)

        (w1_packed, w1_blockscale, w1_global_scale,
         w2_packed, w2_blockscale, w2_global_scale,
         act_global_scale, lin_global_scale) = load_layer_weights(tensors, i, device)

        layers.append((w1_packed, w1_blockscale, w1_global_scale,
                        w2_packed, w2_blockscale, w2_global_scale,
                        act_global_scale, lin_global_scale))

    # Free the full checkpoint dict to save memory
    del tensors
    torch.cuda.synchronize()
    print("All layers built.\n", flush=True)

    # ---- Forward pass function (NO sync inside) ----
    def forward(hidden_states):
        M = hidden_states.shape[0]
        for (w1_packed, w1_scale_val, _g1_alphas,
             w2_packed, w2_scale_val, _g2_alphas,
             _a1_gscale, lin_global_scale) in layers:
            for _ in range(1):
                hidden_states, _ = ref_nvfp4_quant_dequant(hidden_states, lin_global_scale)

            # NOTE: no segfault/issue if commenting out the `dequantize_to_dtype` calls.

            # Dequantize w1 — result is a LOCAL variable
            w1_dequant = dequantize_to_dtype(
                tensor_fp4=w1_packed, tensor_sf=w1_scale_val,
                global_scale=_g1_alphas, dtype=hidden_states.dtype,
            )

            # Dequantize w2 — result is a LOCAL variable
            w2_dequant = dequantize_to_dtype(
                tensor_fp4=w2_packed, tensor_sf=w2_scale_val,
                global_scale=_g2_alphas, dtype=hidden_states.dtype,
            )

            # Activation QDQ
            hidden_states, _ = ref_nvfp4_quant_dequant(
                hidden_states, _a1_gscale
            )

        return hidden_states

    # ---- Prefill ----
    print("=== PREFILL (M=16) ===", flush=True)
    h = torch.randn(16, K, dtype=dtype, device=device)
    h = forward(h)
    torch.cuda.synchronize()
    print(f"  OK (norm={h.float().norm().item():.2f})\n", flush=True)

    # ---- Decode steps (M=1, no sync between steps) ----
    num_decode = 50
    print(f"=== DECODE ({num_decode} steps, M=1, no sync between steps) ===",
          flush=True)
    for step in range(num_decode):
        h = torch.randn(1, K, dtype=dtype, device=device)
        h = forward(h)
        # NO sync here — let GPU work queue build up across steps

    # Final sync to catch deferred errors
    torch.cuda.synchronize()
    print(f"  All {num_decode} decode steps completed.", flush=True)
    print("SUCCESS")


if __name__ == "__main__":
    main()

Run with:

hf download nvidia/Qwen3-30B-A3B-NVFP4
CUDA_VISIBLE_DEVICES=0 python repro_nvfp4_moe_main.py`

Example failure:

torch=2.12.0.dev20260413+rocm7.2
triton=3.7.0
SYNCHRONIZE_LAYER=False
Layers=10, E=128, K=2048, N=768, top_k=8
Downloading / loading nvidia/Qwen3-30B-A3B-NVFP4...
Fetching 18 files: 100%|████████████████████████████████████████████████████████████████| 18/18 [00:00<00:00, 156633.76it/s]
Found 4 safetensors shard(s) in /scratch/huggingface_hub/models--nvidia--Qwen3-30B-A3B-NVFP4/snapshots/2538ded2a4edb247b4d2b4a8ba24e44bd4c017c3
Loaded 74835 tensors from checkpoint.
Building layers from checkpoint weights...
  layer 1/10
  layer 10/10
All layers built.

=== PREFILL (M=16) ===
WARNING 04-14 13:00:35 [fused_moe.py:1090] Using default MoE config. Performance might be sub-optimal! Config file not found at /felmarty/repos/vllm/vllm/model_executor/layers/fused_moe/configs/E=128,N=768,device_name=AMD_Instinct_MI325X.json
  OK (norm=0.00)

=== DECODE (50 steps, M=1, no sync between steps) ===
Segmentation fault (core dumped)

Example success with HIP_LAUNCH_BLOCKING=1:

torch=2.12.0.dev20260413+rocm7.2
triton=3.7.0
SYNCHRONIZE_LAYER=False
Layers=10, E=128, K=2048, N=768, top_k=8
Downloading / loading nvidia/Qwen3-30B-A3B-NVFP4...
Fetching 18 files: 100%|████████████████████████████████████████████████████████████████| 18/18 [00:00<00:00, 246723.76it/s]
Found 4 safetensors shard(s) in /scratch/huggingface_hub/models--nvidia--Qwen3-30B-A3B-NVFP4/snapshots/2538ded2a4edb247b4d2b4a8ba24e44bd4c017c3
Loaded 74835 tensors from checkpoint.
Building layers from checkpoint weights...
  layer 1/10
  layer 10/10
All layers built.

=== PREFILL (M=16) ===
WARNING 04-14 13:01:19 [fused_moe.py:1090] Using default MoE config. Performance might be sub-optimal! Config file not found at /felmarty/repos/vllm/vllm/model_executor/layers/fused_moe/configs/E=128,N=768,device_name=AMD_Instinct_MI325X.json
  OK (norm=0.00)

=== DECODE (50 steps, M=1, no sync between steps) ===
  All 50 decode steps completed.
SUCCESS

To reproduce the issue with torch nightly instead of torch==2.10.0+git8514f05, please run:

pip uninstall torch
pip3 install --pre torch torchvision --index-url https://download.pytorch.org/whl/nightly/rocm7.2
pip uninstall vllm mori
git clone https://github.com/vllm-project/vllm.git && cd vllm
export PYTORCH_ROCM_ARCH=gfx942
python setup.py develop

On Nvidia side, you can use:

FROM nvidia/cuda:13.0.2-cudnn-devel-ubuntu24.04

RUN apt update && apt install -y git wget nano curl

ENV PATH="/root/miniforge3/bin:${PATH}"
ARG PATH="/root/miniforge3/bin:${PATH}"

RUN wget https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-Linux-x86_64.sh && \
    bash Miniforge3-Linux-x86_64.sh -b && \
    conda init

RUN pip install torch==2.11.0

ENV TORCH_CUDA_ARCH_LIST=9.0

RUN pip install accelerate datasets pytest nltk huggingface_hub

WORKDIR /workspace

and from there:

VLLM_USE_PRECOMPILED=1 pip install -vvv --editable . --no-build-isolation

and run the same script successfully.

Do you have an idea what the issue could be? I suspect stacking many kernels, possibly from a function like https://github.com/vllm-project/vllm/blob/f7e62e3d6618f64430262bf776079d0d89f20501/vllm/model_executor/layers/quantization/utils/nvfp4_emulation_utils.py#L89-L100 or https://github.com/vllm-project/vllm/blob/f7e62e3d6618f64430262bf776079d0d89f20501/vllm/model_executor/layers/quantization/utils/nvfp4_emulation_utils.py#L23-L40 is causing crashes in ROCm, but I think ROCm should not crash with this code.

rocgdb backtrace:

=== DECODE (50 steps, M=1, no sync between steps) ===

Thread 66 "python" received signal SIGSEGV, Segmentation fault.
[Switching to thread 66 (Thread 0x7ffdad9ff640 (LWP 3700660))]
0x00007ffec3a76921 in rocr::core::Runtime::AsyncEventsLoop(void*) ()
   from /usr/local/lib/python3.12/dist-packages/torch/lib/libhsa-runtime64.so
(gdb) thread apply all bt

Thread 76 (Thread 0x7fbd68ffc640 (LWP 3700672) "python"):
#0  0x00007ffff7ce7117 in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#1  0x00007ffff7cf1f7e in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#2  0x0000000000529af0 in PyThread_acquire_lock_timed ()
#3  0x000000000063b117 in ?? ()
#4  0x000000000066a252 in ?? ()
#5  0x0000000000570bab in ?? ()
#6  0x0000000000562386 in PyObject_Vectorcall ()
#7  0x0000000000549fe0 in _PyEval_EvalFrameDefault ()
#8  0x000000000059961d in ?? ()
#9  0x00000000005991e6 in ?? ()
#10 0x00000000006a7499 in ?? ()
#11 0x00000000006a7448 in ?? ()
#12 0x00007ffff7ceaac3 in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#13 0x00007ffff7d7c8d0 in ?? () from /lib/x86_64-linux-gnu/libc.so.6

Thread 67 (Thread 0x7ffda7fff640 (LWP 3700661) "python"):
#0  0x00007ffff7d709cf in ioctl () from /lib/x86_64-linux-gnu/libc.so.6
#1  0x00007ffec3b222b0 in hsakmt_ioctl () from /usr/local/lib/python3.12/dist-packages/torch/lib/libhsa-runtime64.so
#2  0x00007ffec3b17f53 in hsaKmtWaitOnMultipleEvents_ExtCtx () from /usr/local/lib/python3.12/dist-packages/torch/lib/libhsa-runtime64.so
#3  0x00007ffec3a9b999 in rocr::core::Signal::WaitAnyExceptions(unsigned int, hsa_signal_s const*, hsa_signal_condition_t const*, long const*, long*) () from /usr/local/lib/python3.12/dist-packages/torch/lib/libhsa-runtime64.so
#4  0x00007ffec3a77248 in rocr::core::Runtime::AsyncEventsLoop(void*) () from /usr/local/lib/python3.12/dist-packages/torch/lib/libhsa-runtime64.so
#5  0x00007ffec3ad34dd in rocr::os::ThreadTrampoline(void*) () from /usr/local/lib/python3.12/dist-packages/torch/lib/libhsa-runtime64.so
#6  0x00007ffff7ceaac3 in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#7  0x00007ffff7d7c8d0 in ?? () from /lib/x86_64-linux-gnu/libc.so.6

Thread 66 (Thread 0x7ffdad9ff640 (LWP 3700660) "python"):
#0  0x00007ffec3a76921 in rocr::core::Runtime::AsyncEventsLoop(void*) () from /usr/local/lib/python3.12/dist-packages/torch/lib/libhsa-runtime64.so

Versions

MI325 + torch nightly

CMake version: version 3.31.10
Libc version: glibc-2.35

Python version: 3.12.13 (main, Mar  4 2026, 09:23:07) [GCC 11.4.0] (64-bit runtime)
Python platform: Linux-6.8.0-101-generic-x86_64-with-glibc2.35
Is CUDA available: True
CUDA runtime version: Could not collect
CUDA_MODULE_LOADING set to:
GPU models and configuration: AMD Instinct MI325X (gfx942:sramecc+:xnack-)
Nvidia driver version: Could not collect
cuDNN version: Could not collect
Is XPU available: False
HIP runtime version: 7.2.53211
MIOpen runtime version: 3.5.1
Is XNNPACK available: True
Caching allocator config: N/A

CPU:
Architecture:                            x86_64
CPU op-mode(s):                          32-bit, 64-bit
Address sizes:                           52 bits physical, 57 bits virtual
Byte Order:                              Little Endian
CPU(s):                                  256
On-line CPU(s) list:                     0-255
Vendor ID:                               AuthenticAMD
Model name:                              AMD EPYC 9555 64-Core Processor
CPU family:                              26
Model:                                   2
Thread(s) per core:                      2
Core(s) per socket:                      64
Socket(s):                               2
Stepping:                                1
Frequency boost:                         enabled
CPU max MHz:                             4409.3750
CPU min MHz:                             1500.0000
BogoMIPS:                                6399.98
Flags:                                   fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good amd_lbr_v2 nopl nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate ssbd mba perfmon_v2 ibrs ibpb stibp ibrs_enhanced vmmcall fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local user_shstk avx_vnni avx512_bf16 clzero irperf xsaveerptr rdpru wbnoinvd amd_ppin cppc amd_ibpb_ret arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif x2avic v_spec_ctrl vnmi avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq la57 rdpid bus_lock_detect movdiri movdir64b overflow_recov succor smca fsrm avx512_vp2intersect flush_l1d debug_swap
Virtualization:                          AMD-V
L1d cache:                               6 MiB (128 instances)
L1i cache:                               4 MiB (128 instances)
L2 cache:                                128 MiB (128 instances)
L3 cache:                                512 MiB (16 instances)
NUMA node(s):                            2
NUMA node0 CPU(s):                       0-63,128-191
NUMA node1 CPU(s):                       64-127,192-255
Vulnerability Gather data sampling:      Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Not affected
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; Enhanced / Automatic IBRS; IBPB conditional; STIBP always-on; PBRSB-eIBRS Not affected; BHI Not affected
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Not affected
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Not affected

Versions of relevant libraries:
[pip3] conch-triton-kernels==1.2.1
[pip3] numpy==2.1.3
[pip3] onnx==1.19.0
[pip3] onnx-ir==0.2.0
[pip3] onnxscript==0.6.2
[pip3] onnxslim==0.1.91
[pip3] torch==2.12.0.dev20260413+rocm7.2
[pip3] torchaudio==2.9.0+eaa9e4e
[pip3] torchvision==0.24.1+d801a34
[pip3] triton==3.6.0
[pip3] triton_kernels==1.0.0
[pip3] triton-rocm==3.7.0+git282c8251
[conda] Could not collect

H100 + torch 2.11

PyTorch version: 2.11.0+cu130
Is debug build: False
CUDA used to build PyTorch: 13.0
ROCM used to build PyTorch: N/A

OS: Ubuntu 24.04.3 LTS (x86_64)
GCC version: (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0
Clang version: Could not collect
CMake version: Could not collect
Libc version: glibc-2.39

Python version: 3.13.12 | packaged by conda-forge | (main, Feb  5 2026, 05:53:46) [GCC 14.3.0] (64-bit runtime)
Python platform: Linux-6.8.0-107-generic-x86_64-with-glibc2.39
Is CUDA available: True
CUDA runtime version: 13.0.88
CUDA_MODULE_LOADING set to:
GPU models and configuration:
GPU 0: NVIDIA H100 80GB HBM3
GPU 1: NVIDIA H100 80GB HBM3
GPU 2: NVIDIA H100 80GB HBM3
GPU 3: NVIDIA H100 80GB HBM3
GPU 4: NVIDIA H100 80GB HBM3
GPU 5: NVIDIA H100 80GB HBM3
GPU 6: NVIDIA H100 80GB HBM3
GPU 7: NVIDIA H100 80GB HBM3

Nvidia driver version: 595.58.03
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.9.14.0
/usr/lib/x86_64-linux-gnu/libcudnn_adv.so.9.14.0
/usr/lib/x86_64-linux-gnu/libcudnn_cnn.so.9.14.0
/usr/lib/x86_64-linux-gnu/libcudnn_engines_precompiled.so.9.14.0
/usr/lib/x86_64-linux-gnu/libcudnn_engines_runtime_compiled.so.9.14.0
/usr/lib/x86_64-linux-gnu/libcudnn_graph.so.9.14.0
/usr/lib/x86_64-linux-gnu/libcudnn_heuristic.so.9.14.0
/usr/lib/x86_64-linux-gnu/libcudnn_ops.so.9.14.0
Is XPU available: False
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True
Caching allocator config: N/A

CPU:
Architecture:                            x86_64
CPU op-mode(s):                          32-bit, 64-bit
Address sizes:                           52 bits physical, 57 bits virtual
Byte Order:                              Little Endian
CPU(s):                                  384
On-line CPU(s) list:                     0-383
Vendor ID:                               AuthenticAMD
Model name:                              AMD EPYC 9654 96-Core Processor
CPU family:                              25
Model:                                   17
Thread(s) per core:                      2
Core(s) per socket:                      96
Socket(s):                               2
Stepping:                                1
Frequency boost:                         enabled
CPU(s) scaling MHz:                      50%
CPU max MHz:                             3709.3569
CPU min MHz:                             1500.0000
BogoMIPS:                                4799.99
Flags:                                   fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good amd_lbr_v2 nopl nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate ssbd mba perfmon_v2 ibrs ibpb stibp ibrs_enhanced vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local user_shstk avx512_bf16 clzero irperf xsaveerptr rdpru wbnoinvd amd_ppin cppc amd_ibpb_ret arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif x2avic v_spec_ctrl vnmi avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq la57 rdpid overflow_recov succor smca fsrm flush_l1d debug_swap ibpb_exit_to_user
Virtualization:                          AMD-V
L1d cache:                               6 MiB (192 instances)
L1i cache:                               6 MiB (192 instances)
L2 cache:                                192 MiB (192 instances)
L3 cache:                                768 MiB (24 instances)
NUMA node(s):                            2
NUMA node0 CPU(s):                       0-95,192-287
NUMA node1 CPU(s):                       96-191,288-383
Vulnerability Gather data sampling:      Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Mitigation; Safe RET
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; Enhanced / Automatic IBRS; IBPB conditional; STIBP always-on; PBRSB-eIBRS Not affected; BHI Not affected
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Vulnerable: Clear CPU buffers attempted, no microcode
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Mitigation; IBPB before exit to userspace

Versions of relevant libraries:
[pip3] numpy==2.2.6
[pip3] nvidia-cublas==13.1.0.3
[pip3] nvidia-cuda-cupti==13.0.85
[pip3] nvidia-cuda-nvrtc==13.0.88
[pip3] nvidia-cuda-runtime==13.0.96
[pip3] nvidia-cudnn-cu13==9.19.0.56
[pip3] nvidia-cudnn-frontend==1.18.0
[pip3] nvidia-cufft==12.0.0.61
[pip3] nvidia-curand==10.4.0.35
[pip3] nvidia-cusolver==12.0.4.66
[pip3] nvidia-cusparse==12.6.3.3
[pip3] nvidia-cusparselt-cu13==0.8.0
[pip3] nvidia-nccl-cu13==2.28.9
[pip3] nvidia-nvjitlink==13.0.88
[pip3] nvidia-nvtx==13.0.85
[pip3] torch==2.11.0
[pip3] torch_c_dlpack_ext==0.1.5
[pip3] torchaudio==2.11.0
[pip3] torchvision==0.26.0
[pip3] triton==3.6.0
[conda] numpy                                       2.2.6                                    pypi_0                pypi
[conda] nvidia-cublas                               13.1.0.3                                 pypi_0                pypi
[conda] nvidia-cuda-cupti                           13.0.85                                  pypi_0                pypi
[conda] nvidia-cuda-nvrtc                           13.0.88                                  pypi_0                pypi
[conda] nvidia-cuda-runtime                         13.0.96                                  pypi_0                pypi
[conda] nvidia-cudnn-cu13                           9.19.0.56                                pypi_0                pypi
[conda] nvidia-cudnn-frontend                       1.18.0                                   pypi_0                pypi
[conda] nvidia-cufft                                12.0.0.61                                pypi_0                pypi
[conda] nvidia-curand                               10.4.0.35                                pypi_0                pypi
[conda] nvidia-cusolver                             12.0.4.66                                pypi_0                pypi
[conda] nvidia-cusparse                             12.6.3.3                                 pypi_0                pypi
[conda] nvidia-cusparselt-cu13                      0.8.0                                    pypi_0                pypi
[conda] nvidia-nccl-cu13                            2.28.9                                   pypi_0                pypi
[conda] nvidia-nvjitlink                            13.0.88                                  pypi_0                pypi
[conda] nvidia-nvtx                                 13.0.85                                  pypi_0                pypi
[conda] torch                                       2.11.0                                   pypi_0                pypi
[conda] torch-c-dlpack-ext                          0.1.5                                    pypi_0                pypi
[conda] torchaudio                                  2.11.0                                   pypi_0                pypi
[conda] torchvision                                 0.26.0                                   pypi_0                pypi
[conda] triton                                      3.6.0                                    pypi_0                pypi

ROCm + MI325 + custom torch==2.10.0+git8514f05 from rocm/vllm-dev:nightly_main_20260413

PyTorch version: 2.10.0+git8514f05
Is debug build: False
CUDA used to build PyTorch: N/A
ROCM used to build PyTorch: 7.2.53211

OS: Ubuntu 22.04.5 LTS (x86_64)
GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04.3) 11.4.0
Clang version: 22.0.0git (https://github.com/RadeonOpenCompute/llvm-project roc-7.2.1 26084 f58b06dce1f9c15707c5f808fd002e18c2accf7e)
CMake version: version 3.31.10
Libc version: glibc-2.35

Python version: 3.12.13 (main, Mar  4 2026, 09:23:07) [GCC 11.4.0] (64-bit runtime)
Python platform: Linux-6.8.0-101-generic-x86_64-with-glibc2.35
Is CUDA available: True
CUDA runtime version: Could not collect
CUDA_MODULE_LOADING set to:
GPU models and configuration:  (gfx942:sramecc+:xnack-)
Nvidia driver version: Could not collect
cuDNN version: Could not collect
Is XPU available: False
HIP runtime version: 7.2.53211
MIOpen runtime version: 3.5.1
Is XNNPACK available: True
Caching allocator config: N/A

CPU:
Architecture:                            x86_64
CPU op-mode(s):                          32-bit, 64-bit
Address sizes:                           52 bits physical, 57 bits virtual
Byte Order:                              Little Endian
CPU(s):                                  256
On-line CPU(s) list:                     0-255
Vendor ID:                               AuthenticAMD
Model name:                              AMD EPYC 9555 64-Core Processor
CPU family:                              26
Model:                                   2
Thread(s) per core:                      2
Core(s) per socket:                      64
Socket(s):                               2
Stepping:                                1
Frequency boost:                         enabled
CPU max MHz:                             4409.3750
CPU min MHz:                             1500.0000
BogoMIPS:                                6399.98
Flags:                                   fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good amd_lbr_v2 nopl nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate ssbd mba perfmon_v2 ibrs ibpb stibp ibrs_enhanced vmmcall fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local user_shstk avx_vnni avx512_bf16 clzero irperf xsaveerptr rdpru wbnoinvd amd_ppin cppc amd_ibpb_ret arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif x2avic v_spec_ctrl vnmi avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq la57 rdpid bus_lock_detect movdiri movdir64b overflow_recov succor smca fsrm avx512_vp2intersect flush_l1d debug_swap
Virtualization:                          AMD-V
L1d cache:                               6 MiB (128 instances)
L1i cache:                               4 MiB (128 instances)
L2 cache:                                128 MiB (128 instances)
L3 cache:                                512 MiB (16 instances)
NUMA node(s):                            2
NUMA node0 CPU(s):                       0-63,128-191
NUMA node1 CPU(s):                       64-127,192-255
Vulnerability Gather data sampling:      Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Not affected
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; Enhanced / Automatic IBRS; IBPB conditional; STIBP always-on; PBRSB-eIBRS Not affected; BHI Not affected
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Not affected
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Not affected

Versions of relevant libraries:
[pip3] conch-triton-kernels==1.2.1
[pip3] numpy==2.1.3
[pip3] onnx==1.19.0
[pip3] onnx-ir==0.2.0
[pip3] onnxscript==0.6.2
[pip3] onnxslim==0.1.91
[pip3] torch==2.10.0+git8514f05
[pip3] torchaudio==2.9.0+eaa9e4e
[pip3] torchvision==0.24.1+d801a34
[pip3] triton==3.6.0
[pip3] triton_kernels==1.0.0
[conda] Could not collect

cc @jeffdaily @sunway513 @jithunnair-amd @pruthvistony @ROCmSupport @jataylo @hongxiayang @naromero77amd @pragupta @jerrymannil @xinyazhang

extent analysis

TL;DR

The issue is likely related to a synchronization problem between GPU kernels, and setting HIP_LAUNCH_BLOCKING=1 or AMD_SERIALIZE_KERNEL=3 resolves the issue, suggesting a potential fix involving kernel synchronization.

Guidance

  • The problem seems to be related to the execution of multiple GPU kernels without proper synchronization, leading to crashes and segmentation faults.
  • The fact that setting HIP_LAUNCH_BLOCKING=1 or AMD_SERIALIZE_KERNEL=3 resolves the issue suggests that the problem is related to the asynchronous execution of kernels.
  • To fix the issue, you can try to synchronize the kernel launches using torch.cuda.synchronize() or triton.cuda.current_stream().synchronize() after each kernel launch.
  • Another possible solution is to use HIP_LAUNCH_BLOCKING=1 or AMD_SERIALIZE_KERNEL=3 as an environment variable to force synchronous kernel launches.
  • It's also worth investigating the kernel launches in the dequantize_to_dtype and ref_nvfp4_quant_dequant functions to ensure that they are properly synchronized.

Example

import torch

# ...

def forward(hidden_states):
    # ...
    for (w1_packed, w1_scale_val, _g1_alphas,
         w2_packed, w2_scale_val, _g2_alphas,
         _a1_gscale, lin_global_scale) in layers:
        # ...
        w1_dequant = dequantize_to_dtype(
            tensor_fp4=w1_packed, tensor_sf=w1_scale_val,
            global_scale=_g1_alphas, dtype=hidden_states.dtype,
        )
        torch.cuda.synchronize()  # Add synchronization here
        # ...

Notes

  • The issue seems to be specific to the ROCm platform and does not occur on Nvidia GPUs.
  • The problem may be related to the specific version of the ROCm platform or the PyTorch version being used.

Recommendation

Apply the workaround by setting HIP_LAUNCH_BLOCKING=1 or AMD_SERIALIZE_KERNEL=3 as an environment variable to force synchronous kernel launches. This should resolve the issue until a more permanent fix can be implemented.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

pytorch - ✅(Solved) Fix [rocm only] Likely reuse after free when stacking many kernels calls, mixed with Triton kernel calls [2 pull requests, 1 comments, 1 participants]