pytorch - 💡(How to fix) Fix torch.compile not tracing streams properly [5 comments, 3 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
pytorch/pytorch#177691Fetched 2026-04-08 00:52:35
View on GitHub
Comments
5
Participants
3
Timeline
142
Reactions
0
Assignees
Timeline (top)
mentioned ×64subscribed ×64commented ×5labeled ×5

Error Message

Traceback (most recent call last): File "/root/eagle/vllm_centml_fork/test_multistream_compile.py", line 75, in <module> main() File "/root/eagle/vllm_centml_fork/test_multistream_compile.py", line 67, in main multistream_operator(*inputs) File "/root/eagle/vllm_centml_fork/.venv_pytorchnightly/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1778, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/root/eagle/vllm_centml_fork/.venv_pytorchnightly/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1789, in _call_impl return forward_call(*args, **kwargs) File "/root/eagle/vllm_centml_fork/.venv_pytorchnightly/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py", line 1036, in compile_wrapper return fn(*args, **kwargs) File "/root/eagle/vllm_centml_fork/test_multistream_compile.py", line 28, in forward @torch.compile File "/root/eagle/vllm_centml_fork/.venv_pytorchnightly/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py", line 1272, in _fn return fn(*args, **kwargs) File "/root/eagle/vllm_centml_fork/.venv_pytorchnightly/lib/python3.10/site-packages/torch/_functorch/aot_autograd.py", line 1210, in forward return compiled_fn(full_args) File "/root/eagle/vllm_centml_fork/.venv_pytorchnightly/lib/python3.10/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 582, in runtime_wrapper all_outs = call_func_at_runtime_with_args( File "/root/eagle/vllm_centml_fork/.venv_pytorchnightly/lib/python3.10/site-packages/torch/_functorch/_aot_autograd/utils.py", line 138, in call_func_at_runtime_with_args out = normalize_as_list(f(args)) File "/root/eagle/vllm_centml_fork/.venv_pytorchnightly/lib/python3.10/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 785, in wrapper return compiled_fn(runtime_args) File "/root/eagle/vllm_centml_fork/.venv_pytorchnightly/lib/python3.10/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 989, in inner_fn outs = compiled_fn(args) File "/root/eagle/vllm_centml_fork/.venv_pytorchnightly/lib/python3.10/site-packages/torch/_inductor/output_code.py", line 682, in call return self.current_callable(inputs) File "/root/eagle/vllm_centml_fork/.venv_pytorchnightly/lib/python3.10/site-packages/torch/_inductor/utils.py", line 3444, in run out = model(new_inputs) File "/tmp/torchinductor_root/jw/cjwjaptw5j47m4zrumwv3lp72ghagh4gnm5qj2qims4pgof2vscx.py", line 117, in call torch.ops.streams.record_event.default(0, 1) File "/root/eagle/vllm_centml_fork/.venv_pytorchnightly/lib/python3.10/site-packages/torch/_ops.py", line 871, in call return self._op(*args, **kwargs) File "/root/eagle/vllm_centml_fork/.venv_pytorchnightly/lib/python3.10/site-packages/torch/_compile.py", line 54, in inner return disable_fn(*args, **kwargs) File "/root/eagle/vllm_centml_fork/.venv_pytorchnightly/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py", line 1272, in _fn return fn(*args, **kwargs) File "/root/eagle/vllm_centml_fork/.venv_pytorchnightly/lib/python3.10/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 411, in torch_dispatch res = func(*args, **kwargs) File "/root/eagle/vllm_centml_fork/.venv_pytorchnightly/lib/python3.10/site-packages/torch/_ops.py", line 871, in call return self._op(*args, **kwargs) File "/root/eagle/vllm_centml_fork/.venv_pytorchnightly/lib/python3.10/site-packages/torch/_library/custom_ops.py", line 375, in backend_impl result = self._backend_fns[device_type](*args, **kwargs) File "/root/eagle/vllm_centml_fork/.venv_pytorchnightly/lib/python3.10/site-packages/torch/_compile.py", line 54, in inner return disable_fn(*args, **kwargs) File "/root/eagle/vllm_centml_fork/.venv_pytorchnightly/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py", line 1272, in _fn return fn(*args, **kwargs) File "/root/eagle/vllm_centml_fork/.venv_pytorchnightly/lib/python3.10/site-packages/torch/_library/custom_ops.py", line 410, in wrapped_fn return fn(*args, **kwargs) File "/root/eagle/vllm_centml_fork/.venv_pytorchnightly/lib/python3.10/site-packages/torch/_dynamo/variables/streams.py", line 128, in record_event stream.record_event(event) RuntimeError: expected event to be a torch.Event object

Fix Action

Fix / Workaround

Traceback (most recent call last):
  File "/root/eagle/vllm_centml_fork/test_multistream_compile.py", line 75, in <module>
    main()
  File "/root/eagle/vllm_centml_fork/test_multistream_compile.py", line 67, in main
    multistream_operator(*inputs)
  File "/root/eagle/vllm_centml_fork/.venv_pytorchnightly/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1778, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/root/eagle/vllm_centml_fork/.venv_pytorchnightly/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1789, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/eagle/vllm_centml_fork/.venv_pytorchnightly/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py", line 1036, in compile_wrapper
    return fn(*args, **kwargs)
  File "/root/eagle/vllm_centml_fork/test_multistream_compile.py", line 28, in forward
    @torch.compile
  File "/root/eagle/vllm_centml_fork/.venv_pytorchnightly/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py", line 1272, in _fn
    return fn(*args, **kwargs)
  File "/root/eagle/vllm_centml_fork/.venv_pytorchnightly/lib/python3.10/site-packages/torch/_functorch/aot_autograd.py", line 1210, in forward
    return compiled_fn(full_args)
  File "/root/eagle/vllm_centml_fork/.venv_pytorchnightly/lib/python3.10/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 582, in runtime_wrapper
    all_outs = call_func_at_runtime_with_args(
  File "/root/eagle/vllm_centml_fork/.venv_pytorchnightly/lib/python3.10/site-packages/torch/_functorch/_aot_autograd/utils.py", line 138, in call_func_at_runtime_with_args
    out = normalize_as_list(f(args))
  File "/root/eagle/vllm_centml_fork/.venv_pytorchnightly/lib/python3.10/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 785, in wrapper
    return compiled_fn(runtime_args)
  File "/root/eagle/vllm_centml_fork/.venv_pytorchnightly/lib/python3.10/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 989, in inner_fn
    outs = compiled_fn(args)
  File "/root/eagle/vllm_centml_fork/.venv_pytorchnightly/lib/python3.10/site-packages/torch/_inductor/output_code.py", line 682, in __call__
    return self.current_callable(inputs)
  File "/root/eagle/vllm_centml_fork/.venv_pytorchnightly/lib/python3.10/site-packages/torch/_inductor/utils.py", line 3444, in run
    out = model(new_inputs)
  File "/tmp/torchinductor_root/jw/cjwjaptw5j47m4zrumwv3lp72ghagh4gnm5qj2qims4pgof2vscx.py", line 117, in call
    torch.ops.streams.record_event.default(0, 1)
  File "/root/eagle/vllm_centml_fork/.venv_pytorchnightly/lib/python3.10/site-packages/torch/_ops.py", line 871, in __call__
    return self._op(*args, **kwargs)
  File "/root/eagle/vllm_centml_fork/.venv_pytorchnightly/lib/python3.10/site-packages/torch/_compile.py", line 54, in inner
    return disable_fn(*args, **kwargs)
  File "/root/eagle/vllm_centml_fork/.venv_pytorchnightly/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py", line 1272, in _fn
    return fn(*args, **kwargs)
  File "/root/eagle/vllm_centml_fork/.venv_pytorchnightly/lib/python3.10/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 411, in __torch_dispatch__
    res = func(*args, **kwargs)
  File "/root/eagle/vllm_centml_fork/.venv_pytorchnightly/lib/python3.10/site-packages/torch/_ops.py", line 871, in __call__
    return self._op(*args, **kwargs)
  File "/root/eagle/vllm_centml_fork/.venv_pytorchnightly/lib/python3.10/site-packages/torch/_library/custom_ops.py", line 375, in backend_impl
    result = self._backend_fns[device_type](*args, **kwargs)
  File "/root/eagle/vllm_centml_fork/.venv_pytorchnightly/lib/python3.10/site-packages/torch/_compile.py", line 54, in inner
    return disable_fn(*args, **kwargs)
  File "/root/eagle/vllm_centml_fork/.venv_pytorchnightly/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py", line 1272, in _fn
    return fn(*args, **kwargs)
  File "/root/eagle/vllm_centml_fork/.venv_pytorchnightly/lib/python3.10/site-packages/torch/_library/custom_ops.py", line 410, in wrapped_fn
    return fn(*args, **kwargs)
  File "/root/eagle/vllm_centml_fork/.venv_pytorchnightly/lib/python3.10/site-packages/torch/_dynamo/variables/streams.py", line 128, in record_event
    stream.record_event(event)
RuntimeError: expected event to be a torch.Event object

CPU: Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Address sizes: 52 bits physical, 57 bits virtual Byte Order: Little Endian CPU(s): 256 On-line CPU(s) list: 0-255 Vendor ID: AuthenticAMD Model name: AMD EPYC 9575F 64-Core Processor CPU family: 26 Model: 2 Thread(s) per core: 2 Core(s) per socket: 64 Socket(s): 2 Stepping: 1 Frequency boost: enabled CPU max MHz: 5008.0068 CPU min MHz: 1500.0000 BogoMIPS: 6589.90 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 invpcid_single hw_pstate ssbd mba ibrs ibpb stibp ibrs_enhanced vmmcall fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local avx_vnni avx512_bf16 clzero irperf xsaveerptr rdpru wbnoinvd amd_ppin cppc arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif v_spec_ctrl avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq la57 rdpid bus_lock_detect movdiri movdir64b overflow_recov succor smca fsrm avx512_vp2intersect flush_l1d Virtualization: AMD-V L1d cache: 6 MiB (128 instances) L1i cache: 4 MiB (128 instances) L2 cache: 128 MiB (128 instances) L3 cache: 512 MiB (16 instances) NUMA node(s): 2 NUMA node0 CPU(s): 0-63,128-191 NUMA node1 CPU(s): 64-127,192-255 Vulnerability Gather data sampling: Not affected Vulnerability Indirect target selection: Not affected Vulnerability Itlb multihit: Not affected Vulnerability L1tf: Not affected Vulnerability Mds: Not affected Vulnerability Meltdown: Not affected Vulnerability Mmio stale data: Not affected Vulnerability Reg file data sampling: Not affected Vulnerability Retbleed: Not affected Vulnerability Spec rstack overflow: Not affected Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Vulnerability Spectre v2: Mitigation; Enhanced / Automatic IBRS; IBPB conditional; STIBP always-on; PBRSB-eIBRS Not affected; BHI Not affected Vulnerability Srbds: Not affected Vulnerability Tsa: Not affected Vulnerability Tsx async abort: Not affected Vulnerability Vmscape: Not affected

Code Example

import torch
import torch.cuda

def _op(a: torch.Tensor) -> torch.Tensor:
    return torch.relu((a * 3)) * 2

@torch.compile(dynamic=False, fullgraph=True)
def compiled_binary_fn(a: torch.Tensor, b: torch.Tensor) -> tuple[torch.Tensor, torch.Tensor]:
    return _op(a), _op(b)

class MultiStreamOperator(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.aux_stream = torch.cuda.Stream(device=torch.device("cuda"))
        self.event_main = torch.cuda.Event()
        self.event_secondary = torch.cuda.Event()

    @torch.compile
    def forward(self, a: torch.Tensor, b: torch.Tensor) -> tuple[torch.Tensor, torch.Tensor]:
        self.event_main.record()
        result_A = _op(a)
        with torch.cuda.stream(self.aux_stream):
            self.event_main.wait()
            result_B = _op(b)
            self.event_secondary.record()
        self.event_secondary.wait()
        return result_A, result_B

def main():
    device = torch.device("cuda")
    size = 2048

    multistream_operator = MultiStreamOperator()

    inputs = [
        torch.randn(size, size, device=device),
        torch.randn(size, size, device=device),
    ]
    
    default_stream = torch.cuda.default_stream(device)
    torch.cuda.synchronize()

    # Warmup
    _op(inputs[0])
    _op(inputs[1])

    # let the cpu run ahead
    with torch.cuda.stream(default_stream):
        torch.cuda._sleep(10_000_000)  # ~10 ms

    for _ in range(3):
        with torch.cuda.nvtx.range("Single-Stream Operation"):
            compiled_binary_fn(*inputs)

    for _ in range(3):
        with torch.cuda.nvtx.range("Multi-Stream Operation"):
            multistream_operator(*inputs)

    torch.cuda.synchronize()

    print("Multi-stream compile test completed successfully.")


if __name__ == "__main__":
    main()

---

Traceback (most recent call last):
  File "/root/eagle/vllm_centml_fork/test_multistream_compile.py", line 75, in <module>
    main()
  File "/root/eagle/vllm_centml_fork/test_multistream_compile.py", line 67, in main
    multistream_operator(*inputs)
  File "/root/eagle/vllm_centml_fork/.venv_pytorchnightly/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1778, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/root/eagle/vllm_centml_fork/.venv_pytorchnightly/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1789, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/eagle/vllm_centml_fork/.venv_pytorchnightly/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py", line 1036, in compile_wrapper
    return fn(*args, **kwargs)
  File "/root/eagle/vllm_centml_fork/test_multistream_compile.py", line 28, in forward
    @torch.compile
  File "/root/eagle/vllm_centml_fork/.venv_pytorchnightly/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py", line 1272, in _fn
    return fn(*args, **kwargs)
  File "/root/eagle/vllm_centml_fork/.venv_pytorchnightly/lib/python3.10/site-packages/torch/_functorch/aot_autograd.py", line 1210, in forward
    return compiled_fn(full_args)
  File "/root/eagle/vllm_centml_fork/.venv_pytorchnightly/lib/python3.10/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 582, in runtime_wrapper
    all_outs = call_func_at_runtime_with_args(
  File "/root/eagle/vllm_centml_fork/.venv_pytorchnightly/lib/python3.10/site-packages/torch/_functorch/_aot_autograd/utils.py", line 138, in call_func_at_runtime_with_args
    out = normalize_as_list(f(args))
  File "/root/eagle/vllm_centml_fork/.venv_pytorchnightly/lib/python3.10/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 785, in wrapper
    return compiled_fn(runtime_args)
  File "/root/eagle/vllm_centml_fork/.venv_pytorchnightly/lib/python3.10/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 989, in inner_fn
    outs = compiled_fn(args)
  File "/root/eagle/vllm_centml_fork/.venv_pytorchnightly/lib/python3.10/site-packages/torch/_inductor/output_code.py", line 682, in __call__
    return self.current_callable(inputs)
  File "/root/eagle/vllm_centml_fork/.venv_pytorchnightly/lib/python3.10/site-packages/torch/_inductor/utils.py", line 3444, in run
    out = model(new_inputs)
  File "/tmp/torchinductor_root/jw/cjwjaptw5j47m4zrumwv3lp72ghagh4gnm5qj2qims4pgof2vscx.py", line 117, in call
    torch.ops.streams.record_event.default(0, 1)
  File "/root/eagle/vllm_centml_fork/.venv_pytorchnightly/lib/python3.10/site-packages/torch/_ops.py", line 871, in __call__
    return self._op(*args, **kwargs)
  File "/root/eagle/vllm_centml_fork/.venv_pytorchnightly/lib/python3.10/site-packages/torch/_compile.py", line 54, in inner
    return disable_fn(*args, **kwargs)
  File "/root/eagle/vllm_centml_fork/.venv_pytorchnightly/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py", line 1272, in _fn
    return fn(*args, **kwargs)
  File "/root/eagle/vllm_centml_fork/.venv_pytorchnightly/lib/python3.10/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 411, in __torch_dispatch__
    res = func(*args, **kwargs)
  File "/root/eagle/vllm_centml_fork/.venv_pytorchnightly/lib/python3.10/site-packages/torch/_ops.py", line 871, in __call__
    return self._op(*args, **kwargs)
  File "/root/eagle/vllm_centml_fork/.venv_pytorchnightly/lib/python3.10/site-packages/torch/_library/custom_ops.py", line 375, in backend_impl
    result = self._backend_fns[device_type](*args, **kwargs)
  File "/root/eagle/vllm_centml_fork/.venv_pytorchnightly/lib/python3.10/site-packages/torch/_compile.py", line 54, in inner
    return disable_fn(*args, **kwargs)
  File "/root/eagle/vllm_centml_fork/.venv_pytorchnightly/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py", line 1272, in _fn
    return fn(*args, **kwargs)
  File "/root/eagle/vllm_centml_fork/.venv_pytorchnightly/lib/python3.10/site-packages/torch/_library/custom_ops.py", line 410, in wrapped_fn
    return fn(*args, **kwargs)
  File "/root/eagle/vllm_centml_fork/.venv_pytorchnightly/lib/python3.10/site-packages/torch/_dynamo/variables/streams.py", line 128, in record_event
    stream.record_event(event)
RuntimeError: expected event to be a torch.Event object

---

Collecting environment information...
PyTorch version: 2.12.0.dev20260317+cu130
Is debug build: False
CUDA used to build PyTorch: 13.0
ROCM used to build PyTorch: N/A

OS: Ubuntu 22.04.5 LTS (x86_64)
GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04.3) 11.4.0
Clang version: 14.0.0-1ubuntu1.1
CMake version: version 3.22.1
Libc version: glibc-2.35

Python version: 3.10.12 (main, Aug 15 2025, 14:32:43) [GCC 11.4.0] (64-bit runtime)
Python platform: Linux-5.15.0-161-generic-x86_64-with-glibc2.35
Is CUDA available: True
CUDA runtime version: Could not collect
CUDA_MODULE_LOADING set to: 
GPU models and configuration: 
GPU 0: NVIDIA B200
GPU 1: NVIDIA B200
GPU 2: NVIDIA B200
GPU 3: NVIDIA B200
GPU 4: NVIDIA B200
GPU 5: NVIDIA B200
GPU 6: NVIDIA B200
GPU 7: NVIDIA B200

Nvidia driver version: 580.95.05
cuDNN version: Could not collect
Is XPU available: False
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True
Caching allocator config: N/A

CPU:
Architecture:                            x86_64
CPU op-mode(s):                          32-bit, 64-bit
Address sizes:                           52 bits physical, 57 bits virtual
Byte Order:                              Little Endian
CPU(s):                                  256
On-line CPU(s) list:                     0-255
Vendor ID:                               AuthenticAMD
Model name:                              AMD EPYC 9575F 64-Core Processor
CPU family:                              26
Model:                                   2
Thread(s) per core:                      2
Core(s) per socket:                      64
Socket(s):                               2
Stepping:                                1
Frequency boost:                         enabled
CPU max MHz:                             5008.0068
CPU min MHz:                             1500.0000
BogoMIPS:                                6589.90
Flags:                                   fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 invpcid_single hw_pstate ssbd mba ibrs ibpb stibp ibrs_enhanced vmmcall fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local avx_vnni avx512_bf16 clzero irperf xsaveerptr rdpru wbnoinvd amd_ppin cppc arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif v_spec_ctrl avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq la57 rdpid bus_lock_detect movdiri movdir64b overflow_recov succor smca fsrm avx512_vp2intersect flush_l1d
Virtualization:                          AMD-V
L1d cache:                               6 MiB (128 instances)
L1i cache:                               4 MiB (128 instances)
L2 cache:                                128 MiB (128 instances)
L3 cache:                                512 MiB (16 instances)
NUMA node(s):                            2
NUMA node0 CPU(s):                       0-63,128-191
NUMA node1 CPU(s):                       64-127,192-255
Vulnerability Gather data sampling:      Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Not affected
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1:                Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; Enhanced / Automatic IBRS; IBPB conditional; STIBP always-on; PBRSB-eIBRS Not affected; BHI Not affected
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Not affected
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Not affected

Versions of relevant libraries:
[pip3] numpy==2.2.6
[pip3] nvidia-cublas==13.1.0.3
[pip3] nvidia-cuda-cupti==13.0.85
[pip3] nvidia-cuda-nvrtc==13.0.88
[pip3] nvidia-cuda-runtime==13.0.96
[pip3] nvidia-cudnn-cu13==9.19.0.56
[pip3] nvidia-cufft==12.0.0.61
[pip3] nvidia-curand==10.4.0.35
[pip3] nvidia-cusolver==12.0.4.66
[pip3] nvidia-cusparse==12.6.3.3
[pip3] nvidia-cusparselt-cu13==0.8.0
[pip3] nvidia-nccl-cu13==2.29.3
[pip3] nvidia-nvjitlink==13.0.88
[pip3] nvidia-nvtx==13.0.85
[pip3] torch==2.12.0.dev20260317+cu130
[pip3] torchvision==0.26.0.dev20260317+cu130
[pip3] triton==3.6.0+git9844da95
[conda] Could not collect
RAW_BUFFERClick to expand / collapse

🐛 Describe the bug

When trying to compile a torch function that operates on multiple streams using cuda events, I get an error. Here's the reproducer:

import torch
import torch.cuda

def _op(a: torch.Tensor) -> torch.Tensor:
    return torch.relu((a * 3)) * 2

@torch.compile(dynamic=False, fullgraph=True)
def compiled_binary_fn(a: torch.Tensor, b: torch.Tensor) -> tuple[torch.Tensor, torch.Tensor]:
    return _op(a), _op(b)

class MultiStreamOperator(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.aux_stream = torch.cuda.Stream(device=torch.device("cuda"))
        self.event_main = torch.cuda.Event()
        self.event_secondary = torch.cuda.Event()

    @torch.compile
    def forward(self, a: torch.Tensor, b: torch.Tensor) -> tuple[torch.Tensor, torch.Tensor]:
        self.event_main.record()
        result_A = _op(a)
        with torch.cuda.stream(self.aux_stream):
            self.event_main.wait()
            result_B = _op(b)
            self.event_secondary.record()
        self.event_secondary.wait()
        return result_A, result_B

def main():
    device = torch.device("cuda")
    size = 2048

    multistream_operator = MultiStreamOperator()

    inputs = [
        torch.randn(size, size, device=device),
        torch.randn(size, size, device=device),
    ]
    
    default_stream = torch.cuda.default_stream(device)
    torch.cuda.synchronize()

    # Warmup
    _op(inputs[0])
    _op(inputs[1])

    # let the cpu run ahead
    with torch.cuda.stream(default_stream):
        torch.cuda._sleep(10_000_000)  # ~10 ms

    for _ in range(3):
        with torch.cuda.nvtx.range("Single-Stream Operation"):
            compiled_binary_fn(*inputs)

    for _ in range(3):
        with torch.cuda.nvtx.range("Multi-Stream Operation"):
            multistream_operator(*inputs)

    torch.cuda.synchronize()

    print("Multi-stream compile test completed successfully.")


if __name__ == "__main__":
    main()

Here's the traceback:

Traceback (most recent call last):
  File "/root/eagle/vllm_centml_fork/test_multistream_compile.py", line 75, in <module>
    main()
  File "/root/eagle/vllm_centml_fork/test_multistream_compile.py", line 67, in main
    multistream_operator(*inputs)
  File "/root/eagle/vllm_centml_fork/.venv_pytorchnightly/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1778, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/root/eagle/vllm_centml_fork/.venv_pytorchnightly/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1789, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/eagle/vllm_centml_fork/.venv_pytorchnightly/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py", line 1036, in compile_wrapper
    return fn(*args, **kwargs)
  File "/root/eagle/vllm_centml_fork/test_multistream_compile.py", line 28, in forward
    @torch.compile
  File "/root/eagle/vllm_centml_fork/.venv_pytorchnightly/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py", line 1272, in _fn
    return fn(*args, **kwargs)
  File "/root/eagle/vllm_centml_fork/.venv_pytorchnightly/lib/python3.10/site-packages/torch/_functorch/aot_autograd.py", line 1210, in forward
    return compiled_fn(full_args)
  File "/root/eagle/vllm_centml_fork/.venv_pytorchnightly/lib/python3.10/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 582, in runtime_wrapper
    all_outs = call_func_at_runtime_with_args(
  File "/root/eagle/vllm_centml_fork/.venv_pytorchnightly/lib/python3.10/site-packages/torch/_functorch/_aot_autograd/utils.py", line 138, in call_func_at_runtime_with_args
    out = normalize_as_list(f(args))
  File "/root/eagle/vllm_centml_fork/.venv_pytorchnightly/lib/python3.10/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 785, in wrapper
    return compiled_fn(runtime_args)
  File "/root/eagle/vllm_centml_fork/.venv_pytorchnightly/lib/python3.10/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 989, in inner_fn
    outs = compiled_fn(args)
  File "/root/eagle/vllm_centml_fork/.venv_pytorchnightly/lib/python3.10/site-packages/torch/_inductor/output_code.py", line 682, in __call__
    return self.current_callable(inputs)
  File "/root/eagle/vllm_centml_fork/.venv_pytorchnightly/lib/python3.10/site-packages/torch/_inductor/utils.py", line 3444, in run
    out = model(new_inputs)
  File "/tmp/torchinductor_root/jw/cjwjaptw5j47m4zrumwv3lp72ghagh4gnm5qj2qims4pgof2vscx.py", line 117, in call
    torch.ops.streams.record_event.default(0, 1)
  File "/root/eagle/vllm_centml_fork/.venv_pytorchnightly/lib/python3.10/site-packages/torch/_ops.py", line 871, in __call__
    return self._op(*args, **kwargs)
  File "/root/eagle/vllm_centml_fork/.venv_pytorchnightly/lib/python3.10/site-packages/torch/_compile.py", line 54, in inner
    return disable_fn(*args, **kwargs)
  File "/root/eagle/vllm_centml_fork/.venv_pytorchnightly/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py", line 1272, in _fn
    return fn(*args, **kwargs)
  File "/root/eagle/vllm_centml_fork/.venv_pytorchnightly/lib/python3.10/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 411, in __torch_dispatch__
    res = func(*args, **kwargs)
  File "/root/eagle/vllm_centml_fork/.venv_pytorchnightly/lib/python3.10/site-packages/torch/_ops.py", line 871, in __call__
    return self._op(*args, **kwargs)
  File "/root/eagle/vllm_centml_fork/.venv_pytorchnightly/lib/python3.10/site-packages/torch/_library/custom_ops.py", line 375, in backend_impl
    result = self._backend_fns[device_type](*args, **kwargs)
  File "/root/eagle/vllm_centml_fork/.venv_pytorchnightly/lib/python3.10/site-packages/torch/_compile.py", line 54, in inner
    return disable_fn(*args, **kwargs)
  File "/root/eagle/vllm_centml_fork/.venv_pytorchnightly/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py", line 1272, in _fn
    return fn(*args, **kwargs)
  File "/root/eagle/vllm_centml_fork/.venv_pytorchnightly/lib/python3.10/site-packages/torch/_library/custom_ops.py", line 410, in wrapped_fn
    return fn(*args, **kwargs)
  File "/root/eagle/vllm_centml_fork/.venv_pytorchnightly/lib/python3.10/site-packages/torch/_dynamo/variables/streams.py", line 128, in record_event
    stream.record_event(event)
RuntimeError: expected event to be a torch.Event object

Versions

<details><summary>Details</summary>

Collecting environment information...
PyTorch version: 2.12.0.dev20260317+cu130
Is debug build: False
CUDA used to build PyTorch: 13.0
ROCM used to build PyTorch: N/A

OS: Ubuntu 22.04.5 LTS (x86_64)
GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04.3) 11.4.0
Clang version: 14.0.0-1ubuntu1.1
CMake version: version 3.22.1
Libc version: glibc-2.35

Python version: 3.10.12 (main, Aug 15 2025, 14:32:43) [GCC 11.4.0] (64-bit runtime)
Python platform: Linux-5.15.0-161-generic-x86_64-with-glibc2.35
Is CUDA available: True
CUDA runtime version: Could not collect
CUDA_MODULE_LOADING set to: 
GPU models and configuration: 
GPU 0: NVIDIA B200
GPU 1: NVIDIA B200
GPU 2: NVIDIA B200
GPU 3: NVIDIA B200
GPU 4: NVIDIA B200
GPU 5: NVIDIA B200
GPU 6: NVIDIA B200
GPU 7: NVIDIA B200

Nvidia driver version: 580.95.05
cuDNN version: Could not collect
Is XPU available: False
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True
Caching allocator config: N/A

CPU:
Architecture:                            x86_64
CPU op-mode(s):                          32-bit, 64-bit
Address sizes:                           52 bits physical, 57 bits virtual
Byte Order:                              Little Endian
CPU(s):                                  256
On-line CPU(s) list:                     0-255
Vendor ID:                               AuthenticAMD
Model name:                              AMD EPYC 9575F 64-Core Processor
CPU family:                              26
Model:                                   2
Thread(s) per core:                      2
Core(s) per socket:                      64
Socket(s):                               2
Stepping:                                1
Frequency boost:                         enabled
CPU max MHz:                             5008.0068
CPU min MHz:                             1500.0000
BogoMIPS:                                6589.90
Flags:                                   fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 invpcid_single hw_pstate ssbd mba ibrs ibpb stibp ibrs_enhanced vmmcall fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local avx_vnni avx512_bf16 clzero irperf xsaveerptr rdpru wbnoinvd amd_ppin cppc arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif v_spec_ctrl avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq la57 rdpid bus_lock_detect movdiri movdir64b overflow_recov succor smca fsrm avx512_vp2intersect flush_l1d
Virtualization:                          AMD-V
L1d cache:                               6 MiB (128 instances)
L1i cache:                               4 MiB (128 instances)
L2 cache:                                128 MiB (128 instances)
L3 cache:                                512 MiB (16 instances)
NUMA node(s):                            2
NUMA node0 CPU(s):                       0-63,128-191
NUMA node1 CPU(s):                       64-127,192-255
Vulnerability Gather data sampling:      Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Not affected
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1:                Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; Enhanced / Automatic IBRS; IBPB conditional; STIBP always-on; PBRSB-eIBRS Not affected; BHI Not affected
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Not affected
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Not affected

Versions of relevant libraries:
[pip3] numpy==2.2.6
[pip3] nvidia-cublas==13.1.0.3
[pip3] nvidia-cuda-cupti==13.0.85
[pip3] nvidia-cuda-nvrtc==13.0.88
[pip3] nvidia-cuda-runtime==13.0.96
[pip3] nvidia-cudnn-cu13==9.19.0.56
[pip3] nvidia-cufft==12.0.0.61
[pip3] nvidia-curand==10.4.0.35
[pip3] nvidia-cusolver==12.0.4.66
[pip3] nvidia-cusparse==12.6.3.3
[pip3] nvidia-cusparselt-cu13==0.8.0
[pip3] nvidia-nccl-cu13==2.29.3
[pip3] nvidia-nvjitlink==13.0.88
[pip3] nvidia-nvtx==13.0.85
[pip3] torch==2.12.0.dev20260317+cu130
[pip3] torchvision==0.26.0.dev20260317+cu130
[pip3] triton==3.6.0+git9844da95
[conda] Could not collect
</details>

cc @ptrblck @msaroufim @eqy @jerryzh168 @tinglvv @nWEIdia @chauhang @penguinwu @voznesenskym @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @kadeng @amjames @Lucaskabela @jataylo

extent analysis

Fix Plan

The issue arises from the fact that torch.compile does not support CUDA events. To fix this, we need to remove the @torch.compile decorator from the forward method of the MultiStreamOperator class.

Here are the steps to fix the issue:

  • Remove the @torch.compile decorator from the forward method.
  • If compilation is necessary, consider using a different compilation method that supports CUDA events.

Example code:

class MultiStreamOperator(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.aux_stream = torch.cuda.Stream(device=torch.device("cuda"))
        self.event_main = torch.cuda.Event()
        self.event_secondary = torch.cuda.Event()

    def forward(self, a: torch.Tensor, b: torch.Tensor) -> tuple[torch.Tensor, torch.Tensor]:
        self.event_main.record()
        result_A = _op(a)
        with torch.cuda.stream(self.aux_stream):
            self.event_main.wait()
            result_B = _op(b)
            self.event_secondary.record()
        self.event_secondary.wait()
        return result_A, result_B

Verification

To verify that the fix worked, run the main function again. If the issue is resolved, the program should execute without any errors.

Extra Tips

  • Be cautious when using torch.compile with CUDA events, as it may not be supported in all cases.
  • Consider using a different compilation method, such as torch.jit.script, if compilation is necessary.
  • Make sure to test your code thoroughly after making any changes to ensure that it works as expected.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

pytorch - 💡(How to fix) Fix torch.compile not tracing streams properly [5 comments, 3 participants]