pytorch - ✅(Solved) Fix Incorrect timing running output code from torch.compile on non-cuda devices [1 pull requests, 2 comments, 3 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
pytorch/pytorch#181954Fetched 2026-05-01 05:33:05
View on GitHub
Comments
2
Participants
3
Timeline
51
Reactions
1
Timeline (top)
mentioned ×21subscribed ×21labeled ×5commented ×2

Root Cause

The device is default to "cuda". While the device parameter is important as we can see the timed function called from print_performance uses the device to synchronize so that the time and events is correct. If we run the output code to benchmark on other devices like xpu, the times are incorrect because synchronize(device) will do nothing as the device is "cuda" even it is running on XPU.

Fix Action

Fix / Workaround

CPU: Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Address sizes: 45 bits physical, 48 bits virtual Byte Order: Little Endian CPU(s): 12 On-line CPU(s) list: 0-11 Vendor ID: GenuineIntel Model name: Intel(R) Xeon(R) Gold 5220R CPU @ 2.20GHz CPU family: 6 Model: 85 Thread(s) per core: 1 Core(s) per socket: 6 Socket(s): 2 Stepping: 0 BogoMIPS: 4389.68 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon nopl xtopology tsc_reliable nonstop_tsc cpuid tsc_known_freq pni pclmulqdq vmx ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch pti ssbd ibrs ibpb stibp tpr_shadow ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 invpcid avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xsaves arat vnmi pku ospke md_clear flush_l1d arch_capabilities Virtualization: VT-x Hypervisor vendor: VMware Virtualization type: full L1d cache: 384 KiB (12 instances) L1i cache: 384 KiB (12 instances) L2 cache: 12 MiB (12 instances) L3 cache: 71.5 MiB (2 instances) NUMA node(s): 1 NUMA node0 CPU(s): 0-11 Vulnerability Gather data sampling: Unknown: Dependent on hypervisor status Vulnerability Indirect target selection: Vulnerable Vulnerability Itlb multihit: KVM: Mitigation: VMX disabled Vulnerability L1tf: Mitigation; PTE Inversion; VMX flush not necessary, SMT disabled Vulnerability Mds: Mitigation; Clear CPU buffers; SMT Host state unknown Vulnerability Meltdown: Mitigation; PTI Vulnerability Mmio stale data: Mitigation; Clear CPU buffers; SMT Host state unknown Vulnerability Reg file data sampling: Not affected Vulnerability Retbleed: Mitigation; IBRS Vulnerability Spec rstack overflow: Not affected Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Vulnerability Spectre v2: Mitigation; IBRS; IBPB conditional; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI SW loop, KVM SW loop Vulnerability Srbds: Not affected Vulnerability Tsa: Not affected Vulnerability Tsx async abort: Not affected Vulnerability Vmscape: Not affected

PR fix notes

PR #181957: [inductor] Pass correct device to print_performance

Description (problem / solution / changelog)

Stack from ghstack (oldest at bottom):

  • -> #181957

Motivation

Fix https://github.com/pytorch/pytorch/issues/181954 Pass the correct device to print_performance within benchmark_compiled_module, without this PR, it will always be "cuda".

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @kadeng @muchulee8 @amjames @chauhang @aakhundov @coconutruben @jataylo

Changed files

  • test/inductor/test_kernel_benchmark.py (modified, +15/-1)
  • torch/_inductor/codegen/wrapper.py (modified, +2/-2)

Code Example

def benchmark_compiled_module(args, times=10, repeat=10):
    from torch._inductor.utils import print_performance
    fn = lambda: call(list(args))
    return print_performance(fn, times=times, repeat=repeat)

if __name__ == "__main__":
    from torch._inductor.wrapper_benchmark import compiled_module_main
    args = get_args()
    compiled_module_main('None', lambda times, repeat: benchmark_compiled_module(args, times=times, repeat=repeat))

---

def timed(
    model: Callable[..., Any],
    example_inputs: Sequence[Any],
    times: int = 1,
    device: str = "cuda",
) -> float:
    synchronize(device)
    torch.manual_seed(1337)
    t0 = time.perf_counter()
    for _ in range(times):
        result = model(*example_inputs)
        synchronize(device)
    t1 = time.perf_counter()
    # GC the result after timing
    assert result is not None  # type: ignore[possibly-undefined]
    return t1 - t0


def print_performance(
    model: Callable[..., Any],
    example_inputs: Sequence[Any] = (),
    times: int = 10,
    repeat: int = 10,
    baseline: float = 1.0,
    device: str = "cuda",
) -> float:
    timings = torch.tensor(
        [timed(model, example_inputs, times, device) for _ in range(repeat)]
    )
    took = torch.median(timings) / times
    print(f"{took / baseline:.6f}")
    return took.item()
RAW_BUFFERClick to expand / collapse

🐛 Describe the bug

When running the output code from torch.compile inductor on XPU, we found that timing and event are not properly captured in the benchmark. There is a problem on the generated code as to benchmark_compiled_module function which calls print_performance without device parameter.

def benchmark_compiled_module(args, times=10, repeat=10):
    from torch._inductor.utils import print_performance
    fn = lambda: call(list(args))
    return print_performance(fn, times=times, repeat=repeat)

if __name__ == "__main__":
    from torch._inductor.wrapper_benchmark import compiled_module_main
    args = get_args()
    compiled_module_main('None', lambda times, repeat: benchmark_compiled_module(args, times=times, repeat=repeat))

The device is default to "cuda". While the device parameter is important as we can see the timed function called from print_performance uses the device to synchronize so that the time and events is correct. If we run the output code to benchmark on other devices like xpu, the times are incorrect because synchronize(device) will do nothing as the device is "cuda" even it is running on XPU.

def timed(
    model: Callable[..., Any],
    example_inputs: Sequence[Any],
    times: int = 1,
    device: str = "cuda",
) -> float:
    synchronize(device)
    torch.manual_seed(1337)
    t0 = time.perf_counter()
    for _ in range(times):
        result = model(*example_inputs)
        synchronize(device)
    t1 = time.perf_counter()
    # GC the result after timing
    assert result is not None  # type: ignore[possibly-undefined]
    return t1 - t0


def print_performance(
    model: Callable[..., Any],
    example_inputs: Sequence[Any] = (),
    times: int = 10,
    repeat: int = 10,
    baseline: float = 1.0,
    device: str = "cuda",
) -> float:
    timings = torch.tensor(
        [timed(model, example_inputs, times, device) for _ in range(repeat)]
    )
    took = torch.median(timings) / times
    print(f"{took / baseline:.6f}")
    return took.item()

So we need to generate code to pass the actual device when calling print_performance from benchmark_compiled_module.

@EikanWang

Versions

Collecting environment information... PyTorch version: 2.12.0a0+gitbf51772 Is debug build: False CUDA used to build PyTorch: None ROCM used to build PyTorch: N/A

OS: Ubuntu 24.04 LTS (x86_64) GCC version: (Ubuntu 11.5.0-1ubuntu1~24.04.1) 11.5.0 Clang version: Could not collect CMake version: version 4.3.1 Libc version: glibc-2.39

Python version: 3.12.3 (main, Mar 23 2026, 19:04:32) [GCC 13.3.0] (64-bit runtime) Python platform: Linux-6.8.0-100-generic-x86_64-with-glibc2.39 Is CUDA available: False CUDA runtime version: No CUDA CUDA_MODULE_LOADING set to: N/A GPU models and configuration: No CUDA Nvidia driver version: No CUDA cuDNN version: No CUDA Is XPU available: True XPU used to build PyTorch: 20260100 Intel GPU driver version:

  • level-zero: 1.28.2
  • intel-opencl-icd: 1:26.15.037981.20873-0embargo Intel GPU models onboard: N/A Intel GPU models detected:
  • [0] _XpuDeviceProperties(...) HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True Caching allocator config: N/A

CPU: Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Address sizes: 45 bits physical, 48 bits virtual Byte Order: Little Endian CPU(s): 12 On-line CPU(s) list: 0-11 Vendor ID: GenuineIntel Model name: Intel(R) Xeon(R) Gold 5220R CPU @ 2.20GHz CPU family: 6 Model: 85 Thread(s) per core: 1 Core(s) per socket: 6 Socket(s): 2 Stepping: 0 BogoMIPS: 4389.68 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon nopl xtopology tsc_reliable nonstop_tsc cpuid tsc_known_freq pni pclmulqdq vmx ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch pti ssbd ibrs ibpb stibp tpr_shadow ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 invpcid avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xsaves arat vnmi pku ospke md_clear flush_l1d arch_capabilities Virtualization: VT-x Hypervisor vendor: VMware Virtualization type: full L1d cache: 384 KiB (12 instances) L1i cache: 384 KiB (12 instances) L2 cache: 12 MiB (12 instances) L3 cache: 71.5 MiB (2 instances) NUMA node(s): 1 NUMA node0 CPU(s): 0-11 Vulnerability Gather data sampling: Unknown: Dependent on hypervisor status Vulnerability Indirect target selection: Vulnerable Vulnerability Itlb multihit: KVM: Mitigation: VMX disabled Vulnerability L1tf: Mitigation; PTE Inversion; VMX flush not necessary, SMT disabled Vulnerability Mds: Mitigation; Clear CPU buffers; SMT Host state unknown Vulnerability Meltdown: Mitigation; PTI Vulnerability Mmio stale data: Mitigation; Clear CPU buffers; SMT Host state unknown Vulnerability Reg file data sampling: Not affected Vulnerability Retbleed: Mitigation; IBRS Vulnerability Spec rstack overflow: Not affected Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Vulnerability Spectre v2: Mitigation; IBRS; IBPB conditional; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI SW loop, KVM SW loop Vulnerability Srbds: Not affected Vulnerability Tsa: Not affected Vulnerability Tsx async abort: Not affected Vulnerability Vmscape: Not affected

Versions of relevant libraries: [pip3] mypy_extensions==1.1.0 [pip3] numpy==1.26.4 [pip3] optree==0.19.0 [pip3] torch==2.12.0a0+gitbf51772 [pip3] triton-xxx==3.7.1+gite74175b2 [conda] Could not collect

cc @chauhang @penguinwu @voznesenskym @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @kadeng @muchulee8 @amjames @aakhundov @coconutruben @jataylo @gujinghui @fengyuan14 @guangyey

extent analysis

TL;DR

Pass the actual device to the print_performance function when calling it from benchmark_compiled_module to ensure accurate timing and event capture.

Guidance

  • Modify the benchmark_compiled_module function to accept and pass the device parameter to print_performance.
  • Verify that the device is correctly passed by checking the synchronize function calls within timed and print_performance.
  • Ensure the device parameter is set to the actual device being used (e.g., "xpu") when calling benchmark_compiled_module.
  • Test the modified code to confirm accurate timing and event capture.

Example

def benchmark_compiled_module(args, device, times=10, repeat=10):
    from torch._inductor.utils import print_performance
    fn = lambda: call(list(args))
    return print_performance(fn, times=times, repeat=repeat, device=device)

Notes

The provided code and environment information suggest a PyTorch version and XPU setup, but the exact fix may depend on the specific use case and requirements.

Recommendation

Apply the workaround by modifying the benchmark_compiled_module function to pass the actual device to print_performance, as this will ensure accurate timing and event capture for the benchmark.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING