pytorch - ✅(Solved) Fix CUDA trace callbacks leak across tests in a shared process [1 pull requests, 1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
pytorch/pytorch#181865Fetched 2026-04-30 06:18:05
View on GitHub
Comments
0
Participants
1
Timeline
1
Reactions
0
Author
Participants
Timeline (top)
cross-referenced ×1

Error Message

cuda trace hook execution failed: SystemError: <built-in function __import__> returned a result with an exception set SystemError: <function NestedWrappedModule.init ...> returned NULL without setting an exception

Fix Action

Fixed

PR fix notes

PR #181866: Support deactivating and reactivating GPU tracing

Description (problem / solution / changelog)

Fixes #181865

Adds explicit deactivation support for GPU tracing and uses it to prevent CUDA/XPU GPU trace callbacks from leaking into unrelated tests that run later in the same Python process.

Summary:

  • Add a C++ GPU trace deactivation path via GPUTrace::unset_trace() and torch._C._deactivate_gpu_trace().
  • Keep the first registered PyInterpreter pointer intact while using an atomic haveState flag to enable/disable GPU tracing safely; this preserves the existing first-interpreter registration semantics and allows _activate_gpu_trace() to re-enable tracing after _deactivate_gpu_trace().
  • Add callback clearing support to CallbackRegistry and expose clear_callbacks() from CUDA/XPU GPU trace modules.
  • Update test/test_cuda_trace.py to clear registered callbacks around each test and deactivate process-global GPU tracing when the test class finishes.
  • Update test/test_xpu.py::TestXpuTrace with the same cleanup pattern for parity. tearDown() clears per-test callback registrations so callbacks from one trace assertion do not affect the next one; tearDownClass() clears any remaining callbacks and deactivates the process-global GPU trace state after the trace test class is done.
  • Add a typing stub for torch._C._deactivate_gpu_trace().

XPU note: I do not have an XPU-enabled environment in this workspace (torch.xpu.is_available() and torch.xpu._is_compiled() are both false), so test/test_xpu.py::TestXpuTrace still needs to be validated by a contributor or CI job with XPU support.

Changed files

  • c10/core/impl/GPUTrace.cpp (modified, +6/-2)
  • c10/core/impl/GPUTrace.h (modified, +3/-6)
  • test/test_cuda_trace.py (modified, +22/-0)
  • test/test_xpu.py (modified, +9/-0)
  • torch/_C/__init__.pyi.in (modified, +1/-0)
  • torch/_utils.py (modified, +3/-0)
  • torch/csrc/autograd/init.cpp (modified, +1/-0)
  • torch/csrc/autograd/python_variable.cpp (modified, +4/-0)
  • torch/csrc/autograd/python_variable.h (modified, +1/-0)
  • torch/cuda/_gpu_trace.py (modified, +22/-0)
  • torch/xpu/_gpu_trace.py (modified, +22/-0)

Code Example

python -m pytest -sv test/test_cuda_trace.py \
  test/distributed/fsdp/test_fsdp_misc.py::TestFSDPMiscMultiThread::test_module_device_mismatches_device_id

---

cuda trace hook execution failed: SystemError: <built-in function __import__> returned a result with an exception set
SystemError: <function NestedWrappedModule.__init__ ...> returned NULL without setting an exception
RAW_BUFFERClick to expand / collapse

🐛 Describe the bug

test/test_cuda_trace.py activates process-global GPU tracing and registers Python callback objects in torch.cuda._gpu_trace. The callbacks and the C++ GPU trace state are not reset after the test class finishes. When another CUDA test runs later in the same Python process, CUDA events still enter the GPU trace hook even though the later test did not opt into tracing.

This can make unrelated tests fail. One observed failure is when running CUDA trace tests followed by an FSDP multi-threaded test in the same pytest process:

python -m pytest -sv test/test_cuda_trace.py \
  test/distributed/fsdp/test_fsdp_misc.py::TestFSDPMiscMultiThread::test_module_device_mismatches_device_id

Observed error:

cuda trace hook execution failed: SystemError: <built-in function __import__> returned a result with an exception set
SystemError: <function NestedWrappedModule.__init__ ...> returned NULL without setting an exception

Expected behavior: GPU trace callbacks registered by CUDA trace tests should not affect unrelated tests that run later in the same process.

Versions

PyTorch version: 2.8.0+cu129 Is debug build: False CUDA used to build PyTorch: 12.9 ROCM used to build PyTorch: N/A

OS: Ubuntu 24.04.2 LTS (x86_64) GCC version: (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 Clang version: Could not collect CMake version: version 4.3.2 Libc version: glibc-2.39

Python version: 3.12.12 (main, Oct 28 2025, 12:10:49) [Clang 20.1.4 ] (64-bit runtime) Python platform: Linux-5.4.0-100-generic-x86_64-with-glibc2.39 Is CUDA available: True CUDA runtime version: 12.9.41 CUDA_MODULE_LOADING set to: LAZY

Nvidia driver version: 575.51.03 cuDNN version: Probably one of the following: /usr/lib/x86_64-linux-gnu/libcudnn.so.9.10.1 /usr/lib/x86_64-linux-gnu/libcudnn_adv.so.9.10.1 /usr/lib/x86_64-linux-gnu/libcudnn_cnn.so.9.10.1 /usr/lib/x86_64-linux-gnu/libcudnn_engines_precompiled.so.9.10.1 /usr/lib/x86_64-linux-gnu/libcudnn_engines_runtime_compiled.so.9.10.1 /usr/lib/x86_64-linux-gnu/libcudnn_graph.so.9.10.1 /usr/lib/x86_64-linux-gnu/libcudnn_heuristic.so.9.10.1 /usr/lib/x86_64-linux-gnu/libcudnn_ops.so.9.10.1 Is XPU available: False HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True Caching allocator config: N/A

Versions of relevant libraries: [pip3] mypy_extensions==1.1.0 [pip3] nccl4py==0.1.1 [pip3] numpy==1.26.4 [pip3] nvidia-cublas-cu12==12.9.1.4 [pip3] nvidia-cuda-cupti-cu12==12.9.79 [pip3] nvidia-cuda-nvrtc-cu12==12.9.86 [pip3] nvidia-cuda-runtime-cu12==12.9.79 [pip3] nvidia-cudnn-cu12==9.10.2.21 [pip3] nvidia-cufft-cu12==11.4.1.4 [pip3] nvidia-curand-cu12==10.3.10.19 [pip3] nvidia-cusolver-cu12==11.7.5.82 [pip3] nvidia-cusparse-cu12==12.5.10.65 [pip3] nvidia-cusparselt-cu12==0.7.1 [pip3] nvidia-nccl-cu12==2.27.3 [pip3] nvidia-nvjitlink-cu12==12.9.86 [pip3] nvidia-nvtx-cu12==12.9.79 [pip3] nvtx==0.2.13 [pip3] optree==0.19.0 [pip3] torch==2.8.0+cu129 [pip3] torchdata==0.11.0 [pip3] torchvision==0.24.0+cu129 [pip3] triton==3.4.0 [conda] Could not collect

extent analysis

TL;DR

Resetting the GPU trace state and callbacks after each CUDA test class finishes is likely necessary to prevent unrelated tests from failing.

Guidance

  • Identify the test class responsible for activating process-global GPU tracing and registering Python callback objects in torch.cuda._gpu_trace.
  • Ensure that the test class properly resets the GPU trace state and callbacks after it finishes, potentially by using a teardown method or a context manager.
  • Verify that the GPU trace callbacks are not affecting unrelated tests by running them in isolation and checking for any errors or unexpected behavior.
  • Consider using a pytest fixture or a context manager to manage the GPU trace state and callbacks, ensuring they are properly reset after each test.

Example

import pytest
import torch

@pytest.fixture
def gpu_trace():
    # Activate GPU tracing
    torch.cuda._gpu_trace.activate()
    try:
        yield
    finally:
        # Reset GPU trace state and callbacks
        torch.cuda._gpu_trace.deactivate()
        torch.cuda._gpu_trace.reset_callbacks()

Notes

The exact implementation of resetting the GPU trace state and callbacks may depend on the internal workings of torch.cuda._gpu_trace and may require additional investigation or debugging.

Recommendation

Apply a workaround by resetting the GPU trace state and callbacks after each CUDA test class finishes, as described in the guidance section. This should help prevent unrelated tests from failing due to lingering GPU trace callbacks.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

pytorch - ✅(Solved) Fix CUDA trace callbacks leak across tests in a shared process [1 pull requests, 1 participants]