pytorch - ✅(Solved) Fix [RFC] Adding Kernel-Level Profiling Support for PrivateUse1 Backends via Kineto Plugin [1 pull requests, 6 comments, 5 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
pytorch/pytorch#177978Fetched 2026-04-08 01:07:44
View on GitHub
Comments
6
Participants
5
Timeline
65
Reactions
0
Timeline (top)
subscribed ×29mentioned ×27commented ×6labeled ×2

This RFC adds kernel-level profiling support for PrivateUse1 backends through a Kineto plugin (PU1PTI). It delivers a vendor-facing C API (PrivateUse1TracingApi), dynamic plugin loading so vendors can integrate without patching Kineto, an OpenReg reference implementation, and vendor-facing documentation.

Error Message

  • Error handling: env unset → no session (zero overhead); dlopen/dlsym failure → log warning, return nullptr; ABI mismatch → log error, return nullptr
  • Dynamic loading adds complexity (dlopen, ABI versioning, error handling)

Root Cause

This RFC adds kernel-level profiling support for PrivateUse1 backends through a Kineto plugin (PU1PTI). It delivers a vendor-facing C API (PrivateUse1TracingApi), dynamic plugin loading so vendors can integrate without patching Kineto, an OpenReg reference implementation, and vendor-facing documentation.

Fix Action

Fix / Workaround

Summary

This RFC adds kernel-level profiling support for PrivateUse1 backends through a Kineto plugin (PU1PTI). It delivers a vendor-facing C API (PrivateUse1TracingApi), dynamic plugin loading so vendors can integrate without patching Kineto, an OpenReg reference implementation, and vendor-facing documentation.

Motivation

OpenReg is PyTorch's reference implementation for out-of-tree hardware backends. It demonstrates device model, operator dispatch, memory management, streams, autograd etc. But if a vendor wants kernel-level profiling (individual kernels, runtime/driver events, memory operations, CPU-device correlation), there is no reference demonstrating how an out-of-tree accelerator can integrate with Kineto's profiling infrastructure.

This proposal addresses the gap by introducing a vendor-facing C API, a Kineto plugin that implements IActivityProfiler for PrivateUse1, dynamic plugin loading, and OpenReg as the reference implementation. Critically, this enables any out-of-tree accelerator vendor to integrate kernel-level profiling with Kineto without patching Kineto itself — vendors ship their own .so implementing the stable C API, and the plugin loader discovers it at runtime. This makes PrivateUse1 profiling truly seamless and out-of-tree, matching the same zero-patch philosophy that PrivateUse1 provides for device registration, operator dispatch, and memory management.

PR fix notes

PR #172154: privateuse1 backend integration with kineto

Description (problem / solution / changelog)

  1. Created privateuse1_profiler.h/.cpp — A registry pattern that allows PrivateUse1 backends to register IActivityProfiler factories via REGISTER_PRIVATEUSE1_PROFILER(MyProfiler) macro, with compile-time static_assert ensuring the class inherits from libkineto::IActivityProfiler.
    • This makes the assumption that backends will take a dependency on Kineto to use IActivityProfiler interface. Right now the backends have to check in their implementation to Kineto - so this might be a step up and a safe assumption.
    • As an alternative, PyTorch could define its own abstract interface that mirrors IActivityProfiler, then internally forward to Kineto.
  2. Kineto init paths — Added onKinetoInit() calls in kineto_shim.cpp (user-triggered profiling via prepareTrace()), but not for kineto_client_interface.cpp (daemon mode via global_kineto_init()), with guards to ensure Kineto is initialized before forwarding.

TODO

  1. [Done] Gate this behind a new ProfilerState::KINETO_PRIVATEUSE1 check
  2. [Done] Check how (if at all) kineto build args need to change. Mostly it shouldn't as for privateuse1 we wont need CUDA/ROCm/XPU etc.
  3. [Done] How does this break kineto's fbcode setup? Not applicable

Changed files

  • .ci/pytorch/test.sh (modified, +13/-0)
  • build_variables.bzl (modified, +1/-0)
  • caffe2/CMakeLists.txt (modified, +1/-0)
  • test/cpp/profiler/CMakeLists.txt (added, +25/-0)
  • test/cpp/profiler/test_privateuse1_profiler.cpp (added, +181/-0)
  • test/profiler/test_profiler.py (modified, +56/-0)
  • torch/_C/_profiler.pyi (modified, +1/-0)
  • torch/autograd/profiler.py (modified, +8/-7)
  • torch/csrc/autograd/profiler_kineto.cpp (modified, +14/-1)
  • torch/csrc/profiler/orchestration/observer.h (modified, +1/-0)
  • torch/csrc/profiler/python/init.cpp (modified, +2/-1)
  • torch/csrc/profiler/standalone/privateuse1_profiler.cpp (added, +73/-0)
  • torch/csrc/profiler/standalone/privateuse1_profiler.h (added, +113/-0)

Code Example

torch.profiler.profile(activities=[PrivateUse1])
Kineto Core (ActivityProfilerProxyControllerCuptiActivityProfiler)
        ├── addChildActivityProfiler()
PrivateUse1ActivityProfiler (implements IActivityProfiler)
        ├── PrivateUse1ActivityProfilerSession (start/stop/processTrace)
        ├── PrivateUse1ActivityApi (C++ wrapper, calls vendor C API)
        └── PrivateUse1PluginLoader (discovers & loads vendor .so)
Vendor .so (e.g. libopenreg_tracing.so)
        ├── Implements PrivateUse1TracingApi C functions
        ├── Kernel launch tracking
        ├── Runtime call tracking
        └── Memory operation tracking
RAW_BUFFERClick to expand / collapse

🚀 The feature, motivation and pitch

Summary

This RFC adds kernel-level profiling support for PrivateUse1 backends through a Kineto plugin (PU1PTI). It delivers a vendor-facing C API (PrivateUse1TracingApi), dynamic plugin loading so vendors can integrate without patching Kineto, an OpenReg reference implementation, and vendor-facing documentation.

Motivation

OpenReg is PyTorch's reference implementation for out-of-tree hardware backends. It demonstrates device model, operator dispatch, memory management, streams, autograd etc. But if a vendor wants kernel-level profiling (individual kernels, runtime/driver events, memory operations, CPU-device correlation), there is no reference demonstrating how an out-of-tree accelerator can integrate with Kineto's profiling infrastructure.

Today, the only references for Kineto profiling integration are CUPTI (CUDA), XPUPTI (XPU), and AIUPTI (AIU). All three are tightly coupled to their respective vendor libraries, compiled into the Kineto source tree via #ifdef, and not designed as integration guides for external vendors.

At present, PrivateUse1 backends lack:

  • No IActivityProfiler implementation for PrivateUse1. Activity types (PRIVATEUSE1_RUNTIME, PRIVATEUSE1_DRIVER, CONCURRENT_KERNEL, GPU_MEMCPY, GPU_MEMSET) are defined in Kineto but nothing populates them.

  • No vendor-facing API. XPU/AIU use proprietary vendor APIs (XPUPTI, AIUPTI) directly — there is no generic, documented C API for PrivateUse1.

  • No dynamic loading mechanism. XPU/AIU use compile-time #ifdef registration in init.cpp; vendors must modify Kineto source to add a new backend.

  • No documentation for the IActivityProfiler contract, buffer protocol, or correlation flow from an external vendor's perspective.

  • No reference implementation. OpenReg only implements ProfilerStubs for fallback mode (KINETO_PRIVATEUSE1_FALLBACK), providing operator-level timing only.

Impact: vendors must reverse-engineer Kineto internals, leading to high integration cost, fragile implementations, and limited adoption.

This proposal addresses the gap by introducing a vendor-facing C API, a Kineto plugin that implements IActivityProfiler for PrivateUse1, dynamic plugin loading, and OpenReg as the reference implementation. Critically, this enables any out-of-tree accelerator vendor to integrate kernel-level profiling with Kineto without patching Kineto itself — vendors ship their own .so implementing the stable C API, and the plugin loader discovers it at runtime. This makes PrivateUse1 profiling truly seamless and out-of-tree, matching the same zero-patch philosophy that PrivateUse1 provides for device registration, operator dispatch, and memory management.

Goals:

  1. PU1PTI plugin: A minimal IActivityProfiler implementation in Kineto that produces CONCURRENT_KERNEL, GPU_MEMCPY, GPU_MEMSET, PRIVATEUSE1_RUNTIME, and PRIVATEUSE1_DRIVER activities. Registered via registerProfilerFactory().

  2. Vendor-facing C API (PrivateUse1TracingApi.h): A small, stable, ABI-versioned C API that vendors implement in their .so. Functions: enable/disable, buffer callbacks, correlation push/pop, flush, record iteration, version check.

  3. Dynamic plugin loading: Kineto discovers and loads vendor .so at runtime (KINETO_PLUGIN_PATH or standard paths) via dlopen/dlsym. No Kineto source changes per vendor.

  4. OpenReg reference implementation: Minimal working implementation of PrivateUse1TracingApi in OpenReg demonstrating how vendors integrate kernel-level profiling.

  5. Documentation: Vendor guide covering C API, buffer protocol, correlation, plugin lifecycle, and troubleshooting.

Non-goals: Production-grade plugin features, performance metrics (FLOPs, occupancy), advanced analysis (roofline, SASS), automatic profiler detection API, vendor-specific optimizations.

Proposed Implementation

Detailed design document: openreg_profiling_proposal contains the full implementation specification.

Architecture Overview

torch.profiler.profile(activities=[PrivateUse1])
Kineto Core (ActivityProfilerProxy → Controller → CuptiActivityProfiler)
        ├── addChildActivityProfiler()
PrivateUse1ActivityProfiler (implements IActivityProfiler)
        ├── PrivateUse1ActivityProfilerSession (start/stop/processTrace)
        ├── PrivateUse1ActivityApi (C++ wrapper, calls vendor C API)
        └── PrivateUse1PluginLoader (discovers & loads vendor .so)
Vendor .so (e.g. libopenreg_tracing.so)
        ├── Implements PrivateUse1TracingApi C functions
        ├── Kernel launch tracking
        ├── Runtime call tracking
        └── Memory operation tracking

The implementation has five components:

1. PrivateUse1TracingApi.h (C API header)

Defines the contract between Kineto and vendor implementations. Equivalent to what cupti.h is for CUDA, but open, Kineto-defined, and simplified:

  • pu1TracingEnable(activityKind) / pu1TracingDisable(activityKind) — activity control
  • pu1TracingSetCallbacks(requestFn, completeFn) — buffer management (producer-consumer pattern; completeFn omits ctx/streamId for simplicity)
  • pu1PushExternalCorrelationId(id) / pu1PopExternalCorrelationId() — single correlation stack (unlike CUDA's dual CUSTOM0/CUSTOM1)
  • pu1FlushAll() / pu1GetNextRecord(buffer, size, &record) — data retrieval
  • pu1GetVersion() — ABI compatibility
  • pu1ActivityRecord — one unified struct (kind, start_time, end_time, device_id, stream_id, correlation_id, name, metadata) for all activity types. Intentionally simpler than CUPTI's many per-type structs.

2. PU1PTI Kineto Plugin

Implements IActivityProfiler and IActivityProfilerSession inside Kineto. Vendors do not implement these C++ interfaces — PU1PTI does it once and delegates to the vendor's C API. Components:

  • PrivateUse1ActivityProfiler — factory, returns sessions
  • PrivateUse1ActivityProfilerSession — start/stop, processTrace, correlation
  • PrivateUse1ActivityApi — calls vendor C functions via function pointers from loaded .so
  • PrivateUse1ActivityHandlers — converts pu1ActivityRecord → Kineto GenericTraceActivity
  • PrivateUse1ActivityBuffer — RAII buffer wrapper

3. Dynamic Plugin Loader

PrivateUse1PluginLoader discovers and loads vendor .so at Kineto initialization:

  • Discovery: KINETO_PLUGIN_PATH (colon-separated) → standard paths (/usr/lib/kineto/plugins/, /usr/local/lib/kineto/plugins/, ~/.local/lib/kineto/plugins/)
  • Loading: dlopendlsym("kineto_register_privateuse1_plugin")pu1GetVersion() ABI check → call registration
  • Error handling: env unset → no session (zero overhead); dlopen/dlsym failure → log warning, return nullptr; ABI mismatch → log error, return nullptr
  • Vendor .so loaded once, remains loaded for process lifetime (no hot-reload)
  • Registration in init.cpp under #ifdef HAS_PRIVATEUSE1_PROFILER

Why not use REGISTER_PRIVATEUSE1_PROFILER (PR #172154) only? That macro provides process-local registration (consistent with other PrivateUse1 components) but requires the vendor to implement the full IActivityProfiler C++ interface, depend on Kineto's C++ ABI, and be linked into the process. The dynamic loader + C API gives vendors a simpler, ABI-stable surface without Kineto build dependency.

4. OpenReg Reference Implementation

Minimal working implementation of the C API in OpenReg:

  • openreg_tracing.h / tracing.cpp — implements all C API functions
  • Kernel launch hooks — records kernel name, timestamps, correlation
  • Runtime hooks — records runtime API calls
  • Memory hooks — records memcpy/memset
  • Builds as .so, loadable by the plugin loader

OpenReg is a reference for the contract and integration pattern. It does not need to run on every vendor's hardware. Other PrivateUse1 vendors (e.g. custom NPUs, OEM stacks, in-house accelerators) reimplement the same C API on their own stack.

5. Documentation

Vendor integration guide covering: C API contract, buffer protocol, correlation mechanism, plugin lifecycle, implementation steps, dynamic loading, troubleshooting, and OpenReg walkthrough.

Drawbacks

  • Extra indirection layer (PU1PTI + C API) between Kineto and vendor — negligible runtime cost but more code to maintain in Kineto
  • C API must be designed correctly upfront; changes require ABI version bump
  • Dynamic loading adds complexity (dlopen, ABI versioning, error handling)
  • Single-process scope initially; no multi-process plugin coordination

Alternatives

  • Vendors implement IActivityProfiler directly (PR #172154 path): Lower Kineto maintenance but each vendor re-implements boilerplate, depends on Kineto C++ ABI, and must be linked into the process.
  • Compile-time registration (like XPU/AIU): Simpler loading but requires vendor code in Kineto tree; not suitable for out-of-tree backends.
  • PyTorch-level profiling only (ProfilerStubs fallback): Already exists but limited to operator-level timing; no kernel visibility.

Additional context

  • Related PR: pytorch/pytorch#172154 — adds REGISTER_PRIVATEUSE1_PROFILER macro for process-local registration of PrivateUse1 profilers with Kineto. This RFC builds on that by adding the C API layer, dynamic loading, and reference implementation.
  • Kineto XPU plugin: third_party/kineto/libkineto/src/plugin/xpupti/
  • Kineto AIU plugin: third_party/kineto/libkineto/src/plugin/aiupti/
  • IActivityProfiler interface: third_party/kineto/libkineto/include/IActivityProfiler.h

cc @robieta @chaekit @guotuofeng @guyang3532 @dzhulgakov @davidberard98 @briancoutinho @sraikund16 @sanrise @mwootton @divyanshk @jiannanWang @scotts @ryanzhang22 @mikaylagawarecki @malfet

extent analysis

Fix Plan

To implement kernel-level profiling support for PrivateUse1 backends, follow these steps:

  1. Implement the PrivateUse1TracingApi C API:

    • Define the PrivateUse1TracingApi.h header with the required functions:
      • pu1TracingEnable(activityKind)
      • pu1TracingDisable(activityKind)
      • pu1TracingSetCallbacks(requestFn, completeFn)
      • pu1PushExternalCorrelationId(id)
      • pu1PopExternalCorrelationId()
      • pu1FlushAll()
      • pu1GetNextRecord(buffer, size, &record)
      • pu1GetVersion()
    • Implement these functions in the vendor's .so file.
  2. Create the PU1PTI Kineto Plugin:

    • Implement IActivityProfiler and IActivityProfilerSession in Kineto.
    • Create PrivateUse1ActivityProfiler, PrivateUse1ActivityProfilerSession, PrivateUse1ActivityApi, PrivateUse1ActivityHandlers, and PrivateUse1ActivityBuffer classes.
    • Use dlopen and dlsym to load the vendor's .so file.
  3. Implement Dynamic Plugin Loading:

    • Create PrivateUse1PluginLoader to discover and load vendor .so files at runtime.
    • Use KINETO_PLUGIN_PATH and standard paths to find the .so files.
  4. Implement OpenReg Reference Implementation:

    • Create a minimal working implementation of the C API in OpenReg.
    • Implement kernel launch hooks, runtime hooks, and memory hooks.
  5. Document the Vendor Integration Guide:

    • Write a guide covering the C API contract, buffer protocol, correlation mechanism, plugin lifecycle, implementation steps, dynamic loading, and troubleshooting.

Example Code

// PrivateUse1TracingApi.h
#ifndef PRIVATEUSE1_TRACING_API_H
#define PRIVATEUSE1_TRACING_API_H

typedef enum {
    PRIVATEUSE1_RUNTIME,
    PRIVATEUSE1_DRIVER,
    CONCURRENT_KERNEL,
    GPU_MEMCPY,
    GPU_MEMSET
} ActivityKind;

typedef struct {
    ActivityKind kind;
    uint64_t start_time;
    uint64_t end_time;
    int device_id;
    int stream_id;
    int correlation_id;
    const char* name;
    const char* metadata;
} pu1ActivityRecord;

void pu1TracingEnable(ActivityKind activityKind);
void pu1TracingDisable(ActivityKind activityKind);
void pu1TracingSetCallbacks(void* requestFn, void* completeFn);
void pu1PushExternalCorrelationId(int id);
void pu1PopExternalCorrelationId();
void pu1FlushAll();
int pu1GetNextRecord(void* buffer, int size, pu1ActivityRecord* record);

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING