pytorch - ✅(Solved) Fix [RFC] Tensor.record_use: precise cross-stream lifetime for the CUDA caching allocator [1 pull requests, 2 comments, 2 participants]

pytorch2026-04-22 23:31:05

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

pytorch/pytorch#181191•Fetched 2026-04-23 07:22:02

View on GitHub

Comments

Participants

Timeline

Reactions

Author

weifengpy

Participants

ngimel

weifengpy

Timeline (top)

subscribed ×25mentioned ×24labeled ×6unsubscribed ×6

Fix Action

Fixed

Fixed by PR: Tensor.record_use: precise cross-stream lifetime for the CUDA caching allocator (https://github.com/pytorch/pytorch/pull/181189)

PR fix notes

PR #181189: Tensor.record_use: precise cross-stream lifetime for the CUDA caching allocator

Repository: pytorch/pytorch
Author: weifengpy
State: open | merged: False
Link: https://github.com/pytorch/pytorch/pull/181189

Description (problem / solution / changelog)

Stack from ghstack (oldest at bottom):

-> #181189

proposal: https://github.com/pytorch/pytorch/issues/181191

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @kadeng @chauhang @amjames @Lucaskabela @jataylo @azahed98

Changed files

agent_space/record_use_design_doc.md (added, +168/-0)
aten/src/ATen/native/cuda/RecordUse.cu (added, +19/-0)
aten/src/ATen/native/native_functions.yaml (modified, +5/-0)
c10/core/CachingDeviceAllocator.h (modified, +10/-0)
c10/cuda/CUDACachingAllocator.cpp (modified, +115/-2)
c10/cuda/CUDACachingAllocator.h (modified, +16/-0)
test/test_cuda.py (modified, +168/-0)
tools/autograd/gen_variable_type.py (modified, +1/-0)
torch/_dynamo/variables/streams.py (modified, +16/-0)
torch/_dynamo/variables/tensor.py (modified, +19/-0)
torch/_tensor_docs.py (modified, +66/-0)
torch/nested/_internal/ops.py (modified, +11/-0)
torch/overrides.py (modified, +1/-0)
torchgen/gen_functionalization_type.py (modified, +1/-0)
torchgen/native_function_generation.py (modified, +1/-0)

Code Example

# Producer on stream A (compute), consumer on stream B (comm).
stream_B.wait_event(producer_event)
with torch.cuda.stream(stream_B):
    y = consumer_kernel(x)                 # e.g. reduce_scatter
consumer_event = stream_B.record_event()

# Keep a Python ref alive past this scope — the allocator would
# otherwise reclaim x's block before consumer_event fires.
stash[slot] = x

# ...later, after some path has waited for consumer_event...
with torch.cuda.stream(stream_B):          # critical: drop inside the
    stash[slot] = None                     # consumer's stream context

---

stream_B.wait_event(producer_event)
with torch.cuda.stream(stream_B):
    y = consumer_kernel(x)
    x.record_use(stream_B)           # ← allocator-visible barrier
del x                                # any stream, any thread, any time

RAW_BUFFERClick to expand / collapse

`Tensor.record_use`: precise cross-stream lifetime for the CUDA caching allocator

Prototype: https://github.com/pytorch/pytorch/pull/181189

Motivation

The CUDA caching allocator tracks only each block's allocation stream; cross-stream reads are invisible to it. FSDP2 bridges this with a five-step recipe, repeated in three subsystems (all-gather output, reduce-scatter input, all-reduce output) at ~40–80 LOC each:

# Producer on stream A (compute), consumer on stream B (comm).
stream_B.wait_event(producer_event)
with torch.cuda.stream(stream_B):
    y = consumer_kernel(x)                 # e.g. reduce_scatter
consumer_event = stream_B.record_event()

# Keep a Python ref alive past this scope — the allocator would
# otherwise reclaim x's block before consumer_event fires.
stash[slot] = x

# ...later, after some path has waited for consumer_event...
with torch.cuda.stream(stream_B):          # critical: drop inside the
    stash[slot] = None                     # consumer's stream context

Three invariants make it work:

consumer_event is recorded before any unrelated work queues on stream_B. Delay it and the event captures too much, over-reserving memory.
stash[slot] lifetime matches consumer_event pending. Shorter → use-after-free; longer → O(N_layers) memory leak.
The drop runs inside with torch.cuda.stream(stream_B):. The allocator attributes each free to the current stream at the moment of deletion, and only stream_B's FIFO has absorbed consumer_event; drop without the wrapper and a later allocation reuses the block while the consumer is still reading.

FSDP2 PRs #140044, #179443, and #180666 each traced to one of these going wrong in a new code path. The same recipe, with its own helper class, is reimplemented in activation-offloading hooks, non-FSDP collective libs, and user multi-stream code via cpp_extension.

The caching allocator already has the event-polling machinery (cuda_events, event_count, process_events()). The only user-facing entry into it today, Tensor.record_stream(stream), records at block-free time rather than at consumer-done — so production code doesn't use it and hand-rolls the recipe instead.

Proposal: `Tensor.record_use(stream)`

The same scene from the Motivation section, rewritten with record_use:

stream_B.wait_event(producer_event)
with torch.cuda.stream(stream_B):
    y = consumer_kernel(x)
    x.record_use(stream_B)           # ← allocator-visible barrier
del x                                # any stream, any thread, any time

The consumer_event = stream_B.record_event(), the stash[slot] = x, the deferred with torch.cuda.stream(stream_B): stash[slot] = None, and the subsystem-specific NamedTuple that glued them together all go away. Drop order, drop site, and drop stream stop mattering.

record_use records a fresh CUDA event on stream at the point of the call and attaches it to the tensor's allocation block; the caching allocator will not reuse the block until every attached event has fired. Precise semantics:

Event recorded now. The allocator calls cudaEventRecord on stream at the call site. The caller is expected to place the call right after their consumer's last read of the tensor on stream.
No-op on allocation stream. record_use(tensor.alloc_stream) is a no-op — the allocation stream's FIFO already orders the read and the next allocation.
Accumulates. Multiple record_use calls attach multiple events. The block waits for all of them.
Composes with record_stream. A block may have both precise use_events and imprecise stream_uses. Both gate the free.
Thread-safe. Takes the same allocator mutex as record_stream.

Comparison

Aspect	`record_stream`	hand-rolled recipe (prod today)	`record_use` (proposed)
Event is recorded	at block free time	at caller's end-of-use	at caller's end-of-use
Precision	conservative	precise	precise
Caller code	1 line	5-ish load-bearing lines	1 line
Safety if misused	always safe	UAF if any step wrong	UAF if called before last read
Ref held alive by	caching allocator	user-owned Python stash	caching allocator
Drop-on-right-stream routing	`with stream(X): del` required	`with stream(X): del` required	automatic
Dynamo-traceable	yes	no (stream context + stash)	yes (custom op)
No-op on alloc stream	yes	caller must special-case	yes
Composable	one call per stream	one stash per concurrent use	one call per use point
Graph capture	yes (via deferral)	works but requires care	falls back to `record_stream` for PR 1
Allocator overhead	1 `cudaEventRecord` per stream at free	1 user-level event per use	1 `cudaEventRecord` per call
BC	existing	n/a (user code)	additive; `record_stream` untouched

Risk and landability

Things TL review should weigh:

BC is clean. record_stream semantics, Block::stream_uses semantics, insert_events() behavior, process_events() loop: all untouched. Every existing caller sees identical behavior.
User-visible footgun. record_use must be called after the consumer's last read or it's a UAF. This is the same hazard today's hand-rolled callers already carry; record_use inherits it, doesn't invent it. Docstring is the guard (no runtime checker — infeasible without new instrumentation).
Graph-capture path degrades silently to imprecise. Under capture, recordUse falls back to stream_uses. Correct, loses precision. Acceptable for PR 1; a precise-in-capture story needs a separate RFC coordinated with graph_capture_record_stream_reuse.
Move-only Block. use_events contains unique_ptrs, so Block becomes move-only. Required one std::move in AllocParams assignment — the only Block copy in the file. No other spots use Block by value.
Codegen plumbing. record_use needs entries in three allow-lists (FUNCTIONAL_OPS_THAT_CANNOT_GET_AN_OUT_VARIANT, MUTABLE_OPS_NOT_USING_FUNCTIONALIZATION, and the non-differentiable list in gen_variable_type.py). Mechanical, same pattern as record_stream.
Non-CUDA backends. DeviceAllocator::recordUse default delegates to recordStream, so XPU / MallocAsync / pluggable allocators compile and run unchanged. Precise implementations are clean mirrors of the CUDA path; deferred to follow-up PRs.
Allocator state footprint. One extra vector per Block, empty on blocks that never record. No heap growth in steady state for record_use-free workloads.
Snapshot visibility gap. use_events doesn't appear in SnapshotInfo. Open question below.

Alternatives (why not…)

Make record_stream precise. BC-breaking: any caller that today calls it earlier than their last read would silently become a UAF.
keep_alive_until(event) (caller supplies event). Forces boilerplate at the common call site; could be added later as a companion.
record_stream(stream, precise=True). Behavior-changing flag on a public API is hard to grep and hard to review.
Storage-level API. record_stream is a Tensor method; mirror that surface.
Name. Placeholder. Candidates: record_stream_precise, stream_barrier, keep_alive_until, done_using_on. Bikeshed before public-API commitment.

Scope

PR 1: native CUDA allocator, Tensor.record_use, docs, dynamo custom op, 5 unit tests.

Follow-ups: CUDAMallocAsyncAllocator native path; c10/xpu mirror; FSDP2 migration (collapses StreamHandoff to a one-liner); capture-precise (separate RFC); torch.compile eager-mode fast path.

Non-goals: replacing record_stream; changing stream_uses semantics; runtime misuse detection.

Open questions

Name. record_use vs alternatives above.
Pathological per-block event accumulation. Tight-loop record_use without intervening frees grows the per-block vector. Worth a cap + TORCH_WARN_ONCE, or trust users?
Snapshot visibility. Should use_events surface in SnapshotInfo / memory-viz tooling? Free is already attributed correctly via process_events; it's the stash window that's currently invisible.
Capture story. Long term, should precise events participate in captured graphs' dependency structure, or always defer to post-capture? Concrete use cases would help decide.

References

Tensor.record_stream docstring (torch/_tensor_docs.py).
Jane Xu, "FSDP & CUDACachingAllocator: an outsider newb perspective" (dev-discuss.pytorch.org).
cudaMallocAsync / cuMemAsyncFree — allocator-aware-of-streams approach; driver-level, doesn't compose with the native caching allocator's pool semantics.

Appendix: prototype

Implemented and passing locally. 5 new record_use tests green + 3 existing record_stream tests green (no regression). Build clean on the native CUDA allocator path. File-by-file diff, code, and build plumbing live in the PR.

extent analysis

TL;DR

The proposed Tensor.record_use method provides a precise cross-stream lifetime for the CUDA caching allocator, allowing for more efficient and safe memory management.

Guidance

To implement the Tensor.record_use method, modify the Tensor class to include a new method that records a CUDA event on the specified stream and attaches it to the tensor's allocation block.
Update the caching allocator to wait for all attached events to fire before reusing the block, ensuring precise cross-stream lifetime management.
Test the new method thoroughly to ensure it works correctly in various scenarios, including multiple streams and concurrent usage.
Consider adding documentation and examples to help users understand the correct usage of Tensor.record_use and its benefits over the existing record_stream method.

Example

# Example usage of Tensor.record_use
stream_B.wait_event(producer_event)
with torch.cuda.stream(stream_B):
    y = consumer_kernel(x)
    x.record_use(stream_B)  # Records a CUDA event on stream_B
del x  # The block will not be reused until the event on stream_B fires

Notes

The implementation of Tensor.record_use should ensure thread safety by taking the same allocator mutex as record_stream.
The method should also handle the case where the allocation stream is the same as the stream on which the event is recorded, in which case it should be a no-op.
The record_use method should accumulate events, allowing multiple calls to attach multiple events to the same block.

Recommendation

Apply the Tensor.record_use workaround to provide precise cross-stream lifetime management for the CUDA caching allocator, as it offers a more efficient and safe solution compared to the existing record_stream method.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #installation #tensor shape #autograd error #memory leak

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

pytorch - ✅(Solved) Fix [RFC] Tensor.record_use: precise cross-stream lifetime for the CUDA caching allocator [1 pull requests, 2 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Fix Action

Fixed

PR fix notes

PR #181189: Tensor.record_use: precise cross-stream lifetime for the CUDA caching allocator

Description (problem / solution / changelog)

Changed files

Code Example

`Tensor.record_use`: precise cross-stream lifetime for the CUDA caching allocator

Motivation

Proposal: `Tensor.record_use(stream)`

Comparison

Risk and landability

Alternatives (why not…)

Scope

Open questions

References

Appendix: prototype

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

TRENDING

pytorch - ✅(Solved) Fix [RFC] Tensor.record_use: precise cross-stream lifetime for the CUDA caching allocator [1 pull requests, 2 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Fix Action

Fixed

PR fix notes

PR #181189: Tensor.record_use: precise cross-stream lifetime for the CUDA caching allocator

Description (problem / solution / changelog)

Changed files

Code Example

Tensor.record_use: precise cross-stream lifetime for the CUDA caching allocator

Motivation

Proposal: Tensor.record_use(stream)

Comparison

Risk and landability

Alternatives (why not…)

Scope

Open questions

References

Appendix: prototype

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING

`Tensor.record_use`: precise cross-stream lifetime for the CUDA caching allocator

Proposal: `Tensor.record_use(stream)`