pytorch - ✅(Solved) Fix [RFC] Tensor.record_use: precise cross-stream lifetime for the CUDA caching allocator [1 pull requests, 2 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
pytorch/pytorch#181191Fetched 2026-04-23 07:22:02
View on GitHub
Comments
2
Participants
2
Timeline
65
Reactions
0
Author
Participants
Timeline (top)
subscribed ×25mentioned ×24labeled ×6unsubscribed ×6

Fix Action

Fixed

PR fix notes

PR #181189: Tensor.record_use: precise cross-stream lifetime for the CUDA caching allocator

Description (problem / solution / changelog)

Stack from ghstack (oldest at bottom):

  • -> #181189

proposal: https://github.com/pytorch/pytorch/issues/181191

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @kadeng @chauhang @amjames @Lucaskabela @jataylo @azahed98

Changed files

  • agent_space/record_use_design_doc.md (added, +168/-0)
  • aten/src/ATen/native/cuda/RecordUse.cu (added, +19/-0)
  • aten/src/ATen/native/native_functions.yaml (modified, +5/-0)
  • c10/core/CachingDeviceAllocator.h (modified, +10/-0)
  • c10/cuda/CUDACachingAllocator.cpp (modified, +115/-2)
  • c10/cuda/CUDACachingAllocator.h (modified, +16/-0)
  • test/test_cuda.py (modified, +168/-0)
  • tools/autograd/gen_variable_type.py (modified, +1/-0)
  • torch/_dynamo/variables/streams.py (modified, +16/-0)
  • torch/_dynamo/variables/tensor.py (modified, +19/-0)
  • torch/_tensor_docs.py (modified, +66/-0)
  • torch/nested/_internal/ops.py (modified, +11/-0)
  • torch/overrides.py (modified, +1/-0)
  • torchgen/gen_functionalization_type.py (modified, +1/-0)
  • torchgen/native_function_generation.py (modified, +1/-0)

Code Example

# Producer on stream A (compute), consumer on stream B (comm).
stream_B.wait_event(producer_event)
with torch.cuda.stream(stream_B):
    y = consumer_kernel(x)                 # e.g. reduce_scatter
consumer_event = stream_B.record_event()

# Keep a Python ref alive past this scope — the allocator would
# otherwise reclaim x's block before consumer_event fires.
stash[slot] = x

# ...later, after some path has waited for consumer_event...
with torch.cuda.stream(stream_B):          # critical: drop inside the
    stash[slot] = None                     # consumer's stream context

---

stream_B.wait_event(producer_event)
with torch.cuda.stream(stream_B):
    y = consumer_kernel(x)
    x.record_use(stream_B)           # ← allocator-visible barrier
del x                                # any stream, any thread, any time
RAW_BUFFERClick to expand / collapse

Tensor.record_use: precise cross-stream lifetime for the CUDA caching allocator

Prototype: https://github.com/pytorch/pytorch/pull/181189

Motivation

The CUDA caching allocator tracks only each block's allocation stream; cross-stream reads are invisible to it. FSDP2 bridges this with a five-step recipe, repeated in three subsystems (all-gather output, reduce-scatter input, all-reduce output) at ~40–80 LOC each:

# Producer on stream A (compute), consumer on stream B (comm).
stream_B.wait_event(producer_event)
with torch.cuda.stream(stream_B):
    y = consumer_kernel(x)                 # e.g. reduce_scatter
consumer_event = stream_B.record_event()

# Keep a Python ref alive past this scope — the allocator would
# otherwise reclaim x's block before consumer_event fires.
stash[slot] = x

# ...later, after some path has waited for consumer_event...
with torch.cuda.stream(stream_B):          # critical: drop inside the
    stash[slot] = None                     # consumer's stream context

Three invariants make it work:

  • consumer_event is recorded before any unrelated work queues on stream_B. Delay it and the event captures too much, over-reserving memory.
  • stash[slot] lifetime matches consumer_event pending. Shorter → use-after-free; longer → O(N_layers) memory leak.
  • The drop runs inside with torch.cuda.stream(stream_B):. The allocator attributes each free to the current stream at the moment of deletion, and only stream_B's FIFO has absorbed consumer_event; drop without the wrapper and a later allocation reuses the block while the consumer is still reading.

FSDP2 PRs #140044, #179443, and #180666 each traced to one of these going wrong in a new code path. The same recipe, with its own helper class, is reimplemented in activation-offloading hooks, non-FSDP collective libs, and user multi-stream code via cpp_extension.

The caching allocator already has the event-polling machinery (cuda_events, event_count, process_events()). The only user-facing entry into it today, Tensor.record_stream(stream), records at block-free time rather than at consumer-done — so production code doesn't use it and hand-rolls the recipe instead.

Proposal: Tensor.record_use(stream)

The same scene from the Motivation section, rewritten with record_use:

stream_B.wait_event(producer_event)
with torch.cuda.stream(stream_B):
    y = consumer_kernel(x)
    x.record_use(stream_B)           # ← allocator-visible barrier
del x                                # any stream, any thread, any time

The consumer_event = stream_B.record_event(), the stash[slot] = x, the deferred with torch.cuda.stream(stream_B): stash[slot] = None, and the subsystem-specific NamedTuple that glued them together all go away. Drop order, drop site, and drop stream stop mattering.

record_use records a fresh CUDA event on stream at the point of the call and attaches it to the tensor's allocation block; the caching allocator will not reuse the block until every attached event has fired. Precise semantics:

  1. Event recorded now. The allocator calls cudaEventRecord on stream at the call site. The caller is expected to place the call right after their consumer's last read of the tensor on stream.
  2. No-op on allocation stream. record_use(tensor.alloc_stream) is a no-op — the allocation stream's FIFO already orders the read and the next allocation.
  3. Accumulates. Multiple record_use calls attach multiple events. The block waits for all of them.
  4. Composes with record_stream. A block may have both precise use_events and imprecise stream_uses. Both gate the free.
  5. Thread-safe. Takes the same allocator mutex as record_stream.

Comparison

Aspectrecord_streamhand-rolled recipe (prod today)record_use (proposed)
Event is recordedat block free timeat caller's end-of-useat caller's end-of-use
Precisionconservativepreciseprecise
Caller code1 line5-ish load-bearing lines1 line
Safety if misusedalways safeUAF if any step wrongUAF if called before last read
Ref held alive bycaching allocatoruser-owned Python stashcaching allocator
Drop-on-right-stream routingwith stream(X): del requiredwith stream(X): del requiredautomatic
Dynamo-traceableyesno (stream context + stash)yes (custom op)
No-op on alloc streamyescaller must special-caseyes
Composableone call per streamone stash per concurrent useone call per use point
Graph captureyes (via deferral)works but requires carefalls back to record_stream for PR 1
Allocator overhead1 cudaEventRecord per stream at free1 user-level event per use1 cudaEventRecord per call
BCexistingn/a (user code)additive; record_stream untouched

Risk and landability

Things TL review should weigh:

  1. BC is clean. record_stream semantics, Block::stream_uses semantics, insert_events() behavior, process_events() loop: all untouched. Every existing caller sees identical behavior.
  2. User-visible footgun. record_use must be called after the consumer's last read or it's a UAF. This is the same hazard today's hand-rolled callers already carry; record_use inherits it, doesn't invent it. Docstring is the guard (no runtime checker — infeasible without new instrumentation).
  3. Graph-capture path degrades silently to imprecise. Under capture, recordUse falls back to stream_uses. Correct, loses precision. Acceptable for PR 1; a precise-in-capture story needs a separate RFC coordinated with graph_capture_record_stream_reuse.
  4. Move-only Block. use_events contains unique_ptrs, so Block becomes move-only. Required one std::move in AllocParams assignment — the only Block copy in the file. No other spots use Block by value.
  5. Codegen plumbing. record_use needs entries in three allow-lists (FUNCTIONAL_OPS_THAT_CANNOT_GET_AN_OUT_VARIANT, MUTABLE_OPS_NOT_USING_FUNCTIONALIZATION, and the non-differentiable list in gen_variable_type.py). Mechanical, same pattern as record_stream.
  6. Non-CUDA backends. DeviceAllocator::recordUse default delegates to recordStream, so XPU / MallocAsync / pluggable allocators compile and run unchanged. Precise implementations are clean mirrors of the CUDA path; deferred to follow-up PRs.
  7. Allocator state footprint. One extra vector per Block, empty on blocks that never record. No heap growth in steady state for record_use-free workloads.
  8. Snapshot visibility gap. use_events doesn't appear in SnapshotInfo. Open question below.

Alternatives (why not…)

  • Make record_stream precise. BC-breaking: any caller that today calls it earlier than their last read would silently become a UAF.
  • keep_alive_until(event) (caller supplies event). Forces boilerplate at the common call site; could be added later as a companion.
  • record_stream(stream, precise=True). Behavior-changing flag on a public API is hard to grep and hard to review.
  • Storage-level API. record_stream is a Tensor method; mirror that surface.
  • Name. Placeholder. Candidates: record_stream_precise, stream_barrier, keep_alive_until, done_using_on. Bikeshed before public-API commitment.

Scope

PR 1: native CUDA allocator, Tensor.record_use, docs, dynamo custom op, 5 unit tests.

Follow-ups: CUDAMallocAsyncAllocator native path; c10/xpu mirror; FSDP2 migration (collapses StreamHandoff to a one-liner); capture-precise (separate RFC); torch.compile eager-mode fast path.

Non-goals: replacing record_stream; changing stream_uses semantics; runtime misuse detection.

Open questions

  1. Name. record_use vs alternatives above.
  2. Pathological per-block event accumulation. Tight-loop record_use without intervening frees grows the per-block vector. Worth a cap + TORCH_WARN_ONCE, or trust users?
  3. Snapshot visibility. Should use_events surface in SnapshotInfo / memory-viz tooling? Free is already attributed correctly via process_events; it's the stash window that's currently invisible.
  4. Capture story. Long term, should precise events participate in captured graphs' dependency structure, or always defer to post-capture? Concrete use cases would help decide.

References

  • Tensor.record_stream docstring (torch/_tensor_docs.py).
  • Jane Xu, "FSDP & CUDACachingAllocator: an outsider newb perspective" (dev-discuss.pytorch.org).
  • cudaMallocAsync / cuMemAsyncFree — allocator-aware-of-streams approach; driver-level, doesn't compose with the native caching allocator's pool semantics.

Appendix: prototype

Implemented and passing locally. 5 new record_use tests green + 3 existing record_stream tests green (no regression). Build clean on the native CUDA allocator path. File-by-file diff, code, and build plumbing live in the PR.

extent analysis

TL;DR

The proposed Tensor.record_use method provides a precise cross-stream lifetime for the CUDA caching allocator, allowing for more efficient and safe memory management.

Guidance

  • To implement the Tensor.record_use method, modify the Tensor class to include a new method that records a CUDA event on the specified stream and attaches it to the tensor's allocation block.
  • Update the caching allocator to wait for all attached events to fire before reusing the block, ensuring precise cross-stream lifetime management.
  • Test the new method thoroughly to ensure it works correctly in various scenarios, including multiple streams and concurrent usage.
  • Consider adding documentation and examples to help users understand the correct usage of Tensor.record_use and its benefits over the existing record_stream method.

Example

# Example usage of Tensor.record_use
stream_B.wait_event(producer_event)
with torch.cuda.stream(stream_B):
    y = consumer_kernel(x)
    x.record_use(stream_B)  # Records a CUDA event on stream_B
del x  # The block will not be reused until the event on stream_B fires

Notes

  • The implementation of Tensor.record_use should ensure thread safety by taking the same allocator mutex as record_stream.
  • The method should also handle the case where the allocation stream is the same as the stream on which the event is recorded, in which case it should be a no-op.
  • The record_use method should accumulate events, allowing multiple calls to attach multiple events to the same block.

Recommendation

Apply the Tensor.record_use workaround to provide precise cross-stream lifetime management for the CUDA caching allocator, as it offers a more efficient and safe solution compared to the existing record_stream method.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING