pytorch - ✅(Solved) Fix [inductor] Stream handling in torch.compile with reduce-overhead mode is broken [1 pull requests, 1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
pytorch/pytorch#180396Fetched 2026-04-16 06:34:52
View on GitHub
Comments
0
Participants
1
Timeline
98
Reactions
0
Participants
Assignees
Timeline (top)
mentioned ×45subscribed ×45labeled ×6assigned ×1

Error Message

''' Run as CUDA_LAUNCH_BLOCKING=1 python test_streams.py

Based on my understanding of reduce-overhead, it should either capture, partition unsupported operation or fallback to eager but not error '''

import torch

s1 = torch.cuda.Stream() ev = torch.cuda.Event() ev2 = torch.cuda.Event()

def stream_fn(x, y): ev.record() # record on capture stream with torch.cuda.stream(s1): ev.wait() # s1 waits for capture stream z = x * 2 # runs on s1 ev2.record() # record completion on s1 ev2.wait() # capture stream waits for s1 return z + y # runs on capture stream

x = torch.randn(1000, 1000).cuda() y = torch.randn(1000, 1000).cuda()

eager_result = stream_fn(x, y) # Eager

compiled_fn = torch.compile(stream_fn, mode="reduce-overhead")

compiled_fn(x, y) # Warmup compiled_fn(x, y) # Capture

torch.AcceleratorError: CUDA error: operation not permitted when stream is capturing

Search for `cudaErrorStreamCaptureUnsupported' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information.

Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

compiled_fn(x, y) # Run

PR fix notes

PR #180497: [user-streams] Fix cudagraphs compatibility with current stream

Description (problem / solution / changelog)

Fixes https://github.com/pytorch/pytorch/issues/180396

Issue: When torch.compile(fn, mode="reduce-overhead") captures a CUDA graph, custom stream ops (event record/wait) resolve the "current stream" from the external object registry. But the registry was populated during the dynamo bytecode prologue with the trace-time default stream — not
the cudagraph capture stream. This caused cudaErrorStreamCaptureUnsupported.

Fix: Register the current stream at index 0 in the external object registry at trace time. The inductor wrapper emits set_external_object_by_index(0, torch.cuda.current_stream()) at runtime, so during cudagraph capture, index 0 resolves to the actual capture stream instead of the stale default stream.

Real code changes (~30 lines across 4 files):

  1. graph_bytecode_inputs.py (+9) — Added CURRENT_STREAM_INDEX = 0 constant and set_external_object_by_index() which updates an entry at runtime and keeps the object alive via the existing keep_alive list.
  2. variables/streams.py (+~20) — SymbolicStreamState.init now registers the current stream at index 0 when the registry is fresh. Simplified _get_stream_arg to just return user_object_index (no conditional logic). Added back cur_stream_id() which output_graph.py needs.
  3. variables/builder.py (+4/-4) — When wrapping a stream with CurrentStreamSource, use CURRENT_STREAM_INDEX instead of allocating a new index.
  4. codegen/wrapper.py (+7) — Emit set_external_object_by_index(0, torch.cuda.current_stream()) at the top of the wrapper so custom ops see the actual runtime stream (capture stream during cudagraph recording).

Stack from ghstack (oldest at bottom):

  • -> #180497

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @kadeng @muchulee8 @amjames @chauhang @aakhundov @coconutruben @jataylo @Lucaskabela @azahed98

Changed files

  • test/dynamo/test_streams.py (modified, +90/-90)
  • test/functorch/test_leaf_function.py (modified, +6/-6)
  • test/inductor/test_user_streams.py (modified, +98/-34)
  • torch/_dynamo/graph_bytecode_inputs.py (modified, +9/-0)
  • torch/_dynamo/variables/builder.py (modified, +7/-1)
  • torch/_dynamo/variables/streams.py (modified, +35/-10)
  • torch/_inductor/codegen/wrapper.py (modified, +6/-3)
  • torch/_inductor/cudagraph_trees.py (modified, +21/-0)

Code Example

'''
Run as `CUDA_LAUNCH_BLOCKING=1 python test_streams.py`

Based on my understanding of `reduce-overhead`, it should either capture,
partition unsupported operation or fallback to eager but not error
'''

import torch

s1 = torch.cuda.Stream()
ev = torch.cuda.Event()
ev2 = torch.cuda.Event()

def stream_fn(x, y):
    ev.record()                    # record on capture stream
    with torch.cuda.stream(s1):
        ev.wait()                  # s1 waits for capture stream
        z = x * 2                  # runs on s1
        ev2.record()               # record completion on s1
    ev2.wait()                     # capture stream waits for s1
    return z + y                   # runs on capture stream


x = torch.randn(1000, 1000).cuda()
y = torch.randn(1000, 1000).cuda()

eager_result = stream_fn(x, y)  # Eager

compiled_fn = torch.compile(stream_fn, mode="reduce-overhead")

compiled_fn(x, y)  # Warmup
compiled_fn(x, y)  # Capture

# torch.AcceleratorError: CUDA error: operation not permitted when stream is capturing
# Search for `cudaErrorStreamCaptureUnsupported' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information.
# Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
compiled_fn(x, y)  # Run
RAW_BUFFERClick to expand / collapse

🐛 Describe the bug

When there's a stream fork in the user function, torch.compile with reduce-overhead mode fails with torch.AcceleratorError at capture time due to use of unsupported operation.

MRE:

'''
Run as `CUDA_LAUNCH_BLOCKING=1 python test_streams.py`

Based on my understanding of `reduce-overhead`, it should either capture,
partition unsupported operation or fallback to eager but not error
'''

import torch

s1 = torch.cuda.Stream()
ev = torch.cuda.Event()
ev2 = torch.cuda.Event()

def stream_fn(x, y):
    ev.record()                    # record on capture stream
    with torch.cuda.stream(s1):
        ev.wait()                  # s1 waits for capture stream
        z = x * 2                  # runs on s1
        ev2.record()               # record completion on s1
    ev2.wait()                     # capture stream waits for s1
    return z + y                   # runs on capture stream


x = torch.randn(1000, 1000).cuda()
y = torch.randn(1000, 1000).cuda()

eager_result = stream_fn(x, y)  # Eager

compiled_fn = torch.compile(stream_fn, mode="reduce-overhead")

compiled_fn(x, y)  # Warmup
compiled_fn(x, y)  # Capture

# torch.AcceleratorError: CUDA error: operation not permitted when stream is capturing
# Search for `cudaErrorStreamCaptureUnsupported' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information.
# Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
compiled_fn(x, y)  # Run

This should be allowed under CUDA graph capture since it cleanly forks off of a capturing stream and joins back.

originally discovered by @kshitij12345

Versions

N/A

cc @mcarilli @ezyang @eellison @penguinwu @BoyuanFeng @chauhang @voznesenskym @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @kadeng @muchulee8 @amjames @aakhundov @coconutruben @jataylo

extent analysis

TL;DR

The issue can be resolved by modifying the stream_fn to avoid using unsupported operations during CUDA graph capture.

Guidance

  • Verify that the error occurs due to the use of torch.cuda.Event and torch.cuda.Stream within the compiled function.
  • Consider restructing the stream_fn to minimize or avoid the use of unsupported operations during capture, potentially by using alternative synchronization methods.
  • Investigate the use of TORCH_USE_CUDA_DSA to enable device-side assertions for more detailed error information.
  • Review the CUDA graph capture documentation to ensure that the current implementation aligns with the expected behavior.

Example

# Example of alternative synchronization using `torch.cuda.synchronize()`
def stream_fn(x, y):
    with torch.cuda.stream(s1):
        z = x *

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

pytorch - ✅(Solved) Fix [inductor] Stream handling in torch.compile with reduce-overhead mode is broken [1 pull requests, 1 participants]