pytorch - ✅(Solved) Fix [inductor] Stream handling in torch.compile with reduce-overhead mode is broken [1 pull requests, 1 participants]

pytorch2026-04-14 22:46:03

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

pytorch/pytorch#180396•Fetched 2026-04-16 06:34:52

View on GitHub

Comments

Participants

Timeline

Reactions

Author

tirthasheshpatel

Participants

tirthasheshpatel

Assignees

mlazos

Timeline (top)

mentioned ×45subscribed ×45labeled ×6assigned ×1

Error Message

''' Run as CUDA_LAUNCH_BLOCKING=1 python test_streams.py

Based on my understanding of reduce-overhead, it should either capture, partition unsupported operation or fallback to eager but not error '''

import torch

s1 = torch.cuda.Stream() ev = torch.cuda.Event() ev2 = torch.cuda.Event()

def stream_fn(x, y): ev.record() # record on capture stream with torch.cuda.stream(s1): ev.wait() # s1 waits for capture stream z = x * 2 # runs on s1 ev2.record() # record completion on s1 ev2.wait() # capture stream waits for s1 return z + y # runs on capture stream

x = torch.randn(1000, 1000).cuda() y = torch.randn(1000, 1000).cuda()

eager_result = stream_fn(x, y) # Eager

compiled_fn = torch.compile(stream_fn, mode="reduce-overhead")

compiled_fn(x, y) # Warmup compiled_fn(x, y) # Capture

torch.AcceleratorError: CUDA error: operation not permitted when stream is capturing

Search for `cudaErrorStreamCaptureUnsupported' in https://docs.nvidia.com/cuda/cuda-runtime-api/groupCUDARTTYPES.html for more information.

Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

compiled_fn(x, y) # Run

PR fix notes

PR #180497: [user-streams] Fix cudagraphs compatibility with current stream

Repository: pytorch/pytorch
Author: mlazos
State: open | merged: False
Link: https://github.com/pytorch/pytorch/pull/180497

Description (problem / solution / changelog)

Fixes https://github.com/pytorch/pytorch/issues/180396

Issue: When torch.compile(fn, mode="reduce-overhead") captures a CUDA graph, custom stream ops (event record/wait) resolve the "current stream" from the external object registry. But the registry was populated during the dynamo bytecode prologue with the trace-time default stream — not
the cudagraph capture stream. This caused cudaErrorStreamCaptureUnsupported.

Fix: Register the current stream at index 0 in the external object registry at trace time. The inductor wrapper emits set_external_object_by_index(0, torch.cuda.current_stream()) at runtime, so during cudagraph capture, index 0 resolves to the actual capture stream instead of the stale default stream.

Real code changes (~30 lines across 4 files):

graph_bytecode_inputs.py (+9) — Added CURRENT_STREAM_INDEX = 0 constant and set_external_object_by_index() which updates an entry at runtime and keeps the object alive via the existing keep_alive list.
variables/streams.py (+~20) — SymbolicStreamState.init now registers the current stream at index 0 when the registry is fresh. Simplified _get_stream_arg to just return user_object_index (no conditional logic). Added back cur_stream_id() which output_graph.py needs.
variables/builder.py (+4/-4) — When wrapping a stream with CurrentStreamSource, use CURRENT_STREAM_INDEX instead of allocating a new index.
codegen/wrapper.py (+7) — Emit set_external_object_by_index(0, torch.cuda.current_stream()) at the top of the wrapper so custom ops see the actual runtime stream (capture stream during cudagraph recording).

Stack from ghstack (oldest at bottom):

-> #180497

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @kadeng @muchulee8 @amjames @chauhang @aakhundov @coconutruben @jataylo @Lucaskabela @azahed98

Changed files

test/dynamo/test_streams.py (modified, +90/-90)
test/functorch/test_leaf_function.py (modified, +6/-6)
test/inductor/test_user_streams.py (modified, +98/-34)
torch/_dynamo/graph_bytecode_inputs.py (modified, +9/-0)
torch/_dynamo/variables/builder.py (modified, +7/-1)
torch/_dynamo/variables/streams.py (modified, +35/-10)
torch/_inductor/codegen/wrapper.py (modified, +6/-3)
torch/_inductor/cudagraph_trees.py (modified, +21/-0)

Code Example

'''
Run as `CUDA_LAUNCH_BLOCKING=1 python test_streams.py`

Based on my understanding of `reduce-overhead`, it should either capture,
partition unsupported operation or fallback to eager but not error
'''

import torch

s1 = torch.cuda.Stream()
ev = torch.cuda.Event()
ev2 = torch.cuda.Event()

def stream_fn(x, y):
    ev.record()                    # record on capture stream
    with torch.cuda.stream(s1):
        ev.wait()                  # s1 waits for capture stream
        z = x * 2                  # runs on s1
        ev2.record()               # record completion on s1
    ev2.wait()                     # capture stream waits for s1
    return z + y                   # runs on capture stream


x = torch.randn(1000, 1000).cuda()
y = torch.randn(1000, 1000).cuda()

eager_result = stream_fn(x, y)  # Eager

compiled_fn = torch.compile(stream_fn, mode="reduce-overhead")

compiled_fn(x, y)  # Warmup
compiled_fn(x, y)  # Capture

# torch.AcceleratorError: CUDA error: operation not permitted when stream is capturing
# Search for `cudaErrorStreamCaptureUnsupported' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information.
# Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
compiled_fn(x, y)  # Run

RAW_BUFFERClick to expand / collapse

🐛 Describe the bug

When there's a stream fork in the user function, torch.compile with reduce-overhead mode fails with torch.AcceleratorError at capture time due to use of unsupported operation.

MRE:

'''
Run as `CUDA_LAUNCH_BLOCKING=1 python test_streams.py`

Based on my understanding of `reduce-overhead`, it should either capture,
partition unsupported operation or fallback to eager but not error
'''

import torch

s1 = torch.cuda.Stream()
ev = torch.cuda.Event()
ev2 = torch.cuda.Event()

def stream_fn(x, y):
    ev.record()                    # record on capture stream
    with torch.cuda.stream(s1):
        ev.wait()                  # s1 waits for capture stream
        z = x * 2                  # runs on s1
        ev2.record()               # record completion on s1
    ev2.wait()                     # capture stream waits for s1
    return z + y                   # runs on capture stream


x = torch.randn(1000, 1000).cuda()
y = torch.randn(1000, 1000).cuda()

eager_result = stream_fn(x, y)  # Eager

compiled_fn = torch.compile(stream_fn, mode="reduce-overhead")

compiled_fn(x, y)  # Warmup
compiled_fn(x, y)  # Capture

# torch.AcceleratorError: CUDA error: operation not permitted when stream is capturing
# Search for `cudaErrorStreamCaptureUnsupported' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information.
# Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
compiled_fn(x, y)  # Run

This should be allowed under CUDA graph capture since it cleanly forks off of a capturing stream and joins back.

originally discovered by @kshitij12345

Versions

N/A

cc @mcarilli @ezyang @eellison @penguinwu @BoyuanFeng @chauhang @voznesenskym @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @kadeng @muchulee8 @amjames @aakhundov @coconutruben @jataylo

extent analysis

TL;DR

The issue can be resolved by modifying the stream_fn to avoid using unsupported operations during CUDA graph capture.

Guidance

Verify that the error occurs due to the use of torch.cuda.Event and torch.cuda.Stream within the compiled function.
Consider restructing the stream_fn to minimize or avoid the use of unsupported operations during capture, potentially by using alternative synchronization methods.
Investigate the use of TORCH_USE_CUDA_DSA to enable device-side assertions for more detailed error information.
Review the CUDA graph capture documentation to ensure that the current implementation aligns with the expected behavior.

Example

# Example of alternative synchronization using `torch.cuda.synchronize()`
def stream_fn(x, y):
    with torch.cuda.stream(s1):
        z = x *

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #pipeline error #runtime error #dependency conflict #environment setup

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

pytorch - ✅(Solved) Fix [inductor] Stream handling in torch.compile with reduce-overhead mode is broken [1 pull requests, 1 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

torch.AcceleratorError: CUDA error: operation not permitted when stream is capturing

Search for `cudaErrorStreamCaptureUnsupported' in https://docs.nvidia.com/cuda/cuda-runtime-api/groupCUDARTTYPES.html for more information.

Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

PR fix notes

PR #180497: [user-streams] Fix cudagraphs compatibility with current stream

Description (problem / solution / changelog)

Changed files

Code Example

🐛 Describe the bug

Versions

extent analysis

TL;DR

Guidance

Example

Still need to ship something?

TRENDING

pytorch - ✅(Solved) Fix [inductor] Stream handling in torch.compile with reduce-overhead mode is broken [1 pull requests, 1 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

torch.AcceleratorError: CUDA error: operation not permitted when stream is capturing

Search for `cudaErrorStreamCaptureUnsupported' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information.

Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

PR fix notes

PR #180497: [user-streams] Fix cudagraphs compatibility with current stream

Description (problem / solution / changelog)

Changed files

Code Example

🐛 Describe the bug

Versions

extent analysis

TL;DR

Guidance

Example

Still need to ship something?

RELATED_DISCOVERY

TRENDING

Search for `cudaErrorStreamCaptureUnsupported' in https://docs.nvidia.com/cuda/cuda-runtime-api/groupCUDARTTYPES.html for more information.

Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.