pytorch - 💡(How to fix) Fix Torch.compile crashes with multi-stream and triton kernels [3 comments, 2 participants]

pytorch2026-05-01 00:47:33

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

pytorch/pytorch#182084•Fetched 2026-05-01 05:32:34

View on GitHub

Comments

Participants

Timeline

163

Reactions

Author

Participants

Assignees

Timeline (top)

mentioned ×76subscribed ×76labeled ×7commented ×3

Error Message

#!/usr/bin/env python3 """ Minimal repro for a PyTorch 2.12 Inductor multi-stream codegen bug.

The wrapper codegen sets up an aux stream as:

stream1 = get_external_object_by_index(N)        # torch.cuda.Stream

Triton kernel launches inside with torch.cuda.stream(default_stream): emit a separate int handle named stream<device_idx>:

stream1 = get_raw_stream(1)                      # int (cudaStream_t)
triton_kernel.run(..., stream=stream1)

When the device index matches the aux Stream object's codegen index (both named stream1), the int rebind clobbers the function-scope name. A subsequent with torch.cuda.stream(stream1): then receives the int and StreamContext.enter crashes:

AttributeError: 'int' object has no attribute 'device'

To trigger:

Bind tensors to cuda:N where N >= 1 (so the int handle is stream<N>, matching the first aux Stream object's variable name).
Force the schedule to be: aux-region -> default Triton -> aux-region, via data dependencies, so Inductor cannot reorder the second aux-region before the clobbering Triton launch.

Run with: TORCH_COMPILE_DEBUG=1 python3 inductor_stream_repro.py to dump generated code to ./torch_compile_debug/ for inspection. """

import torch

assert torch.cuda.is_available(), "CUDA required" assert torch.cuda.device_count() >= 2, ( "this repro needs cuda:1; the bug is specific to a non-zero device index" ) print(f"torch.version = {torch.version}")

device = torch.device("cuda:1") aux = torch.cuda.Stream(device=device) ev0 = torch.cuda.Event() ev1 = torch.cuda.Event()

def model(x: torch.Tensor, w: torch.Tensor) -> torch.Tensor: # Region 1 (aux): produces a. Eager mm + event ops force a real # with torch.cuda.stream(stream1): block in the codegen. ev0.record() with torch.cuda.stream(aux): ev0.wait() a = torch.mm(x, w) ev1.record()

# Region 2 (default): a Triton-fusable pointwise that DEPENDS on `a`,
# so it must execute after Region 1. This is the launch that emits
# `stream1 = get_raw_stream(1)` inside `with torch.cuda.stream(default_stream)`,
# clobbering the function-scope `stream1`.
ev1.wait()
c = (a.sin() * a.cos() + 0.1).relu()

# Region 3 (aux again): depends on `c`, so it cannot be reordered
# before Region 2. By the time we re-enter `with torch.cuda.stream(aux)`,
# the function-scope `stream1` has been overwritten with an int.
with torch.cuda.stream(aux):
    d = torch.mm(c, w.t())

return d

M, K = 256, 1024 x = torch.randn(M, K, device=device, dtype=torch.float32) w = torch.randn(K, M, device=device, dtype=torch.float32)

Eager.

y_eager = model(x, w) torch.cuda.synchronize() print(f"eager: ok, |y| = {y_eager.norm().item():.4f}")

Compiled.

compiled = torch.compile(model, fullgraph=True) try: y_compiled = compiled(x, w) torch.cuda.synchronize() print(f"compiled: ok, |y| = {y_compiled.norm().item():.4f}") print("DID NOT REPRODUCE — pattern may differ from the in-tree case.") except AttributeError as e: print(f"REPRODUCED: AttributeError: {e}") import traceback

traceback.print_exc()

Code Example

#!/usr/bin/env python3
"""
Minimal repro for a PyTorch 2.12 Inductor multi-stream codegen bug.

The wrapper codegen sets up an aux stream as:

    stream1 = get_external_object_by_index(N)        # torch.cuda.Stream

Triton kernel launches inside `with torch.cuda.stream(default_stream):`
emit a separate int handle named `stream<device_idx>`:

    stream1 = get_raw_stream(1)                      # int (cudaStream_t)
    triton_kernel.run(..., stream=stream1)

When the device index matches the aux Stream object's codegen index
(both named `stream1`), the int rebind clobbers the function-scope name.
A subsequent `with torch.cuda.stream(stream1):` then receives the int
and StreamContext.__enter__ crashes:

    AttributeError: 'int' object has no attribute 'device'

To trigger:
  - Bind tensors to cuda:N where N >= 1 (so the int handle is `stream<N>`,
    matching the first aux Stream object's variable name).
  - Force the schedule to be: aux-region -> default Triton -> aux-region,
    via data dependencies, so Inductor cannot reorder the second aux-region
    before the clobbering Triton launch.

Run with:
    TORCH_COMPILE_DEBUG=1 python3 inductor_stream_repro.py
to dump generated code to ./torch_compile_debug/ for inspection.
"""

import torch

assert torch.cuda.is_available(), "CUDA required"
assert torch.cuda.device_count() >= 2, (
    "this repro needs cuda:1; the bug is specific to a non-zero device index"
)
print(f"torch.__version__ = {torch.__version__}")

device = torch.device("cuda:1")
aux = torch.cuda.Stream(device=device)
ev0 = torch.cuda.Event()
ev1 = torch.cuda.Event()


def model(x: torch.Tensor, w: torch.Tensor) -> torch.Tensor:
    # Region 1 (aux): produces `a`. Eager mm + event ops force a real
    # `with torch.cuda.stream(stream1):` block in the codegen.
    ev0.record()
    with torch.cuda.stream(aux):
        ev0.wait()
        a = torch.mm(x, w)
        ev1.record()

    # Region 2 (default): a Triton-fusable pointwise that DEPENDS on `a`,
    # so it must execute after Region 1. This is the launch that emits
    # `stream1 = get_raw_stream(1)` inside `with torch.cuda.stream(default_stream)`,
    # clobbering the function-scope `stream1`.
    ev1.wait()
    c = (a.sin() * a.cos() + 0.1).relu()

    # Region 3 (aux again): depends on `c`, so it cannot be reordered
    # before Region 2. By the time we re-enter `with torch.cuda.stream(aux)`,
    # the function-scope `stream1` has been overwritten with an int.
    with torch.cuda.stream(aux):
        d = torch.mm(c, w.t())

    return d


M, K = 256, 1024
x = torch.randn(M, K, device=device, dtype=torch.float32)
w = torch.randn(K, M, device=device, dtype=torch.float32)

# Eager.
y_eager = model(x, w)
torch.cuda.synchronize()
print(f"eager:    ok, |y| = {y_eager.norm().item():.4f}")

# Compiled.
compiled = torch.compile(model, fullgraph=True)
try:
    y_compiled = compiled(x, w)
    torch.cuda.synchronize()
    print(f"compiled: ok, |y| = {y_compiled.norm().item():.4f}")
    print("DID NOT REPRODUCE — pattern may differ from the in-tree case.")
except AttributeError as e:
    print(f"REPRODUCED: AttributeError: {e}")
    import traceback

    traceback.print_exc()

---

(.venv) root@censored/:~# TORCH_COMPILE_DEBUG=1 python3 inductor_stream_repro.py
torch.__version__ = 2.12.0+cu130
eager:    ok, |y| = 2593.1619
W0430 17:39:33.423000 1535718 vllm/.venv/lib/python3.12/site-packages/torch/_inductor/debug.py:530] [0/0] model__0_inference_0 debug trace: censored//torch_compile_debug/run_2026_04_30_17_39_29_650949-pid_1535718/torchinductor/model__0_inference_0.0
REPRODUCED: AttributeError: 'int' object has no attribute 'device'
Traceback (most recent call last):
  File "censored/inductor_stream_repro.py", line 85, in <module>
    y_compiled = compiled(x, w)
                 ^^^^^^^^^^^^^^
  File "censored//vllm/.venv/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py", line 1047, in compile_wrapper
    result = fn(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^
  File "censored//inductor_stream_repro.py", line 48, in model
    def model(x: torch.Tensor, w: torch.Tensor) -> torch.Tensor:
  File "censored//vllm/.venv/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py", line 1297, in _fn
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "censored//vllm/.venv/lib/python3.12/site-packages/torch/_functorch/aot_autograd.py", line 1273, in forward
    return compiled_fn(full_args)
           ^^^^^^^^^^^^^^^^^^^^^^
  File "censored//vllm/.venv/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 777, in runtime_wrapper
    all_outs = compiled_invoker.run(args, on_before_call=exit_prologue)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "censored//vllm/.venv/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 523, in run
    return call_func_at_runtime_with_args(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "censored//vllm/.venv/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/utils.py", line 126, in call_func_at_runtime_with_args
    out = normalize_as_list(f(args))
                            ^^^^^^^
  File "censored//vllm/.venv/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 850, in wrapper
    return compiled_fn(runtime_args)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "censored//vllm/.venv/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 1055, in inner_fn
    outs = compiled_fn(args)
           ^^^^^^^^^^^^^^^^^
  File "censored/vllm/.venv/lib/python3.12/site-packages/torch/_inductor/output_code.py", line 725, in __call__
    return self.current_callable(inputs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/tmp/torchinductor_bchislett/sq/csq6v6kldanm56skobiuvku7tnppgavpnskrdybulllhcotnwbm6.py", line 140, in call
    with torch.cuda.stream(stream1):
  File "censored//vllm/.venv/lib/python3.12/site-packages/torch/cuda/__init__.py", line 755, in __enter__
    if self.src_prev_stream.device != cur_stream.device:
                                      ^^^^^^^^^^^^^^^^^
AttributeError: 'int' object has no attribute 'device'

---

(.venv) root@bia0087:~# curl -sL https://raw.githubusercontent.com/pytorch/pytorch/main/torch/utils/collect_env.py | python
Collecting environment information...
PyTorch version: 2.12.0+cu130
Is debug build: False
CUDA used to build PyTorch: 13.0
ROCM used to build PyTorch: N/A

OS: Ubuntu 24.04.3 LTS (x86_64)
GCC version: (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0
Clang version: Could not collect
CMake version: version 4.3.2
Libc version: glibc-2.39

Python version: 3.12.3 (main, Aug 14 2025, 17:47:21) [GCC 13.3.0] (64-bit runtime)
Python platform: Linux-6.14.0-37-generic-x86_64-with-glibc2.39
Is CUDA available: True
CUDA runtime version: 13.0.88
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: 
GPU 0: NVIDIA B300 SXM6 AC
GPU 1: NVIDIA B300 SXM6 AC
GPU 2: NVIDIA B300 SXM6 AC
GPU 3: NVIDIA B300 SXM6 AC
GPU 4: NVIDIA B300 SXM6 AC
GPU 5: NVIDIA B300 SXM6 AC
GPU 6: NVIDIA B300 SXM6 AC
GPU 7: NVIDIA B300 SXM6 AC

Nvidia driver version: 580.126.09
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.9.14.0
/usr/lib/x86_64-linux-gnu/libcudnn_adv.so.9.14.0
/usr/lib/x86_64-linux-gnu/libcudnn_cnn.so.9.14.0
/usr/lib/x86_64-linux-gnu/libcudnn_engines_precompiled.so.9.14.0
/usr/lib/x86_64-linux-gnu/libcudnn_engines_runtime_compiled.so.9.14.0
/usr/lib/x86_64-linux-gnu/libcudnn_graph.so.9.14.0
/usr/lib/x86_64-linux-gnu/libcudnn_heuristic.so.9.14.0
/usr/lib/x86_64-linux-gnu/libcudnn_ops.so.9.14.0
Is XPU available: False
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True
Caching allocator config: N/A

CPU:
Architecture:                            x86_64
CPU op-mode(s):                          32-bit, 64-bit
Address sizes:                           52 bits physical, 57 bits virtual
Byte Order:                              Little Endian
CPU(s):                                  256
On-line CPU(s) list:                     0-255
Vendor ID:                               GenuineIntel
Model name:                              Intel(R) Xeon(R) 6776P
CPU family:                              6
Model:                                   173
Thread(s) per core:                      2
Core(s) per socket:                      64
Socket(s):                               2
Stepping:                                1
CPU(s) scaling MHz:                      24%
CPU max MHz:                             4600.0000
CPU min MHz:                             800.0000
BogoMIPS:                                4600.00
Flags:                                   fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cat_l2 cdp_l3 intel_ppin cdp_l2 ssbd mba ibrs ibpb stibp ibrs_enhanced tpr_shadow flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb intel_pt avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local split_lock_detect user_shstk avx_vnni avx512_bf16 wbnoinvd dtherm ida arat pln pts hwp hwp_act_window hwp_epp hwp_pkg_req hfi vnmi avx512vbmi umip pku ospke waitpkg avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq la57 rdpid bus_lock_detect cldemote movdiri movdir64b enqcmd fsrm md_clear serialize tsxldtrk pconfig arch_lbr ibt amx_bf16 avx512_fp16 amx_tile amx_int8 flush_l1d arch_capabilities
Virtualization:                          VT-x
L1d cache:                               6 MiB (128 instances)
L1i cache:                               8 MiB (128 instances)
L2 cache:                                256 MiB (128 instances)
L3 cache:                                672 MiB (2 instances)
NUMA node(s):                            2
NUMA node0 CPU(s):                       0-63,128-191
NUMA node1 CPU(s):                       64-127,192-255
Vulnerability Gather data sampling:      Not affected
Vulnerability Ghostwrite:                Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Not affected
Vulnerability Spec store bypass:         Vulnerable
Vulnerability Spectre v1:                Vulnerable: __user pointer sanitization and usercopy barriers only; no swapgs barriers
Vulnerability Spectre v2:                Vulnerable; IBPB: disabled; STIBP: disabled; PBRSB-eIBRS: Not affected; BHI: Vulnerable
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Not affected
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Vulnerable

Versions of relevant libraries:
[pip3] numpy==2.3.5
[pip3] nvidia-cublas==13.1.1.3
[pip3] nvidia-cuda-cupti==13.0.85
[pip3] nvidia-cuda-nvrtc==13.0.88
[pip3] nvidia-cuda-runtime==13.0.96
[pip3] nvidia-cudnn-cu13==9.20.0.48
[pip3] nvidia-cudnn-frontend==1.18.0
[pip3] nvidia-cufft==12.0.0.61
[pip3] nvidia-curand==10.4.0.35
[pip3] nvidia-cusolver==12.0.4.66
[pip3] nvidia-cusparse==12.6.3.3
[pip3] nvidia-cusparselt-cu13==0.8.1
[pip3] nvidia-nccl-cu13==2.29.7
[pip3] nvidia-nvjitlink==13.0.88
[pip3] nvidia-nvtx==13.0.85
[pip3] torch==2.12.0+cu130
[pip3] torch_c_dlpack_ext==0.1.5
[pip3] torchaudio==2.11.0
[pip3] torchvision==0.27.0+cu130
[pip3] triton==3.7.0
[conda] Could not collect

RAW_BUFFERClick to expand / collapse

🐛 Describe the bug

Arrived on this bug when testing 2.12's multi-stream feature in vLLM serving DeepSeek-V4-Pro.

The problem is that the Inductor codegen for fused kernels emits an integer stream id which can shadow (and clobber) the stream object used to identify the main stream. Needs multiple GPUs to reproduce since the stream id mirrors the cuda device ID, so we need cuda:1 to get "stream1" aliasing the multi-stream "stream1" label.

Reproducer:

#!/usr/bin/env python3
"""
Minimal repro for a PyTorch 2.12 Inductor multi-stream codegen bug.

The wrapper codegen sets up an aux stream as:

    stream1 = get_external_object_by_index(N)        # torch.cuda.Stream

Triton kernel launches inside `with torch.cuda.stream(default_stream):`
emit a separate int handle named `stream<device_idx>`:

    stream1 = get_raw_stream(1)                      # int (cudaStream_t)
    triton_kernel.run(..., stream=stream1)

When the device index matches the aux Stream object's codegen index
(both named `stream1`), the int rebind clobbers the function-scope name.
A subsequent `with torch.cuda.stream(stream1):` then receives the int
and StreamContext.__enter__ crashes:

    AttributeError: 'int' object has no attribute 'device'

To trigger:
  - Bind tensors to cuda:N where N >= 1 (so the int handle is `stream<N>`,
    matching the first aux Stream object's variable name).
  - Force the schedule to be: aux-region -> default Triton -> aux-region,
    via data dependencies, so Inductor cannot reorder the second aux-region
    before the clobbering Triton launch.

Run with:
    TORCH_COMPILE_DEBUG=1 python3 inductor_stream_repro.py
to dump generated code to ./torch_compile_debug/ for inspection.
"""

import torch

assert torch.cuda.is_available(), "CUDA required"
assert torch.cuda.device_count() >= 2, (
    "this repro needs cuda:1; the bug is specific to a non-zero device index"
)
print(f"torch.__version__ = {torch.__version__}")

device = torch.device("cuda:1")
aux = torch.cuda.Stream(device=device)
ev0 = torch.cuda.Event()
ev1 = torch.cuda.Event()


def model(x: torch.Tensor, w: torch.Tensor) -> torch.Tensor:
    # Region 1 (aux): produces `a`. Eager mm + event ops force a real
    # `with torch.cuda.stream(stream1):` block in the codegen.
    ev0.record()
    with torch.cuda.stream(aux):
        ev0.wait()
        a = torch.mm(x, w)
        ev1.record()

    # Region 2 (default): a Triton-fusable pointwise that DEPENDS on `a`,
    # so it must execute after Region 1. This is the launch that emits
    # `stream1 = get_raw_stream(1)` inside `with torch.cuda.stream(default_stream)`,
    # clobbering the function-scope `stream1`.
    ev1.wait()
    c = (a.sin() * a.cos() + 0.1).relu()

    # Region 3 (aux again): depends on `c`, so it cannot be reordered
    # before Region 2. By the time we re-enter `with torch.cuda.stream(aux)`,
    # the function-scope `stream1` has been overwritten with an int.
    with torch.cuda.stream(aux):
        d = torch.mm(c, w.t())

    return d


M, K = 256, 1024
x = torch.randn(M, K, device=device, dtype=torch.float32)
w = torch.randn(K, M, device=device, dtype=torch.float32)

# Eager.
y_eager = model(x, w)
torch.cuda.synchronize()
print(f"eager:    ok, |y| = {y_eager.norm().item():.4f}")

# Compiled.
compiled = torch.compile(model, fullgraph=True)
try:
    y_compiled = compiled(x, w)
    torch.cuda.synchronize()
    print(f"compiled: ok, |y| = {y_compiled.norm().item():.4f}")
    print("DID NOT REPRODUCE — pattern may differ from the in-tree case.")
except AttributeError as e:
    print(f"REPRODUCED: AttributeError: {e}")
    import traceback

    traceback.print_exc()

Traceback:

(.venv) root@censored/:~# TORCH_COMPILE_DEBUG=1 python3 inductor_stream_repro.py
torch.__version__ = 2.12.0+cu130
eager:    ok, |y| = 2593.1619
W0430 17:39:33.423000 1535718 vllm/.venv/lib/python3.12/site-packages/torch/_inductor/debug.py:530] [0/0] model__0_inference_0 debug trace: censored//torch_compile_debug/run_2026_04_30_17_39_29_650949-pid_1535718/torchinductor/model__0_inference_0.0
REPRODUCED: AttributeError: 'int' object has no attribute 'device'
Traceback (most recent call last):
  File "censored/inductor_stream_repro.py", line 85, in <module>
    y_compiled = compiled(x, w)
                 ^^^^^^^^^^^^^^
  File "censored//vllm/.venv/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py", line 1047, in compile_wrapper
    result = fn(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^
  File "censored//inductor_stream_repro.py", line 48, in model
    def model(x: torch.Tensor, w: torch.Tensor) -> torch.Tensor:
  File "censored//vllm/.venv/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py", line 1297, in _fn
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "censored//vllm/.venv/lib/python3.12/site-packages/torch/_functorch/aot_autograd.py", line 1273, in forward
    return compiled_fn(full_args)
           ^^^^^^^^^^^^^^^^^^^^^^
  File "censored//vllm/.venv/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 777, in runtime_wrapper
    all_outs = compiled_invoker.run(args, on_before_call=exit_prologue)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "censored//vllm/.venv/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 523, in run
    return call_func_at_runtime_with_args(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "censored//vllm/.venv/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/utils.py", line 126, in call_func_at_runtime_with_args
    out = normalize_as_list(f(args))
                            ^^^^^^^
  File "censored//vllm/.venv/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 850, in wrapper
    return compiled_fn(runtime_args)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "censored//vllm/.venv/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 1055, in inner_fn
    outs = compiled_fn(args)
           ^^^^^^^^^^^^^^^^^
  File "censored/vllm/.venv/lib/python3.12/site-packages/torch/_inductor/output_code.py", line 725, in __call__
    return self.current_callable(inputs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/tmp/torchinductor_bchislett/sq/csq6v6kldanm56skobiuvku7tnppgavpnskrdybulllhcotnwbm6.py", line 140, in call
    with torch.cuda.stream(stream1):
  File "censored//vllm/.venv/lib/python3.12/site-packages/torch/cuda/__init__.py", line 755, in __enter__
    if self.src_prev_stream.device != cur_stream.device:
                                      ^^^^^^^^^^^^^^^^^
AttributeError: 'int' object has no attribute 'device'

Versions

(.venv) root@bia0087:~# curl -sL https://raw.githubusercontent.com/pytorch/pytorch/main/torch/utils/collect_env.py | python
Collecting environment information...
PyTorch version: 2.12.0+cu130
Is debug build: False
CUDA used to build PyTorch: 13.0
ROCM used to build PyTorch: N/A

OS: Ubuntu 24.04.3 LTS (x86_64)
GCC version: (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0
Clang version: Could not collect
CMake version: version 4.3.2
Libc version: glibc-2.39

Python version: 3.12.3 (main, Aug 14 2025, 17:47:21) [GCC 13.3.0] (64-bit runtime)
Python platform: Linux-6.14.0-37-generic-x86_64-with-glibc2.39
Is CUDA available: True
CUDA runtime version: 13.0.88
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: 
GPU 0: NVIDIA B300 SXM6 AC
GPU 1: NVIDIA B300 SXM6 AC
GPU 2: NVIDIA B300 SXM6 AC
GPU 3: NVIDIA B300 SXM6 AC
GPU 4: NVIDIA B300 SXM6 AC
GPU 5: NVIDIA B300 SXM6 AC
GPU 6: NVIDIA B300 SXM6 AC
GPU 7: NVIDIA B300 SXM6 AC

Nvidia driver version: 580.126.09
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.9.14.0
/usr/lib/x86_64-linux-gnu/libcudnn_adv.so.9.14.0
/usr/lib/x86_64-linux-gnu/libcudnn_cnn.so.9.14.0
/usr/lib/x86_64-linux-gnu/libcudnn_engines_precompiled.so.9.14.0
/usr/lib/x86_64-linux-gnu/libcudnn_engines_runtime_compiled.so.9.14.0
/usr/lib/x86_64-linux-gnu/libcudnn_graph.so.9.14.0
/usr/lib/x86_64-linux-gnu/libcudnn_heuristic.so.9.14.0
/usr/lib/x86_64-linux-gnu/libcudnn_ops.so.9.14.0
Is XPU available: False
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True
Caching allocator config: N/A

CPU:
Architecture:                            x86_64
CPU op-mode(s):                          32-bit, 64-bit
Address sizes:                           52 bits physical, 57 bits virtual
Byte Order:                              Little Endian
CPU(s):                                  256
On-line CPU(s) list:                     0-255
Vendor ID:                               GenuineIntel
Model name:                              Intel(R) Xeon(R) 6776P
CPU family:                              6
Model:                                   173
Thread(s) per core:                      2
Core(s) per socket:                      64
Socket(s):                               2
Stepping:                                1
CPU(s) scaling MHz:                      24%
CPU max MHz:                             4600.0000
CPU min MHz:                             800.0000
BogoMIPS:                                4600.00
Flags:                                   fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cat_l2 cdp_l3 intel_ppin cdp_l2 ssbd mba ibrs ibpb stibp ibrs_enhanced tpr_shadow flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb intel_pt avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local split_lock_detect user_shstk avx_vnni avx512_bf16 wbnoinvd dtherm ida arat pln pts hwp hwp_act_window hwp_epp hwp_pkg_req hfi vnmi avx512vbmi umip pku ospke waitpkg avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq la57 rdpid bus_lock_detect cldemote movdiri movdir64b enqcmd fsrm md_clear serialize tsxldtrk pconfig arch_lbr ibt amx_bf16 avx512_fp16 amx_tile amx_int8 flush_l1d arch_capabilities
Virtualization:                          VT-x
L1d cache:                               6 MiB (128 instances)
L1i cache:                               8 MiB (128 instances)
L2 cache:                                256 MiB (128 instances)
L3 cache:                                672 MiB (2 instances)
NUMA node(s):                            2
NUMA node0 CPU(s):                       0-63,128-191
NUMA node1 CPU(s):                       64-127,192-255
Vulnerability Gather data sampling:      Not affected
Vulnerability Ghostwrite:                Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Not affected
Vulnerability Spec store bypass:         Vulnerable
Vulnerability Spectre v1:                Vulnerable: __user pointer sanitization and usercopy barriers only; no swapgs barriers
Vulnerability Spectre v2:                Vulnerable; IBPB: disabled; STIBP: disabled; PBRSB-eIBRS: Not affected; BHI: Vulnerable
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Not affected
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Vulnerable

Versions of relevant libraries:
[pip3] numpy==2.3.5
[pip3] nvidia-cublas==13.1.1.3
[pip3] nvidia-cuda-cupti==13.0.85
[pip3] nvidia-cuda-nvrtc==13.0.88
[pip3] nvidia-cuda-runtime==13.0.96
[pip3] nvidia-cudnn-cu13==9.20.0.48
[pip3] nvidia-cudnn-frontend==1.18.0
[pip3] nvidia-cufft==12.0.0.61
[pip3] nvidia-curand==10.4.0.35
[pip3] nvidia-cusolver==12.0.4.66
[pip3] nvidia-cusparse==12.6.3.3
[pip3] nvidia-cusparselt-cu13==0.8.1
[pip3] nvidia-nccl-cu13==2.29.7
[pip3] nvidia-nvjitlink==13.0.88
[pip3] nvidia-nvtx==13.0.85
[pip3] torch==2.12.0+cu130
[pip3] torch_c_dlpack_ext==0.1.5
[pip3] torchaudio==2.11.0
[pip3] torchvision==0.27.0+cu130
[pip3] triton==3.7.0
[conda] Could not collect

cc @ptrblck @msaroufim @eqy @jerryzh168 @tinglvv @nWEIdia @chauhang @penguinwu @voznesenskym @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @kadeng @muchulee8 @amjames @aakhundov @coconutruben @jataylo

extent analysis

TL;DR

The issue can be fixed by renaming the stream1 variable in the model function to avoid naming conflicts with the integer handle emitted by the Triton kernel.

Guidance

Identify and rename the stream1 variable in the model function to a unique name that does not match the integer handle emitted by the Triton kernel.
Verify that the renamed variable does not conflict with any other variables in the scope.
Check the generated code in the torch_compile_debug directory to ensure that the renaming has resolved the naming conflict.
Test the modified code to ensure that it runs without errors and produces the expected results.

Example

def model(x: torch.Tensor, w: torch.Tensor) -> torch.Tensor:
    # ...
    aux_stream = torch.cuda.Stream(device=device)
    # ...
    with torch.cuda.stream(aux_stream):
        # ...

In this example, the stream1 variable has been renamed to aux_stream to avoid the naming conflict.

Notes

This fix assumes that the issue is caused by the naming conflict between the stream1 variable and the integer handle emitted by the Triton kernel. If the issue persists after renaming the variable, further investigation may be necessary to identify the root cause.

Recommendation

Apply the workaround by renaming the stream1 variable to a unique name that does not match the integer handle emitted by the Triton kernel. This should resolve the naming conflict and allow the code to run without errors.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #embedding generation #cache error #pipeline error #runtime error

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

pytorch - 💡(How to fix) Fix Torch.compile crashes with multi-stream and triton kernels [3 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Eager.

Compiled.

Code Example

🐛 Describe the bug

Versions

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

TRENDING

pytorch - 💡(How to fix) Fix Torch.compile crashes with multi-stream and triton kernels [3 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Eager.

Compiled.

Code Example

🐛 Describe the bug

Versions

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING