pytorch - 💡(How to fix) Fix `torch.cuda.ExternalStream(0, device=...)` silently returns a fresh pooled stream instead of wrapping handle `0x0`

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

Error Message

No exception is raised — the failure is silent. Output:

Root Cause

The C++ constructor THCPStream_pynew in torch/csrc/cuda/Stream.cpp selects which CUDAStream to build via a three-way ternary:

at::cuda::CUDAStream stream = (stream_id || device_index || device_type)
    ? at::cuda::CUDAStream::unpack3(stream_id, device_index, device_type)
    : stream_ptr ? at::cuda::getStreamFromExternal(
                       reinterpret_cast<cudaStream_t>(stream_ptr), current_device)
                 : at::cuda::getStreamFromPool(priority);

Whether you hit the bug depends on which kwargs the Python wrapper forwards.

Path A: ExternalStream(0, device=0) — broken

The Python wrapper at torch/cuda/streams.py forwards only stream_ptr:

class ExternalStream(Stream):
    def __new__(cls, stream_ptr, device=None, **kwargs):
        with torch.cuda.device(device):
            return super().__new__(cls, stream_ptr=stream_ptr, **kwargs)

The other kwargs reach the C++ layer at their declared defaults of 0:

stream_id=0, device_index=0, device_type=0, stream_ptr=0

(PyArg_ParseTupleAndKeywords with "|iLLLK" doesn't track whether a kwarg was supplied — passing stream_ptr=0 explicitly is indistinguishable from omitting it.) The outer ternary is (0 || 0 || 0) == false, so we fall through to the inner ternary, where if (stream_ptr) is also 0, and we land in getStreamFromPool — a fresh pooled stream with an unrelated handle.

Fix Action

Fix / Workaround

torch.cuda.ExternalStream(stream_ptr, device=...) is documented as wrapping "an externally allocated CUDA stream". That holds for every handle except the legacy default-stream sentinel 0x0 (NULL stream): when stream_ptr == 0 the call silently returns a freshly allocated, pooled CUDA stream whose cuda_stream attribute is not 0, instead of wrapping the NULL stream. The companion API torch.cuda.get_stream_from_external(0, device=...) (introduced in 2.7) wraps 0x0 correctly, so the workaround exists and the underlying C bindings are fine — only the ExternalStream Python constructor is affected. CuPy's analogous constructor is also faithful: cp.cuda.ExternalStream(0).ptr == 0.

Workaround (torch >= 2.7).

fixed = torch.cuda.get_stream_from_external(0x0, device=dev) print(f"get_stream_from_external(0).cuda_stream = {fixed.cuda_stream:#x}")

Code Example

import torch

dev = 0
print(f"torch version: {torch.__version__}")
print(f"current_stream(device={dev}).cuda_stream    = {torch.cuda.current_stream(device=dev).cuda_stream:#x}")

# Bug: ExternalStream(0) does not wrap handle 0.
es0 = torch.cuda.ExternalStream(0x0, device=dev)
print(f"ExternalStream(0).cuda_stream                = {es0.cuda_stream:#x}")
print(f"ExternalStream(0) wraps handle 0?           : {es0.cuda_stream == 0}")

# Re-construct: a different fresh stream each time.
es0_again = torch.cuda.ExternalStream(0x0, device=dev)
print(f"second ExternalStream(0).cuda_stream         = {es0_again.cuda_stream:#x}")

# Counter-cases that work correctly:
es_pt = torch.cuda.ExternalStream(0x2, device=dev)
print(f"ExternalStream(0x2).cuda_stream              = {es_pt.cuda_stream:#x}")  # 0x2

real = torch.cuda.Stream(device=dev)
es_real = torch.cuda.ExternalStream(real.cuda_stream, device=dev)
print(f"ExternalStream(real_stream).cuda_stream      = {es_real.cuda_stream:#x}  (matches real: {es_real.cuda_stream == real.cuda_stream})")

# Workaround (torch >= 2.7).
fixed = torch.cuda.get_stream_from_external(0x0, device=dev)
print(f"get_stream_from_external(0).cuda_stream      = {fixed.cuda_stream:#x}")

---

torch version: 2.10.0+cu130
current_stream(device=0).cuda_stream    = 0x0
ExternalStream(0).cuda_stream                = 0x43f49520
ExternalStream(0) wraps handle 0?           : False
second ExternalStream(0).cuda_stream         = 0x43f53060
ExternalStream(0x2).cuda_stream              = 0x2
ExternalStream(real_stream).cuda_stream      = 0x43f53220  (matches real: True)
get_stream_from_external(0).cuda_stream      = 0x0

---

at::cuda::CUDAStream stream = (stream_id || device_index || device_type)
    ? at::cuda::CUDAStream::unpack3(stream_id, device_index, device_type)
    : stream_ptr ? at::cuda::getStreamFromExternal(
                       reinterpret_cast<cudaStream_t>(stream_ptr), current_device)
                 : at::cuda::getStreamFromPool(priority);

---

class ExternalStream(Stream):
    def __new__(cls, stream_ptr, device=None, **kwargs):
        with torch.cuda.device(device):
            return super().__new__(cls, stream_ptr=stream_ptr, **kwargs)

---

stream_id=0, device_index=0, device_type=0, stream_ptr=0
RAW_BUFFERClick to expand / collapse

🐛 Describe the bug

torch.cuda.ExternalStream(stream_ptr, device=...) is documented as wrapping "an externally allocated CUDA stream". That holds for every handle except the legacy default-stream sentinel 0x0 (NULL stream): when stream_ptr == 0 the call silently returns a freshly allocated, pooled CUDA stream whose cuda_stream attribute is not 0, instead of wrapping the NULL stream. The companion API torch.cuda.get_stream_from_external(0, device=...) (introduced in 2.7) wraps 0x0 correctly, so the workaround exists and the underlying C bindings are fine — only the ExternalStream Python constructor is affected. CuPy's analogous constructor is also faithful: cp.cuda.ExternalStream(0).ptr == 0.

reproducer

import torch

dev = 0
print(f"torch version: {torch.__version__}")
print(f"current_stream(device={dev}).cuda_stream    = {torch.cuda.current_stream(device=dev).cuda_stream:#x}")

# Bug: ExternalStream(0) does not wrap handle 0.
es0 = torch.cuda.ExternalStream(0x0, device=dev)
print(f"ExternalStream(0).cuda_stream                = {es0.cuda_stream:#x}")
print(f"ExternalStream(0) wraps handle 0?           : {es0.cuda_stream == 0}")

# Re-construct: a different fresh stream each time.
es0_again = torch.cuda.ExternalStream(0x0, device=dev)
print(f"second ExternalStream(0).cuda_stream         = {es0_again.cuda_stream:#x}")

# Counter-cases that work correctly:
es_pt = torch.cuda.ExternalStream(0x2, device=dev)
print(f"ExternalStream(0x2).cuda_stream              = {es_pt.cuda_stream:#x}")  # 0x2

real = torch.cuda.Stream(device=dev)
es_real = torch.cuda.ExternalStream(real.cuda_stream, device=dev)
print(f"ExternalStream(real_stream).cuda_stream      = {es_real.cuda_stream:#x}  (matches real: {es_real.cuda_stream == real.cuda_stream})")

# Workaround (torch >= 2.7).
fixed = torch.cuda.get_stream_from_external(0x0, device=dev)
print(f"get_stream_from_external(0).cuda_stream      = {fixed.cuda_stream:#x}")

Observed output (on torch == 2.10.0+cu130)

No exception is raised — the failure is silent. Output:

torch version: 2.10.0+cu130
current_stream(device=0).cuda_stream    = 0x0
ExternalStream(0).cuda_stream                = 0x43f49520
ExternalStream(0) wraps handle 0?           : False
second ExternalStream(0).cuda_stream         = 0x43f53060
ExternalStream(0x2).cuda_stream              = 0x2
ExternalStream(real_stream).cuda_stream      = 0x43f53220  (matches real: True)
get_stream_from_external(0).cuda_stream      = 0x0

The two interesting lines:

  • ExternalStream(0).cuda_stream returns 0x43f49520, not 0 — and successive calls return yet different fresh handles (0x43f53060, ...).
  • get_stream_from_external(0).cuda_stream returns 0x0 correctly, so the underlying C path is fine; only ExternalStream's Python wrapper is wrong.

Expected output

ExternalStream(0).cuda_stream should be 0x0

Root cause

The C++ constructor THCPStream_pynew in torch/csrc/cuda/Stream.cpp selects which CUDAStream to build via a three-way ternary:

at::cuda::CUDAStream stream = (stream_id || device_index || device_type)
    ? at::cuda::CUDAStream::unpack3(stream_id, device_index, device_type)
    : stream_ptr ? at::cuda::getStreamFromExternal(
                       reinterpret_cast<cudaStream_t>(stream_ptr), current_device)
                 : at::cuda::getStreamFromPool(priority);

Whether you hit the bug depends on which kwargs the Python wrapper forwards.

Path A: ExternalStream(0, device=0) — broken

The Python wrapper at torch/cuda/streams.py forwards only stream_ptr:

class ExternalStream(Stream):
    def __new__(cls, stream_ptr, device=None, **kwargs):
        with torch.cuda.device(device):
            return super().__new__(cls, stream_ptr=stream_ptr, **kwargs)

The other kwargs reach the C++ layer at their declared defaults of 0:

stream_id=0, device_index=0, device_type=0, stream_ptr=0

(PyArg_ParseTupleAndKeywords with "|iLLLK" doesn't track whether a kwarg was supplied — passing stream_ptr=0 explicitly is indistinguishable from omitting it.) The outer ternary is (0 || 0 || 0) == false, so we fall through to the inner ternary, where if (stream_ptr) is also 0, and we land in getStreamFromPool — a fresh pooled stream with an unrelated handle.

Versions

I confirm the code path is identical at every released version I checked, through current main:

Versiontorch/cuda/streams.py::ExternalStream.__new__torch/csrc/cuda/Stream.cpp::THCPStream_pynew ternary
v2.10.0only forwards stream_ptrstreams.pyfalls to getStreamFromPool when all kwargs are 0Stream.cpp
v2.11.0identical — streams.pyidentical — Stream.cpp
v2.12.0-rc9identical — streams.pyidentical — Stream.cpp
mainidentical — streams.pyidentical — Stream.cpp

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

pytorch - 💡(How to fix) Fix `torch.cuda.ExternalStream(0, device=...)` silently returns a fresh pooled stream instead of wrapping handle `0x0`