pytorch - ✅(Solved) Fix Library teardown leaves stale _OpNamespace cache, breaking CIA op enumeration (TypeError: 'CustomDecompTable' object is not a mapping) [1 pull requests, 1 participants]

pytorch2026-04-28 18:11:08

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

pytorch/pytorch#181765•Fetched 2026-04-29 06:11:04

View on GitHub

Comments

Participants

Timeline

Reactions

Author

aorenste

Participants

aorenste

Timeline (top)

mentioned ×7subscribed ×7labeled ×6closed ×1

Destroying a torch.library.Library (DEF mode) after its op has been called leaves torch.ops in a poisoned state: the _OpNamespace still lists the namespace and OpOverloadPacket._overload_names still lists 'default', but getattr(packet, 'default') raises AttributeError because the underlying op has been deregistered from the C++ dispatcher.

Any subsequent code that walks all CompositeImplicitAutograd ops crashes — most visibly, {**core_aten_decompositions()} at torch/_inductor/decomposition.py:108 surfaces the underlying AttributeError as the misleading TypeError: 'CustomDecompTable' object is not a mapping (CPython's DICT_UPDATE opcode rewrites AttributeError that way).

Error Message

import torch from torch.library import Library, impl

my_lib = Library('my_lib', 'DEF') my_lib.define('my_func() -> None')

@impl(my_lib, 'my_func', '') def my_func(): pass

torch.ops.my_lib.my_func() # required del my_lib # required

from torch._inductor.cudagraph_trees import reset_cudagraph_trees

TypeError: 'CustomDecompTable' object is not a mapping

Root Cause

Fix Action

Fix / Workaround

_collect_all_valid_cia_ops_for_namespace skips overloads that no longer resolve (cheapest, most defensive).
Library.__del__ cleans up the corresponding _OpNamespace cache and torch.ops._dir entries (most correct, but trickier — multiple Library objects can share a namespace).
OpOverloadPacket.overloads() re-queries the dispatcher rather than returning the cached _overload_names.

PR fix notes

PR #181785: Fix Library finalizer not clearing torch.ops cache

Repository: pytorch/pytorch
Author: aorenste
State: open | merged: False
Link: https://github.com/pytorch/pytorch/pull/181785

Description (problem / solution / changelog)

Stack from ghstack (oldest at bottom):

-> #181785

Calling del lib on a torch.library.Library ran the weakref finalizer but left the cached OpOverloadPacket on torch.ops.<ns>.<name> in place. The C++ side had been reset, but OpOverloadPacket._overload_names still listed 'default' while getattr(packet, 'default') now raised AttributeError. Any later code that walked all CompositeImplicitAutograd ops would hit that -- most visibly, {**core_aten_decompositions(), **inductor_decompositions} at torch/_inductor/decomposition.py:108, where CPython's DICT_UPDATE opcode rewrote the AttributeError as the misleading TypeError: 'CustomDecompTable' object is not a mapping.

Library._destroy() already did this cleanup, so the explicit-teardown path (e.g. _scoped_library) was fine; only the gc-driven finalizer path was broken. Extracted the cleanup into _clear_torch_ops_cache() and called it from _del_library() as well.

This was hitting slow-gradcheck periodic in test_torch.py: test_storage_preserve_nonhermetic_in_hermetic_context creates a DEF-mode Library, calls its op, then lets it go out of scope; subsequent CUDA tests then crashed in before_cuda_memory_leak_check -> torch._dynamo.reset() -> the lazy import chain above.

Fixes #181765. The flaky-test disabler had been auto-creating per-test issues (#181343, #181537, #181684, #119515) papering over this -- those tests should be re-enabled once this lands.

Authored with Claude.

Changed files

test/test_python_dispatch.py (modified, +22/-0)
torch/library.py (modified, +24/-15)

Code Example

import torch
from torch.library import Library, impl

my_lib = Library('my_lib', 'DEF')
my_lib.define('my_func() -> None')

@impl(my_lib, 'my_func', '')
def my_func(): pass

torch.ops.my_lib.my_func()   # required
del my_lib                    # required

from torch._inductor.cudagraph_trees import reset_cudagraph_trees
# TypeError: 'CustomDecompTable' object is not a mapping

---

AttributeError: The underlying op of 'my_lib.my_func' has no overload name 'default'

RAW_BUFFERClick to expand / collapse

Summary

Repro

import torch
from torch.library import Library, impl

my_lib = Library('my_lib', 'DEF')
my_lib.define('my_func() -> None')

@impl(my_lib, 'my_func', '')
def my_func(): pass

torch.ops.my_lib.my_func()   # required
del my_lib                    # required

from torch._inductor.cudagraph_trees import reset_cudagraph_trees
# TypeError: 'CustomDecompTable' object is not a mapping

The actual error, hidden by the dict-spread, is at torch/_export/utils.py:1368 in _collect_all_valid_cia_ops_for_namespace:

AttributeError: The underlying op of 'my_lib.my_func' has no overload name 'default'

How this manifests in CI

slow-gradcheck periodic jobs (e.g. job 73398436183 in run 25055128899) hit this in test_torch.py after test_storage_preserve_nonhermetic_in_hermetic_context poisons the namespace. Subsequent CUDA tests then fail in before_cuda_memory_leak_check → torch._dynamo.reset() → lazy import chain into torch._inductor.decomposition.

The trigger landed in #181040, which added the before_cuda_memory_leak_check hook calling torch._dynamo.reset(). The latent bug in CIA op enumeration predates that — likely from #137650 (lazy decomp table) — but only became reachable mid-suite after #181040.

The flaky-test disabler has been auto-creating per-test "DISABLED ..." issues (#181343, #181684, #181537, ...) that paper over the symptom without addressing the root cause.

Possible fixes

_collect_all_valid_cia_ops_for_namespace skips overloads that no longer resolve (cheapest, most defensive).
Library.__del__ cleans up the corresponding _OpNamespace cache and torch.ops._dir entries (most correct, but trickier — multiple Library objects can share a namespace).
OpOverloadPacket.overloads() re-queries the dispatcher rather than returning the cached _overload_names.

Independently, torch/_inductor/decomposition.py:108 should arguably use dict(core_aten_decompositions()) instead of {**...} so the real error isn't masked.

cc @anjali411 @chauhang @penguinwu @bdhirsh @bobrenjc93 @jansel @tugsbayasgalan

extent analysis

TL;DR

To fix the issue, modify _collect_all_valid_cia_ops_for_namespace to skip overloads that no longer resolve, or implement a cleanup mechanism in Library.__del__ to remove the corresponding _OpNamespace cache and torch.ops._dir entries.

Guidance

Identify and skip overloads that no longer resolve in _collect_all_valid_cia_ops_for_namespace to prevent the AttributeError.
Consider implementing a cleanup mechanism in Library.__del__ to remove the corresponding _OpNamespace cache and torch.ops._dir entries when a torch.library.Library is destroyed.
Modify torch/_inductor/decomposition.py:108 to use dict(core_aten_decompositions()) instead of {**...} to prevent masking the real error.
Review the lazy decomp table implementation from #137650 and the before_cuda_memory_leak_check hook from #181040 to ensure they do not contribute to the issue.

Example

# Example of skipping overloads that no longer resolve
def _collect_all_valid_cia_ops_for_namespace(namespace):
    valid_ops = []
    for op in namespace:
        try:
            getattr(op, 'default')
            valid_ops.append(op)
        except AttributeError:
            pass
    return valid_ops

Notes

The provided fixes may have varying levels of complexity and correctness. Implementing a cleanup mechanism in Library.__del__ may be more correct but also trickier due to the possibility of multiple Library objects sharing a namespace.

Recommendation

Apply the workaround by modifying _collect_all_valid_cia_ops_for_namespace to skip overloads that no longer resolve, as it is the cheapest and most defensive solution. This will prevent the AttributeError and allow the code to run without crashing.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#orchestration issue #cache issue #memory leak #API versioning #request timeout

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

pytorch - ✅(Solved) Fix Library teardown leaves stale _OpNamespace cache, breaking CIA op enumeration (TypeError: 'CustomDecompTable' object is not a mapping) [1 pull requests, 1 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

TypeError: 'CustomDecompTable' object is not a mapping

Root Cause

Fix Action

Fix / Workaround

PR fix notes

PR #181785: Fix Library finalizer not clearing torch.ops cache

Description (problem / solution / changelog)

Changed files

Code Example

Summary

Repro

How this manifests in CI

Possible fixes

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

TRENDING

pytorch - ✅(Solved) Fix Library teardown leaves stale _OpNamespace cache, breaking CIA op enumeration (TypeError: 'CustomDecompTable' object is not a mapping) [1 pull requests, 1 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

TypeError: 'CustomDecompTable' object is not a mapping

Root Cause

Fix Action

Fix / Workaround

PR fix notes

PR #181785: Fix Library finalizer not clearing torch.ops cache

Description (problem / solution / changelog)

Changed files

Code Example

Summary

Repro

How this manifests in CI

Possible fixes

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING