pytorch - ✅(Solved) Fix [vllm] [2.12 regression] torch.library.Library.impl("aten::bmm", ..., "CUDA") now fails with "already a kernel registered from python" [2 pull requests]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

In torch 2.12.0, calling torch.library.Library.impl("aten::bmm", fn, "CUDA") raises:

RuntimeError: This is not allowed since there's already a kernel registered from python overriding bmm's behavior for CUDA dispatch key and aten namespace.

The same call succeeds on torch 2.11.0. This is blocking the torch 2.12 upgrade for vLLM (vllm-project/vllm#40077).

Error Message

File ".../vllm/model_executor/layers/batch_invariant.py", line 974, in enable_batch_invariant_mode _batch_invariant_LIB.impl("aten::bmm", bmm_batch_invariant, "CUDA") File ".../torch/library.py", line 333, in impl raise RuntimeError( RuntimeError: This is not allowed since there's already a kernel registered from python overriding bmm's behavior for CUDA dispatch key and aten namespace.

Root Cause

In torch 2.12.0, calling torch.library.Library.impl("aten::bmm", fn, "CUDA") raises:

RuntimeError: This is not allowed since there's already a kernel registered from python overriding bmm's behavior for CUDA dispatch key and aten namespace.

The same call succeeds on torch 2.11.0. This is blocking the torch 2.12 upgrade for vLLM (vllm-project/vllm#40077).

Fix Action

Fix / Workaround

RuntimeError: This is not allowed since there's already a kernel registered from python overriding bmm's behavior for CUDA dispatch key and aten namespace.

File ".../vllm/model_executor/layers/batch_invariant.py", line 974, in enable_batch_invariant_mode
    _batch_invariant_LIB.impl("aten::bmm", bmm_batch_invariant, "CUDA")
File ".../torch/library.py", line 333, in impl
    raise RuntimeError(
RuntimeError: This is not allowed since there's already a kernel registered from python overriding bmm's behavior for CUDA dispatch key and aten namespace.

PR fix notes

PR #40562: [Bugfix][Torch 2.12] Fix batch_invariant test with allow_override for torch 2.12 upgrade

Description (problem / solution / changelog)

<!-- markdownlint-disable -->

Purpose

Fixes https://github.com/pytorch/pytorch/issues/180905

Torch 2.12 added a new torch._native subsystem that registers a built-in Triton aten::bmm outer-product kernel for CUDA at import time in https://github.com/pytorch/pytorch/pull/179082. When vLLM later calls lib.impl("aten::bmm", ..., "CUDA"), the duplicate-registration check in torch/library.py finds the key already in _impls and raises a RuntimeError.

Added allow_override=True to the lib.impl() call for aten::bmm in vllm/model_executor/layers/batch_invariant.py:969. This is the intended API for replacing an existing dispatcher-level kernel — it's the same parameter torch's own built-in kernels use. The parameter has been available since torch 2.8, so no backwards compatibility concern.

Test Plan

pytest tests/v1/determinism/test_batch_invariance.py -v

Test Result

============================================ 9 passed, 17 warnings in 445.26s (0:07:25) =============================================

<details> <summary> Essential Elements of an Effective PR Description Checklist </summary>
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
</details>

BEFORE SUBMITTING, PLEASE READ https://docs.vllm.ai/en/latest/contributing (anything written below this line will be removed by GitHub Actions)

Changed files

  • vllm/model_executor/layers/batch_invariant.py (modified, +6/-2)

Code Example

lib = torch.library.Library("batch_invariance", "FRAGMENT")
lib.impl("aten::_log_softmax", _log_softmax_batch_invariant, "CUDA")  # OK
lib.impl("aten::softmax",       softmax_batch_invariant,     "CUDA")  # OK
lib.impl("aten::_softmax",      softmax_batch_invariant,     "CUDA")  # OK
lib.impl("aten::mean.dim",      mean_batch_invariant,        "CUDA")  # OK
lib.impl("aten::bmm",           bmm_batch_invariant,         "CUDA")  # FAILS on 2.12

---

File ".../vllm/model_executor/layers/batch_invariant.py", line 974, in enable_batch_invariant_mode
    _batch_invariant_LIB.impl("aten::bmm", bmm_batch_invariant, "CUDA")
File ".../torch/library.py", line 333, in impl
    raise RuntimeError(
RuntimeError: This is not allowed since there's already a kernel registered from python overriding bmm's behavior for CUDA dispatch key and aten namespace.
RAW_BUFFERClick to expand / collapse

Summary

In torch 2.12.0, calling torch.library.Library.impl("aten::bmm", fn, "CUDA") raises:

RuntimeError: This is not allowed since there's already a kernel registered from python overriding bmm's behavior for CUDA dispatch key and aten namespace.

The same call succeeds on torch 2.11.0. This is blocking the torch 2.12 upgrade for vLLM (vllm-project/vllm#40077).

Environment

  • torch: 2.12.0+cu130 (test channel)
  • torchvision: 0.27.0+cu130
  • triton: 3.7.0
  • CUDA: 13.0
  • Python: 3.12
  • GPU: reproduces on H100 and B200

Reproduction

vLLM registers several python-level overrides in one function. Only aten::bmm fails:

lib = torch.library.Library("batch_invariance", "FRAGMENT")
lib.impl("aten::_log_softmax", _log_softmax_batch_invariant, "CUDA")  # OK
lib.impl("aten::softmax",       softmax_batch_invariant,     "CUDA")  # OK
lib.impl("aten::_softmax",      softmax_batch_invariant,     "CUDA")  # OK
lib.impl("aten::mean.dim",      mean_batch_invariant,        "CUDA")  # OK
lib.impl("aten::bmm",           bmm_batch_invariant,         "CUDA")  # FAILS on 2.12

Source: vllm/model_executor/layers/batch_invariant.py:974https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/layers/batch_invariant.py#L974

Traceback

File ".../vllm/model_executor/layers/batch_invariant.py", line 974, in enable_batch_invariant_mode
    _batch_invariant_LIB.impl("aten::bmm", bmm_batch_invariant, "CUDA")
File ".../torch/library.py", line 333, in impl
    raise RuntimeError(
RuntimeError: This is not allowed since there's already a kernel registered from python overriding bmm's behavior for CUDA dispatch key and aten namespace.

Question

Did torch 2.12 add a built-in python-level override of aten::bmm on CUDA (or tighten duplicate-registration validation)? If so, what's the intended migration path for downstream projects that previously registered a python override of aten::bmm?

Affected test suites (all fail at EngineCore init)

  • tests/v1/determinism/test_batch_invariance.py — 7 tests, on both H100 and B200
  • Distributed DP Tests (2 GPU / 4 GPU)
  • Distributed Tests (2 GPUs)(H100)
  • Model Runner V2 Distributed (2 GPUs)

Links

cc @anjali411 @chauhang @penguinwu @bdhirsh @bobrenjc93 @aorenste

extent analysis

TL;DR

The issue can be resolved by removing or modifying the python-level override of aten::bmm for CUDA, as torch 2.12 has added a built-in override.

Guidance

  • Check the torch 2.12 documentation for any changes to the aten::bmm implementation or override policy.
  • Verify if the built-in override in torch 2.12 provides the necessary functionality, and if so, remove the custom override.
  • If the custom override is still necessary, investigate alternative registration methods or namespaces to avoid conflicts with the built-in override.
  • Test the affected test suites after modifying the override to ensure compatibility with torch 2.12.

Example

No code snippet is provided as the issue is related to a specific torch version and override policy, and any code changes would require further investigation into the torch 2.12 documentation and the custom override implementation.

Notes

The exact migration path for downstream projects is unclear and may require further investigation into the torch 2.12 documentation and release notes.

Recommendation

Apply workaround: Remove or modify the custom python-level override of aten::bmm for CUDA to avoid conflicts with the built-in override in torch 2.12. This will allow the affected test suites to run successfully and enable the upgrade to torch 2.12.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

pytorch - ✅(Solved) Fix [vllm] [2.12 regression] torch.library.Library.impl("aten::bmm", ..., "CUDA") now fails with "already a kernel registered from python" [2 pull requests]