pytorch - ✅(Solved) Fix [vllm] [2.12 regression] torch.library.Library.impl("aten::bmm", ..., "CUDA") now fails with "already a kernel registered from python" [2 pull requests]

pytorch2026-04-20 20:30:35

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

In torch 2.12.0, calling torch.library.Library.impl("aten::bmm", fn, "CUDA") raises:

RuntimeError: This is not allowed since there's already a kernel registered from python overriding bmm's behavior for CUDA dispatch key and aten namespace.

The same call succeeds on torch 2.11.0. This is blocking the torch 2.12 upgrade for vLLM (vllm-project/vllm#40077).

Error Message

File ".../vllm/model_executor/layers/batch_invariant.py", line 974, in enable_batch_invariant_mode _batch_invariant_LIB.impl("aten::bmm", bmm_batch_invariant, "CUDA") File ".../torch/library.py", line 333, in impl raise RuntimeError( RuntimeError: This is not allowed since there's already a kernel registered from python overriding bmm's behavior for CUDA dispatch key and aten namespace.

Root Cause

In torch 2.12.0, calling torch.library.Library.impl("aten::bmm", fn, "CUDA") raises:

RuntimeError: This is not allowed since there's already a kernel registered from python overriding bmm's behavior for CUDA dispatch key and aten namespace.

The same call succeeds on torch 2.11.0. This is blocking the torch 2.12 upgrade for vLLM (vllm-project/vllm#40077).

Fix Action

Fix / Workaround

RuntimeError: This is not allowed since there's already a kernel registered from python overriding bmm's behavior for CUDA dispatch key and aten namespace.

File ".../vllm/model_executor/layers/batch_invariant.py", line 974, in enable_batch_invariant_mode
    _batch_invariant_LIB.impl("aten::bmm", bmm_batch_invariant, "CUDA")
File ".../torch/library.py", line 333, in impl
    raise RuntimeError(
RuntimeError: This is not allowed since there's already a kernel registered from python overriding bmm's behavior for CUDA dispatch key and aten namespace.

PR fix notes

PR #40562: [Bugfix][Torch 2.12] Fix batch_invariant test with allow_override for torch 2.12 upgrade

Repository: vllm-project/vllm
Author: Lucaskabela
State: open | merged: False
Link: https://github.com/vllm-project/vllm/pull/40562

Description (problem / solution / changelog)

Purpose

Fixes https://github.com/pytorch/pytorch/issues/180905

Torch 2.12 added a new torch._native subsystem that registers a built-in Triton aten::bmm outer-product kernel for CUDA at import time in https://github.com/pytorch/pytorch/pull/179082. When vLLM later calls lib.impl("aten::bmm", ..., "CUDA"), the duplicate-registration check in torch/library.py finds the key already in _impls and raises a RuntimeError.

Added allow_override=True to the lib.impl() call for aten::bmm in vllm/model_executor/layers/batch_invariant.py:969. This is the intended API for replacing an existing dispatcher-level kernel — it's the same parameter torch's own built-in kernels use. The parameter has been available since torch 2.8, so no backwards compatibility concern.

Test Plan

pytest tests/v1/determinism/test_batch_invariance.py -v

Test Result

============================================ 9 passed, 17 warnings in 445.26s (0:07:25) =============================================

<details> <summary> Essential Elements of an Effective PR Description Checklist </summary>

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.

</details>

BEFORE SUBMITTING, PLEASE READ https://docs.vllm.ai/en/latest/contributing (anything written below this line will be removed by GitHub Actions)

Changed files

vllm/model_executor/layers/batch_invariant.py (modified, +6/-2)

Code Example

lib = torch.library.Library("batch_invariance", "FRAGMENT")
lib.impl("aten::_log_softmax", _log_softmax_batch_invariant, "CUDA")  # OK
lib.impl("aten::softmax",       softmax_batch_invariant,     "CUDA")  # OK
lib.impl("aten::_softmax",      softmax_batch_invariant,     "CUDA")  # OK
lib.impl("aten::mean.dim",      mean_batch_invariant,        "CUDA")  # OK
lib.impl("aten::bmm",           bmm_batch_invariant,         "CUDA")  # FAILS on 2.12

---

File ".../vllm/model_executor/layers/batch_invariant.py", line 974, in enable_batch_invariant_mode
    _batch_invariant_LIB.impl("aten::bmm", bmm_batch_invariant, "CUDA")
File ".../torch/library.py", line 333, in impl
    raise RuntimeError(
RuntimeError: This is not allowed since there's already a kernel registered from python overriding bmm's behavior for CUDA dispatch key and aten namespace.

RAW_BUFFERClick to expand / collapse

Summary

In torch 2.12.0, calling torch.library.Library.impl("aten::bmm", fn, "CUDA") raises:

RuntimeError: This is not allowed since there's already a kernel registered from python overriding bmm's behavior for CUDA dispatch key and aten namespace.

The same call succeeds on torch 2.11.0. This is blocking the torch 2.12 upgrade for vLLM (vllm-project/vllm#40077).

Environment

torch: 2.12.0+cu130 (test channel)
torchvision: 0.27.0+cu130
triton: 3.7.0
CUDA: 13.0
Python: 3.12
GPU: reproduces on H100 and B200

Reproduction

vLLM registers several python-level overrides in one function. Only aten::bmm fails:

lib = torch.library.Library("batch_invariance", "FRAGMENT")
lib.impl("aten::_log_softmax", _log_softmax_batch_invariant, "CUDA")  # OK
lib.impl("aten::softmax",       softmax_batch_invariant,     "CUDA")  # OK
lib.impl("aten::_softmax",      softmax_batch_invariant,     "CUDA")  # OK
lib.impl("aten::mean.dim",      mean_batch_invariant,        "CUDA")  # OK
lib.impl("aten::bmm",           bmm_batch_invariant,         "CUDA")  # FAILS on 2.12

Source: vllm/model_executor/layers/batch_invariant.py:974 — https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/layers/batch_invariant.py#L974

Traceback

File ".../vllm/model_executor/layers/batch_invariant.py", line 974, in enable_batch_invariant_mode
    _batch_invariant_LIB.impl("aten::bmm", bmm_batch_invariant, "CUDA")
File ".../torch/library.py", line 333, in impl
    raise RuntimeError(
RuntimeError: This is not allowed since there's already a kernel registered from python overriding bmm's behavior for CUDA dispatch key and aten namespace.

Question

Did torch 2.12 add a built-in python-level override of aten::bmm on CUDA (or tighten duplicate-registration validation)? If so, what's the intended migration path for downstream projects that previously registered a python override of aten::bmm?

Affected test suites (all fail at EngineCore init)

tests/v1/determinism/test_batch_invariance.py — 7 tests, on both H100 and B200
Distributed DP Tests (2 GPU / 4 GPU)
Distributed Tests (2 GPUs)(H100)
Model Runner V2 Distributed (2 GPUs)

extent analysis

TL;DR

The issue can be resolved by removing or modifying the python-level override of aten::bmm for CUDA, as torch 2.12 has added a built-in override.

Guidance

Check the torch 2.12 documentation for any changes to the aten::bmm implementation or override policy.
Verify if the built-in override in torch 2.12 provides the necessary functionality, and if so, remove the custom override.
If the custom override is still necessary, investigate alternative registration methods or namespaces to avoid conflicts with the built-in override.
Test the affected test suites after modifying the override to ensure compatibility with torch 2.12.

Example

No code snippet is provided as the issue is related to a specific torch version and override policy, and any code changes would require further investigation into the torch 2.12 documentation and the custom override implementation.

Notes

The exact migration path for downstream projects is unclear and may require further investigation into the torch 2.12 documentation and release notes.

Recommendation

Apply workaround: Remove or modify the custom python-level override of aten::bmm for CUDA to avoid conflicts with the built-in override in torch 2.12. This will allow the affected test suites to run successfully and enable the upgrade to torch 2.12.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#autograd error #model save/load #optimization #mixed precision #training loop

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

pytorch - ✅(Solved) Fix [vllm] [2.12 regression] torch.library.Library.impl("aten::bmm", ..., "CUDA") now fails with "already a kernel registered from python" [2 pull requests]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Fix / Workaround

PR fix notes

PR #40562: [Bugfix][Torch 2.12] Fix batch_invariant test with allow_override for torch 2.12 upgrade

Description (problem / solution / changelog)

Purpose

Test Plan

Test Result

Changed files

Code Example

Summary

Environment

Reproduction

Traceback

Question

Affected test suites (all fail at EngineCore init)

Links

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING