pytorch - 💡(How to fix) Fix Audit license-files glob in pyproject.toml for over-collection [1 pull requests]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

Root Cause

So no GPL-3.0 code from cpr/test is ever compiled, linked, or executed in any PyTorch build -- but the LICENSE file is structurally present in the source tree because of recursive submodule vendoring, and the recursive license-files glob picks it up.

Fix Action

Fixed

Code Example

license-files = [
    "LICENSE",
    "third_party/**/LICENSE",
    "third_party/**/LICENSE.txt",
    "third_party/**/LICENSE.rst",
    "third_party/**/COPYING.BSD",
]
RAW_BUFFERClick to expand / collapse

🚀 The feature, motivation and pitch

#180237 migrates PyTorch's bundled-license shipping mechanism from the legacy setup.py-concatenated LICENSE to the PEP 639 license-files field. As part of that PR's review, the question came up of whether the (old and new) license-files glob accurately represents the distribution's licensing surface.

The PR uses a minimal recursive glob that is designed to be identical to the old collection logic:

license-files = [
    "LICENSE",
    "third_party/**/LICENSE",
    "third_party/**/LICENSE.txt",
    "third_party/**/LICENSE.rst",
    "third_party/**/COPYING.BSD",
]

This matches ~108 files per built wheel. Inspection shows it over-collects in several categories:

CategoryExamples
Test fixtures (most concerning -- one is GPL-3.0)kineto/libkineto/third_party/dynolog/third_party/cpr/test/LICENSE (GPL-3.0), fbgemm/fbgemm_gpu/test/quantize/mx/LICENSE
Test frameworksgoogletest/LICENSE and 10+ copies of googletest/, googlemock/, gtest/, doctest/ across nested submodules
Documentation pages (not license text)composable_kernel/docs/LICENSE.rst and copies in aiter, fbgemm, flash-attention, mslk; fbgemm/fbgemm_gpu/docs/src/general/LICENSE.rst; NVTX/tools/docs/github-markdown-css/LICENSE
Build/lint tools (Python sources, not in any binary)hipify_torch/LICENSE.txt (3 copies), kineto/.../json/third_party/cpplint/LICENSE
Language bindings PyTorch doesn't shipflatbuffers/dart/LICENSE, flatbuffers/swift/LICENSE, cutlass/python/LICENSE.txt (4 copies), NVTX/python/LICENSE.txt, nccl/bindings/nccl4py/LICENSE.txt
Example / sample codeprometheus-cpp/3rdparty/civetweb/examples/rest/cJSON/LICENSE, two duktape-*/LICENSE.txt copies under civetweb
dynolog's transitive deps (never compiled)dynolog/third_party/{cpr,DCGM,fmt,pfs,prometheus-cpp}/LICENSE* and their nested transitive deps
Redundant duplicatesNVTX/docs/LICENSE.txt and NVTX/python/LICENSE.txt are byte-identical copies of NVTX/LICENSE.txt

After conservative exclusion, ~61 of the 108 files represent code that actually ships in the wheel; the remainder are over-collection artifacts of the recursive glob.

This is pre-existing behaviour. The old setup.py concat_license_files mechanism that #180237 replaces walked third_party/ with the same broad criteria and produced an equivalently over-collected LICENSE blob in every release wheel. The PEP 639 layout just makes it visible per-file rather than buried in concatenated text.

The technical layout migration (#180237) is in scope to land first; this issue is the follow-up to discuss and implement the audit of which files belong in the set.

cc @malfet @atalman @tinglvv @nWEIdia @rgommers @seemethere

Alternatives

In rough order of investment:

  1. Tighten the recursive globs to a finite depth. Five explicit-depth patterns (third_party/*/LICENSE* through third_party/*/*/*/*/*/LICENSE*) drop all files at path depth >= 8, which removes the GPL-3.0 cpr/test/LICENSE and most dynolog-transitive over-collection. Doesn't address shallower over-collection (googletest at depth 3, cutlass/python at depth 4, etc.).

  2. Enumerate license-files explicitly per shipped submodule. Matches NumPy's pattern. ~61 paths, sorted, each line auditable. Most precise; adds maintenance burden whenever a vendored submodule moves or a new one is added.

  3. Compute license-files dynamically. Use dynamic = ["license-files"] with a build-backend hook that walks third_party/ and applies exclude patterns. Setuptools 77+ may support this; scikit-build-core may need a custom metadata provider. Most flexible, biggest infra investment.

  4. Pair the chosen approach with the SPDX license expression update. @rgommers' review on #180237 also asked for the SPDX license field to reflect the actual distribution licenses; that's coupled to whichever subset of files we decide ships, so should be tackled in the same pass.

Additional context

Why cpr/test deserves a specific call-out

third_party/kineto/libkineto/third_party/dynolog/third_party/cpr/test/LICENSE is GPL-3.0. Including it in the distribution's license-files set would, on a strict reading of PEP 639, imply the distribution contains GPL-3.0 code.

It does not. The full chain of gates:

  1. PyTorch only invokes libkineto's CMakeLists, not dynolog's top-level CMakeLists. So dynolog_lib (the full dynolog with cpr) is never built.
  2. libkineto only links the dynolog_ipcfabric_lib target, which is add_library(... INTERFACE) -- an empty INTERFACE library with no compiled sources.
  3. Even if dynolog_lib were built, add_subdirectory(third_party/cpr) is gated by if(USE_ODS_GRAPH_API), a Facebook-internal flag that PyTorch never enables.
  4. Even if cpr were built, CPR_BUILD_TESTS defaults to OFF, so cpr/test/CMakeLists.txt is never added.

So no GPL-3.0 code from cpr/test is ever compiled, linked, or executed in any PyTorch build -- but the LICENSE file is structurally present in the source tree because of recursive submodule vendoring, and the recursive license-files glob picks it up.

Scope of this issue

Discussion of which approach to take and on which timeline. Implementation is a separate PR (or set of PRs), ideally after #180237 has landed so the PEP 639 layout is the stable baseline.

Refs

  • #180237 (the PEP 639 layout migration)
  • #158104 (earlier PEP 639 attempt, reverted twice for an unrelated torchvision-on-macOS issue)
  • rgommers' inline review on #180237 (pyproject.toml:55) re: SPDX expression accuracy

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

pytorch - 💡(How to fix) Fix Audit license-files glob in pyproject.toml for over-collection [1 pull requests]