pytorch - ✅(Solved) Fix [CI][CUDA][GB200] Unit Test Failure with TritonTensorDescriptorTestCUDA Detected in GB200 Nightly [1 pull requests, 2 comments, 1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
pytorch/pytorch#181174Fetched 2026-04-23 07:22:11
View on GitHub
Comments
2
Participants
1
Timeline
190
Reactions
0
Author
Participants
Timeline (top)
mentioned ×90subscribed ×90labeled ×7commented ×2

Fix Action

Fixed

PR fix notes

PR #179729: [inductor] Raise split reduction threshold from 8K to 524K on Blackwell+

Description (problem / solution / changelog)

The split reduction heuristic was too aggressive for small-batch, moderate-reduction workloads (e.g. entropy/softmax over vocab=32K). With the old threshold of 8192, a batch=1 entropy computation over 32K vocab was split into 8 kernel launches (4 CTAs + tree-reduce for each of the 3 reduction steps), making torch.compile 2x slower than eager.

Raising the threshold to 524K on Blackwell (SM >= 10.0) avoids unnecessary splitting for practical softmax/entropy/layernorm vocab sizes while preserving splits for truly large reductions (>1M) where single-CTA throughput becomes a bottleneck. Older architectures retain the original 8192 threshold.

Benchmark on GB200 (global sum, batch=1): n=32K: split 0.047ms → nosplit 0.036ms (-25%) n=524K: split 0.040ms → nosplit 0.032ms (-19%) n=1M: split 0.042ms → nosplit 0.050ms (+18%, still splits with new threshold) n=8M: split 0.042ms → nosplit 0.595ms (regression, still splits with new threshold)

Fixes https://github.com/pytorch/pytorch/issues/179697

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @kadeng @muchulee8 @amjames @chauhang @aakhundov @coconutruben @jataylo

Changed files

  • torch/_inductor/choices.py (modified, +6/-1)
RAW_BUFFERClick to expand / collapse

We noticed the following failures with the nightly GB200 signal:

TritonBlockPointerTestGPU.test_reduction_prefer_nd_tiling_False_view_size6_num_block_pointers_3_num_triton_kernels_2_cuda

AssertionError: Scalars are not equal!

Expected 2 but got 1. Absolute difference: 1 Relative difference: 0.5

To execute this test, run the following from the base repo dir: python test/inductor/test_torchinductor_strided_blocks.py TritonBlockPointerTestGPU.test_reduction_prefer_nd_tiling_False_view_size6_num_block_pointers_3_num_triton_kernels_2_cuda

TritonBlockPointerTestGPU.test_welford_non_block_pointer_cuda
TritonTensorDescriptorTestCUDA.test_reduction_prefer_nd_tiling_False_view_size5_num_block_pointers_2_num_triton_kernels_2_cuda TritonBlockPointerTestGPU.test_2d_reduction_odd_shapes_view_size1_num_block_pointers_3_num_triton_kernels_2_reduction_op1_cuda TritonTensorDescriptorTestCUDA.test_welford_non_block_pointer_cuda TritonBlockPointerTestGPU.test_reduction_prefer_nd_tiling_True_view_size6_num_block_pointers_3_num_triton_kernels_2_cuda TritonBlockPointerTestGPU.test_2d_welford_reduction_size1_expected_num_block_pointers_7_expected_num_triton_kernels_2_expect_fallback_False_cuda TritonBlockPointerTestGPU.test_reduction_prefer_nd_tiling_False_view_size4_num_block_pointers_3_num_triton_kernels_2_cuda TritonBlockPointerTestGPU.test_reduction_prefer_nd_tiling_True_view_size5_num_block_pointers_2_num_triton_kernels_2_cuda TritonTensorDescriptorTestCUDA.test_reduction_prefer_nd_tiling_True_view_size5_num_block_pointers_2_num_triton_kernels_2_cuda TritonBlockPointerTestGPU.test_reduction_prefer_nd_tiling_True_view_size4_num_block_pointers_3_num_triton_kernels_2_cuda TritonBlockPointerTestGPU.test_reduction_prefer_nd_tiling_False_view_size5_num_block_pointers_2_num_triton_kernels_2_cuda

Versions: as of: https://github.com/pytorch/pytorch/commit/17b157385f3f37f05e53b37ea130bf1ce3fe7650 (that should be able to reproduce this)

Last known good: https://github.com/pytorch/pytorch/commit/c4ec73b4b52e7c878e3c2522cac61e035ee72520

Disclaimer: there is some chance that the triton wheel used (upstream vs. nvidia-internally built) might be slightly different, which we would follow up.

cc @ptrblck @msaroufim @eqy @jerryzh168 @tinglvv @mruberry @chauhang @penguinwu @voznesenskym @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @kadeng @muchulee8 @amjames @aakhundov @coconutruben @jataylo @Aidyn-A @atalman @malfet

extent analysis

TL;DR

Revert to the last known good commit (c4ec73b4b52e7c878e3c2522cac61e035ee72520) to potentially resolve the AssertionError in the nightly GB200 signal tests.

Guidance

  • Run the test TritonBlockPointerTestGPU.test_reduction_prefer_nd_tiling_False_view_size6_num_block_pointers_3_num_triton_kernels_2_cuda to reproduce the issue and verify the fix.
  • Compare the results of the test between the current commit (17b157385f3f37f05e53b37ea130bf1ce3fe7650) and the last known good commit to confirm the fix.
  • Investigate potential differences in the triton wheel used (upstream vs. nvidia-internally built) as a possible cause of the issue.
  • Review the changes made between the last known good commit and the current commit to identify the potential root cause of the AssertionError.

Notes

The fix is not guaranteed, as there is a possibility that the triton wheel used might be slightly different. Further investigation is needed to confirm the root cause of the issue.

Recommendation

Apply workaround: Revert to the last known good commit (c4ec73b4b52e7c878e3c2522cac61e035ee72520) to potentially resolve the AssertionError. This is recommended because it is a known stable version, and reverting to it may provide a temporary solution until the root cause is identified and fixed.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

pytorch - ✅(Solved) Fix [CI][CUDA][GB200] Unit Test Failure with TritonTensorDescriptorTestCUDA Detected in GB200 Nightly [1 pull requests, 2 comments, 1 participants]