pytorch - ✅(Solved) Fix [CI][CUDA][GB200] Unit Test Failure with TritonTensorDescriptorTestCUDA Detected in GB200 Nightly [1 pull requests, 2 comments, 1 participants]

pytorch2026-04-22 21:35:31

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

pytorch/pytorch#181174•Fetched 2026-04-23 07:22:11

View on GitHub

Comments

Participants

Timeline

190

Reactions

Author

nWEIdia

Participants

nWEIdia

Timeline (top)

mentioned ×90subscribed ×90labeled ×7commented ×2

Fix Action

Fixed

Fixed by PR: [inductor] Raise split reduction threshold from 8K to 524K on Blackwell+ (https://github.com/pytorch/pytorch/pull/179729)

PR fix notes

PR #179729: [inductor] Raise split reduction threshold from 8K to 524K on Blackwell+

Repository: pytorch/pytorch
Author: liqiangxl
State: closed | merged: False
Link: https://github.com/pytorch/pytorch/pull/179729

Description (problem / solution / changelog)

The split reduction heuristic was too aggressive for small-batch, moderate-reduction workloads (e.g. entropy/softmax over vocab=32K). With the old threshold of 8192, a batch=1 entropy computation over 32K vocab was split into 8 kernel launches (4 CTAs + tree-reduce for each of the 3 reduction steps), making torch.compile 2x slower than eager.

Raising the threshold to 524K on Blackwell (SM >= 10.0) avoids unnecessary splitting for practical softmax/entropy/layernorm vocab sizes while preserving splits for truly large reductions (>1M) where single-CTA throughput becomes a bottleneck. Older architectures retain the original 8192 threshold.

Benchmark on GB200 (global sum, batch=1): n=32K: split 0.047ms → nosplit 0.036ms (-25%) n=524K: split 0.040ms → nosplit 0.032ms (-19%) n=1M: split 0.042ms → nosplit 0.050ms (+18%, still splits with new threshold) n=8M: split 0.042ms → nosplit 0.595ms (regression, still splits with new threshold)

Fixes https://github.com/pytorch/pytorch/issues/179697

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @kadeng @muchulee8 @amjames @chauhang @aakhundov @coconutruben @jataylo

Changed files

torch/_inductor/choices.py (modified, +6/-1)

RAW_BUFFERClick to expand / collapse

We noticed the following failures with the nightly GB200 signal:

TritonBlockPointerTestGPU.test_reduction_prefer_nd_tiling_False_view_size6_num_block_pointers_3_num_triton_kernels_2_cuda

AssertionError: Scalars are not equal!

Expected 2 but got 1. Absolute difference: 1 Relative difference: 0.5

To execute this test, run the following from the base repo dir: python test/inductor/test_torchinductor_strided_blocks.py TritonBlockPointerTestGPU.test_reduction_prefer_nd_tiling_False_view_size6_num_block_pointers_3_num_triton_kernels_2_cuda

TritonBlockPointerTestGPU.test_welford_non_block_pointer_cuda
TritonTensorDescriptorTestCUDA.test_reduction_prefer_nd_tiling_False_view_size5_num_block_pointers_2_num_triton_kernels_2_cuda TritonBlockPointerTestGPU.test_2d_reduction_odd_shapes_view_size1_num_block_pointers_3_num_triton_kernels_2_reduction_op1_cuda TritonTensorDescriptorTestCUDA.test_welford_non_block_pointer_cuda TritonBlockPointerTestGPU.test_reduction_prefer_nd_tiling_True_view_size6_num_block_pointers_3_num_triton_kernels_2_cuda TritonBlockPointerTestGPU.test_2d_welford_reduction_size1_expected_num_block_pointers_7_expected_num_triton_kernels_2_expect_fallback_False_cuda TritonBlockPointerTestGPU.test_reduction_prefer_nd_tiling_False_view_size4_num_block_pointers_3_num_triton_kernels_2_cuda TritonBlockPointerTestGPU.test_reduction_prefer_nd_tiling_True_view_size5_num_block_pointers_2_num_triton_kernels_2_cuda TritonTensorDescriptorTestCUDA.test_reduction_prefer_nd_tiling_True_view_size5_num_block_pointers_2_num_triton_kernels_2_cuda TritonBlockPointerTestGPU.test_reduction_prefer_nd_tiling_True_view_size4_num_block_pointers_3_num_triton_kernels_2_cuda TritonBlockPointerTestGPU.test_reduction_prefer_nd_tiling_False_view_size5_num_block_pointers_2_num_triton_kernels_2_cuda

Versions: as of: https://github.com/pytorch/pytorch/commit/17b157385f3f37f05e53b37ea130bf1ce3fe7650 (that should be able to reproduce this)

Last known good: https://github.com/pytorch/pytorch/commit/c4ec73b4b52e7c878e3c2522cac61e035ee72520

Disclaimer: there is some chance that the triton wheel used (upstream vs. nvidia-internally built) might be slightly different, which we would follow up.

cc @ptrblck @msaroufim @eqy @jerryzh168 @tinglvv @mruberry @chauhang @penguinwu @voznesenskym @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @kadeng @muchulee8 @amjames @aakhundov @coconutruben @jataylo @Aidyn-A @atalman @malfet

extent analysis

TL;DR

Revert to the last known good commit (c4ec73b4b52e7c878e3c2522cac61e035ee72520) to potentially resolve the AssertionError in the nightly GB200 signal tests.

Guidance

Run the test TritonBlockPointerTestGPU.test_reduction_prefer_nd_tiling_False_view_size6_num_block_pointers_3_num_triton_kernels_2_cuda to reproduce the issue and verify the fix.
Compare the results of the test between the current commit (17b157385f3f37f05e53b37ea130bf1ce3fe7650) and the last known good commit to confirm the fix.
Investigate potential differences in the triton wheel used (upstream vs. nvidia-internally built) as a possible cause of the issue.
Review the changes made between the last known good commit and the current commit to identify the potential root cause of the AssertionError.

Notes

The fix is not guaranteed, as there is a possibility that the triton wheel used might be slightly different. Further investigation is needed to confirm the root cause of the issue.

Recommendation

Apply workaround: Revert to the last known good commit (c4ec73b4b52e7c878e3c2522cac61e035ee72520) to potentially resolve the AssertionError. This is recommended because it is a known stable version, and reverting to it may provide a temporary solution until the root cause is identified and fixed.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#optimization #mixed precision #training loop #device allocation #model download

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

pytorch - ✅(Solved) Fix [CI][CUDA][GB200] Unit Test Failure with TritonTensorDescriptorTestCUDA Detected in GB200 Nightly [1 pull requests, 2 comments, 1 participants]

Recommended Tools

GitHub issue graph ai analysis

Fix Action

Fixed

PR fix notes

PR #179729: [inductor] Raise split reduction threshold from 8K to 524K on Blackwell+

Description (problem / solution / changelog)

Changed files

extent analysis

TL;DR

Guidance

Notes

Recommendation

Still need to ship something?

TRENDING

pytorch - ✅(Solved) Fix [CI][CUDA][GB200] Unit Test Failure with TritonTensorDescriptorTestCUDA Detected in GB200 Nightly [1 pull requests, 2 comments, 1 participants]

Recommended Tools

GitHub issue graph ai analysis

Fix Action

Fixed

PR fix notes

PR #179729: [inductor] Raise split reduction threshold from 8K to 524K on Blackwell+

Description (problem / solution / changelog)

Changed files

extent analysis

TL;DR

Guidance

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING