pytorch - 💡(How to fix) Fix [Inductor][CPU] test_linear_thread_factors produces incorrect results with k-slicing on AVX2 [1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
pytorch/pytorch#180739Fetched 2026-04-19 15:03:56
View on GitHub
Comments
0
Participants
1
Timeline
39
Reactions
0
Participants
Timeline (top)
mentioned ×18subscribed ×18labeled ×3

test_linear_thread_factors in test/inductor/test_cpu_select_algorithm.py fails consistently on linux.2xlarge.avx2 CI machines with:

AssertionError: Incorrect result from choice DataProcessorChoiceCallerWrapper(<CppTemplateCaller>)

Two test variants fail:

  • test_linear_thread_factors_batch_size_1024_in_features_1024_out_features_1024_bias_True_cpu_float32
  • test_linear_thread_factors_dynamic_shapes_batch_size_1024_in_features_1024_out_features_1024_bias_False_cpu_float32

Error Message

AssertionError: Incorrect result from choice DataProcessorChoiceCallerWrapper(<CppTemplateCaller>)

Root Cause

This is a pre-existing bug exposed when test_cpu_select_algorithm.py was added to CI in commit 358117c166b (April 9, #172618).

The test uses @set_num_threads(56) and @inductor_config.patch({"cpp.gemm_thread_factors": "4,2,7"}). With M=N=K=1024 and K-factor=7, this activates k-slicing — splitting K accumulation across 7 thread groups with inter-thread reduction.

The autotuning VERIFY mechanism compares the CPU template kernel output against ATen/MKL reference (atol=1e-4, rtol=1e-4 for float32) and finds a mismatch.

Potential issues in the k-slicing codegen (torch/_inductor/codegen/cpp_gemm_template.py):

  1. Jinja variable scoping (lines 321-324): tile_acc and tile_Y reference m_start/m_end/n_start/n_end which are redefined in the reduction block with different semantics.
  2. Latent mxn_cache_block_id indexing bug (lines 278, 310): When Nc_blocks > 1, the formula (mc / Mc_blocks) * num_Nc_blocks + nc uses raw nc instead of nc / Nc_blocks.

Fix Action

Fix / Workaround

The test uses @set_num_threads(56) and @inductor_config.patch({"cpp.gemm_thread_factors": "4,2,7"}). With M=N=K=1024 and K-factor=7, this activates k-slicing — splitting K accumulation across 7 thread groups with inter-thread reduction.

Code Example

AssertionError: Incorrect result from choice DataProcessorChoiceCallerWrapper(<CppTemplateCaller>)

---

OMP_NUM_THREADS=56 python test/inductor/test_cpu_select_algorithm.py TestSelectAlgorithmCPU.test_linear_thread_factors_batch_size_1024_in_features_1024_out_features_1024_bias_True_cpu_float32
RAW_BUFFERClick to expand / collapse

Description

test_linear_thread_factors in test/inductor/test_cpu_select_algorithm.py fails consistently on linux.2xlarge.avx2 CI machines with:

AssertionError: Incorrect result from choice DataProcessorChoiceCallerWrapper(<CppTemplateCaller>)

Two test variants fail:

  • test_linear_thread_factors_batch_size_1024_in_features_1024_out_features_1024_bias_True_cpu_float32
  • test_linear_thread_factors_dynamic_shapes_batch_size_1024_in_features_1024_out_features_1024_bias_False_cpu_float32

Root Cause Analysis

This is a pre-existing bug exposed when test_cpu_select_algorithm.py was added to CI in commit 358117c166b (April 9, #172618).

The test uses @set_num_threads(56) and @inductor_config.patch({"cpp.gemm_thread_factors": "4,2,7"}). With M=N=K=1024 and K-factor=7, this activates k-slicing — splitting K accumulation across 7 thread groups with inter-thread reduction.

The autotuning VERIFY mechanism compares the CPU template kernel output against ATen/MKL reference (atol=1e-4, rtol=1e-4 for float32) and finds a mismatch.

Potential issues in the k-slicing codegen (torch/_inductor/codegen/cpp_gemm_template.py):

  1. Jinja variable scoping (lines 321-324): tile_acc and tile_Y reference m_start/m_end/n_start/n_end which are redefined in the reduction block with different semantics.
  2. Latent mxn_cache_block_id indexing bug (lines 278, 310): When Nc_blocks > 1, the formula (mc / Mc_blocks) * num_Nc_blocks + nc uses raw nc instead of nc / Nc_blocks.

CI Impact

Blocks inductor_avx2 shards 1 and 2 in the periodic CI (inductor-periodic workflow).

Reproduction

Run on an AVX2 machine with 56+ cores:

OMP_NUM_THREADS=56 python test/inductor/test_cpu_select_algorithm.py TestSelectAlgorithmCPU.test_linear_thread_factors_batch_size_1024_in_features_1024_out_features_1024_bias_True_cpu_float32

Key Files

  • test/inductor/test_cpu_select_algorithm.py (line 2240)
  • torch/_inductor/codegen/cpp_gemm_template.py (lines 195-332, k-slicing codegen)
  • torch/_inductor/select_algorithm.py (lines 4687-4688, VERIFY comparison)

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @kadeng @muchulee8 @amjames @chauhang @aakhundov @coconutruben @jataylo

extent analysis

TL;DR

The most likely fix for the failing test test_linear_thread_factors is to address the potential issues in the k-slicing codegen, specifically the Jinja variable scoping and latent mxn_cache_block_id indexing bug.

Guidance

  • Review the k-slicing codegen in torch/_inductor/codegen/cpp_gemm_template.py to ensure correct variable scoping and indexing.
  • Verify that the tile_acc and tile_Y variables are correctly referencing the m_start/m_end/n_start/n_end variables, and that the reduction block is correctly handling the redefined variables.
  • Investigate the mxn_cache_block_id indexing formula to ensure it is correctly handling the case where Nc_blocks > 1.
  • Run the reproduction command to test the fix and verify that the test passes.

Example

No code snippet is provided as the issue requires a review of the existing code and potential fixes are not explicitly stated.

Notes

The fix may require a deeper understanding of the k-slicing codegen and the specific requirements of the test_linear_thread_factors test. Additionally, the fix may need to be verified on multiple platforms and configurations to ensure correctness.

Recommendation

Apply a workaround by modifying the test_linear_thread_factors test to avoid activating k-slicing, or address the potential issues in the k-slicing codegen to fix the root cause of the problem. The latter approach is recommended as it will provide a more robust and long-term solution.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING