pytorch - 💡(How to fix) Fix [Inductor][CPU] test_linear_thread_factors produces incorrect results with k-slicing on AVX2 [1 participants]

pytorch2026-04-18 13:30:48

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

pytorch/pytorch#180739•Fetched 2026-04-19 15:03:56

View on GitHub

Comments

Participants

Timeline

Reactions

Author

NikhilAPatel

Participants

NikhilAPatel

Timeline (top)

mentioned ×18subscribed ×18labeled ×3

test_linear_thread_factors in test/inductor/test_cpu_select_algorithm.py fails consistently on linux.2xlarge.avx2 CI machines with:

AssertionError: Incorrect result from choice DataProcessorChoiceCallerWrapper(<CppTemplateCaller>)

Two test variants fail:

test_linear_thread_factors_batch_size_1024_in_features_1024_out_features_1024_bias_True_cpu_float32
test_linear_thread_factors_dynamic_shapes_batch_size_1024_in_features_1024_out_features_1024_bias_False_cpu_float32

Error Message

AssertionError: Incorrect result from choice DataProcessorChoiceCallerWrapper(<CppTemplateCaller>)

Root Cause

This is a pre-existing bug exposed when test_cpu_select_algorithm.py was added to CI in commit 358117c166b (April 9, #172618).

The test uses @set_num_threads(56) and @inductor_config.patch({"cpp.gemm_thread_factors": "4,2,7"}). With M=N=K=1024 and K-factor=7, this activates k-slicing — splitting K accumulation across 7 thread groups with inter-thread reduction.

The autotuning VERIFY mechanism compares the CPU template kernel output against ATen/MKL reference (atol=1e-4, rtol=1e-4 for float32) and finds a mismatch.

Potential issues in the k-slicing codegen (torch/_inductor/codegen/cpp_gemm_template.py):

Jinja variable scoping (lines 321-324): tile_acc and tile_Y reference m_start/m_end/n_start/n_end which are redefined in the reduction block with different semantics.
Latent mxn_cache_block_id indexing bug (lines 278, 310): When Nc_blocks > 1, the formula (mc / Mc_blocks) * num_Nc_blocks + nc uses raw nc instead of nc / Nc_blocks.

Fix Action

Fix / Workaround

Code Example

AssertionError: Incorrect result from choice DataProcessorChoiceCallerWrapper(<CppTemplateCaller>)

---

OMP_NUM_THREADS=56 python test/inductor/test_cpu_select_algorithm.py TestSelectAlgorithmCPU.test_linear_thread_factors_batch_size_1024_in_features_1024_out_features_1024_bias_True_cpu_float32

RAW_BUFFERClick to expand / collapse

Description

test_linear_thread_factors in test/inductor/test_cpu_select_algorithm.py fails consistently on linux.2xlarge.avx2 CI machines with:

AssertionError: Incorrect result from choice DataProcessorChoiceCallerWrapper(<CppTemplateCaller>)

Two test variants fail:

test_linear_thread_factors_batch_size_1024_in_features_1024_out_features_1024_bias_True_cpu_float32
test_linear_thread_factors_dynamic_shapes_batch_size_1024_in_features_1024_out_features_1024_bias_False_cpu_float32

Root Cause Analysis

This is a pre-existing bug exposed when test_cpu_select_algorithm.py was added to CI in commit 358117c166b (April 9, #172618).

The autotuning VERIFY mechanism compares the CPU template kernel output against ATen/MKL reference (atol=1e-4, rtol=1e-4 for float32) and finds a mismatch.

Potential issues in the k-slicing codegen (torch/_inductor/codegen/cpp_gemm_template.py):

Jinja variable scoping (lines 321-324): tile_acc and tile_Y reference m_start/m_end/n_start/n_end which are redefined in the reduction block with different semantics.
Latent mxn_cache_block_id indexing bug (lines 278, 310): When Nc_blocks > 1, the formula (mc / Mc_blocks) * num_Nc_blocks + nc uses raw nc instead of nc / Nc_blocks.

CI Impact

Blocks inductor_avx2 shards 1 and 2 in the periodic CI (inductor-periodic workflow).

Reproduction

Run on an AVX2 machine with 56+ cores:

OMP_NUM_THREADS=56 python test/inductor/test_cpu_select_algorithm.py TestSelectAlgorithmCPU.test_linear_thread_factors_batch_size_1024_in_features_1024_out_features_1024_bias_True_cpu_float32

Key Files

test/inductor/test_cpu_select_algorithm.py (line 2240)
torch/_inductor/codegen/cpp_gemm_template.py (lines 195-332, k-slicing codegen)
torch/_inductor/select_algorithm.py (lines 4687-4688, VERIFY comparison)

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @kadeng @muchulee8 @amjames @chauhang @aakhundov @coconutruben @jataylo

extent analysis

TL;DR

The most likely fix for the failing test test_linear_thread_factors is to address the potential issues in the k-slicing codegen, specifically the Jinja variable scoping and latent mxn_cache_block_id indexing bug.

Guidance

Review the k-slicing codegen in torch/_inductor/codegen/cpp_gemm_template.py to ensure correct variable scoping and indexing.
Verify that the tile_acc and tile_Y variables are correctly referencing the m_start/m_end/n_start/n_end variables, and that the reduction block is correctly handling the redefined variables.
Investigate the mxn_cache_block_id indexing formula to ensure it is correctly handling the case where Nc_blocks > 1.
Run the reproduction command to test the fix and verify that the test passes.

Example

No code snippet is provided as the issue requires a review of the existing code and potential fixes are not explicitly stated.

Notes

The fix may require a deeper understanding of the k-slicing codegen and the specific requirements of the test_linear_thread_factors test. Additionally, the fix may need to be verified on multiple platforms and configurations to ensure correctness.

Recommendation

Apply a workaround by modifying the test_linear_thread_factors test to avoid activating k-slicing, or address the potential issues in the k-slicing codegen to fix the root cause of the problem. The latter approach is recommended as it will provide a more robust and long-term solution.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#vector store #embedding generation #cache error #pipeline error #runtime error

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

pytorch - 💡(How to fix) Fix [Inductor][CPU] test_linear_thread_factors produces incorrect results with k-slicing on AVX2 [1 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Fix / Workaround

Code Example

Description

Root Cause Analysis

CI Impact

Reproduction

Key Files

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

TRENDING

pytorch - 💡(How to fix) Fix [Inductor][CPU] test_linear_thread_factors produces incorrect results with k-slicing on AVX2 [1 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Fix / Workaround

Code Example

Description

Root Cause Analysis

CI Impact

Reproduction

Key Files

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING