pytorch - ✅(Solved) Fix viable/strict has been blocked for 5+ days [2 pull requests, 1 comments, 1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
pytorch/pytorch#178396Fetched 2026-04-08 01:30:36
View on GitHub
Comments
1
Participants
1
Timeline
27
Reactions
0
Participants
Assignees
Timeline (top)
subscribed ×13labeled ×6mentioned ×3added_to_project_v2 ×1

Error Message

Error looks like

Root Cause

Breakages to TestHOPCUDA.

Fix Action

Mitigation

In progress, continuing to revert incoming changes.

PR fix notes

PR #177922: [inductor] Add inline_asm_elementwise higher-order operator

Description (problem / solution / changelog)

Stack from ghstack (oldest at bottom):

  • -> #177922

Adds a higher-order operator for inline PTX assembly that works in both eager and compiled modes:

  • Eager: JIT compiles CUDA kernels via Jiterator with inline asm
  • Compiled: Lowers to tl.inline_asm_elementwise in Triton via Inductor

This enables using PTX instructions not exposed through standard PyTorch ops (e.g., cvt.rn.satfinite.e2m1x2.f32 for NVFP4 quantization) while maintaining bitwise equivalence between eager and compiled execution.

Example usage:

from torch._higher_order_ops import inline_asm_elementwise

result = inline_asm_elementwise(
    x, y,
    asm_str="add.f32 $0, $1, $2;",
    constraints="=f,f,f",
    dtype=torch.float32,
)

Notes:

jitterator has limited support for the following:

  • multiple outputs
  • different output dtype than input
  • pack != 1

In these cases, today, we error in jitterator and succeed in inductor.

We inherit the output striding behavior from jitterator in inductor/compilation (which follows eager pointwise ops).

Inductor details:

  • Block size is not guaranteed to be a multiple of pack. Particularly, at the end of a persistent reduction, it's possible that xblock == 1. For this case, i've added a triton helper to pad the triton tensor to a multiple of pack, and then split to the actual output after. I suspect this is unlikely to occur but it's better to handle anyway.

  • Inductor computes bf16/fp16 inside the kernel. For asm targeting these dtypes, we cast prior to invoking the asm. (even with emulate precision casts, we still compute in fp32, just add extra casts).

Otherwise it works more or less as the existing lowering.

reland of https://github.com/pytorch/pytorch/pull/175814

Written with Claude Code.

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @kadeng @muchulee8 @amjames @chauhang @aakhundov @coconutruben @jataylo @Lucaskabela @mlazos

Changed files

  • test/higher_order_ops/test_inline_asm_elementwise.py (added, +585/-0)
  • torch/_dynamo/variables/higher_order_ops.py (modified, +26/-0)
  • torch/_higher_order_ops/__init__.py (modified, +2/-0)
  • torch/_higher_order_ops/inline_asm_elementwise.py (added, +299/-0)
  • torch/_inductor/codegen/common.py (modified, +1/-0)
  • torch/_inductor/codegen/triton.py (modified, +49/-4)
  • torch/_inductor/dtype_propagation.py (modified, +7/-1)
  • torch/_inductor/lowering.py (modified, +37/-0)
  • torch/_inductor/ops_handler.py (modified, +1/-0)
  • torch/_inductor/runtime/triton_helpers.py (modified, +24/-0)
  • torch/testing/_internal/hop_db.py (modified, +28/-0)
RAW_BUFFERClick to expand / collapse

Current Status

Under investigation. Core breakage over the weekend has been reverted, and reverting additional test breakages to main.

Error looks like

viable/strict is still on https://github.com/pytorch/pytorch/commit/958d381444ebcad946b965a08545106898420f00.

Incident timeline (all times pacific)

Core breakage began with https://github.com/pytorch/pytorch/pull/177922, which was reverted Mar 24 evening.

User impact

viable/strict has lagged so branch does not have commits more recent than 5 days ago.

Root cause

Breakages to TestHOPCUDA.

Mitigation

In progress, continuing to revert incoming changes.

Prevention/followups

(will update when resolved)

cc @seemethere @malfet @pytorch/pytorch-dev-infra @mruberry

extent analysis

Fix Plan

The fix involves reverting recent changes that broke TestHOPCUDA and updating the viable/strict branch to include recent commits.

Steps to Fix

  • Revert the changes made in pytorch/pytorch/pull/177922 to prevent core breakage.
  • Update the viable/strict branch to include commits more recent than 5 days ago.
  • Example code to revert changes:
git revert <commit-hash>
git push origin <branch-name>

Replace <commit-hash> with the hash of the commit that introduced the breakage and <branch-name> with the name of the branch being updated.

Verification

Verify that the fix worked by checking the following:

  • The viable/strict branch has been updated with recent commits.
  • TestHOPCUDA is passing without errors.
  • The core breakage has been resolved and the code is functioning as expected.

Extra Tips

  • Regularly review and test changes before merging them into the main branch to prevent similar breakages.
  • Consider implementing automated testing and continuous integration to catch errors early and prevent them from reaching production.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

pytorch - ✅(Solved) Fix viable/strict has been blocked for 5+ days [2 pull requests, 1 comments, 1 participants]