pytorch - 💡(How to fix) Fix `_common_pointwise_single_dim_strategy` should include `Partial()` as a valid output placement for unary ops

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

Root Cause

This means that unary pointwise ops like convert_element_type (dtype cast) cannot preserve a Partial() input placement on their output. If a tensor has placement (Partial(), Partial()) and passes through a dtype cast, the DTensor op strategy forces a reduce before the cast because Partial() is not among the valid output placements.

RAW_BUFFERClick to expand / collapse

🐛 Describe the bug

_common_pointwise_single_dim_strategy in torch/distributed/tensor/_ops/_pointwise_ops.py only generates Shard(i) placements as output strategies for pointwise ops. It never generates Partial() as a valid output placement.

This means that unary pointwise ops like convert_element_type (dtype cast) cannot preserve a Partial() input placement on their output. If a tensor has placement (Partial(), Partial()) and passes through a dtype cast, the DTensor op strategy forces a reduce before the cast because Partial() is not among the valid output placements.

This is semantically incorrect — casting a partial-sum tensor from bf16 to f32 produces a valid f32 partial-sum tensor. The reduction should be deferrable past the cast.

Impact in practice: In AutoParallel's sharding optimizer for LLaMA-3 8B on a 2D mesh (DP=8, TP=8), backward weight gradients are P(sum)P(sum) in bf16 and need to reach S(0)S(0) in f32 (after a dtype cast for mixed-precision training). The ideal path is: cast bf16→f32 first, then do a single fused reduce-scatter in f32 for numerical accuracy. But because Partial() can't pass through the dtype cast node, the optimizer is forced to split the reduction: one reduce-scatter in bf16 before the cast, one in f32 after. This prevents the reduction from being fused into a single collective and forces part of the gradient reduction to happen in the lower-precision dtype.

Versions

PyTorch 2.13.0.dev20260509+cu130

cc @wanchaol @tianyu-l @wz337 @XilunWu @d4l3k @pragupta @SherlockNoMad @ppwwyyxx @weifengpy

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

pytorch - 💡(How to fix) Fix `_common_pointwise_single_dim_strategy` should include `Partial()` as a valid output placement for unary ops