pytorch - 💡(How to fix) Fix native_batch_norm_backward extremely slow on MPS for 3D tensors (slower than CPU) [6 comments, 3 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
pytorch/pytorch#178492Fetched 2026-04-08 01:30:10
View on GitHub
Comments
6
Participants
3
Timeline
18
Reactions
0
Author
Timeline (top)
commented ×6labeled ×5subscribed ×4mentioned ×3

Code Example

import time
  import torch
  import torch.nn as nn

  assert torch.backends.mps.is_available(), "MPS not available"

  # Minimal 3D model with BatchNorm3d
  class Simple3DCNN(nn.Module):
      def __init__(self):
          super().__init__()
          self.net = nn.Sequential(
              nn.Conv3d(1, 32, 3, padding=1),
              nn.BatchNorm3d(32),
              nn.ReLU(),
              nn.Conv3d(32, 64, 3, padding=1),
              nn.BatchNorm3d(64),
              nn.ReLU(),
              nn.AdaptiveAvgPool3d(1),
              nn.Flatten(),
              nn.Linear(64, 4),
          )
      def forward(self, x):
          return self.net(x)

  def bench(device_name, steps=20):
      device = torch.device(device_name)
      model = Simple3DCNN().to(device)
      optimizer = torch.optim.Adam(model.parameters())
      criterion = nn.CrossEntropyLoss()
      x = torch.randn(4, 1, 64, 64, 64, device=device)
      y = torch.randint(0, 4, (4,), device=device)

      # Warmup
      for _ in range(3):
          loss = criterion(model(x), y)
          loss.backward()
          optimizer.step()
          optimizer.zero_grad()
      if device_name == "mps":
          torch.mps.synchronize()

      start = time.perf_counter()
      for _ in range(steps):
          loss = criterion(model(x), y)
          loss.backward()
          optimizer.step()
          optimizer.zero_grad()
      if device_name == "mps":
          torch.mps.synchronize()
      elapsed = time.perf_counter() - start
      print(f"{device_name.upper():>4s}: {elapsed:.2f}s ({4*steps/elapsed:.1f} vol/s)")

  bench("cpu")
  bench("mps")

  Expected behavior

  MPS should be faster than (or at least comparable to) CPU for 3D batch norm backward, similar to how 2D batch norm achieves excellent MPS speedups.

  Environment

  - PyTorch: 2.10.0 (also tested 2.7.1 — same behavior)
  - OS: macOS 26.1 / 26.2
  - Hardware: MacBook Pro M1 Pro (32 GB) and Mac mini M4 Pro (64 GB) — same issue on both
  - Python: 3.12

  cc: #154052 (context from the "most requested MPS ops" discussion)

  ---## 🐛 Describe the bug

  3D CNN training on MPS is **slower than CPU** due to `native_batch_norm_backward` consuming ~70% of training time for 5D tensors (batch, channels, D, H, W).
  2D training on MPS works excellently (912× speedup), so this appears specific to the 3D batch norm backward kernel.

  ### Profiling result

  The bottleneck is `aten::native_batch_norm_backward` for 3D (5D) tensors on MPS. Forward pass and all other 3D ops run at reasonable speed.

  ### Benchmark results

  | Test | Device | Time (s) | Throughput | Speedup |
  |------|--------|-----------|------------|---------|
  | 2D ResNet18 | CPU | 45.7 | 21 img/s ||
  | 2D ResNet18 | MPS | 5.1 | 188 img/s | **8.9×** |
  | 3D ResNet18 | CPU | 119.8 | 1.0 vol/s ||
  | 3D ResNet18 | MPS | 154.9 | 0.8 vol/s | **0.77×** (slower!) |

  Tested on both PyTorch 2.7.1 and 2.10.0 — no improvement in 3D MPS performance.

  ## Reproducer
RAW_BUFFERClick to expand / collapse

🐛 Describe the bug

🐛 Describe the bug

3D CNN training on MPS is slower than CPU due to native_batch_norm_backward consuming ~70% of training time for 5D tensors (batch, channels, D, H, W). 2D training on MPS works excellently (9–12× speedup), so this appears specific to the 3D batch norm backward kernel.

Profiling result

The bottleneck is aten::native_batch_norm_backward for 3D (5D) tensors on MPS. Forward pass and all other 3D ops run at reasonable speed.

Benchmark results

TestDeviceTime (s)ThroughputSpeedup
2D ResNet18CPU45.721 img/s
2D ResNet18MPS5.1188 img/s8.9×
3D ResNet18CPU119.81.0 vol/s
3D ResNet18MPS154.90.8 vol/s0.77× (slower!)

Tested on both PyTorch 2.7.1 and 2.10.0 — no improvement in 3D MPS performance.

Reproducer

import time
import torch
import torch.nn as nn

assert torch.backends.mps.is_available(), "MPS not available"

# Minimal 3D model with BatchNorm3d
class Simple3DCNN(nn.Module):
    def __init__(self):
        super().__init__()
        self.net = nn.Sequential(
            nn.Conv3d(1, 32, 3, padding=1),
            nn.BatchNorm3d(32),
            nn.ReLU(),
            nn.Conv3d(32, 64, 3, padding=1),
            nn.BatchNorm3d(64),
            nn.ReLU(),
            nn.AdaptiveAvgPool3d(1),
            nn.Flatten(),
            nn.Linear(64, 4),
        )
    def forward(self, x):
        return self.net(x)

def bench(device_name, steps=20):
    device = torch.device(device_name)
    model = Simple3DCNN().to(device)
    optimizer = torch.optim.Adam(model.parameters())
    criterion = nn.CrossEntropyLoss()
    x = torch.randn(4, 1, 64, 64, 64, device=device)
    y = torch.randint(0, 4, (4,), device=device)

    # Warmup
    for _ in range(3):
        loss = criterion(model(x), y)
        loss.backward()
        optimizer.step()
        optimizer.zero_grad()
    if device_name == "mps":
        torch.mps.synchronize()

    start = time.perf_counter()
    for _ in range(steps):
        loss = criterion(model(x), y)
        loss.backward()
        optimizer.step()
        optimizer.zero_grad()
    if device_name == "mps":
        torch.mps.synchronize()
    elapsed = time.perf_counter() - start
    print(f"{device_name.upper():>4s}: {elapsed:.2f}s ({4*steps/elapsed:.1f} vol/s)")

bench("cpu")
bench("mps")

Expected behavior

MPS should be faster than (or at least comparable to) CPU for 3D batch norm backward, similar to how 2D batch norm achieves excellent MPS speedups.

Environment

- PyTorch: 2.10.0 (also tested 2.7.1 — same behavior)
- OS: macOS 26.1 / 26.2
- Hardware: MacBook Pro M1 Pro (32 GB) and Mac mini M4 Pro (64 GB) — same issue on both
- Python: 3.12

cc: #154052 (context from the "most requested MPS ops" discussion)

---## 🐛 Describe the bug

3D CNN training on MPS is **slower than CPU** due to `native_batch_norm_backward` consuming ~70% of training time for 5D tensors (batch, channels, D, H, W).
2D training on MPS works excellently (912× speedup), so this appears specific to the 3D batch norm backward kernel.

### Profiling result

The bottleneck is `aten::native_batch_norm_backward` for 3D (5D) tensors on MPS. Forward pass and all other 3D ops run at reasonable speed.

### Benchmark results

| Test | Device | Time (s) | Throughput | Speedup |
|------|--------|-----------|------------|---------|
| 2D ResNet18 | CPU | 45.7 | 21 img/s ||
| 2D ResNet18 | MPS | 5.1 | 188 img/s | **8.9×** |
| 3D ResNet18 | CPU | 119.8 | 1.0 vol/s ||
| 3D ResNet18 | MPS | 154.9 | 0.8 vol/s | **0.77×** (slower!) |

Tested on both PyTorch 2.7.1 and 2.10.0 — no improvement in 3D MPS performance.

## Reproducer

```python
import time
import torch
import torch.nn as nn

assert torch.backends.mps.is_available(), "MPS not available"

# Minimal 3D model with BatchNorm3d
class Simple3DCNN(nn.Module):
    def __init__(self):
        super().__init__()
        self.net = nn.Sequential(
            nn.Conv3d(1, 32, 3, padding=1),
            nn.BatchNorm3d(32),
            nn.ReLU(),
            nn.Conv3d(32, 64, 3, padding=1),
            nn.BatchNorm3d(64),
            nn.ReLU(),
            nn.AdaptiveAvgPool3d(1),
            nn.Flatten(),
            nn.Linear(64, 4),
        )
    def forward(self, x):
        return self.net(x)

def bench(device_name, steps=20):
    device = torch.device(device_name)
    model = Simple3DCNN().to(device)
    optimizer = torch.optim.Adam(model.parameters())
    criterion = nn.CrossEntropyLoss()
    x = torch.randn(4, 1, 64, 64, 64, device=device)
    y = torch.randint(0, 4, (4,), device=device)

    # Warmup
    for _ in range(3):
        loss = criterion(model(x), y)
        loss.backward()
        optimizer.step()
        optimizer.zero_grad()
    if device_name == "mps":
        torch.mps.synchronize()

    start = time.perf_counter()
    for _ in range(steps):
        loss = criterion(model(x), y)
        loss.backward()
        optimizer.step()
        optimizer.zero_grad()
    if device_name == "mps":
        torch.mps.synchronize()
    elapsed = time.perf_counter() - start
    print(f"{device_name.upper():>4s}: {elapsed:.2f}s ({4*steps/elapsed:.1f} vol/s)")

bench("cpu")
bench("mps")

Expected behavior

MPS should be faster than (or at least comparable to) CPU for 3D batch norm backward, similar to how 2D batch norm achieves excellent MPS speedups.

Environment

- PyTorch: 2.10.0 (also tested 2.7.1 — same behavior)
- OS: macOS 26.1 / 26.2
- Hardware: MacBook Pro M1 Pro (32 GB) and Mac mini M4 Pro (64 GB) — same issue on both
- Python: 3.12

cc: #154052 (context from the "most requested MPS ops" discussion)

---

### Versions

native_batch_norm_backward extremely slow on MPS for 3D tensors (slower than CPU)

cc @jerryzh168 @kulinseth @malfet @DenisVieriu97 @jhavukainen @aditvenk

extent analysis

Fix Plan

To address the performance issue with native_batch_norm_backward on MPS for 3D tensors, we can try the following steps:

  • Update PyTorch: Ensure you are using the latest version of PyTorch, as updates often include performance improvements and bug fixes.
  • Use torch.nn.functional.batch_norm: Instead of using the nn.BatchNorm3d module, try using the torch.nn.functional.batch_norm function, which might have better support for MPS.
  • Disable native_batch_norm_backward: If possible, try disabling the native_batch_norm_backward kernel and use the fallback implementation to see if it improves performance.

Here's an example of how you can modify the Simple3DCNN model to use torch.nn.functional.batch_norm:

import torch
import torch.nn as nn
import torch.nn.functional as F

class Simple3DCNN(nn.Module):
    def __init__(self):
        super().__init__()
        self.conv1 = nn.Conv3d(1, 32, 3, padding=1)
        self.bn1 = nn.BatchNorm3d(32)
        self.conv2 = nn.Conv3d(32, 64, 3, padding=1)
        self.bn2 = nn.BatchNorm3d(64)
        self.avg_pool = nn.AdaptiveAvgPool3d(1)
        self.flatten = nn.Flatten()
        self.linear = nn.Linear(64, 4)

    def forward(self, x):
        x = F.relu(F.batch_norm(x, self.bn1.running_mean, self.bn1.running_var, weight=self.bn1.weight, bias=self.bn1.bias, training=self.training, momentum=0.1, eps=1e-5))
        x = F.relu(F.batch_norm(self.conv2(x), self.bn2.running_mean, self.bn2.running_var, weight=self.bn2.weight, bias=self.bn2.bias, training=self.training, momentum=0.1, eps=1e-5))
        x = self.avg_pool(x)
        x = self.flatten(x)
        x = self.linear(x)
        return x

Alternatively, you can try using the torch.nn.BatchNorm3d module with the momentum argument set to None to disable the native_batch_norm_backward kernel:

self.bn1 = nn.BatchNorm3d(32, momentum=None)
self.bn2 = nn.BatchNorm3d(64, momentum=None)

Verification

To verify that the fix worked, you can run the benchmark again and compare the results:

bench("cpu")
bench("mps")

If the performance issue is resolved, you should see a significant improvement in the throughput and speedup on MPS.

Extra Tips

  • Make sure to test your model on different hardware configurations to ensure that the fix works across

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

pytorch - 💡(How to fix) Fix native_batch_norm_backward extremely slow on MPS for 3D tensors (slower than CPU) [6 comments, 3 participants]