pytorch - 💡(How to fix) Fix native_batch_norm_backward extremely slow on MPS for 3D tensors (slower than CPU) [6 comments, 3 participants]

pytorch2026-03-26 08:33:10

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

pytorch/pytorch#178492•Fetched 2026-04-08 01:30:10

View on GitHub

Comments

Participants

Timeline

Reactions

Author

Participants

Timeline (top)

commented ×6labeled ×5subscribed ×4mentioned ×3

Code Example

import time
  import torch
  import torch.nn as nn

  assert torch.backends.mps.is_available(), "MPS not available"

  # Minimal 3D model with BatchNorm3d
  class Simple3DCNN(nn.Module):
      def __init__(self):
          super().__init__()
          self.net = nn.Sequential(
              nn.Conv3d(1, 32, 3, padding=1),
              nn.BatchNorm3d(32),
              nn.ReLU(),
              nn.Conv3d(32, 64, 3, padding=1),
              nn.BatchNorm3d(64),
              nn.ReLU(),
              nn.AdaptiveAvgPool3d(1),
              nn.Flatten(),
              nn.Linear(64, 4),
          )
      def forward(self, x):
          return self.net(x)

  def bench(device_name, steps=20):
      device = torch.device(device_name)
      model = Simple3DCNN().to(device)
      optimizer = torch.optim.Adam(model.parameters())
      criterion = nn.CrossEntropyLoss()
      x = torch.randn(4, 1, 64, 64, 64, device=device)
      y = torch.randint(0, 4, (4,), device=device)

      # Warmup
      for _ in range(3):
          loss = criterion(model(x), y)
          loss.backward()
          optimizer.step()
          optimizer.zero_grad()
      if device_name == "mps":
          torch.mps.synchronize()

      start = time.perf_counter()
      for _ in range(steps):
          loss = criterion(model(x), y)
          loss.backward()
          optimizer.step()
          optimizer.zero_grad()
      if device_name == "mps":
          torch.mps.synchronize()
      elapsed = time.perf_counter() - start
      print(f"{device_name.upper():>4s}: {elapsed:.2f}s ({4*steps/elapsed:.1f} vol/s)")

  bench("cpu")
  bench("mps")

  Expected behavior

  MPS should be faster than (or at least comparable to) CPU for 3D batch norm backward, similar to how 2D batch norm achieves excellent MPS speedups.

  Environment

  - PyTorch: 2.10.0 (also tested 2.7.1 — same behavior)
  - OS: macOS 26.1 / 26.2
  - Hardware: MacBook Pro M1 Pro (32 GB) and Mac mini M4 Pro (64 GB) — same issue on both
  - Python: 3.12

  cc: #154052 (context from the "most requested MPS ops" discussion)

  ---## 🐛 Describe the bug

  3D CNN training on MPS is **slower than CPU** due to `native_batch_norm_backward` consuming ~70% of training time for 5D tensors (batch, channels, D, H, W).
  2D training on MPS works excellently (9–12× speedup), so this appears specific to the 3D batch norm backward kernel.

  ### Profiling result

  The bottleneck is `aten::native_batch_norm_backward` for 3D (5D) tensors on MPS. Forward pass and all other 3D ops run at reasonable speed.

  ### Benchmark results

  | Test | Device | Time (s) | Throughput | Speedup |
  |------|--------|-----------|------------|---------|
  | 2D ResNet18 | CPU | 45.7 | 21 img/s | — |
  | 2D ResNet18 | MPS | 5.1 | 188 img/s | **8.9×** |
  | 3D ResNet18 | CPU | 119.8 | 1.0 vol/s | — |
  | 3D ResNet18 | MPS | 154.9 | 0.8 vol/s | **0.77×** (slower!) |

  Tested on both PyTorch 2.7.1 and 2.10.0 — no improvement in 3D MPS performance.

  ## Reproducer

RAW_BUFFERClick to expand / collapse

🐛 Describe the bug

3D CNN training on MPS is slower than CPU due to native_batch_norm_backward consuming ~70% of training time for 5D tensors (batch, channels, D, H, W). 2D training on MPS works excellently (9–12× speedup), so this appears specific to the 3D batch norm backward kernel.

Profiling result

The bottleneck is aten::native_batch_norm_backward for 3D (5D) tensors on MPS. Forward pass and all other 3D ops run at reasonable speed.

Benchmark results

Test	Device	Time (s)	Throughput	Speedup
2D ResNet18	CPU	45.7	21 img/s	—
2D ResNet18	MPS	5.1	188 img/s	8.9×
3D ResNet18	CPU	119.8	1.0 vol/s	—
3D ResNet18	MPS	154.9	0.8 vol/s	0.77× (slower!)

Tested on both PyTorch 2.7.1 and 2.10.0 — no improvement in 3D MPS performance.

Reproducer

import time
import torch
import torch.nn as nn

assert torch.backends.mps.is_available(), "MPS not available"

# Minimal 3D model with BatchNorm3d
class Simple3DCNN(nn.Module):
    def __init__(self):
        super().__init__()
        self.net = nn.Sequential(
            nn.Conv3d(1, 32, 3, padding=1),
            nn.BatchNorm3d(32),
            nn.ReLU(),
            nn.Conv3d(32, 64, 3, padding=1),
            nn.BatchNorm3d(64),
            nn.ReLU(),
            nn.AdaptiveAvgPool3d(1),
            nn.Flatten(),
            nn.Linear(64, 4),
        )
    def forward(self, x):
        return self.net(x)

def bench(device_name, steps=20):
    device = torch.device(device_name)
    model = Simple3DCNN().to(device)
    optimizer = torch.optim.Adam(model.parameters())
    criterion = nn.CrossEntropyLoss()
    x = torch.randn(4, 1, 64, 64, 64, device=device)
    y = torch.randint(0, 4, (4,), device=device)

    # Warmup
    for _ in range(3):
        loss = criterion(model(x), y)
        loss.backward()
        optimizer.step()
        optimizer.zero_grad()
    if device_name == "mps":
        torch.mps.synchronize()

    start = time.perf_counter()
    for _ in range(steps):
        loss = criterion(model(x), y)
        loss.backward()
        optimizer.step()
        optimizer.zero_grad()
    if device_name == "mps":
        torch.mps.synchronize()
    elapsed = time.perf_counter() - start
    print(f"{device_name.upper():>4s}: {elapsed:.2f}s ({4*steps/elapsed:.1f} vol/s)")

bench("cpu")
bench("mps")

Expected behavior

MPS should be faster than (or at least comparable to) CPU for 3D batch norm backward, similar to how 2D batch norm achieves excellent MPS speedups.

Environment

- PyTorch: 2.10.0 (also tested 2.7.1 — same behavior)
- OS: macOS 26.1 / 26.2
- Hardware: MacBook Pro M1 Pro (32 GB) and Mac mini M4 Pro (64 GB) — same issue on both
- Python: 3.12

cc: #154052 (context from the "most requested MPS ops" discussion)

---## 🐛 Describe the bug

3D CNN training on MPS is **slower than CPU** due to `native_batch_norm_backward` consuming ~70% of training time for 5D tensors (batch, channels, D, H, W).
2D training on MPS works excellently (9–12× speedup), so this appears specific to the 3D batch norm backward kernel.

### Profiling result

The bottleneck is `aten::native_batch_norm_backward` for 3D (5D) tensors on MPS. Forward pass and all other 3D ops run at reasonable speed.

### Benchmark results

| Test | Device | Time (s) | Throughput | Speedup |
|------|--------|-----------|------------|---------|
| 2D ResNet18 | CPU | 45.7 | 21 img/s | — |
| 2D ResNet18 | MPS | 5.1 | 188 img/s | **8.9×** |
| 3D ResNet18 | CPU | 119.8 | 1.0 vol/s | — |
| 3D ResNet18 | MPS | 154.9 | 0.8 vol/s | **0.77×** (slower!) |

Tested on both PyTorch 2.7.1 and 2.10.0 — no improvement in 3D MPS performance.

## Reproducer

```python
import time
import torch
import torch.nn as nn

assert torch.backends.mps.is_available(), "MPS not available"

# Minimal 3D model with BatchNorm3d
class Simple3DCNN(nn.Module):
    def __init__(self):
        super().__init__()
        self.net = nn.Sequential(
            nn.Conv3d(1, 32, 3, padding=1),
            nn.BatchNorm3d(32),
            nn.ReLU(),
            nn.Conv3d(32, 64, 3, padding=1),
            nn.BatchNorm3d(64),
            nn.ReLU(),
            nn.AdaptiveAvgPool3d(1),
            nn.Flatten(),
            nn.Linear(64, 4),
        )
    def forward(self, x):
        return self.net(x)

def bench(device_name, steps=20):
    device = torch.device(device_name)
    model = Simple3DCNN().to(device)
    optimizer = torch.optim.Adam(model.parameters())
    criterion = nn.CrossEntropyLoss()
    x = torch.randn(4, 1, 64, 64, 64, device=device)
    y = torch.randint(0, 4, (4,), device=device)

    # Warmup
    for _ in range(3):
        loss = criterion(model(x), y)
        loss.backward()
        optimizer.step()
        optimizer.zero_grad()
    if device_name == "mps":
        torch.mps.synchronize()

    start = time.perf_counter()
    for _ in range(steps):
        loss = criterion(model(x), y)
        loss.backward()
        optimizer.step()
        optimizer.zero_grad()
    if device_name == "mps":
        torch.mps.synchronize()
    elapsed = time.perf_counter() - start
    print(f"{device_name.upper():>4s}: {elapsed:.2f}s ({4*steps/elapsed:.1f} vol/s)")

bench("cpu")
bench("mps")

Expected behavior

MPS should be faster than (or at least comparable to) CPU for 3D batch norm backward, similar to how 2D batch norm achieves excellent MPS speedups.

Environment

- PyTorch: 2.10.0 (also tested 2.7.1 — same behavior)
- OS: macOS 26.1 / 26.2
- Hardware: MacBook Pro M1 Pro (32 GB) and Mac mini M4 Pro (64 GB) — same issue on both
- Python: 3.12

cc: #154052 (context from the "most requested MPS ops" discussion)

---

### Versions

native_batch_norm_backward extremely slow on MPS for 3D tensors (slower than CPU)

cc @jerryzh168 @kulinseth @malfet @DenisVieriu97 @jhavukainen @aditvenk

extent analysis

Fix Plan

To address the performance issue with native_batch_norm_backward on MPS for 3D tensors, we can try the following steps:

Update PyTorch: Ensure you are using the latest version of PyTorch, as updates often include performance improvements and bug fixes.
Use torch.nn.functional.batch_norm: Instead of using the nn.BatchNorm3d module, try using the torch.nn.functional.batch_norm function, which might have better support for MPS.
Disable native_batch_norm_backward: If possible, try disabling the native_batch_norm_backward kernel and use the fallback implementation to see if it improves performance.

Here's an example of how you can modify the Simple3DCNN model to use torch.nn.functional.batch_norm:

import torch
import torch.nn as nn
import torch.nn.functional as F

class Simple3DCNN(nn.Module):
    def __init__(self):
        super().__init__()
        self.conv1 = nn.Conv3d(1, 32, 3, padding=1)
        self.bn1 = nn.BatchNorm3d(32)
        self.conv2 = nn.Conv3d(32, 64, 3, padding=1)
        self.bn2 = nn.BatchNorm3d(64)
        self.avg_pool = nn.AdaptiveAvgPool3d(1)
        self.flatten = nn.Flatten()
        self.linear = nn.Linear(64, 4)

    def forward(self, x):
        x = F.relu(F.batch_norm(x, self.bn1.running_mean, self.bn1.running_var, weight=self.bn1.weight, bias=self.bn1.bias, training=self.training, momentum=0.1, eps=1e-5))
        x = F.relu(F.batch_norm(self.conv2(x), self.bn2.running_mean, self.bn2.running_var, weight=self.bn2.weight, bias=self.bn2.bias, training=self.training, momentum=0.1, eps=1e-5))
        x = self.avg_pool(x)
        x = self.flatten(x)
        x = self.linear(x)
        return x

Alternatively, you can try using the torch.nn.BatchNorm3d module with the momentum argument set to None to disable the native_batch_norm_backward kernel:

self.bn1 = nn.BatchNorm3d(32, momentum=None)
self.bn2 = nn.BatchNorm3d(64, momentum=None)

Verification

To verify that the fix worked, you can run the benchmark again and compare the results:

bench("cpu")
bench("mps")

If the performance issue is resolved, you should see a significant improvement in the throughput and speedup on MPS.

Extra Tips

Make sure to test your model on different hardware configurations to ensure that the fix works across

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#docker error #permission error #memory optimization #batch processing #GPU compatibility

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

pytorch - 💡(How to fix) Fix native_batch_norm_backward extremely slow on MPS for 3D tensors (slower than CPU) [6 comments, 3 participants]

Recommended Tools

GitHub issue graph ai analysis

Code Example

🐛 Describe the bug

🐛 Describe the bug

Profiling result

Benchmark results

Reproducer

extent analysis

Fix Plan

Verification

Extra Tips

Still need to ship something?

TRENDING

pytorch - 💡(How to fix) Fix native_batch_norm_backward extremely slow on MPS for 3D tensors (slower than CPU) [6 comments, 3 participants]

Recommended Tools

GitHub issue graph ai analysis

Code Example

🐛 Describe the bug

🐛 Describe the bug

Profiling result

Benchmark results

Reproducer

extent analysis

Fix Plan

Verification

Extra Tips

Still need to ship something?

RELATED_DISCOVERY

TRENDING