pytorch - 💡(How to fix) Fix CPU: 2x CPU utilization regression in 2.9+ with no proportional wall-clock improvement [9 comments, 5 participants]

Code Example

import sys
import os
import threading

import psutil
import torch
import torch.nn.functional as F
from torch.utils.benchmark import Timer

proc = psutil.Process(os.getpid())
proc.cpu_percent()

NUM_OPS = 25
IN_C, IN_H, IN_W = 3, 256, 256
OUT_H, OUT_W = 570, 540
CPU_MEASURE_SECONDS = 6


def workload(sources):
    for src in sources:
        F.interpolate(
            src.unsqueeze(0).float(),
            size=(OUT_H, OUT_W),
            mode="bilinear",
            align_corners=False,
        )


def measure_cpu_percent(sources, duration):
    import time
    stop = threading.Event()
    cpu_samples: list[float] = []

    def monitor():
        while not stop.is_set():
            cpu_samples.append(proc.cpu_percent(interval=0.5))

    mon = threading.Thread(target=monitor)
    mon.start()

    t0 = time.perf_counter()
    while time.perf_counter() - t0 < duration:
        workload(sources)

    stop.set()
    mon.join()
    return sum(cpu_samples) / len(cpu_samples) if cpu_samples else 0


if __name__ == "__main__":
    print(f"PyTorch   {torch.__version__}")
    print(f"Python    {sys.executable}")
    print(f"ATen thr  {torch.get_num_threads()}")
    print(f"Workload  {NUM_OPS}x F.interpolate bilinear ({IN_C},{IN_H},{IN_W}) -> ({OUT_H},{OUT_W})")
    print()

    sources = [torch.randint(0, 255, (IN_C, IN_H, IN_W), dtype=torch.uint8) for _ in range(NUM_OPS)]

    print(f"{'threads':>7}  {'median':>9}  {'IQR':>13}  {'CPU%':>7}")
    print(f"{'-------':>7}  {'---------':>9}  {'-------------':>13}  {'------':>7}")
    for nt in [1, 2, 4, 8, 12, 16, 24]:
        torch.set_num_threads(nt)

        timer = Timer(
            stmt="workload(sources)",
            globals={"workload": workload, "sources": sources},
            num_threads=nt,
        )
        result = timer.blocked_autorange(min_run_time=3)

        cpu_pct = measure_cpu_percent(sources, CPU_MEASURE_SECONDS)

        median_ms = result.median * 1000
        iqr_ms = result.iqr * 1000
        print(f"{nt:>7}  {median_ms:>8.1f}ms  (IQR={iqr_ms:>5.2f}ms)  {cpu_pct:>6.0f}%")

🐛 Describe the bug

My application runs multiple threads — GPU inference, CPU preprocessing, and encoding. After upgrading from 2.8 to 2.9, CPU sits at 100% across all cores and the pipeline is actually slower due to thread contention. Downgrading to 2.8 fixes it.

Traced it to F.interpolate (bilinear, CPU tensors but it could I think it could be any cpu tensor op). At default ATen thread settings (24 on my machine), 2.9+ uses 2x the CPU% for ~20% less wall-clock per call. But in a multi-threaded app that's a net negative — the cores are starved and everything else stalls.

Both 2.8 and 2.9 report torch.get_num_threads() == 24, so the thread pool size didn't change. The kernels themselves just parallelize more aggressively in 2.9. I'm not sure if its' bug or feature honestly.

Results

Machine: 24-core CPU (i9-13900K)

ATen threads	2.8 median	2.8 CPU%	2.9 median	2.9 CPU%	2.10 median	2.10 CPU%
1	104.9ms	94%	104.6ms	91%	105.9ms	93%
2	39.2ms	165%	37.5ms	182%	39.2ms	182%
4	21.9ms	318%	20.1ms	395%	19.3ms	376%
8	17.0ms	429%	15.3ms	753%	15.9ms	753%
12	15.9ms	612%	14.0ms	1189%	13.9ms	1113%
16	13.2ms	884%	11.6ms	1513%	11.3ms	1491%
24	11.8ms	1123%	9.1ms	2270%	8.7ms	2265%

At default settings both versions report 24 ATen threads, but 2.9 actually saturates all of them — at 24 threads it burns 2270% CPU for a 23% wall-clock gain over 2.8. That's terrible efficiency and in a real multi-threaded application the other threads get starved, making the whole pipeline slower.
2.8 at 16 threads does 13.2ms at 884% CPU. 2.9 at 12 threads does 14.0ms at 1189% — roughly the same speed but 34% more CPU, with fewer threads even being used.

Reproduction

You can remove the cpu % if it makes simpler to run.

import sys
import os
import threading

import psutil
import torch
import torch.nn.functional as F
from torch.utils.benchmark import Timer

proc = psutil.Process(os.getpid())
proc.cpu_percent()

NUM_OPS = 25
IN_C, IN_H, IN_W = 3, 256, 256
OUT_H, OUT_W = 570, 540
CPU_MEASURE_SECONDS = 6


def workload(sources):
    for src in sources:
        F.interpolate(
            src.unsqueeze(0).float(),
            size=(OUT_H, OUT_W),
            mode="bilinear",
            align_corners=False,
        )


def measure_cpu_percent(sources, duration):
    import time
    stop = threading.Event()
    cpu_samples: list[float] = []

    def monitor():
        while not stop.is_set():
            cpu_samples.append(proc.cpu_percent(interval=0.5))

    mon = threading.Thread(target=monitor)
    mon.start()

    t0 = time.perf_counter()
    while time.perf_counter() - t0 < duration:
        workload(sources)

    stop.set()
    mon.join()
    return sum(cpu_samples) / len(cpu_samples) if cpu_samples else 0


if __name__ == "__main__":
    print(f"PyTorch   {torch.__version__}")
    print(f"Python    {sys.executable}")
    print(f"ATen thr  {torch.get_num_threads()}")
    print(f"Workload  {NUM_OPS}x F.interpolate bilinear ({IN_C},{IN_H},{IN_W}) -> ({OUT_H},{OUT_W})")
    print()

    sources = [torch.randint(0, 255, (IN_C, IN_H, IN_W), dtype=torch.uint8) for _ in range(NUM_OPS)]

    print(f"{'threads':>7}  {'median':>9}  {'IQR':>13}  {'CPU%':>7}")
    print(f"{'-------':>7}  {'---------':>9}  {'-------------':>13}  {'------':>7}")
    for nt in [1, 2, 4, 8, 12, 16, 24]:
        torch.set_num_threads(nt)

        timer = Timer(
            stmt="workload(sources)",
            globals={"workload": workload, "sources": sources},
            num_threads=nt,
        )
        result = timer.blocked_autorange(min_run_time=3)

        cpu_pct = measure_cpu_percent(sources, CPU_MEASURE_SECONDS)

        median_ms = result.median * 1000
        iqr_ms = result.iqr * 1000
        print(f"{nt:>7}  {median_ms:>8.1f}ms  (IQR={iqr_ms:>5.2f}ms)  {cpu_pct:>6.0f}%")

Versions

PyTorch version: 2.8.0+cu128 CUDA used to build PyTorch: 12.8 OS: Microsoft Windows 11 Pro (10.0.26200 64-bit) Python version: 3.13.12 (64-bit runtime) GPU models and configuration: GPU 0: NVIDIA GeForce RTX 5090 Nvidia driver version: 595.79 CPU: 13th Gen Intel(R) Core(TM) i9-13900K

and PyTorch version: 2.9.1+cu130 CUDA used to build PyTorch: 13.0 OS: Microsoft Windows 11 Pro (10.0.26200 64-bit) Python version: 3.13.11 (64-bit runtime) GPU models and configuration: GPU 0: NVIDIA GeForce RTX 5090 Nvidia driver version: 595.79 CPU: 13th Gen Intel(R) Core(TM) i9-13900K

cc @jerryzh168 @peterjc123 @mszhanyi @skyline75489 @nbcsm @iremyux @Blackhex @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10 @aditew01 @albanD @pragupta @mruberry @jbschlosser @walterddr @mikaylagawarecki

extent analysis

Fix Plan

To address the issue of high CPU usage in PyTorch 2.9, we can try the following steps:

Set the number of threads: Manually set the number of threads using torch.set_num_threads() to a value that balances performance and CPU usage.
Use torch.set_num_interop_threads(): This function allows more fine-grained control over the number of threads used for different operations.

Here's an example of how to set the number of threads:

import torch

# Set the number of threads
torch.set_num_threads(16)

# Alternatively, set the number of inter-op threads
torch.set_num_interop_threads(16)

You can experiment with different values to find the optimal number of threads for your specific use case.

Verification

To verify that the fix worked, you can run the provided benchmarking code with the modified thread settings and check the CPU usage and performance metrics.

# Run the benchmarking code with the modified thread settings
for nt in [1, 2, 4, 8, 12, 16, 24]:
    torch.set_num_threads(nt)
    # ... (rest of the benchmarking code)

Extra Tips

Be cautious when setting the number of threads, as excessive threading can lead to performance degradation due to context switching and other overheads.
Consider using torch.set_num_interop_threads() instead of torch.set_num_threads() for more fine-grained control over threading.
Experiment with different thread settings to find the optimal balance between performance and CPU usage for your specific use case.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

pytorch - 💡(How to fix) Fix CPU: 2x CPU utilization regression in 2.9+ with no proportional wall-clock improvement [9 comments, 5 participants]

Recommended Tools

GitHub issue graph ai analysis

Code Example

🐛 Describe the bug

Results

Reproduction

Versions

extent analysis

Fix Plan

Verification

Extra Tips

Still need to ship something?

TRENDING

pytorch - 💡(How to fix) Fix CPU: 2x CPU utilization regression in 2.9+ with no proportional wall-clock improvement [9 comments, 5 participants]

Recommended Tools

GitHub issue graph ai analysis

Code Example

🐛 Describe the bug

Results

Reproduction

Versions

extent analysis

Fix Plan

Verification

Extra Tips

Still need to ship something?

RELATED_DISCOVERY

TRENDING