pytorch - 💡(How to fix) Fix CPU: 2x CPU utilization regression in 2.9+ with no proportional wall-clock improvement [9 comments, 5 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
pytorch/pytorch#177906Fetched 2026-04-08 01:03:00
View on GitHub
Comments
9
Participants
5
Timeline
142
Reactions
0
Author
Timeline (top)
mentioned ×60subscribed ×60labeled ×11commented ×9

Code Example

import sys
import os
import threading

import psutil
import torch
import torch.nn.functional as F
from torch.utils.benchmark import Timer

proc = psutil.Process(os.getpid())
proc.cpu_percent()

NUM_OPS = 25
IN_C, IN_H, IN_W = 3, 256, 256
OUT_H, OUT_W = 570, 540
CPU_MEASURE_SECONDS = 6


def workload(sources):
    for src in sources:
        F.interpolate(
            src.unsqueeze(0).float(),
            size=(OUT_H, OUT_W),
            mode="bilinear",
            align_corners=False,
        )


def measure_cpu_percent(sources, duration):
    import time
    stop = threading.Event()
    cpu_samples: list[float] = []

    def monitor():
        while not stop.is_set():
            cpu_samples.append(proc.cpu_percent(interval=0.5))

    mon = threading.Thread(target=monitor)
    mon.start()

    t0 = time.perf_counter()
    while time.perf_counter() - t0 < duration:
        workload(sources)

    stop.set()
    mon.join()
    return sum(cpu_samples) / len(cpu_samples) if cpu_samples else 0


if __name__ == "__main__":
    print(f"PyTorch   {torch.__version__}")
    print(f"Python    {sys.executable}")
    print(f"ATen thr  {torch.get_num_threads()}")
    print(f"Workload  {NUM_OPS}x F.interpolate bilinear ({IN_C},{IN_H},{IN_W}) -> ({OUT_H},{OUT_W})")
    print()

    sources = [torch.randint(0, 255, (IN_C, IN_H, IN_W), dtype=torch.uint8) for _ in range(NUM_OPS)]

    print(f"{'threads':>7}  {'median':>9}  {'IQR':>13}  {'CPU%':>7}")
    print(f"{'-------':>7}  {'---------':>9}  {'-------------':>13}  {'------':>7}")
    for nt in [1, 2, 4, 8, 12, 16, 24]:
        torch.set_num_threads(nt)

        timer = Timer(
            stmt="workload(sources)",
            globals={"workload": workload, "sources": sources},
            num_threads=nt,
        )
        result = timer.blocked_autorange(min_run_time=3)

        cpu_pct = measure_cpu_percent(sources, CPU_MEASURE_SECONDS)

        median_ms = result.median * 1000
        iqr_ms = result.iqr * 1000
        print(f"{nt:>7}  {median_ms:>8.1f}ms  (IQR={iqr_ms:>5.2f}ms)  {cpu_pct:>6.0f}%")
RAW_BUFFERClick to expand / collapse

🐛 Describe the bug

My application runs multiple threads — GPU inference, CPU preprocessing, and encoding. After upgrading from 2.8 to 2.9, CPU sits at 100% across all cores and the pipeline is actually slower due to thread contention. Downgrading to 2.8 fixes it.

Traced it to F.interpolate (bilinear, CPU tensors but it could I think it could be any cpu tensor op). At default ATen thread settings (24 on my machine), 2.9+ uses 2x the CPU% for ~20% less wall-clock per call. But in a multi-threaded app that's a net negative — the cores are starved and everything else stalls.

Both 2.8 and 2.9 report torch.get_num_threads() == 24, so the thread pool size didn't change. The kernels themselves just parallelize more aggressively in 2.9. I'm not sure if its' bug or feature honestly.

Results

Machine: 24-core CPU (i9-13900K)

ATen threads2.8 median2.8 CPU%2.9 median2.9 CPU%2.10 median2.10 CPU%
1104.9ms94%104.6ms91%105.9ms93%
239.2ms165%37.5ms182%39.2ms182%
421.9ms318%20.1ms395%19.3ms376%
817.0ms429%15.3ms753%15.9ms753%
1215.9ms612%14.0ms1189%13.9ms1113%
1613.2ms884%11.6ms1513%11.3ms1491%
2411.8ms1123%9.1ms2270%8.7ms2265%

At default settings both versions report 24 ATen threads, but 2.9 actually saturates all of them — at 24 threads it burns 2270% CPU for a 23% wall-clock gain over 2.8. That's terrible efficiency and in a real multi-threaded application the other threads get starved, making the whole pipeline slower.
2.8 at 16 threads does 13.2ms at 884% CPU. 2.9 at 12 threads does 14.0ms at 1189% — roughly the same speed but 34% more CPU, with fewer threads even being used.

Reproduction

You can remove the cpu % if it makes simpler to run.

import sys
import os
import threading

import psutil
import torch
import torch.nn.functional as F
from torch.utils.benchmark import Timer

proc = psutil.Process(os.getpid())
proc.cpu_percent()

NUM_OPS = 25
IN_C, IN_H, IN_W = 3, 256, 256
OUT_H, OUT_W = 570, 540
CPU_MEASURE_SECONDS = 6


def workload(sources):
    for src in sources:
        F.interpolate(
            src.unsqueeze(0).float(),
            size=(OUT_H, OUT_W),
            mode="bilinear",
            align_corners=False,
        )


def measure_cpu_percent(sources, duration):
    import time
    stop = threading.Event()
    cpu_samples: list[float] = []

    def monitor():
        while not stop.is_set():
            cpu_samples.append(proc.cpu_percent(interval=0.5))

    mon = threading.Thread(target=monitor)
    mon.start()

    t0 = time.perf_counter()
    while time.perf_counter() - t0 < duration:
        workload(sources)

    stop.set()
    mon.join()
    return sum(cpu_samples) / len(cpu_samples) if cpu_samples else 0


if __name__ == "__main__":
    print(f"PyTorch   {torch.__version__}")
    print(f"Python    {sys.executable}")
    print(f"ATen thr  {torch.get_num_threads()}")
    print(f"Workload  {NUM_OPS}x F.interpolate bilinear ({IN_C},{IN_H},{IN_W}) -> ({OUT_H},{OUT_W})")
    print()

    sources = [torch.randint(0, 255, (IN_C, IN_H, IN_W), dtype=torch.uint8) for _ in range(NUM_OPS)]

    print(f"{'threads':>7}  {'median':>9}  {'IQR':>13}  {'CPU%':>7}")
    print(f"{'-------':>7}  {'---------':>9}  {'-------------':>13}  {'------':>7}")
    for nt in [1, 2, 4, 8, 12, 16, 24]:
        torch.set_num_threads(nt)

        timer = Timer(
            stmt="workload(sources)",
            globals={"workload": workload, "sources": sources},
            num_threads=nt,
        )
        result = timer.blocked_autorange(min_run_time=3)

        cpu_pct = measure_cpu_percent(sources, CPU_MEASURE_SECONDS)

        median_ms = result.median * 1000
        iqr_ms = result.iqr * 1000
        print(f"{nt:>7}  {median_ms:>8.1f}ms  (IQR={iqr_ms:>5.2f}ms)  {cpu_pct:>6.0f}%")

Versions

PyTorch version: 2.8.0+cu128 CUDA used to build PyTorch: 12.8 OS: Microsoft Windows 11 Pro (10.0.26200 64-bit) Python version: 3.13.12 (64-bit runtime) GPU models and configuration: GPU 0: NVIDIA GeForce RTX 5090 Nvidia driver version: 595.79 CPU: 13th Gen Intel(R) Core(TM) i9-13900K

and PyTorch version: 2.9.1+cu130 CUDA used to build PyTorch: 13.0 OS: Microsoft Windows 11 Pro (10.0.26200 64-bit) Python version: 3.13.11 (64-bit runtime) GPU models and configuration: GPU 0: NVIDIA GeForce RTX 5090 Nvidia driver version: 595.79 CPU: 13th Gen Intel(R) Core(TM) i9-13900K

cc @jerryzh168 @peterjc123 @mszhanyi @skyline75489 @nbcsm @iremyux @Blackhex @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10 @aditew01 @albanD @pragupta @mruberry @jbschlosser @walterddr @mikaylagawarecki

extent analysis

Fix Plan

To address the issue of high CPU usage in PyTorch 2.9, we can try the following steps:

  • Set the number of threads: Manually set the number of threads using torch.set_num_threads() to a value that balances performance and CPU usage.
  • Use torch.set_num_interop_threads(): This function allows more fine-grained control over the number of threads used for different operations.

Here's an example of how to set the number of threads:

import torch

# Set the number of threads
torch.set_num_threads(16)

# Alternatively, set the number of inter-op threads
torch.set_num_interop_threads(16)

You can experiment with different values to find the optimal number of threads for your specific use case.

Verification

To verify that the fix worked, you can run the provided benchmarking code with the modified thread settings and check the CPU usage and performance metrics.

# Run the benchmarking code with the modified thread settings
for nt in [1, 2, 4, 8, 12, 16, 24]:
    torch.set_num_threads(nt)
    # ... (rest of the benchmarking code)

Extra Tips

  • Be cautious when setting the number of threads, as excessive threading can lead to performance degradation due to context switching and other overheads.
  • Consider using torch.set_num_interop_threads() instead of torch.set_num_threads() for more fine-grained control over threading.
  • Experiment with different thread settings to find the optimal balance between performance and CPU usage for your specific use case.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

pytorch - 💡(How to fix) Fix CPU: 2x CPU utilization regression in 2.9+ with no proportional wall-clock improvement [9 comments, 5 participants]