pytorch - ✅(Solved) Fix Significant runtime overhead for standalone_compile [1 pull requests, 8 comments, 4 participants]

pytorch2026-03-17 15:59:45

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

pytorch/pytorch#177655•Fetched 2026-04-08 00:57:11

View on GitHub

Comments

Participants

Timeline

Reactions

Author

Participants

Assignees

Timeline (top)

mentioned ×25subscribed ×25labeled ×10commented ×8

Fix Action

Fixed

Fixed by PR: grouping assert_size_stride to Improve runtime overhead for standalone compile (https://github.com/pytorch/pytorch/pull/177698)

PR fix notes

PR #177698: grouping assert_size_stride to Improve runtime overhead for standalone compile

Repository: pytorch/pytorch
Author: trichmo
State: open | merged: False
Link: https://github.com/pytorch/pytorch/pull/177698

Description (problem / solution / changelog)

Contributes to #177655

Codegen utilizes a new assert_size_stride_grouped function to combine all assert_size_stride calls at the beginning of a kernel so they share launch overhead.

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @kadeng @muchulee8 @amjames @chauhang @aakhundov @coconutruben @jataylo @Lucaskabela

Changed files

torch/_C/_dynamo/guards.pyi (modified, +5/-0)
torch/_inductor/codegen/wrapper.py (modified, +14/-1)
torch/csrc/dynamo/guards.cpp (modified, +101/-0)

Code Example

MODEL="meta-llama/meta-llama-3-8b"
vllm serve $MODEL \
  -cc.cudagraph_mode=none \
  --profiler-config.profiler=torch \
  --profiler-config.torch_profiler_dir=./trace \
  --profiler-config.torch_profiler_with_stack=false \
  --port 2345
  
vllm bench serve \
  --dataset-name random \
  --ignore-eos \
  --input-len 512 \
  --output-len 4 \
  --profile \
  --model $MODEL \
  --max-concurrency 8 \
  --num-prompts 8 \
  --port 2345

RAW_BUFFERClick to expand / collapse

to repro:

MODEL="meta-llama/meta-llama-3-8b"
vllm serve $MODEL \
  -cc.cudagraph_mode=none \
  --profiler-config.profiler=torch \
  --profiler-config.torch_profiler_dir=./trace \
  --profiler-config.torch_profiler_with_stack=false \
  --port 2345
  
vllm bench serve \
  --dataset-name random \
  --ignore-eos \
  --input-len 512 \
  --output-len 4 \
  --profile \
  --model $MODEL \
  --max-concurrency 8 \
  --num-prompts 8 \
  --port 2345

We have turned off cudagraphs to demonstrate the overhead. Empirically we have seen the overhead sometimes in models with cudagraphs on (but we are beyond the max cudagraph size), which has been surprising to me.

Zooming into the decode sections, this gives a trace that looks like the following (on my 8xH100 machine):

The space between the blue bars is entirely cpu-side work. We're looking at 89us.

Setting torch_profiler_with_stack=true gives us some more visibility into what is going on. The region looks a bit larger due to profiler overhead (138us):

but getting from the invoking the compiled artifact returned by standalone_compile to the first kernel call in the inductor output code takes a significant part of that time, 83us:

aot autograd runtime wrappers taking some time (19us), though not a lot of it:

Some links if you work at meta:

cc @jerryzh168 @chauhang @penguinwu @avikchaudhuri @zhxchen17 @tugsbayasgalan @angelayi @suo @ydwu4 @desertfire @yushangdi @benjaminglass1 @jataylo @iupaikov-amd

extent analysis

Fix Plan

To address the CPU-side overhead, we'll focus on optimizing the decode sections and reducing the time spent on invoking the compiled artifact and AOT autograd runtime wrappers.

Step-by-Step Solution

Optimize the decode sections:
- Review the vllm serve and vllm bench serve commands to ensure that the model and dataset are properly configured.
- Consider using a more efficient decoding algorithm or optimizing the existing one.
Reduce invocation overhead:
- Investigate the standalone_compile function and the compiled artifact to identify potential bottlenecks.
- Apply just-in-time (JIT) compilation or caching to reduce the invocation time.
Minimize AOT autograd runtime wrappers overhead:
- Review the autograd implementation and optimize the runtime wrappers to reduce their execution time.
- Consider using a more efficient autograd implementation or disabling it if not necessary.

Example Code

To demonstrate the optimization, let's assume we're using PyTorch and the torch.profiler module to profile the model. We can use the following code snippet to optimize the decode sections:

import torch
import torch.profiler

# Define the model and dataset
model = ...
dataset = ...

# Create a profiler
with torch.profiler.profile(
    schedule=torch.profiler.Schedule(
        wait=1,
        warmup=1,
        active=3,
        repeat=2
    ),
    on_trace_ready=torch.profiler.tensorboard_trace_handler('./logs')
) as p:
    # Run the model with profiling
    for input in dataset:
        output = model(input)
        p.step()

This code snippet profiles the model using the torch.profiler module and saves the trace to a TensorBoard log file.

Verification

To verify that the fix worked, run the vllm serve and vllm bench serve commands with the optimized model and dataset, and compare the results with the original output. Use the torch.profiler module to profile the model and verify that the CPU-side overhead has been reduced.

Extra Tips

Use the torch.profiler module to profile the model and identify performance bottlenecks.
Apply optimization techniques such as JIT compilation, caching, and parallelization to improve performance.
Review the autograd implementation and optimize the runtime wrappers to reduce their execution time.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #ssr #installation #tensor shape #autograd error #logging issue #authentication issue #prompt issue #agent setup

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

pytorch - ✅(Solved) Fix Significant runtime overhead for standalone_compile [1 pull requests, 8 comments, 4 participants]

Recommended Tools

GitHub issue graph ai analysis

Fix Action

Fixed

PR fix notes

PR #177698: grouping assert_size_stride to Improve runtime overhead for standalone compile

Description (problem / solution / changelog)

Changed files

Code Example

extent analysis

Fix Plan

Step-by-Step Solution

Example Code

Verification

Extra Tips

Still need to ship something?

TRENDING

pytorch - ✅(Solved) Fix Significant runtime overhead for standalone_compile [1 pull requests, 8 comments, 4 participants]

Recommended Tools

GitHub issue graph ai analysis

Fix Action

Fixed

PR fix notes

PR #177698: grouping assert_size_stride to Improve runtime overhead for standalone compile

Description (problem / solution / changelog)

Changed files

Code Example

extent analysis

Fix Plan

Step-by-Step Solution

Example Code

Verification

Extra Tips

Still need to ship something?

RELATED_DISCOVERY

TRENDING