pytorch - ✅(Solved) Fix Significant runtime overhead for standalone_compile [1 pull requests, 8 comments, 4 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
pytorch/pytorch#177655Fetched 2026-04-08 00:57:11
View on GitHub
Comments
8
Participants
4
Timeline
73
Reactions
1
Author
Assignees
Timeline (top)
mentioned ×25subscribed ×25labeled ×10commented ×8

Fix Action

Fixed

PR fix notes

PR #177698: grouping assert_size_stride to Improve runtime overhead for standalone compile

Description (problem / solution / changelog)

Contributes to #177655

Codegen utilizes a new assert_size_stride_grouped function to combine all assert_size_stride calls at the beginning of a kernel so they share launch overhead.

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @kadeng @muchulee8 @amjames @chauhang @aakhundov @coconutruben @jataylo @Lucaskabela

Changed files

  • torch/_C/_dynamo/guards.pyi (modified, +5/-0)
  • torch/_inductor/codegen/wrapper.py (modified, +14/-1)
  • torch/csrc/dynamo/guards.cpp (modified, +101/-0)

Code Example

MODEL="meta-llama/meta-llama-3-8b"
vllm serve $MODEL \
  -cc.cudagraph_mode=none \
  --profiler-config.profiler=torch \
  --profiler-config.torch_profiler_dir=./trace \
  --profiler-config.torch_profiler_with_stack=false \
  --port 2345
  
vllm bench serve \
  --dataset-name random \
  --ignore-eos \
  --input-len 512 \
  --output-len 4 \
  --profile \
  --model $MODEL \
  --max-concurrency 8 \
  --num-prompts 8 \
  --port 2345
RAW_BUFFERClick to expand / collapse

to repro:

MODEL="meta-llama/meta-llama-3-8b"
vllm serve $MODEL \
  -cc.cudagraph_mode=none \
  --profiler-config.profiler=torch \
  --profiler-config.torch_profiler_dir=./trace \
  --profiler-config.torch_profiler_with_stack=false \
  --port 2345
  
vllm bench serve \
  --dataset-name random \
  --ignore-eos \
  --input-len 512 \
  --output-len 4 \
  --profile \
  --model $MODEL \
  --max-concurrency 8 \
  --num-prompts 8 \
  --port 2345

We have turned off cudagraphs to demonstrate the overhead. Empirically we have seen the overhead sometimes in models with cudagraphs on (but we are beyond the max cudagraph size), which has been surprising to me.

Zooming into the decode sections, this gives a trace that looks like the following (on my 8xH100 machine):

<img width="303" height="388" alt="Image" src="https://github.com/user-attachments/assets/78847b46-afa2-47f9-9dd8-8ee85bc183b3" />

The space between the blue bars is entirely cpu-side work. We're looking at 89us.

Setting torch_profiler_with_stack=true gives us some more visibility into what is going on. The region looks a bit larger due to profiler overhead (138us):

<img width="772" height="470" alt="Image" src="https://github.com/user-attachments/assets/2c7fb93a-87b5-47e0-b24a-a3e658e3e479" />

but getting from the invoking the compiled artifact returned by standalone_compile to the first kernel call in the inductor output code takes a significant part of that time, 83us:

<img width="615" height="388" alt="Image" src="https://github.com/user-attachments/assets/2068183c-53de-42d6-aa33-c9c28bd7e7a4" />

aot autograd runtime wrappers taking some time (19us), though not a lot of it:

<img width="658" height="429" alt="Image" src="https://github.com/user-attachments/assets/a5ac1f08-0a8b-4f0f-89e6-eae8e2a68d63" />

Some links if you work at meta:

cc @jerryzh168 @chauhang @penguinwu @avikchaudhuri @zhxchen17 @tugsbayasgalan @angelayi @suo @ydwu4 @desertfire @yushangdi @benjaminglass1 @jataylo @iupaikov-amd

extent analysis

Fix Plan

To address the CPU-side overhead, we'll focus on optimizing the decode sections and reducing the time spent on invoking the compiled artifact and AOT autograd runtime wrappers.

Step-by-Step Solution

  1. Optimize the decode sections:

    • Review the vllm serve and vllm bench serve commands to ensure that the model and dataset are properly configured.
    • Consider using a more efficient decoding algorithm or optimizing the existing one.
  2. Reduce invocation overhead:

    • Investigate the standalone_compile function and the compiled artifact to identify potential bottlenecks.
    • Apply just-in-time (JIT) compilation or caching to reduce the invocation time.
  3. Minimize AOT autograd runtime wrappers overhead:

    • Review the autograd implementation and optimize the runtime wrappers to reduce their execution time.
    • Consider using a more efficient autograd implementation or disabling it if not necessary.

Example Code

To demonstrate the optimization, let's assume we're using PyTorch and the torch.profiler module to profile the model. We can use the following code snippet to optimize the decode sections:

import torch
import torch.profiler

# Define the model and dataset
model = ...
dataset = ...

# Create a profiler
with torch.profiler.profile(
    schedule=torch.profiler.Schedule(
        wait=1,
        warmup=1,
        active=3,
        repeat=2
    ),
    on_trace_ready=torch.profiler.tensorboard_trace_handler('./logs')
) as p:
    # Run the model with profiling
    for input in dataset:
        output = model(input)
        p.step()

This code snippet profiles the model using the torch.profiler module and saves the trace to a TensorBoard log file.

Verification

To verify that the fix worked, run the vllm serve and vllm bench serve commands with the optimized model and dataset, and compare the results with the original output. Use the torch.profiler module to profile the model and verify that the CPU-side overhead has been reduced.

Extra Tips

  • Use the torch.profiler module to profile the model and identify performance bottlenecks.
  • Apply optimization techniques such as JIT compilation, caching, and parallelization to improve performance.
  • Review the autograd implementation and optimize the runtime wrappers to reduce their execution time.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING