pytorch - 💡(How to fix) Fix [RFC] torch.compile + CUDA graph capture: Inductor JIT has no capture-state awareness, causes CPU→GPU copy error on first call

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

Error Message

  • Inductor JIT firing mid-capture was not blocked — hipGraph silently accepted the compilation and replay completed without error

Root Cause

Inductor/Dynamo has zero awareness of CUDA graph capture state. The dispatch path in `eval_frame.py` fires compilation unconditionally on first call, with no check of `is_current_stream_capturing()`. There is no path where a cache miss + active capture results in a safe fallback.

Fix Action

Fix / Workaround

Inductor/Dynamo has zero awareness of CUDA graph capture state. The dispatch path in `eval_frame.py` fires compilation unconditionally on first call, with no check of `is_current_stream_capturing()`. There is no path where a cache miss + active capture results in a safe fallback.

vllm-project/vllm#42241 — `@torch.compile` on `rearrange_mixed_qkv` in GDN linear attention breaks CUDA graph capture in the Qwen3.5 speculative decoding path. The workaround requires:

At the Dynamo cache-miss dispatch point, check capture state and fall back to eager if compilation hasn't happened yet:

RAW_BUFFERClick to expand / collapse

Problem

When a `@torch.compile`-decorated function is called for the first time inside a CUDA graph capture, Inductor JIT fires mid-capture and performs internal CPU→GPU bookkeeping. CUDA graph capture rejects this:

``` RuntimeError: Cannot copy between CPU and CUDA tensors during CUDA graph capture unless the CPU tensor is pinned. ```

On ROCm/hipGraph the failure is worse — the illegal operation is silently swallowed and the captured graph produces wrong results at replay (see #155684).

Root cause

Inductor/Dynamo has zero awareness of CUDA graph capture state. The dispatch path in `eval_frame.py` fires compilation unconditionally on first call, with no check of `is_current_stream_capturing()`. There is no path where a cache miss + active capture results in a safe fallback.

Motivating case

vllm-project/vllm#42241 — `@torch.compile` on `rearrange_mixed_qkv` in GDN linear attention breaks CUDA graph capture in the Qwen3.5 speculative decoding path. The workaround requires:

  1. Manual warmup calls before capture begins (incomplete — doesn't cover all code paths, e.g. target vs draft model in speculative decoding)
  2. A runtime guard `if is_current_stream_capturing(): return eager_impl()` as a safety net for uncovered warmup cases

This pattern is being replicated across vLLM's codebase wherever `torch.compile` + CUDA graphs intersect.

Validated on ROCm (gfx1201 / RDNA4)

Tested on AMD Radeon RX 9070 XT with PyTorch 2.13.0.dev20260418+rocm7.1 / HIP 7.1.52802:

  • `is_current_stream_capturing()` correctly returns `True` during hipGraph capture ✓
  • Inductor JIT firing mid-capture was not blocked — hipGraph silently accepted the compilation and replay completed without error
  • This confirms the ROCm failure mode: no crash, no warning, silent capture of unexpected operations

On NVIDIA, the same scenario raises `RuntimeError: Cannot copy between CPU and CUDA tensors during CUDA graph capture` immediately. The ROCm silent-acceptance makes missed warmup cases significantly harder to detect.

Proposed fix

At the Dynamo cache-miss dispatch point, check capture state and fall back to eager if compilation hasn't happened yet:

stateaction
cache miss + capture activeeager fallback — safe, no JIT fires
cache miss + no capturecompile normally
cache hit + anythingrun binary — already safe

This would make `@torch.compile` automatically safe to call during capture without requiring user-level warmup or stream-capture guards. Existing code that already warms up correctly would be unaffected (cache hit path).

A user who wants optimal eager performance can still write a separate eager implementation — but they would no longer need to for correctness.

cc @jeffdaily @sunway513 @jithunnair-amd @pruthvistony @ROCmSupport @jataylo @hongxiayang @naromero77amd @pragupta @jerrymannil @xinyazhang @mcarilli @ezyang @eellison @penguinwu @BoyuanFeng @chauhang @voznesenskym @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @kadeng @amjames @azahed98

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

pytorch - 💡(How to fix) Fix [RFC] torch.compile + CUDA graph capture: Inductor JIT has no capture-state awareness, causes CPU→GPU copy error on first call