vllm - ✅(Solved) Fix fix(compilation): fix piecewise CUDA graph bugs with splitting_ops [1 pull requests, 1 comments, 1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#37363Fetched 2026-04-08 00:53:17
View on GitHub
Comments
1
Participants
1
Timeline
1
Reactions
0
Participants
Timeline (top)
commented ×1

Two bugs in piecewise CUDA graph compilation that surface when a splitting_op produces multiple outputs or allocates new output tensors.

Bug 1 — backends.py: cycle in split_graph() getitem nodes of a multi-output splitting_op were assigned to the same subgraph, creating a dependency cycle that causes torch.fx.passes.split_module to raise.

Bug 2 — cuda_graph.py: stale tensor addresses during replay When a splitting_op allocates new tensors (e.g. via torch.bmm), the next piece's CUDA graph replays with stale addresses → silent data corruption.

Both are general vLLM issues, not tied to any specific model. They surface whenever a splitting_op produces multiple outputs or allocates new tensors.

Root Cause

Two bugs in piecewise CUDA graph compilation that surface when a splitting_op produces multiple outputs or allocates new output tensors.

Bug 1 — backends.py: cycle in split_graph() getitem nodes of a multi-output splitting_op were assigned to the same subgraph, creating a dependency cycle that causes torch.fx.passes.split_module to raise.

Bug 2 — cuda_graph.py: stale tensor addresses during replay When a splitting_op allocates new tensors (e.g. via torch.bmm), the next piece's CUDA graph replays with stale addresses → silent data corruption.

Both are general vLLM issues, not tied to any specific model. They surface whenever a splitting_op produces multiple outputs or allocates new tensors.

Fix Action

PR fix notes

PR #37361: fix(compilation): fix piecewise CUDA graph bugs with splitting_ops

Description (problem / solution / changelog)

Purpose

Fix two bugs in piecewise CUDA graph compilation that surface when a splitting_op produces multiple outputs or allocates new output tensors.

Bug 1 — backends.py: cycle in split_graph() getitem nodes of a multi-output splitting_op were assigned to the same subgraph, creating a dependency cycle. Fix: assign them to the next subgraph instead.

Bug 2 — cuda_graph.py: stale tensor addresses during replay When a splitting_op allocates new tensors (e.g. via torch.bmm), the next piece's CUDA graph replays with stale addresses. Fix: save input tensor references at capture time (input_buffers) and copy new data into them before replay.

Duplicate-work check

No existing open PR addresses these specific piecewise CUDA graph bugs with splitting_ops.

AI Disclosure

This PR was developed with AI assistance (Claude). All code has been reviewed and understood by the human submitter.

Test Plan

python -m pytest tests/compile/test_piecewise_cudagraph_fixes.py -v --noconftest

Test Result

test_splitting_op_getitem_assigned_to_next_subgraph PASSED
test_cudagraph_entry_input_buffers_populated       PASSED

2 passed in 4.86s

<details> <summary> Essential Elements of an Effective PR Description Checklist </summary>
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update.
  • (Optional) Release notes update.
</details>

Changed files

  • tests/compile/test_piecewise_cudagraph_fixes.py (added, +140/-0)
  • vllm/compilation/backends.py (modified, +10/-2)
  • vllm/compilation/cuda_graph.py (modified, +43/-6)
RAW_BUFFERClick to expand / collapse

Description

Two bugs in piecewise CUDA graph compilation that surface when a splitting_op produces multiple outputs or allocates new output tensors.

Bug 1 — backends.py: cycle in split_graph() getitem nodes of a multi-output splitting_op were assigned to the same subgraph, creating a dependency cycle that causes torch.fx.passes.split_module to raise.

Bug 2 — cuda_graph.py: stale tensor addresses during replay When a splitting_op allocates new tensors (e.g. via torch.bmm), the next piece's CUDA graph replays with stale addresses → silent data corruption.

Both are general vLLM issues, not tied to any specific model. They surface whenever a splitting_op produces multiple outputs or allocates new tensors.

Fix

PR: https://github.com/vllm-project/vllm/pull/37361

extent analysis

Fix Plan

To address the bugs in piecewise CUDA graph compilation, we need to modify the split_graph() function in backends.py and update the cuda_graph.py to handle stale tensor addresses.

Step-by-Step Solution

  • In backends.py, update the split_graph() function to assign getitem nodes of a multi-output splitting_op to separate subgraphs:
def split_graph(graph):
    # ...
    for node in graph.nodes:
        if node.op == 'getitem' and node.input[0].op == 'splitting_op':
            # Assign getitem nodes to separate subgraphs
            subgraph = graph.clone()
            subgraph.nodes = [node]
            subgraph.input_nodes = [node.input[0]]
            # ...
  • In cuda_graph.py, update the replay logic to refresh tensor addresses after a splitting_op allocates new tensors:
def replay_cuda_graph(piece):
    # ...
    if piece.op == 'splitting_op':
        # Refresh tensor addresses
        tensor_addresses = {}
        for output in piece.outputs:
            tensor_addresses[output] = output.data_ptr()
        # ...
    # ...

Temporary Workaround

If the above changes are not feasible, a temporary workaround is to disable piecewise CUDA graph compilation for models that use splitting_op with multiple outputs or new tensor allocations.

Verification

To verify the fix, run the following tests:

  • Test piecewise CUDA graph compilation with a model that uses splitting_op with multiple outputs.
  • Test piecewise CUDA graph compilation with a model that uses splitting_op with new tensor allocations.
  • Verify that the fix does not introduce any performance regressions.

Extra Tips

  • When working with CUDA graphs, it's essential to ensure that tensor addresses are properly updated to avoid silent data corruption.
  • Consider adding additional tests to cover different scenarios and edge cases.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

vllm - ✅(Solved) Fix fix(compilation): fix piecewise CUDA graph bugs with splitting_ops [1 pull requests, 1 comments, 1 participants]