vllm - ✅(Solved) Fix fix(compilation): fix piecewise CUDA graph bugs with splitting_ops [1 pull requests, 1 comments, 1 participants]

vllm2026-03-18 01:08:50

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

vllm-project/vllm#37363•Fetched 2026-04-08 00:53:17

View on GitHub

Comments

Participants

Timeline

Reactions

Author

Complexity-ML

Participants

Complexity-ML

Timeline (top)

commented ×1

Two bugs in piecewise CUDA graph compilation that surface when a splitting_op produces multiple outputs or allocates new output tensors.

Bug 1 — backends.py: cycle in split_graph() getitem nodes of a multi-output splitting_op were assigned to the same subgraph, creating a dependency cycle that causes torch.fx.passes.split_module to raise.

Bug 2 — cuda_graph.py: stale tensor addresses during replay When a splitting_op allocates new tensors (e.g. via torch.bmm), the next piece's CUDA graph replays with stale addresses → silent data corruption.

Both are general vLLM issues, not tied to any specific model. They surface whenever a splitting_op produces multiple outputs or allocates new tensors.

Root Cause

Two bugs in piecewise CUDA graph compilation that surface when a splitting_op produces multiple outputs or allocates new output tensors.

Both are general vLLM issues, not tied to any specific model. They surface whenever a splitting_op produces multiple outputs or allocates new tensors.

Fix Action

Fix

PR: https://github.com/vllm-project/vllm/pull/37361

PR fix notes

PR #37361: fix(compilation): fix piecewise CUDA graph bugs with splitting_ops

Repository: vllm-project/vllm
Author: Complexity-ML
State: open | merged: False
Link: https://github.com/vllm-project/vllm/pull/37361

Description (problem / solution / changelog)

Purpose

Fix two bugs in piecewise CUDA graph compilation that surface when a splitting_op produces multiple outputs or allocates new output tensors.

Bug 1 — backends.py: cycle in split_graph() getitem nodes of a multi-output splitting_op were assigned to the same subgraph, creating a dependency cycle. Fix: assign them to the next subgraph instead.

Duplicate-work check

No existing open PR addresses these specific piecewise CUDA graph bugs with splitting_ops.

AI Disclosure

This PR was developed with AI assistance (Claude). All code has been reviewed and understood by the human submitter.

Test Plan

python -m pytest tests/compile/test_piecewise_cudagraph_fixes.py -v --noconftest

Test Result

test_splitting_op_getitem_assigned_to_next_subgraph PASSED
test_cudagraph_entry_input_buffers_populated       PASSED

2 passed in 4.86s

<details> <summary> Essential Elements of an Effective PR Description Checklist </summary>

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update.
(Optional) Release notes update.

</details>

Changed files

tests/compile/test_piecewise_cudagraph_fixes.py (added, +140/-0)
vllm/compilation/backends.py (modified, +10/-2)
vllm/compilation/cuda_graph.py (modified, +43/-6)

RAW_BUFFERClick to expand / collapse

Description

Two bugs in piecewise CUDA graph compilation that surface when a splitting_op produces multiple outputs or allocates new output tensors.

Both are general vLLM issues, not tied to any specific model. They surface whenever a splitting_op produces multiple outputs or allocates new tensors.

Fix

PR: https://github.com/vllm-project/vllm/pull/37361

extent analysis

Fix Plan

To address the bugs in piecewise CUDA graph compilation, we need to modify the split_graph() function in backends.py and update the cuda_graph.py to handle stale tensor addresses.

Step-by-Step Solution

In backends.py, update the split_graph() function to assign getitem nodes of a multi-output splitting_op to separate subgraphs:

def split_graph(graph):
    # ...
    for node in graph.nodes:
        if node.op == 'getitem' and node.input[0].op == 'splitting_op':
            # Assign getitem nodes to separate subgraphs
            subgraph = graph.clone()
            subgraph.nodes = [node]
            subgraph.input_nodes = [node.input[0]]
            # ...

In cuda_graph.py, update the replay logic to refresh tensor addresses after a splitting_op allocates new tensors:

def replay_cuda_graph(piece):
    # ...
    if piece.op == 'splitting_op':
        # Refresh tensor addresses
        tensor_addresses = {}
        for output in piece.outputs:
            tensor_addresses[output] = output.data_ptr()
        # ...
    # ...

Temporary Workaround

If the above changes are not feasible, a temporary workaround is to disable piecewise CUDA graph compilation for models that use splitting_op with multiple outputs or new tensor allocations.

Verification

To verify the fix, run the following tests:

Test piecewise CUDA graph compilation with a model that uses splitting_op with multiple outputs.
Test piecewise CUDA graph compilation with a model that uses splitting_op with new tensor allocations.
Verify that the fix does not introduce any performance regressions.

Extra Tips

When working with CUDA graphs, it's essential to ensure that tensor addresses are properly updated to avoid silent data corruption.
Consider adding additional tests to cover different scenarios and edge cases.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #ssr #installation #tensor shape #autograd error #integration issue #index setup #retrieval issue

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

vllm - ✅(Solved) Fix fix(compilation): fix piecewise CUDA graph bugs with splitting_ops [1 pull requests, 1 comments, 1 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Fix Action

Fix

PR fix notes

PR #37361: fix(compilation): fix piecewise CUDA graph bugs with splitting_ops

Description (problem / solution / changelog)

Purpose

Duplicate-work check

AI Disclosure

Test Plan

Test Result

Changed files

Description

Fix

extent analysis

Fix Plan

Step-by-Step Solution

Temporary Workaround

Verification

Extra Tips

Still need to ship something?

TRENDING

vllm - ✅(Solved) Fix fix(compilation): fix piecewise CUDA graph bugs with splitting_ops [1 pull requests, 1 comments, 1 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Fix Action

Fix

PR fix notes

PR #37361: fix(compilation): fix piecewise CUDA graph bugs with splitting_ops

Description (problem / solution / changelog)

Purpose

Duplicate-work check

AI Disclosure

Test Plan

Test Result

Changed files

Description

Fix

extent analysis

Fix Plan

Step-by-Step Solution

Temporary Workaround

Verification

Extra Tips

Still need to ship something?

RELATED_DISCOVERY

TRENDING