pytorch - 💡(How to fix) Fix [vllm] [2.12 regression] PyTorch Fullgraph Smoke Test: full-cudagraph LLM diverges from piecewise-cudagraph LLM at batch=123 [1 participants]

atalman · 2026-05-01T14:57:07Z

[pytorch] Under torch 2.12.0 + triton 3.7.0, vLLM's test full cudagraph 123-10-llm pair10 fails because the full-cudagraph LLM produces different output than t… Under torch 2.12.0 + triton 3.7.0, vLLM's `test_full_cudagraph[123-10-llm_pair10]` fails because the **full-cudagraph** LLM produces different output than the **piecewise-cudagraph** LLM at `batch_size=123, max_tokens=10` for the same prompt and greedy sampling: > `AssertionError: assert ' jumps over ...y dog\n\n# 10' == ' jumps over ...y dog\n\n# 1.'` Diff at the last 2 characters: ``` piecewise: ... jumps over the lazy dog\n\n# 1. full_cg: ... jumps over the lazy dog\n\n# 10 ^^^ ``` Both modes use **temperature=0.0, top_p=1.0** (purely greedy), so the result must be bit-identical on a correct implementation. On torch 2.11 they were. On torch 2.12, the full-cudagraph path diverges from piecewise at this specific batch size. Blocking the torch 2.12 upgrade for vLLM (vllm-project/vllm#40077). ## Fix / Workaround This was originally suspected to share a root cause with pytorch/pytorch#182124 (the AsyncTP correctness divergence). We verified that **#182124 is fixed** by reverting pytorch/pytorch#176994 ("[Inductor] Improve materialization heuristic for a chain of computations") on the test wheel. **However, this issue (#182125) is NOT fixed by that revert** — the `# 1.` vs `# 10` divergence at `batch_size=123` reproduces with #176994 reverted. vLLM-side commits in the same window also worth considering: - `c2fb01331` [Bugfix][Compile] Fix gc.collect/empty_cache patch arity in `CUDAGraphWrapper` - `6f20f81cb` shape_invariants → shape_id refactor in `dynamic_arg_dims` 1. **What state mutates** between batch-size parametrizations on the SAME `llm_pair` fixture that causes later batch sizes to produce divergent piecewise-vs-full-cudagraph output? Candidates: cudagraph capture pool fragmentation, inductor compile cache state, allocator reuse pattern. 2. Why does the same divergence pattern (`def` vs `#`) appear across `llm_pair0..llm_pair3` (different backends)? Suggests the bug is in a layer common to all backends — likely cudagraph capture or torch.compile / inductor. 3. **Bisect candidates** still open (after pytorch/pytorch#176994 was ruled out — the same revert that fixed pytorch/pytorch#182124 does NOT fix this issue): - `f813f7732d` Fix dynamic shape tile issue (#181793) — touches `torch/_inductor/tiling_utils.py`. Tile selection at non-power-of-two batch (123 → padded to 128 boundary lines up). - `7c927dd255` [dynamo] Disable recursive dict tag optimization - `c2fb01331` (vLLM) [Bugfix][Compile] Fix gc.collect/empty_cache patch arity in `CUDAGraphWrapper` - `6f20f81cb` (vLLM) shape_invariants → shape_id refactor in `dynamic_arg_dims` ## Summary Under torch 2.12.0 + triton 3.7.0, vLLM's `test_full_cudagraph[123-10-llm_pair10]` fails because the **full-cudagraph** LLM produces different output than the **piecewise-cudagraph** LLM at `batch_size=123, max_tokens=10` for the same prompt and greedy sampling: > `AssertionError: assert ' jumps over ...y dog\n\n# 10' == ' jumps over ...y dog\n\n# 1.'` Diff at the last 2 characters: ``` piecewise: ... jumps over the lazy dog\n\n# 1. full_cg: ... jumps over the lazy dog\n\n# 10 ^^^ ``` Both modes use **temperature=0.0, top_p=1.0** (purely greedy), so the result must be bit-identical on a correct implementation. On torch 2.11 they were. On torch 2.12, the full-cudagraph path diverges from piecewise at this specific batch size. Blocking the torch 2.12 upgrade for vLLM (vllm-project/vllm#40077). ## Environment - `torch`: 2.12.0+cu130 (test channel) - `triton`: 3.7.0 - CUDA: 13.0 - Python: 3.12.13 - GPU: NVIDIA H100 (mithril-h100-pool) ## Reproduction Failing test: ``` tests/compile/fullgraph/test_full_cudagraph.py::TestFullCUDAGraph::test_full_cudagraph[123-10-llm_pair10] ``` Test parametrization: `batch_size=123, max_tokens=10`, `llm_pair10` (the 11th fixture combo). The test does: ```python prompts = ["the quick brown fox"] * batch_size # 123 copies of the same prompt sampling_params = SamplingParams(temperature=0.0, max_tokens=max_tokens, top_p=1.0) piecewise_responses = piecewise_llm.generate(prompts, sampling_params) full_responses = full_cudagraph_llm.generate(prompts, sampling_params) for piecewise_res, full_res in zip(piecewise_responses, full_responses): assert piecewise_res.outputs[0].text.lower() == full_res.outputs[0].text.lower() ``` The two LLMs differ only in cudagraph mode (FULL vs PIECEWISE). With greedy decoding the outputs must match exactly. They don't on torch 2.12 at `batch_size=123`. The other 9 batch-size/max-tokens combinations in the parametrize list (`(1,10), (7,10), (16,10), (25,10), (32,10), (45,10), (64,10), (8,5), (8,30)`) all pass; only `(123, 10)` fails. ## Reproducibility on torch 2.12 branch Reproduces on every test-PR run since 2026-04-29: - 2026-04-29: https://buildkite.com/vllm/ci/builds/63890#019ddfd6-931c-4edb-b3e7-928eaee434f6 - 2026-05-01: https://buildkite.com/vllm/ci/bu

pytorch2026-05-01 14:57:07

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

pytorch/pytorch#182125•Fetched 2026-05-02 05:27:03

View on GitHub

Comments

Participants

Timeline

186

Reactions

Author

atalman

Participants

atalman

Timeline (top)

mentioned ×88subscribed ×88labeled ×8cross-referenced ×2

Under torch 2.12.0 + triton 3.7.0, vLLM's test_full_cudagraph[123-10-llm_pair10] fails because the full-cudagraph LLM produces different output than the piecewise-cudagraph LLM at batch_size=123, max_tokens=10 for the same prompt and greedy sampling:

AssertionError: assert ' jumps over ...y dog\n\n# 10' == ' jumps over ...y dog\n\n# 1.'

Diff at the last 2 characters:

piecewise: ... jumps over the lazy dog\n\n# 1.
full_cg:   ... jumps over the lazy dog\n\n# 10
                                          ^^^

Both modes use temperature=0.0, top_p=1.0 (purely greedy), so the result must be bit-identical on a correct implementation. On torch 2.11 they were. On torch 2.12, the full-cudagraph path diverges from piecewise at this specific batch size. Blocking the torch 2.12 upgrade for vLLM (vllm-project/vllm#40077).

Error Message

AssertionError: assert ' jumps over ...y dog\n\n# 10' == ' jumps over ...y dog\n\n# 1.'

Root Cause

Fix Action

Fix / Workaround

This was originally suspected to share a root cause with pytorch/pytorch#182124 (the AsyncTP correctness divergence). We verified that #182124 is fixed by reverting pytorch/pytorch#176994 ("[Inductor] Improve materialization heuristic for a chain of computations") on the test wheel. However, this issue (#182125) is NOT fixed by that revert — the # 1. vs # 10 divergence at batch_size=123 reproduces with #176994 reverted.

vLLM-side commits in the same window also worth considering:

c2fb01331 [Bugfix][Compile] Fix gc.collect/empty_cache patch arity in CUDAGraphWrapper
6f20f81cb shape_invariants → shape_id refactor in dynamic_arg_dims

What state mutates between batch-size parametrizations on the SAME llm_pair fixture that causes later batch sizes to produce divergent piecewise-vs-full-cudagraph output? Candidates: cudagraph capture pool fragmentation, inductor compile cache state, allocator reuse pattern.
Why does the same divergence pattern (def vs #) appear across llm_pair0..llm_pair3 (different backends)? Suggests the bug is in a layer common to all backends — likely cudagraph capture or torch.compile / inductor.
Bisect candidates still open (after pytorch/pytorch#176994 was ruled out — the same revert that fixed pytorch/pytorch#182124 does NOT fix this issue):
- f813f7732d Fix dynamic shape tile issue (#181793) — touches torch/_inductor/tiling_utils.py. Tile selection at non-power-of-two batch (123 → padded to 128 boundary lines up).
- 7c927dd255 [dynamo] Disable recursive dict tag optimization
- c2fb01331 (vLLM) [Bugfix][Compile] Fix gc.collect/empty_cache patch arity in CUDAGraphWrapper
- 6f20f81cb (vLLM) shape_invariants → shape_id refactor in dynamic_arg_dims

Code Example

piecewise: ... jumps over the lazy dog\n\n# 1.
full_cg:   ... jumps over the lazy dog\n\n# 10
                                          ^^^

---

tests/compile/fullgraph/test_full_cudagraph.py::TestFullCUDAGraph::test_full_cudagraph[123-10-llm_pair10]

---

prompts = ["the quick brown fox"] * batch_size  # 123 copies of the same prompt
sampling_params = SamplingParams(temperature=0.0, max_tokens=max_tokens, top_p=1.0)

piecewise_responses = piecewise_llm.generate(prompts, sampling_params)
full_responses = full_cudagraph_llm.generate(prompts, sampling_params)

for piecewise_res, full_res in zip(piecewise_responses, full_responses):
    assert piecewise_res.outputs[0].text.lower() == full_res.outputs[0].text.lower()

---

docker pull public.ecr.aws/q9t5s3a7/vllm-ci-test-repo:18b9904b59f96fe5d28c8e7acdf6dec19c84acaf

---

docker run --rm -it \
    --gpus '"device=0"' \
    --shm-size=4g \
    -e HF_TOKEN=<your-hf-token> \
    public.ecr.aws/q9t5s3a7/vllm-ci-test-repo:18b9904b59f96fe5d28c8e7acdf6dec19c84acaf \
    bash

# inside the container
cd /vllm-workspace/tests
rm -rf /home/dev/.cache/vllm/torch_compile_cache    # ensure cold compile

# CRITICAL: run the FULL llm_pair10 sequence in one pytest invocation —
# the failure only triggers after the LLM instances accumulate state
# across the prior batch sizes (1, 7, 16, 25, 32, 45, 64) before reaching 123.
pytest -v -s 'compile/fullgraph/test_full_cudagraph.py::TestFullCUDAGraph' -k 'llm_pair10' 2>&1 \
    | grep -E "(test_full_cudagraph\[|PASSED|FAILED|AssertionError|jumps over)"

---

AssertionError: assert ' jumps over ...y dog\n\n# 10' == ' jumps over ...y dog\n\n# 1.'
  - # 1.
  + # 10

---

============== 19 failed, 121 passed, 40 skipped, 48 warnings in 940.81s ==============

---

docker run --rm -it \
    --gpus '"device=0"' \
    --shm-size=4g \
    -e HF_TOKEN=<your-hf-token> \
    public.ecr.aws/q9t5s3a7/vllm-ci-test-repo:18b9904b59f96fe5d28c8e7acdf6dec19c84acaf \
    bash

# inside the container
cd /vllm-workspace/tests
rm -rf /home/dev/.cache/vllm/torch_compile_cache /tmp/torchinductor_*

# Run the full test class — matches CI shard execution; class-scoped llm_pair
# fixture is reused across all (batch_size, max_tokens) parametrizations.
CUDA_VISIBLE_DEVICES=0 pytest -v -s \
    'compile/fullgraph/test_full_cudagraph.py::TestFullCUDAGraph' 2>&1 \
    | tee /tmp/fullgraph_repro.log

RAW_BUFFERClick to expand / collapse

Summary

AssertionError: assert ' jumps over ...y dog\n\n# 10' == ' jumps over ...y dog\n\n# 1.'

Diff at the last 2 characters:

piecewise: ... jumps over the lazy dog\n\n# 1.
full_cg:   ... jumps over the lazy dog\n\n# 10
                                          ^^^

Environment

torch: 2.12.0+cu130 (test channel)
triton: 3.7.0
CUDA: 13.0
Python: 3.12.13
GPU: NVIDIA H100 (mithril-h100-pool)

Reproduction

Failing test:

tests/compile/fullgraph/test_full_cudagraph.py::TestFullCUDAGraph::test_full_cudagraph[123-10-llm_pair10]

Test parametrization: batch_size=123, max_tokens=10, llm_pair10 (the 11th fixture combo). The test does:

prompts = ["the quick brown fox"] * batch_size  # 123 copies of the same prompt
sampling_params = SamplingParams(temperature=0.0, max_tokens=max_tokens, top_p=1.0)

piecewise_responses = piecewise_llm.generate(prompts, sampling_params)
full_responses = full_cudagraph_llm.generate(prompts, sampling_params)

for piecewise_res, full_res in zip(piecewise_responses, full_responses):
    assert piecewise_res.outputs[0].text.lower() == full_res.outputs[0].text.lower()

The two LLMs differ only in cudagraph mode (FULL vs PIECEWISE). With greedy decoding the outputs must match exactly. They don't on torch 2.12 at batch_size=123.

The other 9 batch-size/max-tokens combinations in the parametrize list ((1,10), (7,10), (16,10), (25,10), (32,10), (45,10), (64,10), (8,5), (8,30)) all pass; only (123, 10) fails.

Reproducibility on torch 2.12 branch

Reproduces on every test-PR run since 2026-04-29:

Passes on every recent main build:

2026-04-30 daily: https://buildkite.com/vllm/ci/builds/63914
2026-05-01 nightly: https://buildkite.com/vllm/ci/builds/63994

Diagnosis request

A correctly-implemented full-cudagraph rewrite must produce bit-identical outputs vs piecewise on greedy decoding — they're the same compute graph, only the CUDA-graph capture/replay differs. The fact that this fails specifically at batch_size=123 (a non-power-of-two padded batch) and only on torch 2.12 suggests:

A change in cudagraph capture/replay semantics in torch 2.12 that affects how padded batches are handled.
A change in the inductor-compiled kernel for the padded shape that produces different kernel selection between full-graph vs piecewise compile boundaries.
A change in how torch 2.12 cudagraph handles tensor strides for the padded dim (the 123 likely gets padded to 128 internally; that boundary may now be handled differently).

Could a maintainer compare cudagraph capture state at batch=123 between torch 2.11 and torch 2.12 for this model?

Links

vLLM PR: vllm-project/vllm#40077
Umbrella: pytorch/pytorch#180899

cc @mcarilli @ezyang @eellison @penguinwu @BoyuanFeng @chauhang @voznesenskym @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @kadeng @muchulee8 @amjames @aakhundov @coconutruben @jataylo

Update 2026-05-01: root cause still unknown

So the root cause for this issue is something different and still unknown. Continuing to investigate other 2.12 cherry-picks and vLLM-side commits as candidates.

Reproduction (current state of investigation)

Hardware: 1× H100 80GB (mithril-h100-pool in CI; any H100 should work).

Image (pinned torch 2.12.0+cu130 RC + matching vLLM build):

docker pull public.ecr.aws/q9t5s3a7/vllm-ci-test-repo:18b9904b59f96fe5d28c8e7acdf6dec19c84acaf

Run:

docker run --rm -it \
    --gpus '"device=0"' \
    --shm-size=4g \
    -e HF_TOKEN=<your-hf-token> \
    public.ecr.aws/q9t5s3a7/vllm-ci-test-repo:18b9904b59f96fe5d28c8e7acdf6dec19c84acaf \
    bash

# inside the container
cd /vllm-workspace/tests
rm -rf /home/dev/.cache/vllm/torch_compile_cache    # ensure cold compile

# CRITICAL: run the FULL llm_pair10 sequence in one pytest invocation —
# the failure only triggers after the LLM instances accumulate state
# across the prior batch sizes (1, 7, 16, 25, 32, 45, 64) before reaching 123.
pytest -v -s 'compile/fullgraph/test_full_cudagraph.py::TestFullCUDAGraph' -k 'llm_pair10' 2>&1 \
    | grep -E "(test_full_cudagraph\[|PASSED|FAILED|AssertionError|jumps over)"

Expected fail signature at [123-10-llm_pair10]:

AssertionError: assert ' jumps over ...y dog\n\n# 10' == ' jumps over ...y dog\n\n# 1.'
  - # 1.
  + # 10

The earlier batch sizes (1, 7, 16, 25, 32, 45, 64) PASS; only [123-10-llm_pair10] fails.

What we've ruled out

❌ Hardware variation: bit-identical CI failure on mithril-h100-pool H100 80GB; reproduced (only when running the full llm_pair10 parametrization sequence) on a separate H100 80GB dev box → not a CI-agent-specific quirk.
❌ AOT cache state: cold compile (no cache) still reproduces the divergence in CI; only the single-test-in-isolation invocation skips the issue (because it doesn't accumulate state from the earlier batch sizes).
❌ MIG slicing: confirmed neither CI nor dev box uses MIG; both run on full GPUs with MIG M.: Disabled.
❌ Memory pressure: gpu_memory_utilization=0.5 and 0.9 both fail in CI; not memory-allocator state.
❌ pytorch/pytorch#176994 (Inductor materialization heuristic): reverting on the installed torch fixes pytorch/pytorch#182124 but NOT this issue.

Open candidates

Other commits that landed on release/2.12 between 63658 (passing) and 63890 (first failure):

f813f7732d Fix dynamic shape tile issue (#181793) — touches torch/_inductor/tiling_utils.py (tile selection at non-power-of-two batch — 123 → padded to 128 matches this hypothesis)
7c927dd255 [dynamo] Disable recursive dict tag optimization — touches torch/_dynamo/config.py
dea39b1260 warn instead of error on fullgraph=True fallback to eager — should be benign

vLLM-side commits in the same window also worth considering:

c2fb01331 [Bugfix][Compile] Fix gc.collect/empty_cache patch arity in CUDAGraphWrapper
6f20f81cb shape_invariants → shape_id refactor in dynamic_arg_dims

Diagnosis request

The divergence pattern (only batch_size=123 after running other batch sizes through the same shared LLM pair) suggests state accumulation in cudagraph capture/replay or in the inductor compile cache. Could a maintainer trace what state is mutated by the earlier batch sizes that affects the codegen/replay at batch=123?

Update 2026-05-01 (later): confirmed local reproduction + scope is broader than originally reported

Reproduced on a 1×H100 80GB dev box (using 1 of 2 GPUs in a 2×H100 node) inside the same vllm-ci-test-repo:18b9904b59f96fe5d28c8e7acdf6dec19c84acaf image. Running the full TestFullCUDAGraph class (matches CI's invocation of pytest tests/compile/fullgraph/test_full_cudagraph.py) produces 19 failures across 4 different llm_pair configurations and many batch sizes — much broader than CI's single [123-10-llm_pair10] failure suggested.

Failing parametrizations on local repro

llm_pair	Failing `[batch_size-max_tokens]`	Divergence pattern
`llm_pair0`	`[7-10]`, `[16-10]`, `[32-10]`, `[45-10]`, `[64-10]`, `[123-10]`	piecewise: `dog"\n\n def` vs full: `dog"\n\n #`
`llm_pair1`	`[16-10]`, `[32-10]`, `[64-10]`, `[123-10]`	mixed direction (`def` vs `#`, also `#` vs `def`)
`llm_pair2`	`[32-10]`, `[45-10]`, `[64-10]`, `[123-10]`	piecewise: `def` vs full: `#`
`llm_pair3`	`[32-10]`, `[45-10]`, `[64-10]`, `[123-10]`, `[8-30]`	`def`/`#` divergence, plus at (8, 30) divergence inside generated `def test_is_palindrome`: piecewise `assertequal(is_palind` vs full `asserttrue(is_palind`

Final pytest summary:

============== 19 failed, 121 passed, 40 skipped, 48 warnings in 940.81s ==============

llm_pair0..llm_pair9 are alternative (model, backend_config, use_inductor_graph_partition) tuples. The fact that llm_pair0..llm_pair3 ALL diverge (and not just llm_pair10 as CI showed) means this is not specific to one model/backend combination — it's a general full-cudagraph-vs-piecewise codegen drift in vLLM's compile path, surfaced after enough warmup.

Properties confirmed

State accumulation required: single-test invocations (e.g., pytest -k '[123-10-llm_pair10]' alone) PASS on the same hardware. Failures only appear when the class-scoped llm_pair fixture has been reused across prior (batch_size, max_tokens) parametrizations, indicating allocator/cudagraph-pool/inductor state mutates between runs in a way that affects later compilations or replays.
Greedy decoding: temperature=0.0, top_p=1.0 — output must be bit-identical between piecewise and full cudagraph by construction.
Reproducible: bit-identical divergence text on repeated runs.
Hardware: H100 80GB (full GPU, MIG disabled). Reproduces on standalone dev box, not just CI's mithril-h100-pool.

Reproduction recipe (matches what triggered the failures above)

docker run --rm -it \
    --gpus '"device=0"' \
    --shm-size=4g \
    -e HF_TOKEN=<your-hf-token> \
    public.ecr.aws/q9t5s3a7/vllm-ci-test-repo:18b9904b59f96fe5d28c8e7acdf6dec19c84acaf \
    bash

# inside the container
cd /vllm-workspace/tests
rm -rf /home/dev/.cache/vllm/torch_compile_cache /tmp/torchinductor_*

# Run the full test class — matches CI shard execution; class-scoped llm_pair
# fixture is reused across all (batch_size, max_tokens) parametrizations.
CUDA_VISIBLE_DEVICES=0 pytest -v -s \
    'compile/fullgraph/test_full_cudagraph.py::TestFullCUDAGraph' 2>&1 \
    | tee /tmp/fullgraph_repro.log

Expect ~15 minutes runtime; multiple FAILED lines in the short test summary at the end.

Open questions for maintainers

What state mutates between batch-size parametrizations on the SAME llm_pair fixture that causes later batch sizes to produce divergent piecewise-vs-full-cudagraph output? Candidates: cudagraph capture pool fragmentation, inductor compile cache state, allocator reuse pattern.
Why does the same divergence pattern (def vs #) appear across llm_pair0..llm_pair3 (different backends)? Suggests the bug is in a layer common to all backends — likely cudagraph capture or torch.compile / inductor.
Bisect candidates still open (after pytorch/pytorch#176994 was ruled out — the same revert that fixed pytorch/pytorch#182124 does NOT fix this issue):
- f813f7732d Fix dynamic shape tile issue (#181793) — touches torch/_inductor/tiling_utils.py. Tile selection at non-power-of-two batch (123 → padded to 128 boundary lines up).
- 7c927dd255 [dynamo] Disable recursive dict tag optimization
- c2fb01331 (vLLM) [Bugfix][Compile] Fix gc.collect/empty_cache patch arity in CUDAGraphWrapper
- 6f20f81cb (vLLM) shape_invariants → shape_id refactor in dynamic_arg_dims

A single-cherry-pick revert experiment is in progress; will update once we know which is the cause.

extent analysis

TL;DR

The most likely fix for the divergence issue between full-cudagraph and piecewise-cudagraph LLMs on torch 2.12 is to identify and revert the specific commit that introduced the change in cudagraph capture or inductor compile cache state.

Guidance

Investigate state accumulation: Determine what state is mutated between batch-size parametrizations on the same llm_pair fixture that causes later batch sizes to produce divergent output.
Bisect candidates: Test reverts of individual commits, such as f813f7732d, 7c927dd255, c2fb01331, and 6f20f81cb, to identify the root cause of the issue.
Verify cudagraph capture: Compare cudagraph capture state at batch=123 between torch 2.11 and torch 2.12 for the same model to identify any differences in capture or replay semantics.
Check inductor compile cache: Investigate the inductor compile cache state to see if there are any changes in how padded batches are handled or if there are any issues with kernel selection.

Example

No specific code example is provided, as the issue is related to the interaction between torch, triton, and the specific model being used. However, the reproduction recipe provided in the issue can be used to test potential fixes.

Notes

The issue is specific to torch 2.12 and does not occur on torch 2.11. The divergence pattern is consistent across different llm_pair configurations and batch sizes, suggesting a general issue with full-cudagraph-vs-piecewise codegen drift in vLLM's compile path.

Recommendation

Apply a workaround by reverting the specific commit that introduced the change in cudagraph capture or inductor compile cache state, once it is identified through bisecting and testing.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #ssr #optimization #GPU compatibility #latency issue

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

pytorch - 💡(How to fix) Fix [vllm] [2.12 regression] PyTorch Fullgraph Smoke Test: full-cudagraph LLM diverges from piecewise-cudagraph LLM at batch=123 [1 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

1.

10

Root Cause

Fix Action

Fix / Workaround

Code Example

Summary

Environment

Reproduction

Reproducibility on torch 2.12 branch

Diagnosis request

Links

Update 2026-05-01: root cause still unknown

Reproduction (current state of investigation)

What we've ruled out

Open candidates

Diagnosis request

Update 2026-05-01 (later): confirmed local reproduction + scope is broader than originally reported

Failing parametrizations on local repro

Properties confirmed

Reproduction recipe (matches what triggered the failures above)

Open questions for maintainers

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING