pytorch - 💡(How to fix) Fix [vllm] [2.12 regression] PyTorch Fullgraph Smoke Test: full-cudagraph LLM diverges from piecewise-cudagraph LLM at batch=123 [1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
pytorch/pytorch#182125Fetched 2026-05-02 05:27:03
View on GitHub
Comments
0
Participants
1
Timeline
186
Reactions
0
Author
Participants
Timeline (top)
mentioned ×88subscribed ×88labeled ×8cross-referenced ×2

Under torch 2.12.0 + triton 3.7.0, vLLM's test_full_cudagraph[123-10-llm_pair10] fails because the full-cudagraph LLM produces different output than the piecewise-cudagraph LLM at batch_size=123, max_tokens=10 for the same prompt and greedy sampling:

AssertionError: assert ' jumps over ...y dog\n\n# 10' == ' jumps over ...y dog\n\n# 1.'

Diff at the last 2 characters:

piecewise: ... jumps over the lazy dog\n\n# 1.
full_cg:   ... jumps over the lazy dog\n\n# 10
                                          ^^^

Both modes use temperature=0.0, top_p=1.0 (purely greedy), so the result must be bit-identical on a correct implementation. On torch 2.11 they were. On torch 2.12, the full-cudagraph path diverges from piecewise at this specific batch size. Blocking the torch 2.12 upgrade for vLLM (vllm-project/vllm#40077).

Error Message

AssertionError: assert ' jumps over ...y dog\n\n# 10' == ' jumps over ...y dog\n\n# 1.'

  • 1.

  • 10

Root Cause

Under torch 2.12.0 + triton 3.7.0, vLLM's test_full_cudagraph[123-10-llm_pair10] fails because the full-cudagraph LLM produces different output than the piecewise-cudagraph LLM at batch_size=123, max_tokens=10 for the same prompt and greedy sampling:

Fix Action

Fix / Workaround

This was originally suspected to share a root cause with pytorch/pytorch#182124 (the AsyncTP correctness divergence). We verified that #182124 is fixed by reverting pytorch/pytorch#176994 ("[Inductor] Improve materialization heuristic for a chain of computations") on the test wheel. However, this issue (#182125) is NOT fixed by that revert — the # 1. vs # 10 divergence at batch_size=123 reproduces with #176994 reverted.

vLLM-side commits in the same window also worth considering:

  • c2fb01331 [Bugfix][Compile] Fix gc.collect/empty_cache patch arity in CUDAGraphWrapper
  • 6f20f81cb shape_invariants → shape_id refactor in dynamic_arg_dims
  1. What state mutates between batch-size parametrizations on the SAME llm_pair fixture that causes later batch sizes to produce divergent piecewise-vs-full-cudagraph output? Candidates: cudagraph capture pool fragmentation, inductor compile cache state, allocator reuse pattern.
  2. Why does the same divergence pattern (def vs #) appear across llm_pair0..llm_pair3 (different backends)? Suggests the bug is in a layer common to all backends — likely cudagraph capture or torch.compile / inductor.
  3. Bisect candidates still open (after pytorch/pytorch#176994 was ruled out — the same revert that fixed pytorch/pytorch#182124 does NOT fix this issue):
    • f813f7732d Fix dynamic shape tile issue (#181793) — touches torch/_inductor/tiling_utils.py. Tile selection at non-power-of-two batch (123 → padded to 128 boundary lines up).
    • 7c927dd255 [dynamo] Disable recursive dict tag optimization
    • c2fb01331 (vLLM) [Bugfix][Compile] Fix gc.collect/empty_cache patch arity in CUDAGraphWrapper
    • 6f20f81cb (vLLM) shape_invariants → shape_id refactor in dynamic_arg_dims

Code Example

piecewise: ... jumps over the lazy dog\n\n# 1.
full_cg:   ... jumps over the lazy dog\n\n# 10
                                          ^^^

---

tests/compile/fullgraph/test_full_cudagraph.py::TestFullCUDAGraph::test_full_cudagraph[123-10-llm_pair10]

---

prompts = ["the quick brown fox"] * batch_size  # 123 copies of the same prompt
sampling_params = SamplingParams(temperature=0.0, max_tokens=max_tokens, top_p=1.0)

piecewise_responses = piecewise_llm.generate(prompts, sampling_params)
full_responses = full_cudagraph_llm.generate(prompts, sampling_params)

for piecewise_res, full_res in zip(piecewise_responses, full_responses):
    assert piecewise_res.outputs[0].text.lower() == full_res.outputs[0].text.lower()

---

docker pull public.ecr.aws/q9t5s3a7/vllm-ci-test-repo:18b9904b59f96fe5d28c8e7acdf6dec19c84acaf

---

docker run --rm -it \
    --gpus '"device=0"' \
    --shm-size=4g \
    -e HF_TOKEN=<your-hf-token> \
    public.ecr.aws/q9t5s3a7/vllm-ci-test-repo:18b9904b59f96fe5d28c8e7acdf6dec19c84acaf \
    bash

# inside the container
cd /vllm-workspace/tests
rm -rf /home/dev/.cache/vllm/torch_compile_cache    # ensure cold compile

# CRITICAL: run the FULL llm_pair10 sequence in one pytest invocation —
# the failure only triggers after the LLM instances accumulate state
# across the prior batch sizes (1, 7, 16, 25, 32, 45, 64) before reaching 123.
pytest -v -s 'compile/fullgraph/test_full_cudagraph.py::TestFullCUDAGraph' -k 'llm_pair10' 2>&1 \
    | grep -E "(test_full_cudagraph\[|PASSED|FAILED|AssertionError|jumps over)"

---

AssertionError: assert ' jumps over ...y dog\n\n# 10' == ' jumps over ...y dog\n\n# 1.'
  - # 1.
  + # 10

---

============== 19 failed, 121 passed, 40 skipped, 48 warnings in 940.81s ==============

---

docker run --rm -it \
    --gpus '"device=0"' \
    --shm-size=4g \
    -e HF_TOKEN=<your-hf-token> \
    public.ecr.aws/q9t5s3a7/vllm-ci-test-repo:18b9904b59f96fe5d28c8e7acdf6dec19c84acaf \
    bash

# inside the container
cd /vllm-workspace/tests
rm -rf /home/dev/.cache/vllm/torch_compile_cache /tmp/torchinductor_*

# Run the full test class — matches CI shard execution; class-scoped llm_pair
# fixture is reused across all (batch_size, max_tokens) parametrizations.
CUDA_VISIBLE_DEVICES=0 pytest -v -s \
    'compile/fullgraph/test_full_cudagraph.py::TestFullCUDAGraph' 2>&1 \
    | tee /tmp/fullgraph_repro.log
RAW_BUFFERClick to expand / collapse

Summary

Under torch 2.12.0 + triton 3.7.0, vLLM's test_full_cudagraph[123-10-llm_pair10] fails because the full-cudagraph LLM produces different output than the piecewise-cudagraph LLM at batch_size=123, max_tokens=10 for the same prompt and greedy sampling:

AssertionError: assert ' jumps over ...y dog\n\n# 10' == ' jumps over ...y dog\n\n# 1.'

Diff at the last 2 characters:

piecewise: ... jumps over the lazy dog\n\n# 1.
full_cg:   ... jumps over the lazy dog\n\n# 10
                                          ^^^

Both modes use temperature=0.0, top_p=1.0 (purely greedy), so the result must be bit-identical on a correct implementation. On torch 2.11 they were. On torch 2.12, the full-cudagraph path diverges from piecewise at this specific batch size. Blocking the torch 2.12 upgrade for vLLM (vllm-project/vllm#40077).

Environment

  • torch: 2.12.0+cu130 (test channel)
  • triton: 3.7.0
  • CUDA: 13.0
  • Python: 3.12.13
  • GPU: NVIDIA H100 (mithril-h100-pool)

Reproduction

Failing test:

tests/compile/fullgraph/test_full_cudagraph.py::TestFullCUDAGraph::test_full_cudagraph[123-10-llm_pair10]

Test parametrization: batch_size=123, max_tokens=10, llm_pair10 (the 11th fixture combo). The test does:

prompts = ["the quick brown fox"] * batch_size  # 123 copies of the same prompt
sampling_params = SamplingParams(temperature=0.0, max_tokens=max_tokens, top_p=1.0)

piecewise_responses = piecewise_llm.generate(prompts, sampling_params)
full_responses = full_cudagraph_llm.generate(prompts, sampling_params)

for piecewise_res, full_res in zip(piecewise_responses, full_responses):
    assert piecewise_res.outputs[0].text.lower() == full_res.outputs[0].text.lower()

The two LLMs differ only in cudagraph mode (FULL vs PIECEWISE). With greedy decoding the outputs must match exactly. They don't on torch 2.12 at batch_size=123.

The other 9 batch-size/max-tokens combinations in the parametrize list ((1,10), (7,10), (16,10), (25,10), (32,10), (45,10), (64,10), (8,5), (8,30)) all pass; only (123, 10) fails.

Reproducibility on torch 2.12 branch

Reproduces on every test-PR run since 2026-04-29:

Passes on every recent main build:

Diagnosis request

A correctly-implemented full-cudagraph rewrite must produce bit-identical outputs vs piecewise on greedy decoding — they're the same compute graph, only the CUDA-graph capture/replay differs. The fact that this fails specifically at batch_size=123 (a non-power-of-two padded batch) and only on torch 2.12 suggests:

  1. A change in cudagraph capture/replay semantics in torch 2.12 that affects how padded batches are handled.
  2. A change in the inductor-compiled kernel for the padded shape that produces different kernel selection between full-graph vs piecewise compile boundaries.
  3. A change in how torch 2.12 cudagraph handles tensor strides for the padded dim (the 123 likely gets padded to 128 internally; that boundary may now be handled differently).

Could a maintainer compare cudagraph capture state at batch=123 between torch 2.11 and torch 2.12 for this model?

Links

  • vLLM PR: vllm-project/vllm#40077
  • Umbrella: pytorch/pytorch#180899

cc @mcarilli @ezyang @eellison @penguinwu @BoyuanFeng @chauhang @voznesenskym @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @kadeng @muchulee8 @amjames @aakhundov @coconutruben @jataylo


Update 2026-05-01: root cause still unknown

This was originally suspected to share a root cause with pytorch/pytorch#182124 (the AsyncTP correctness divergence). We verified that #182124 is fixed by reverting pytorch/pytorch#176994 ("[Inductor] Improve materialization heuristic for a chain of computations") on the test wheel. However, this issue (#182125) is NOT fixed by that revert — the # 1. vs # 10 divergence at batch_size=123 reproduces with #176994 reverted.

So the root cause for this issue is something different and still unknown. Continuing to investigate other 2.12 cherry-picks and vLLM-side commits as candidates.

Reproduction (current state of investigation)

Hardware: 1× H100 80GB (mithril-h100-pool in CI; any H100 should work).

Image (pinned torch 2.12.0+cu130 RC + matching vLLM build):

docker pull public.ecr.aws/q9t5s3a7/vllm-ci-test-repo:18b9904b59f96fe5d28c8e7acdf6dec19c84acaf

Run:

docker run --rm -it \
    --gpus '"device=0"' \
    --shm-size=4g \
    -e HF_TOKEN=<your-hf-token> \
    public.ecr.aws/q9t5s3a7/vllm-ci-test-repo:18b9904b59f96fe5d28c8e7acdf6dec19c84acaf \
    bash

# inside the container
cd /vllm-workspace/tests
rm -rf /home/dev/.cache/vllm/torch_compile_cache    # ensure cold compile

# CRITICAL: run the FULL llm_pair10 sequence in one pytest invocation —
# the failure only triggers after the LLM instances accumulate state
# across the prior batch sizes (1, 7, 16, 25, 32, 45, 64) before reaching 123.
pytest -v -s 'compile/fullgraph/test_full_cudagraph.py::TestFullCUDAGraph' -k 'llm_pair10' 2>&1 \
    | grep -E "(test_full_cudagraph\[|PASSED|FAILED|AssertionError|jumps over)"

Expected fail signature at [123-10-llm_pair10]:

AssertionError: assert ' jumps over ...y dog\n\n# 10' == ' jumps over ...y dog\n\n# 1.'
  - # 1.
  + # 10

The earlier batch sizes (1, 7, 16, 25, 32, 45, 64) PASS; only [123-10-llm_pair10] fails.

What we've ruled out

  • Hardware variation: bit-identical CI failure on mithril-h100-pool H100 80GB; reproduced (only when running the full llm_pair10 parametrization sequence) on a separate H100 80GB dev box → not a CI-agent-specific quirk.
  • AOT cache state: cold compile (no cache) still reproduces the divergence in CI; only the single-test-in-isolation invocation skips the issue (because it doesn't accumulate state from the earlier batch sizes).
  • MIG slicing: confirmed neither CI nor dev box uses MIG; both run on full GPUs with MIG M.: Disabled.
  • Memory pressure: gpu_memory_utilization=0.5 and 0.9 both fail in CI; not memory-allocator state.
  • pytorch/pytorch#176994 (Inductor materialization heuristic): reverting on the installed torch fixes pytorch/pytorch#182124 but NOT this issue.

Open candidates

Other commits that landed on release/2.12 between 63658 (passing) and 63890 (first failure):

  • f813f7732d Fix dynamic shape tile issue (#181793) — touches torch/_inductor/tiling_utils.py (tile selection at non-power-of-two batch — 123 → padded to 128 matches this hypothesis)
  • 7c927dd255 [dynamo] Disable recursive dict tag optimization — touches torch/_dynamo/config.py
  • dea39b1260 warn instead of error on fullgraph=True fallback to eager — should be benign

vLLM-side commits in the same window also worth considering:

  • c2fb01331 [Bugfix][Compile] Fix gc.collect/empty_cache patch arity in CUDAGraphWrapper
  • 6f20f81cb shape_invariants → shape_id refactor in dynamic_arg_dims

Diagnosis request

The divergence pattern (only batch_size=123 after running other batch sizes through the same shared LLM pair) suggests state accumulation in cudagraph capture/replay or in the inductor compile cache. Could a maintainer trace what state is mutated by the earlier batch sizes that affects the codegen/replay at batch=123?


Update 2026-05-01 (later): confirmed local reproduction + scope is broader than originally reported

Reproduced on a 1×H100 80GB dev box (using 1 of 2 GPUs in a 2×H100 node) inside the same vllm-ci-test-repo:18b9904b59f96fe5d28c8e7acdf6dec19c84acaf image. Running the full TestFullCUDAGraph class (matches CI's invocation of pytest tests/compile/fullgraph/test_full_cudagraph.py) produces 19 failures across 4 different llm_pair configurations and many batch sizes — much broader than CI's single [123-10-llm_pair10] failure suggested.

Failing parametrizations on local repro

llm_pairFailing [batch_size-max_tokens]Divergence pattern
llm_pair0[7-10], [16-10], [32-10], [45-10], [64-10], [123-10]piecewise: dog"\n\n def vs full: dog"\n\n #
llm_pair1[16-10], [32-10], [64-10], [123-10]mixed direction (def vs #, also # vs def)
llm_pair2[32-10], [45-10], [64-10], [123-10]piecewise: def vs full: #
llm_pair3[32-10], [45-10], [64-10], [123-10], [8-30]def/# divergence, plus at (8, 30) divergence inside generated def test_is_palindrome: piecewise assertequal(is_palind vs full asserttrue(is_palind

Final pytest summary:

============== 19 failed, 121 passed, 40 skipped, 48 warnings in 940.81s ==============

llm_pair0..llm_pair9 are alternative (model, backend_config, use_inductor_graph_partition) tuples. The fact that llm_pair0..llm_pair3 ALL diverge (and not just llm_pair10 as CI showed) means this is not specific to one model/backend combination — it's a general full-cudagraph-vs-piecewise codegen drift in vLLM's compile path, surfaced after enough warmup.

Properties confirmed

  • State accumulation required: single-test invocations (e.g., pytest -k '[123-10-llm_pair10]' alone) PASS on the same hardware. Failures only appear when the class-scoped llm_pair fixture has been reused across prior (batch_size, max_tokens) parametrizations, indicating allocator/cudagraph-pool/inductor state mutates between runs in a way that affects later compilations or replays.
  • Greedy decoding: temperature=0.0, top_p=1.0 — output must be bit-identical between piecewise and full cudagraph by construction.
  • Reproducible: bit-identical divergence text on repeated runs.
  • Hardware: H100 80GB (full GPU, MIG disabled). Reproduces on standalone dev box, not just CI's mithril-h100-pool.

Reproduction recipe (matches what triggered the failures above)

docker run --rm -it \
    --gpus '"device=0"' \
    --shm-size=4g \
    -e HF_TOKEN=<your-hf-token> \
    public.ecr.aws/q9t5s3a7/vllm-ci-test-repo:18b9904b59f96fe5d28c8e7acdf6dec19c84acaf \
    bash

# inside the container
cd /vllm-workspace/tests
rm -rf /home/dev/.cache/vllm/torch_compile_cache /tmp/torchinductor_*

# Run the full test class — matches CI shard execution; class-scoped llm_pair
# fixture is reused across all (batch_size, max_tokens) parametrizations.
CUDA_VISIBLE_DEVICES=0 pytest -v -s \
    'compile/fullgraph/test_full_cudagraph.py::TestFullCUDAGraph' 2>&1 \
    | tee /tmp/fullgraph_repro.log

Expect ~15 minutes runtime; multiple FAILED lines in the short test summary at the end.

Open questions for maintainers

  1. What state mutates between batch-size parametrizations on the SAME llm_pair fixture that causes later batch sizes to produce divergent piecewise-vs-full-cudagraph output? Candidates: cudagraph capture pool fragmentation, inductor compile cache state, allocator reuse pattern.
  2. Why does the same divergence pattern (def vs #) appear across llm_pair0..llm_pair3 (different backends)? Suggests the bug is in a layer common to all backends — likely cudagraph capture or torch.compile / inductor.
  3. Bisect candidates still open (after pytorch/pytorch#176994 was ruled out — the same revert that fixed pytorch/pytorch#182124 does NOT fix this issue):
    • f813f7732d Fix dynamic shape tile issue (#181793) — touches torch/_inductor/tiling_utils.py. Tile selection at non-power-of-two batch (123 → padded to 128 boundary lines up).
    • 7c927dd255 [dynamo] Disable recursive dict tag optimization
    • c2fb01331 (vLLM) [Bugfix][Compile] Fix gc.collect/empty_cache patch arity in CUDAGraphWrapper
    • 6f20f81cb (vLLM) shape_invariants → shape_id refactor in dynamic_arg_dims

A single-cherry-pick revert experiment is in progress; will update once we know which is the cause.

extent analysis

TL;DR

The most likely fix for the divergence issue between full-cudagraph and piecewise-cudagraph LLMs on torch 2.12 is to identify and revert the specific commit that introduced the change in cudagraph capture or inductor compile cache state.

Guidance

  1. Investigate state accumulation: Determine what state is mutated between batch-size parametrizations on the same llm_pair fixture that causes later batch sizes to produce divergent output.
  2. Bisect candidates: Test reverts of individual commits, such as f813f7732d, 7c927dd255, c2fb01331, and 6f20f81cb, to identify the root cause of the issue.
  3. Verify cudagraph capture: Compare cudagraph capture state at batch=123 between torch 2.11 and torch 2.12 for the same model to identify any differences in capture or replay semantics.
  4. Check inductor compile cache: Investigate the inductor compile cache state to see if there are any changes in how padded batches are handled or if there are any issues with kernel selection.

Example

No specific code example is provided, as the issue is related to the interaction between torch, triton, and the specific model being used. However, the reproduction recipe provided in the issue can be used to test potential fixes.

Notes

The issue is specific to torch 2.12 and does not occur on torch 2.11. The divergence pattern is consistent across different llm_pair configurations and batch sizes, suggesting a general issue with full-cudagraph-vs-piecewise codegen drift in vLLM's compile path.

Recommendation

Apply a workaround by reverting the specific commit that introduced the change in cudagraph capture or inductor compile cache state, once it is identified through bisecting and testing.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

pytorch - 💡(How to fix) Fix [vllm] [2.12 regression] PyTorch Fullgraph Smoke Test: full-cudagraph LLM diverges from piecewise-cudagraph LLM at batch=123 [1 participants]