pytorch - 💡(How to fix) Fix [vllm] [2.12 regression] AsyncTP TP=2 (mp backend) on Llama-3.2-1B-Instruct: outputs differ between fuse_gemm_comms=True/False [1 participants]

pytorch2026-05-01 14:57:06

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

pytorch/pytorch#182124•Fetched 2026-05-02 05:27:05

View on GitHub

Comments

Participants

Timeline

Reactions

Author

atalman

Participants

atalman

Timeline (top)

mentioned ×30subscribed ×30labeled ×10cross-referenced ×2

Under torch 2.12.0 + triton 3.7.0, vLLM's test_async_tp_pass_correctness[False-mp-True-2-meta-llama/Llama-3.2-1B-Instruct] fails because the fuse_gemm_comms=True reference run produces different outputs than the fuse_gemm_comms=False comparison run on TP=2 with the mp distributed backend:

AssertionError: Results for model='meta-llama/Llama-3.2-1B-Instruct' are not the same.

Sample divergence (only the 5th–10th tokens differ — cessive vs zahl):

ref_result:    text = [' Receeba Pt modelAndViewcessive', ' Receeba错误 modelAndViewcessive']  (fuse_gemm_comms=True)
compare_result:text = [' Receeba Pt modelAndViewzahl',    ' Receeba Pt modelAndViewzahl']     (fuse_gemm_comms=False)

The two configurations are expected to produce identical outputs on TP=2 (the AsyncTP fusion pass is supposed to be a numerically-equivalent rewrite). On torch 2.11 they did. On torch 2.12 they don't — the AsyncTP-fused version diverges from the unfused baseline. Blocking the torch 2.12 upgrade for vLLM (vllm-project/vllm#40077).

Root Cause

Fix Action

Fix / Workaround

Applied as a reverse patch on the test image's bundled torch:

cd /usr/local/lib/python3.12/dist-packages
patch -R -p1 < revert_176994.patch

Code Example

ref_result:    text = [' Receeba Pt modelAndViewcessive', ' Receeba错误 modelAndViewcessive']  (fuse_gemm_comms=True)
compare_result:text = [' Receeba Pt modelAndViewzahl',    ' Receeba Pt modelAndViewzahl']     (fuse_gemm_comms=False)

---

tests/compile/correctness_e2e/test_async_tp.py::test_async_tp_pass_correctness[False-mp-True-2-meta-llama/Llama-3.2-1B-Instruct]

---

ref_args:     --tensor-parallel-size 2 --distributed-executor-backend mp \
              --compilation_config '{"mode": 3, "compile_sizes": [2, 4, 8], "splitting_ops": [], "pass_config": {"fuse_gemm_comms": true}}'
compare_args: --tensor-parallel-size 2 --distributed-executor-backend mp

---

cd /usr/local/lib/python3.12/dist-packages
   patch -R -p1 < revert_176994.patch

---

rm -rf /home/dev/.cache/vllm/torch_compile_cache
   CUDA_VISIBLE_DEVICES=0,1 pytest -v -s \
     'compile/correctness_e2e/test_async_tp.py::test_async_tp_pass_correctness[False-mp-True-2-meta-llama/Llama-3.2-1B-Instruct]'

RAW_BUFFERClick to expand / collapse

Summary

AssertionError: Results for model='meta-llama/Llama-3.2-1B-Instruct' are not the same.

Sample divergence (only the 5th–10th tokens differ — cessive vs zahl):

ref_result:    text = [' Receeba Pt modelAndViewcessive', ' Receeba错误 modelAndViewcessive']  (fuse_gemm_comms=True)
compare_result:text = [' Receeba Pt modelAndViewzahl',    ' Receeba Pt modelAndViewzahl']     (fuse_gemm_comms=False)

Environment

torch: 2.12.0+cu130 (test channel)
triton: 3.7.0
CUDA: 13.0
Python: 3.12.13
GPU: 2× NVIDIA H100
Distributed backend: mp (multiprocess)

Reproduction

Failing test:

tests/compile/correctness_e2e/test_async_tp.py::test_async_tp_pass_correctness[False-mp-True-2-meta-llama/Llama-3.2-1B-Instruct]

The test runs the same model under two configurations:

ref_args:     --tensor-parallel-size 2 --distributed-executor-backend mp \
              --compilation_config '{"mode": 3, "compile_sizes": [2, 4, 8], "splitting_ops": [], "pass_config": {"fuse_gemm_comms": true}}'
compare_args: --tensor-parallel-size 2 --distributed-executor-backend mp

On torch 2.11 the two produce identical outputs; on torch 2.12 they diverge for meta-llama/Llama-3.2-1B-Instruct.

Reproducibility on torch 2.12 branch

Reproduces on every test-PR run since 2026-04-29:

Passes on every recent main build (Llama-3.2-1B-Instruct):

2026-04-30 daily: https://buildkite.com/vllm/ci/builds/63914
2026-05-01 nightly: https://buildkite.com/vllm/ci/builds/63994

Diagnosis request

The test compares two semantically-equivalent compilations of the same model. The AsyncTP fusion pass (fuse_gemm_comms=True) on vllm.compilation.passes rewrites all-reduce + matmul sequences into fused operators that are supposed to be numerically equivalent. On torch 2.12, the fused path diverges from the baseline starting at token ~5. Likely causes worth investigating:

A change in the all-reduce kernel's reduction order on H100 in torch 2.12 (the AsyncTP fused path depends on a specific order).
A change in the matmul kernel selected by the inductor compiler (compile_sizes=[2,4,8]) on torch 2.12.
A change in splitting_ops=[] semantics when combined with fuse_gemm_comms.

Links

vLLM PR: vllm-project/vllm#40077
Umbrella: pytorch/pytorch#180899

cc @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @pragupta @msaroufim @dcci @aditvenk @weifengpy @chauhang @penguinwu @voznesenskym @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @kadeng @muchulee8 @amjames @aakhundov @coconutruben @jataylo

Update 2026-05-01: confirmed root cause is pytorch/pytorch#176994

Bisected with a manual revert. Reverting the functional changes from pytorch/pytorch#176994 / commit c8b7b24124f4 ("[Inductor] Improve materialization heuristic for a chain of computations") on the installed torch makes this test PASS.

Repro of the fix

Reverted only the functional files (kept tests intact):
- torch/_inductor/graph.py (the _build_estimated_effective_users helper, the _estimated_effective_users cache field, the _count_effective_users method, and the call-site changes that pass has_non_fusible_users into mark_reuse / should_realize_on_reuse)
- torch/_inductor/ir.py (the mark_reuse / should_realize_on_reuse signature additions of has_non_fusible_users and the size-aware cost-model body in should_realize_on_reuse)

Applied as a reverse patch on the test image's bundled torch:

cd /usr/local/lib/python3.12/dist-packages
patch -R -p1 < revert_176994.patch

Cleared the inductor compile cache and reran the test on 2 GPUs:

rm -rf /home/dev/.cache/vllm/torch_compile_cache
CUDA_VISIBLE_DEVICES=0,1 pytest -v -s \
  'compile/correctness_e2e/test_async_tp.py::test_async_tp_pass_correctness[False-mp-True-2-meta-llama/Llama-3.2-1B-Instruct]'

Result: PASSED (vs deterministic FAIL on the unmodified torch with cessive/zahl divergence at token 5).

Conclusion

The new size-aware materialization heuristic introduced by #176994 changes which intermediate buffers Inductor materializes vs inlines for the Llama-3.2-1B-Instruct model under TP=2. The AsyncTP fusion pass (fuse_gemm_comms=True) and its baseline (fuse_gemm_comms=False) produce graphs that differ in users / has_non_fusible_users counts at certain residual-add points, so the new heuristic decides differently for the same buffer in the two paths → different inductor codegen → different reduction order in the all-reduce + matmul fusion → divergent output starting at token 5.

Suggested fix (same recommendation as for the related pytorch/pytorch#182125)

Make the heuristic produce identical decisions across semantically-equivalent compile paths (the AsyncTP-fused path and the unfused baseline must produce numerically-identical outputs by contract; the new heuristic broke that contract).
Gate the new heuristic behind a config flag (config.size_aware_materialization defaulting to off) so users running both paths can opt out for correctness.
Revert #176994 from the 2.12 release branch until #1 or #2 lands. The materialization optimization is a perf win on transformer residuals, but breaking AsyncTP correctness on a release branch isn't the right trade-off.

Note on related issue

pytorch/pytorch#182125 (PyTorch Fullgraph Smoke Test divergence at batch=123) was originally suspected to share this root cause. It does not — the same revert clears #182124 but not #182125. #182125 has a different (still-unknown) root cause, likely state-accumulation in cudagraph capture/replay when running multiple batch sizes through the same shared LLM pair.

extent analysis

TL;DR

The most likely fix for the issue is to revert the functional changes from pytorch/pytorch#176994, which introduced a new size-aware materialization heuristic that breaks the correctness contract of the AsyncTP fusion pass.

Guidance

Revert the changes from pytorch/pytorch#176994 to restore the previous materialization heuristic.
Apply the revert as a reverse patch on the test image's bundled torch.
Clear the inductor compile cache and rerun the test to verify the fix.
Consider gating the new heuristic behind a config flag to allow users to opt out for correctness.

Example

No code snippet is provided as the issue is related to a specific pytorch commit and revert.

Notes

The revert of pytorch/pytorch#176994 is confirmed to fix the issue, but it may not be the only solution. The introduction of a config flag to gate the new heuristic may also be a viable solution.

Recommendation

Apply the workaround by reverting the functional changes from pytorch/pytorch#176994, as it is confirmed to fix the issue and restore the correctness of the AsyncTP fusion pass.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#optimization #latency issue #model loading #dependency error #configuration error

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

pytorch - 💡(How to fix) Fix [vllm] [2.12 regression] AsyncTP TP=2 (mp backend) on Llama-3.2-1B-Instruct: outputs differ between fuse_gemm_comms=True/False [1 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Fix Action

Fix / Workaround

Code Example

Summary

Environment

Reproduction

Reproducibility on torch 2.12 branch

Diagnosis request

Links

Update 2026-05-01: confirmed root cause is pytorch/pytorch#176994

Repro of the fix

Conclusion

Suggested fix (same recommendation as for the related pytorch/pytorch#182125)

Note on related issue

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

TRENDING

pytorch - 💡(How to fix) Fix [vllm] [2.12 regression] AsyncTP TP=2 (mp backend) on Llama-3.2-1B-Instruct: outputs differ between fuse_gemm_comms=True/False [1 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Fix Action

Fix / Workaround

Code Example

Summary

Environment

Reproduction

Reproducibility on torch 2.12 branch

Diagnosis request

Links

Update 2026-05-01: confirmed root cause is pytorch/pytorch#176994

Repro of the fix

Conclusion

Suggested fix (same recommendation as for the related pytorch/pytorch#182125)

Note on related issue

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING