pytorch - 💡(How to fix) Fix [vllm] [2.12 regression] AsyncTP TP=2 (mp backend) on Llama-3.2-1B-Instruct: outputs differ between fuse_gemm_comms=True/False [1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
pytorch/pytorch#182124Fetched 2026-05-02 05:27:05
View on GitHub
Comments
0
Participants
1
Timeline
72
Reactions
0
Author
Participants
Timeline (top)
mentioned ×30subscribed ×30labeled ×10cross-referenced ×2

Under torch 2.12.0 + triton 3.7.0, vLLM's test_async_tp_pass_correctness[False-mp-True-2-meta-llama/Llama-3.2-1B-Instruct] fails because the fuse_gemm_comms=True reference run produces different outputs than the fuse_gemm_comms=False comparison run on TP=2 with the mp distributed backend:

AssertionError: Results for model='meta-llama/Llama-3.2-1B-Instruct' are not the same.

Sample divergence (only the 5th–10th tokens differ — cessive vs zahl):

ref_result:    text = [' Receeba Pt modelAndViewcessive', ' Receeba错误 modelAndViewcessive']  (fuse_gemm_comms=True)
compare_result:text = [' Receeba Pt modelAndViewzahl',    ' Receeba Pt modelAndViewzahl']     (fuse_gemm_comms=False)

The two configurations are expected to produce identical outputs on TP=2 (the AsyncTP fusion pass is supposed to be a numerically-equivalent rewrite). On torch 2.11 they did. On torch 2.12 they don't — the AsyncTP-fused version diverges from the unfused baseline. Blocking the torch 2.12 upgrade for vLLM (vllm-project/vllm#40077).

Root Cause

Under torch 2.12.0 + triton 3.7.0, vLLM's test_async_tp_pass_correctness[False-mp-True-2-meta-llama/Llama-3.2-1B-Instruct] fails because the fuse_gemm_comms=True reference run produces different outputs than the fuse_gemm_comms=False comparison run on TP=2 with the mp distributed backend:

Fix Action

Fix / Workaround

  1. Applied as a reverse patch on the test image's bundled torch:
    cd /usr/local/lib/python3.12/dist-packages
    patch -R -p1 < revert_176994.patch

Code Example

ref_result:    text = [' Receeba Pt modelAndViewcessive', ' Receeba错误 modelAndViewcessive']  (fuse_gemm_comms=True)
compare_result:text = [' Receeba Pt modelAndViewzahl',    ' Receeba Pt modelAndViewzahl']     (fuse_gemm_comms=False)

---

tests/compile/correctness_e2e/test_async_tp.py::test_async_tp_pass_correctness[False-mp-True-2-meta-llama/Llama-3.2-1B-Instruct]

---

ref_args:     --tensor-parallel-size 2 --distributed-executor-backend mp \
              --compilation_config '{"mode": 3, "compile_sizes": [2, 4, 8], "splitting_ops": [], "pass_config": {"fuse_gemm_comms": true}}'
compare_args: --tensor-parallel-size 2 --distributed-executor-backend mp

---

cd /usr/local/lib/python3.12/dist-packages
   patch -R -p1 < revert_176994.patch

---

rm -rf /home/dev/.cache/vllm/torch_compile_cache
   CUDA_VISIBLE_DEVICES=0,1 pytest -v -s \
     'compile/correctness_e2e/test_async_tp.py::test_async_tp_pass_correctness[False-mp-True-2-meta-llama/Llama-3.2-1B-Instruct]'
RAW_BUFFERClick to expand / collapse

Summary

Under torch 2.12.0 + triton 3.7.0, vLLM's test_async_tp_pass_correctness[False-mp-True-2-meta-llama/Llama-3.2-1B-Instruct] fails because the fuse_gemm_comms=True reference run produces different outputs than the fuse_gemm_comms=False comparison run on TP=2 with the mp distributed backend:

AssertionError: Results for model='meta-llama/Llama-3.2-1B-Instruct' are not the same.

Sample divergence (only the 5th–10th tokens differ — cessive vs zahl):

ref_result:    text = [' Receeba Pt modelAndViewcessive', ' Receeba错误 modelAndViewcessive']  (fuse_gemm_comms=True)
compare_result:text = [' Receeba Pt modelAndViewzahl',    ' Receeba Pt modelAndViewzahl']     (fuse_gemm_comms=False)

The two configurations are expected to produce identical outputs on TP=2 (the AsyncTP fusion pass is supposed to be a numerically-equivalent rewrite). On torch 2.11 they did. On torch 2.12 they don't — the AsyncTP-fused version diverges from the unfused baseline. Blocking the torch 2.12 upgrade for vLLM (vllm-project/vllm#40077).

Environment

  • torch: 2.12.0+cu130 (test channel)
  • triton: 3.7.0
  • CUDA: 13.0
  • Python: 3.12.13
  • GPU: 2× NVIDIA H100
  • Distributed backend: mp (multiprocess)

Reproduction

Failing test:

tests/compile/correctness_e2e/test_async_tp.py::test_async_tp_pass_correctness[False-mp-True-2-meta-llama/Llama-3.2-1B-Instruct]

The test runs the same model under two configurations:

ref_args:     --tensor-parallel-size 2 --distributed-executor-backend mp \
              --compilation_config '{"mode": 3, "compile_sizes": [2, 4, 8], "splitting_ops": [], "pass_config": {"fuse_gemm_comms": true}}'
compare_args: --tensor-parallel-size 2 --distributed-executor-backend mp

On torch 2.11 the two produce identical outputs; on torch 2.12 they diverge for meta-llama/Llama-3.2-1B-Instruct.

Reproducibility on torch 2.12 branch

Reproduces on every test-PR run since 2026-04-29:

Passes on every recent main build (Llama-3.2-1B-Instruct):

Diagnosis request

The test compares two semantically-equivalent compilations of the same model. The AsyncTP fusion pass (fuse_gemm_comms=True) on vllm.compilation.passes rewrites all-reduce + matmul sequences into fused operators that are supposed to be numerically equivalent. On torch 2.12, the fused path diverges from the baseline starting at token ~5. Likely causes worth investigating:

  1. A change in the all-reduce kernel's reduction order on H100 in torch 2.12 (the AsyncTP fused path depends on a specific order).
  2. A change in the matmul kernel selected by the inductor compiler (compile_sizes=[2,4,8]) on torch 2.12.
  3. A change in splitting_ops=[] semantics when combined with fuse_gemm_comms.

Links

  • vLLM PR: vllm-project/vllm#40077
  • Umbrella: pytorch/pytorch#180899

cc @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @pragupta @msaroufim @dcci @aditvenk @weifengpy @chauhang @penguinwu @voznesenskym @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @kadeng @muchulee8 @amjames @aakhundov @coconutruben @jataylo


Update 2026-05-01: confirmed root cause is pytorch/pytorch#176994

Bisected with a manual revert. Reverting the functional changes from pytorch/pytorch#176994 / commit c8b7b24124f4 ("[Inductor] Improve materialization heuristic for a chain of computations") on the installed torch makes this test PASS.

Repro of the fix

  1. Reverted only the functional files (kept tests intact):

    • torch/_inductor/graph.py (the _build_estimated_effective_users helper, the _estimated_effective_users cache field, the _count_effective_users method, and the call-site changes that pass has_non_fusible_users into mark_reuse / should_realize_on_reuse)
    • torch/_inductor/ir.py (the mark_reuse / should_realize_on_reuse signature additions of has_non_fusible_users and the size-aware cost-model body in should_realize_on_reuse)
  2. Applied as a reverse patch on the test image's bundled torch:

    cd /usr/local/lib/python3.12/dist-packages
    patch -R -p1 < revert_176994.patch
  3. Cleared the inductor compile cache and reran the test on 2 GPUs:

    rm -rf /home/dev/.cache/vllm/torch_compile_cache
    CUDA_VISIBLE_DEVICES=0,1 pytest -v -s \
      'compile/correctness_e2e/test_async_tp.py::test_async_tp_pass_correctness[False-mp-True-2-meta-llama/Llama-3.2-1B-Instruct]'
  4. Result: PASSED (vs deterministic FAIL on the unmodified torch with cessive/zahl divergence at token 5).

Conclusion

The new size-aware materialization heuristic introduced by #176994 changes which intermediate buffers Inductor materializes vs inlines for the Llama-3.2-1B-Instruct model under TP=2. The AsyncTP fusion pass (fuse_gemm_comms=True) and its baseline (fuse_gemm_comms=False) produce graphs that differ in users / has_non_fusible_users counts at certain residual-add points, so the new heuristic decides differently for the same buffer in the two paths → different inductor codegen → different reduction order in the all-reduce + matmul fusion → divergent output starting at token 5.

Suggested fix (same recommendation as for the related pytorch/pytorch#182125)

  1. Make the heuristic produce identical decisions across semantically-equivalent compile paths (the AsyncTP-fused path and the unfused baseline must produce numerically-identical outputs by contract; the new heuristic broke that contract).
  2. Gate the new heuristic behind a config flag (config.size_aware_materialization defaulting to off) so users running both paths can opt out for correctness.
  3. Revert #176994 from the 2.12 release branch until #1 or #2 lands. The materialization optimization is a perf win on transformer residuals, but breaking AsyncTP correctness on a release branch isn't the right trade-off.

Note on related issue

pytorch/pytorch#182125 (PyTorch Fullgraph Smoke Test divergence at batch=123) was originally suspected to share this root cause. It does not — the same revert clears #182124 but not #182125. #182125 has a different (still-unknown) root cause, likely state-accumulation in cudagraph capture/replay when running multiple batch sizes through the same shared LLM pair.

extent analysis

TL;DR

The most likely fix for the issue is to revert the functional changes from pytorch/pytorch#176994, which introduced a new size-aware materialization heuristic that breaks the correctness contract of the AsyncTP fusion pass.

Guidance

  • Revert the changes from pytorch/pytorch#176994 to restore the previous materialization heuristic.
  • Apply the revert as a reverse patch on the test image's bundled torch.
  • Clear the inductor compile cache and rerun the test to verify the fix.
  • Consider gating the new heuristic behind a config flag to allow users to opt out for correctness.

Example

No code snippet is provided as the issue is related to a specific pytorch commit and revert.

Notes

The revert of pytorch/pytorch#176994 is confirmed to fix the issue, but it may not be the only solution. The introduction of a config flag to gate the new heuristic may also be a viable solution.

Recommendation

Apply the workaround by reverting the functional changes from pytorch/pytorch#176994, as it is confirmed to fix the issue and restore the correctness of the AsyncTP fusion pass.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

pytorch - 💡(How to fix) Fix [vllm] [2.12 regression] AsyncTP TP=2 (mp backend) on Llama-3.2-1B-Instruct: outputs differ between fuse_gemm_comms=True/False [1 participants]