pytorch - 💡(How to fix) Fix [vllm] [2.12 regression][multimodal] Qwen2-Audio text-then-audio_embeds: prompt_embeds vs raw-text outputs diverge under --enforce-eager [1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
pytorch/pytorch#184431Fetched 2026-05-20 03:38:40
View on GitHub
Comments
0
Participants
1
Timeline
27
Reactions
0
Author
Participants
Timeline (top)
mentioned ×9subscribed ×9labeled ×8cross-referenced ×1

On torch==2.12.0 + triton==3.7.0, the Qwen2-Audio "text vs prompt_embeds should yield identical outputs when mixed with audio_embeds" test diverges. The same model + prompt + temperature=0.0 produces a refusal-like reply for the text path and the expected semantic reply for the embeds path:

>       assert text_out == embeds_out
E       assert "I'm sorry, b...e an accurate" == 'This audio c...hroughout its'
E         - This audio contains music playing in the background throughout its
E         + I'm sorry, but I cannot provide an accurate

The test runs with --enforce-eager, so torch.compile / Inductor / Triton are not in the path — this is plain eager-mode numerical drift between two equivalent content shapes on the Qwen2-Audio multimodal pipeline.

  • Passes on every one of the last 7 main Full CI run builds with torch==2.11.0 (#66525, #66556, #66569, #66603, #66633, #66759, #66835).
  • Fails deterministically on the torch 2.12.0 / triton 3.7.0 / torchvision 0.27.0 upgrade PR (vllm-project/vllm#42848, build #66553).

Only the audio_first=False parameterization (text-then-audio_embeds, i.e. text/embeds part before the audio part in the message) fails — the audio_first=True (audio_embeds-then-text) parameterization passes. This makes it likely a positional/ordering interaction in the multimodal feature merge under torch 2.12.

Root Cause

On torch==2.12.0 + triton==3.7.0, the Qwen2-Audio "text vs prompt_embeds should yield identical outputs when mixed with audio_embeds" test diverges. The same model + prompt + temperature=0.0 produces a refusal-like reply for the text path and the expected semantic reply for the embeds path:

>       assert text_out == embeds_out
E       assert "I'm sorry, b...e an accurate" == 'This audio c...hroughout its'
E         - This audio contains music playing in the background throughout its
E         + I'm sorry, but I cannot provide an accurate

The test runs with --enforce-eager, so torch.compile / Inductor / Triton are not in the path — this is plain eager-mode numerical drift between two equivalent content shapes on the Qwen2-Audio multimodal pipeline.

  • Passes on every one of the last 7 main Full CI run builds with torch==2.11.0 (#66525, #66556, #66569, #66603, #66633, #66759, #66835).
  • Fails deterministically on the torch 2.12.0 / triton 3.7.0 / torchvision 0.27.0 upgrade PR (vllm-project/vllm#42848, build #66553).

Only the audio_first=False parameterization (text-then-audio_embeds, i.e. text/embeds part before the audio part in the message) fails — the audio_first=True (audio_embeds-then-text) parameterization passes. This makes it likely a positional/ordering interaction in the multimodal feature merge under torch 2.12.

Code Example

>       assert text_out == embeds_out
E       assert "I'm sorry, b...e an accurate" == 'This audio c...hroughout its'
E         - This audio contains music playing in the background throughout its
E         + I'm sorry, but I cannot provide an accurate

---

vllm serve Qwen/Qwen2-Audio-7B-Instruct \
  --dtype bfloat16 \
  --max-model-len 2048 \
  --max-num-seqs 4 \
  --enforce-eager \
  --trust-remote-code \
  --gpu-memory-utilization 0.85 \
  --limit-mm-per-prompt '{"audio": 1}' \
  --enable-prompt-embeds \
  --enable-mm-embeds

---

pytest -x tests/entrypoints/openai/chat_completion/test_chat_completion_with_mixed_audio_embeds.py::test_text_content_and_prompt_embeds_match_with_audio_embeds
RAW_BUFFERClick to expand / collapse

Summary

On torch==2.12.0 + triton==3.7.0, the Qwen2-Audio "text vs prompt_embeds should yield identical outputs when mixed with audio_embeds" test diverges. The same model + prompt + temperature=0.0 produces a refusal-like reply for the text path and the expected semantic reply for the embeds path:

>       assert text_out == embeds_out
E       assert "I'm sorry, b...e an accurate" == 'This audio c...hroughout its'
E         - This audio contains music playing in the background throughout its
E         + I'm sorry, but I cannot provide an accurate

The test runs with --enforce-eager, so torch.compile / Inductor / Triton are not in the path — this is plain eager-mode numerical drift between two equivalent content shapes on the Qwen2-Audio multimodal pipeline.

  • Passes on every one of the last 7 main Full CI run builds with torch==2.11.0 (#66525, #66556, #66569, #66603, #66633, #66759, #66835).
  • Fails deterministically on the torch 2.12.0 / triton 3.7.0 / torchvision 0.27.0 upgrade PR (vllm-project/vllm#42848, build #66553).

Only the audio_first=False parameterization (text-then-audio_embeds, i.e. text/embeds part before the audio part in the message) fails — the audio_first=True (audio_embeds-then-text) parameterization passes. This makes it likely a positional/ordering interaction in the multimodal feature merge under torch 2.12.

Environment

  • torch==2.12.0+cu130
  • triton==3.7.0
  • torchvision==0.27.0
  • CUDA 13.0 driver 570.133.20
  • GPU: H100 (Buildkite bk-gpu-1-queue-ci)
  • Python 3.12
  • vLLM commit 47af9e1bc6cad1987cb87c967a3a001aecc2124e (PR vllm-project/vllm#42848)
  • Model: Qwen/Qwen2-Audio-7B-Instruct, --dtype bfloat16, --enforce-eager

Reproduction

vllm serve Qwen/Qwen2-Audio-7B-Instruct \
  --dtype bfloat16 \
  --max-model-len 2048 \
  --max-num-seqs 4 \
  --enforce-eager \
  --trust-remote-code \
  --gpu-memory-utilization 0.85 \
  --limit-mm-per-prompt '{"audio": 1}' \
  --enable-prompt-embeds \
  --enable-mm-embeds

then run:

pytest -x tests/entrypoints/openai/chat_completion/test_chat_completion_with_mixed_audio_embeds.py::test_text_content_and_prompt_embeds_match_with_audio_embeds

The test sends two chat.completions.create(temperature=0.0, max_tokens=10) calls with identical semantic content — one with the user text passed as a raw text part, the other with the same text passed as a prompt_embeds part. Both are mixed with the same audio_embeds part. With audio_first=False (text/embeds before audio), the two responses diverge on torch 2.12.

Failing test

tests/entrypoints/openai/chat_completion/test_chat_completion_with_mixed_audio_embeds.py::test_text_content_and_prompt_embeds_match_with_audio_embeds[text-then-audio_embeds]

Diagnosis question

Both --enforce-eager and temperature=0.0 are set, so this is not a compile-graph or sampling-noise issue. The most likely surfaces touched by torch 2.12 here are:

  • attention kernels in the audio encoder / language-model fusion
  • bfloat16 reductions in the prompt-embeds vs raw-text embedding paths
  • positional embedding handling when the multimodal token is after a text token

Is this an intended numerical change in one of the bf16 / attention / SDPA code paths between 2.11 and 2.12, or is it a regression?

Links

cc @drisspg @liangel-02 @howardzhang-cv

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING