pytorch - 💡(How to fix) Fix [vllm] [2.12 regression][multimodal] Qwen2-Audio text-then-audio_embeds: prompt_embeds vs raw-text outputs diverge under --enforce-eager [1 participants]

atalman · 2026-05-19T20:41:59Z

[pytorch] On torch==2.12.0 + triton==3.7.0 , the Qwen2-Audio "text vs prompt embeds should yield identical outputs when mixed with audio embeds " test diverges… On `torch==2.12.0` + `triton==3.7.0`, the Qwen2-Audio "text vs `prompt_embeds` should yield identical outputs when mixed with `audio_embeds`" test diverges. The same model + prompt + `temperature=0.0` produces a refusal-like reply for the text path and the expected semantic reply for the embeds path: ``` > assert text_out == embeds_out E assert "I'm sorry, b...e an accurate" == 'This audio c...hroughout its' E - This audio contains music playing in the background throughout its E + I'm sorry, but I cannot provide an accurate ``` The test runs with `--enforce-eager`, so `torch.compile` / Inductor / Triton are not in the path — this is plain eager-mode numerical drift between two equivalent content shapes on the Qwen2-Audio multimodal pipeline. - Passes on every one of the last 7 main `Full CI run` builds with `torch==2.11.0` (#66525, #66556, #66569, #66603, #66633, #66759, #66835). - Fails deterministically on the torch 2.12.0 / triton 3.7.0 / torchvision 0.27.0 upgrade PR (vllm-project/vllm#42848, build #66553). Only the `audio_first=False` parameterization (`text-then-audio_embeds`, i.e. text/embeds part *before* the audio part in the message) fails — the `audio_first=True` (`audio_embeds-then-text`) parameterization passes. This makes it likely a positional/ordering interaction in the multimodal feature merge under torch 2.12. ## Summary On `torch==2.12.0` + `triton==3.7.0`, the Qwen2-Audio "text vs `prompt_embeds` should yield identical outputs when mixed with `audio_embeds`" test diverges. The same model + prompt + `temperature=0.0` produces a refusal-like reply for the text path and the expected semantic reply for the embeds path: ``` > assert text_out == embeds_out E assert "I'm sorry, b...e an accurate" == 'This audio c...hroughout its' E - This audio contains music playing in the background throughout its E + I'm sorry, but I cannot provide an accurate ``` The test runs with `--enforce-eager`, so `torch.compile` / Inductor / Triton are not in the path — this is plain eager-mode numerical drift between two equivalent content shapes on the Qwen2-Audio multimodal pipeline. - Passes on every one of the last 7 main `Full CI run` builds with `torch==2.11.0` (#66525, #66556, #66569, #66603, #66633, #66759, #66835). - Fails deterministically on the torch 2.12.0 / triton 3.7.0 / torchvision 0.27.0 upgrade PR (vllm-project/vllm#42848, build #66553). Only the `audio_first=False` parameterization (`text-then-audio_embeds`, i.e. text/embeds part *before* the audio part in the message) fails — the `audio_first=True` (`audio_embeds-then-text`) parameterization passes. This makes it likely a positional/ordering interaction in the multimodal feature merge under torch 2.12. ## Environment - `torch==2.12.0+cu130` - `triton==3.7.0` - `torchvision==0.27.0` - CUDA 13.0 driver 570.133.20 - GPU: H100 (Buildkite `bk-gpu-1-queue-ci`) - Python 3.12 - vLLM commit `47af9e1bc6cad1987cb87c967a3a001aecc2124e` (PR vllm-project/vllm#42848) - Model: `Qwen/Qwen2-Audio-7B-Instruct`, `--dtype bfloat16`, `--enforce-eager` ## Reproduction ``` vllm serve Qwen/Qwen2-Audio-7B-Instruct \ --dtype bfloat16 \ --max-model-len 2048 \ --max-num-seqs 4 \ --enforce-eager \ --trust-remote-code \ --gpu-memory-utilization 0.85 \ --limit-mm-per-prompt '{"audio": 1}' \ --enable-prompt-embeds \ --enable-mm-embeds ``` then run: ``` pytest -x tests/entrypoints/openai/chat_completion/test_chat_completion_with_mixed_audio_embeds.py::test_text_content_and_prompt_embeds_match_with_audio_embeds ``` The test sends two `chat.completions.create(temperature=0.0, max_tokens=10)` calls with identical semantic content — one with the user text passed as a raw `text` part, the other with the same text passed as a `prompt_embeds` part. Both are mixed with the same `audio_embeds` part. With `audio_first=False` (text/embeds before audio), the two responses diverge on torch 2.12. ## Failing test `tests/entrypoints/openai/chat_completion/test_chat_completion_with_mixed_audio_embeds.py::test_text_content_and_prompt_embeds_match_with_audio_embeds[text-then-audio_embeds]` ## Diagnosis question Both `--enforce-eager` and `temperature=0.0` are set, so this is not a compile-graph or sampling-noise issue. The most likely surfaces touched by torch 2.12 here are: - attention kernels in the audio encoder / language-model fusion - bfloat16 reductions in the prompt-embeds vs raw-text embedding paths - positional embedding handling when the multimodal token is *after* a text token Is this an intended numerical change in one of the bf16 / attention / SDPA code paths between 2.11 and 2.12, or is it a regression? ## Links - vLLM PR: vllm-project/vllm#42848 - Failing build: https://buildkite.com/vllm/ci/builds/66553#019e4024-9b1c-4e65-a46f-6c67750912f3 - Passing main baseline (torch 2.11): https://buildkite.com/vllm/ci/builds/66835 (an

pytorch2026-05-19 20:41:59

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

pytorch/pytorch#184431•Fetched 2026-05-20 03:38:40

View on GitHub

Comments

Participants

Timeline

Reactions

Author

atalman

Participants

atalman

Timeline (top)

mentioned ×9subscribed ×9labeled ×8cross-referenced ×1

On torch==2.12.0 + triton==3.7.0, the Qwen2-Audio "text vs prompt_embeds should yield identical outputs when mixed with audio_embeds" test diverges. The same model + prompt + temperature=0.0 produces a refusal-like reply for the text path and the expected semantic reply for the embeds path:

>       assert text_out == embeds_out
E       assert "I'm sorry, b...e an accurate" == 'This audio c...hroughout its'
E         - This audio contains music playing in the background throughout its
E         + I'm sorry, but I cannot provide an accurate

The test runs with --enforce-eager, so torch.compile / Inductor / Triton are not in the path — this is plain eager-mode numerical drift between two equivalent content shapes on the Qwen2-Audio multimodal pipeline.

Passes on every one of the last 7 main Full CI run builds with torch==2.11.0 (#66525, #66556, #66569, #66603, #66633, #66759, #66835).
Fails deterministically on the torch 2.12.0 / triton 3.7.0 / torchvision 0.27.0 upgrade PR (vllm-project/vllm#42848, build #66553).

Only the audio_first=False parameterization (text-then-audio_embeds, i.e. text/embeds part before the audio part in the message) fails — the audio_first=True (audio_embeds-then-text) parameterization passes. This makes it likely a positional/ordering interaction in the multimodal feature merge under torch 2.12.

Root Cause

>       assert text_out == embeds_out
E       assert "I'm sorry, b...e an accurate" == 'This audio c...hroughout its'
E         - This audio contains music playing in the background throughout its
E         + I'm sorry, but I cannot provide an accurate

Passes on every one of the last 7 main Full CI run builds with torch==2.11.0 (#66525, #66556, #66569, #66603, #66633, #66759, #66835).
Fails deterministically on the torch 2.12.0 / triton 3.7.0 / torchvision 0.27.0 upgrade PR (vllm-project/vllm#42848, build #66553).

Code Example

>       assert text_out == embeds_out
E       assert "I'm sorry, b...e an accurate" == 'This audio c...hroughout its'
E         - This audio contains music playing in the background throughout its
E         + I'm sorry, but I cannot provide an accurate

---

vllm serve Qwen/Qwen2-Audio-7B-Instruct \
  --dtype bfloat16 \
  --max-model-len 2048 \
  --max-num-seqs 4 \
  --enforce-eager \
  --trust-remote-code \
  --gpu-memory-utilization 0.85 \
  --limit-mm-per-prompt '{"audio": 1}' \
  --enable-prompt-embeds \
  --enable-mm-embeds

---

pytest -x tests/entrypoints/openai/chat_completion/test_chat_completion_with_mixed_audio_embeds.py::test_text_content_and_prompt_embeds_match_with_audio_embeds

RAW_BUFFERClick to expand / collapse

Summary

>       assert text_out == embeds_out
E       assert "I'm sorry, b...e an accurate" == 'This audio c...hroughout its'
E         - This audio contains music playing in the background throughout its
E         + I'm sorry, but I cannot provide an accurate

Passes on every one of the last 7 main Full CI run builds with torch==2.11.0 (#66525, #66556, #66569, #66603, #66633, #66759, #66835).
Fails deterministically on the torch 2.12.0 / triton 3.7.0 / torchvision 0.27.0 upgrade PR (vllm-project/vllm#42848, build #66553).

Environment

torch==2.12.0+cu130
triton==3.7.0
torchvision==0.27.0
CUDA 13.0 driver 570.133.20
GPU: H100 (Buildkite bk-gpu-1-queue-ci)
Python 3.12
vLLM commit 47af9e1bc6cad1987cb87c967a3a001aecc2124e (PR vllm-project/vllm#42848)
Model: Qwen/Qwen2-Audio-7B-Instruct, --dtype bfloat16, --enforce-eager

Reproduction

vllm serve Qwen/Qwen2-Audio-7B-Instruct \
  --dtype bfloat16 \
  --max-model-len 2048 \
  --max-num-seqs 4 \
  --enforce-eager \
  --trust-remote-code \
  --gpu-memory-utilization 0.85 \
  --limit-mm-per-prompt '{"audio": 1}' \
  --enable-prompt-embeds \
  --enable-mm-embeds

then run:

pytest -x tests/entrypoints/openai/chat_completion/test_chat_completion_with_mixed_audio_embeds.py::test_text_content_and_prompt_embeds_match_with_audio_embeds

The test sends two chat.completions.create(temperature=0.0, max_tokens=10) calls with identical semantic content — one with the user text passed as a raw text part, the other with the same text passed as a prompt_embeds part. Both are mixed with the same audio_embeds part. With audio_first=False (text/embeds before audio), the two responses diverge on torch 2.12.

Failing test

tests/entrypoints/openai/chat_completion/test_chat_completion_with_mixed_audio_embeds.py::test_text_content_and_prompt_embeds_match_with_audio_embeds[text-then-audio_embeds]

Diagnosis question

Both --enforce-eager and temperature=0.0 are set, so this is not a compile-graph or sampling-noise issue. The most likely surfaces touched by torch 2.12 here are:

attention kernels in the audio encoder / language-model fusion
bfloat16 reductions in the prompt-embeds vs raw-text embedding paths
positional embedding handling when the multimodal token is after a text token

Is this an intended numerical change in one of the bf16 / attention / SDPA code paths between 2.11 and 2.12, or is it a regression?

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

pytorch - 💡(How to fix) Fix [vllm] [2.12 regression][multimodal] Qwen2-Audio text-then-audio_embeds: prompt_embeds vs raw-text outputs diverge under --enforce-eager [1 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Code Example

Summary

Environment

Reproduction

Failing test

Diagnosis question

Links

Still need to ship something?

TRENDING

pytorch - 💡(How to fix) Fix [vllm] [2.12 regression][multimodal] Qwen2-Audio text-then-audio_embeds: prompt_embeds vs raw-text outputs diverge under --enforce-eager [1 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Code Example

Summary

Environment

Reproduction

Failing test

Diagnosis question

Links

Still need to ship something?

RELATED_DISCOVERY

TRENDING