pytorch - ✅(Solved) Fix [vllm] [2.12 regression][FLASH_ATTN] test_cascade_attention divergence: extra "you" in generated Fibonacci output [1 pull requests, 2 comments, 1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
pytorch/pytorch#182700Fetched 2026-05-07 03:30:37
View on GitHub
Comments
2
Participants
1
Timeline
35
Reactions
0
Author
Participants
Timeline (top)
mentioned ×12subscribed ×12labeled ×5commented ×2

Under torch 2.12.0+cu130 + triton 3.7.0, vLLM's tests/v1/e2e/general/test_cascade_attention.py::test_cascade_attention[FLASH_ATTN] deterministically diverges from its reference output by a single inserted word. The cascade-attention path produces:

got:      "Sure, I can help you with that. The Fibonacci sequence is a series ..."
expected: "Sure, I can help with that. The Fibonacci sequence is a series ..."

(rest of the response is byte-identical). The test is a deterministic-output check (response.outputs[0].text == ref_output), so a single-token divergence trips the assertion.

Reproducible: failed on first run AND retry of the same build.

Root Cause

Under torch 2.12.0+cu130 + triton 3.7.0, vLLM's tests/v1/e2e/general/test_cascade_attention.py::test_cascade_attention[FLASH_ATTN] deterministically diverges from its reference output by a single inserted word. The cascade-attention path produces:

got:      "Sure, I can help you with that. The Fibonacci sequence is a series ..."
expected: "Sure, I can help with that. The Fibonacci sequence is a series ..."

(rest of the response is byte-identical). The test is a deterministic-output check (response.outputs[0].text == ref_output), so a single-token divergence trips the assertion.

Reproducible: failed on first run AND retry of the same build.

Fix Action

Fixed

PR fix notes

PR #41181: [Bugfix] Fix RuntimeError: Already borrowed by adding thread-safe Hugging Face fast-tokenizer wrappers

Description (problem / solution / changelog)

Purpose

Thread-safe HuggingFace fast tokenizer wrapper for the RuntimeError: Already borrowed concurrency issue reported in #40949 .

  • Uses a tokenizer pool that dispatches calls to borrowing methods to a free deepcopied tokenizer instance.

Fix concurrency issues with bad_words sampling param. Removes the need to use deepcopy for multimodal processor.

Limitations:

  • Mutation is not propagated to tokenizers in the pool.
  • Adjacent method calls could happen on different deep copies.
  • Direct access to _tokenzier is not supported by the pool (this is done by FastIncrementalDetokenizer).

Test Plan

pytest tests/models/multimodal/processing/test_common.py

benchmark.py

python benchmark.py --prompt-length 512 --iterations 5000 --warmup 5000 --mixed
vllm serve Qwen/Qwen3-4B-Instruct-2507-FP8 --renderer_num_workers 4 --api-server-count=4

vllm serve deepseek-ai/DeepSeek-V4-Flash --renderer_num_workers 4   --api-server-count=4 --trust-remote-code --tensor-parallel-size=2 --max-model-len 4096 --kv-cache-dtype fp8

vllm serve deepseek-ai/DeepSeek-OCR --renderer_num_workers 4 --mm-processor-cache-gb 0 --api-server-count=4

vllm serve Qwen/Qwen-VL-Chat --renderer_num_workers 4 --mm-processor-cache-gb 0 --api-server-count=4 --trust-remote-code --hf-overrides '{"architectures": ["QwenVLForConditionalGeneration"]}'

stress_send.py

python stress_send.py -n 5000 -c 500 [--mm] [--bad-word]

Test Result

Unite test pass

Benchmark:

Loading tokenizer from meta-llama/Llama-3.1-8B-Instruct …
Prompt: 522 tokens, 2236 chars
Config: iterations=5000  warmup=5000  threads=[1, 2, 8]  mixed=True  truncation_max_length=1044


=== 1 thread(s) ===
  raw (no wrapper)                mean=0.526 ms  median=0.490 ms  p99=0.855 ms  total=2627.7 ms  wall=2629.1 ms  n=5000
  lock wrapper                    mean=0.502 ms  median=0.485 ms  p99=0.682 ms  total=2510.0 ms  wall=2511.4 ms  n=5000
  copy wrapper (threading.local)  mean=0.546 ms  median=0.524 ms  p99=0.761 ms  total=2730.2 ms  wall=2746.4 ms  n=5000
  copy wrapper (dict)             mean=0.527 ms  median=0.520 ms  p99=0.699 ms  total=2634.0 ms  wall=2635.5 ms  n=5000
  queue wrapper                   mean=0.537 ms  median=0.534 ms  p99=0.603 ms  total=2686.5 ms  wall=2688.0 ms  n=5000

=== 2 thread(s) ===
  raw (no wrapper)                failures=2418
  lock wrapper                    mean=0.983 ms  median=0.527 ms  p99=1.106 ms  total=4913.6 ms  wall=2702.8 ms  n=5000
  copy wrapper (threading.local)  mean=0.564 ms  median=0.545 ms  p99=0.925 ms  total=2818.5 ms  wall=1448.9 ms  n=5000
  copy wrapper (dict)             mean=0.596 ms  median=0.554 ms  p99=1.095 ms  total=2978.9 ms  wall=1497.5 ms  n=5000
  queue wrapper                   mean=0.575 ms  median=0.550 ms  p99=0.999 ms  total=2873.8 ms  wall=1471.6 ms  n=5000

=== 8 thread(s) ===
  raw (no wrapper)                failures=2495
  lock wrapper                    mean=2.364 ms  median=0.493 ms  p99=0.763 ms  total=11819.0 ms  wall=2549.5 ms  n=5000
  copy wrapper (threading.local)  mean=0.633 ms  median=0.543 ms  p99=0.692 ms  total=3166.1 ms  wall=470.1 ms  n=5000
  copy wrapper (dict)             mean=0.618 ms  median=0.552 ms  p99=1.125 ms  total=3088.6 ms  wall=444.9 ms  n=5000
  queue wrapper                   mean=0.741 ms  median=0.570 ms  p99=3.086 ms  total=3702.6 ms  wall=471.7 ms  n=5000
<details> <summary>Qwen3-4B-Instruct-2507-FP8</summary>

Before:

Sending 5000 text-only requests (concurrency=500) to http://localhost:8000
Model: Qwen/Qwen3-4B-Instruct-2507-FP8  mode=text
==================================================
Requests : 5000/5000 ok, 0 errors
Wall time: 2.61s  (1916.1 req/s)
Latency  : p50=0.296s  p99=0.475

After:

Sending 5000 text-only requests (concurrency=500) to http://localhost:8000 bad_words=['hello world']
Model: Qwen/Qwen3-4B-Instruct-2507-FP8  mode=text
==================================================
Requests : 5000/5000 ok, 0 errors
Wall time: 2.61s  (1918.9 req/s)
Latency  : p50=0.307s  p99=0.549s
</details> <details> <summary>DeepSeek-V4-Flash</summary>

Before:

Sending 5000 text-only requests (concurrency=500) to http://localhost:8000
Model: deepseek-ai/DeepSeek-V4-Flash  mode=text
==================================================
Requests : 5000/5000 ok, 0 errors
Wall time: 13.53s  (369.5 req/s)
Latency  : p50=1.335s  p99=2.478s

After:

Sending 5000 text-only requests (concurrency=500) to http://localhost:8000 bad_words=['hello world']
Model: deepseek-ai/DeepSeek-V4-Flash  mode=text
==================================================
Requests : 5000/5000 ok, 0 errors
Wall time: 12.95s  (386.1 req/s)
Latency  : p50=1.340s  p99=2.374s
</details> <details> <summary>Qwen-VL-Chat</summary>

Before:

Sending 5000 multimodal requests (concurrency=500) to http://localhost:8000
Model: Qwen/Qwen-VL-Chat  mode=multimodal
==================================================
Requests : 5000/5000 ok, 0 errors
Wall time: 10.30s  (485.6 req/s)
Latency  : p50=1.088s  p99=1.420s

After:

Sending 5000 multimodal requests (concurrency=500) to http://localhost:8000 bad_words=['hello world']
Model: Qwen/Qwen-VL-Chat  mode=multimodal
==================================================
Requests : 5000/5000 ok, 0 errors
Wall time: 10.32s  (484.6 req/s)
Latency  : p50=1.125s  p99=1.492s
</details> <details> <summary>DeepSeek-OCR</summary>

Before:

Sending 5000 multimodal requests (concurrency=500) to http://localhost:8000
Model: deepseek-ai/DeepSeek-OCR  mode=multimodal
==================================================
Requests : 5000/5000 ok, 0 errors
Wall time: 11.14s  (448.9 req/s)
Latency  : p50=1.063s  p99=1.565s

After:

Sending 5000 multimodal requests (concurrency=500) to http://localhost:8000 bad_words=['hello world']
Model: deepseek-ai/DeepSeek-OCR  mode=multimodal
==================================================
Requests : 5000/5000 ok, 0 errors
Wall time: 10.54s  (474.5 req/s)
Latency  : p50=1.040s  p99=1.805s
</details>

AI Assistance

Made with Cursor

cc @sfeng33 @bbrowning


<details> <summary> Essential Elements of an Effective PR Description Checklist </summary>
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
</details>

Changed files

  • vllm/multimodal/processing/context.py (modified, +0/-22)
  • vllm/renderers/base.py (modified, +2/-11)
  • vllm/renderers/hf.py (modified, +15/-1)
  • vllm/tokenizers/__init__.py (modified, +2/-0)
  • vllm/tokenizers/hf.py (modified, +88/-1)

Code Example

got:      "Sure, I can help you with that. The Fibonacci sequence is a series ..."
expected: "Sure, I can help with that. The Fibonacci sequence is a series ..."

---

.venv/bin/python -m pytest tests/v1/e2e/general/test_cascade_attention.py::test_cascade_attention -k FLASH_ATTN -v
RAW_BUFFERClick to expand / collapse

Summary

Under torch 2.12.0+cu130 + triton 3.7.0, vLLM's tests/v1/e2e/general/test_cascade_attention.py::test_cascade_attention[FLASH_ATTN] deterministically diverges from its reference output by a single inserted word. The cascade-attention path produces:

got:      "Sure, I can help you with that. The Fibonacci sequence is a series ..."
expected: "Sure, I can help with that. The Fibonacci sequence is a series ..."

(rest of the response is byte-identical). The test is a deterministic-output check (response.outputs[0].text == ref_output), so a single-token divergence trips the assertion.

Reproducible: failed on first run AND retry of the same build.

Environment

  • torch: 2.12.0+cu130 (test channel)
  • triton: 3.7.0
  • CUDA: 13.0
  • Python: 3.12
  • Hardware: 1× H100
  • vLLM: release_212_tests branch, commit 62574c091f

Reproducibility

BuildCommitDatee2e Core (1 GPU)
6446854590d51622026-05-05passed
64577 (run 1)62574c091f2026-05-06failedhttps://buildkite.com/vllm/ci/builds/64577#019dfadf-f056-4bdd-b214-d07868ed74f3
64577 (retry)62574c091f2026-05-06failed (same signature)https://buildkite.com/vllm/ci/builds/64577#019dfeb6-57c3-46dd-b318-6d623869bfc7

Passes on every recent main build (torch 2.11):

  • 64551 (2026-05-05 daily): passed
  • 64616 (2026-05-05 nightly): passed

The regression appeared between 64468 (passing) and 64577 (failing), in a 17-commit rebase from main. The strongest suspect is:

  • 2228fe686 [Attention] Move FA3→FA4 upgrade into get_flash_attn_version() (#40815)

This commit changed how the FA version is selected. The failing test parametrization is [FLASH_ATTN] specifically, which is direct evidence the FA path is the trigger.

Other commits in the delta worth a look if FA3→FA4 isn't it:

  • 01b9b5af6 [Attention] Minor refactor: layer takes ownership of the MLA prefill backend (#41744)
  • 48954de23 Fix DeepGEMM ep_scatter output address overflow (#39213)
  • 84bd8a3c1 Remove unnecessary runtime asserts from linear layers (#41729)

Reproduction

.venv/bin/python -m pytest tests/v1/e2e/general/test_cascade_attention.py::test_cascade_attention -k FLASH_ATTN -v

Diagnosis request

Could pytorch / vllm confirm whether the FA3→FA4 path on torch 2.12 produces a different KV / attention numerical result for the test_cascade_attention workload than 2.11? A single-token change at this position (after help) suggests a subtle softmax / scoring drift rather than a structural bug.

Links

cc @drisspg @liangel-02 @howardzhang-cv

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING