pytorch - ✅(Solved) Fix [vllm] [2.12 regression][FLASH_ATTN] test_cascade_attention divergence: extra "you" in generated Fibonacci output [1 pull requests, 2 comments, 1 participants]

pytorch2026-05-06 20:10:15

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

pytorch/pytorch#182700•Fetched 2026-05-07 03:30:37

View on GitHub

Comments

Participants

Timeline

Reactions

Author

atalman

Participants

atalman

Timeline (top)

mentioned ×12subscribed ×12labeled ×5commented ×2

Under torch 2.12.0+cu130 + triton 3.7.0, vLLM's tests/v1/e2e/general/test_cascade_attention.py::test_cascade_attention[FLASH_ATTN] deterministically diverges from its reference output by a single inserted word. The cascade-attention path produces:

got:      "Sure, I can help you with that. The Fibonacci sequence is a series ..."
expected: "Sure, I can help with that. The Fibonacci sequence is a series ..."

(rest of the response is byte-identical). The test is a deterministic-output check (response.outputs[0].text == ref_output), so a single-token divergence trips the assertion.

Reproducible: failed on first run AND retry of the same build.

Root Cause

got:      "Sure, I can help you with that. The Fibonacci sequence is a series ..."
expected: "Sure, I can help with that. The Fibonacci sequence is a series ..."

(rest of the response is byte-identical). The test is a deterministic-output check (response.outputs[0].text == ref_output), so a single-token divergence trips the assertion.

Reproducible: failed on first run AND retry of the same build.

Fix Action

Fixed

Fixed by PR: [Bugfix] Fix RuntimeError: Already borrowed by adding thread-safe Hugging Face fast-tokenizer wrappers (https://github.com/vllm-project/vllm/pull/41181)

PR fix notes

PR #41181: [Bugfix] Fix `RuntimeError: Already borrowed` by adding thread-safe Hugging Face fast-tokenizer wrappers

Repository: vllm-project/vllm
Author: yzong-rh
State: closed | merged: True
Link: https://github.com/vllm-project/vllm/pull/41181

Description (problem / solution / changelog)

Purpose

Thread-safe HuggingFace fast tokenizer wrapper for the RuntimeError: Already borrowed concurrency issue reported in #40949 .

Uses a tokenizer pool that dispatches calls to borrowing methods to a free deepcopied tokenizer instance.

Fix concurrency issues with bad_words sampling param. Removes the need to use deepcopy for multimodal processor.

Limitations:

Mutation is not propagated to tokenizers in the pool.
Adjacent method calls could happen on different deep copies.
Direct access to _tokenzier is not supported by the pool (this is done by FastIncrementalDetokenizer).

Test Plan

pytest tests/models/multimodal/processing/test_common.py

benchmark.py

python benchmark.py --prompt-length 512 --iterations 5000 --warmup 5000 --mixed

vllm serve Qwen/Qwen3-4B-Instruct-2507-FP8 --renderer_num_workers 4 --api-server-count=4

vllm serve deepseek-ai/DeepSeek-V4-Flash --renderer_num_workers 4   --api-server-count=4 --trust-remote-code --tensor-parallel-size=2 --max-model-len 4096 --kv-cache-dtype fp8

vllm serve deepseek-ai/DeepSeek-OCR --renderer_num_workers 4 --mm-processor-cache-gb 0 --api-server-count=4

vllm serve Qwen/Qwen-VL-Chat --renderer_num_workers 4 --mm-processor-cache-gb 0 --api-server-count=4 --trust-remote-code --hf-overrides '{"architectures": ["QwenVLForConditionalGeneration"]}'

stress_send.py

python stress_send.py -n 5000 -c 500 [--mm] [--bad-word]

Test Result

Unite test pass

Benchmark:

Loading tokenizer from meta-llama/Llama-3.1-8B-Instruct …
Prompt: 522 tokens, 2236 chars
Config: iterations=5000  warmup=5000  threads=[1, 2, 8]  mixed=True  truncation_max_length=1044


=== 1 thread(s) ===
  raw (no wrapper)                mean=0.526 ms  median=0.490 ms  p99=0.855 ms  total=2627.7 ms  wall=2629.1 ms  n=5000
  lock wrapper                    mean=0.502 ms  median=0.485 ms  p99=0.682 ms  total=2510.0 ms  wall=2511.4 ms  n=5000
  copy wrapper (threading.local)  mean=0.546 ms  median=0.524 ms  p99=0.761 ms  total=2730.2 ms  wall=2746.4 ms  n=5000
  copy wrapper (dict)             mean=0.527 ms  median=0.520 ms  p99=0.699 ms  total=2634.0 ms  wall=2635.5 ms  n=5000
  queue wrapper                   mean=0.537 ms  median=0.534 ms  p99=0.603 ms  total=2686.5 ms  wall=2688.0 ms  n=5000

=== 2 thread(s) ===
  raw (no wrapper)                failures=2418
  lock wrapper                    mean=0.983 ms  median=0.527 ms  p99=1.106 ms  total=4913.6 ms  wall=2702.8 ms  n=5000
  copy wrapper (threading.local)  mean=0.564 ms  median=0.545 ms  p99=0.925 ms  total=2818.5 ms  wall=1448.9 ms  n=5000
  copy wrapper (dict)             mean=0.596 ms  median=0.554 ms  p99=1.095 ms  total=2978.9 ms  wall=1497.5 ms  n=5000
  queue wrapper                   mean=0.575 ms  median=0.550 ms  p99=0.999 ms  total=2873.8 ms  wall=1471.6 ms  n=5000

=== 8 thread(s) ===
  raw (no wrapper)                failures=2495
  lock wrapper                    mean=2.364 ms  median=0.493 ms  p99=0.763 ms  total=11819.0 ms  wall=2549.5 ms  n=5000
  copy wrapper (threading.local)  mean=0.633 ms  median=0.543 ms  p99=0.692 ms  total=3166.1 ms  wall=470.1 ms  n=5000
  copy wrapper (dict)             mean=0.618 ms  median=0.552 ms  p99=1.125 ms  total=3088.6 ms  wall=444.9 ms  n=5000
  queue wrapper                   mean=0.741 ms  median=0.570 ms  p99=3.086 ms  total=3702.6 ms  wall=471.7 ms  n=5000

<details> <summary>Qwen3-4B-Instruct-2507-FP8</summary>

Before:

Sending 5000 text-only requests (concurrency=500) to http://localhost:8000
Model: Qwen/Qwen3-4B-Instruct-2507-FP8  mode=text
==================================================
Requests : 5000/5000 ok, 0 errors
Wall time: 2.61s  (1916.1 req/s)
Latency  : p50=0.296s  p99=0.475

After:

Sending 5000 text-only requests (concurrency=500) to http://localhost:8000 bad_words=['hello world']
Model: Qwen/Qwen3-4B-Instruct-2507-FP8  mode=text
==================================================
Requests : 5000/5000 ok, 0 errors
Wall time: 2.61s  (1918.9 req/s)
Latency  : p50=0.307s  p99=0.549s

</details> <details> <summary>DeepSeek-V4-Flash</summary>

Before:

Sending 5000 text-only requests (concurrency=500) to http://localhost:8000
Model: deepseek-ai/DeepSeek-V4-Flash  mode=text
==================================================
Requests : 5000/5000 ok, 0 errors
Wall time: 13.53s  (369.5 req/s)
Latency  : p50=1.335s  p99=2.478s

After:

Sending 5000 text-only requests (concurrency=500) to http://localhost:8000 bad_words=['hello world']
Model: deepseek-ai/DeepSeek-V4-Flash  mode=text
==================================================
Requests : 5000/5000 ok, 0 errors
Wall time: 12.95s  (386.1 req/s)
Latency  : p50=1.340s  p99=2.374s

</details> <details> <summary>Qwen-VL-Chat</summary>

Before:

Sending 5000 multimodal requests (concurrency=500) to http://localhost:8000
Model: Qwen/Qwen-VL-Chat  mode=multimodal
==================================================
Requests : 5000/5000 ok, 0 errors
Wall time: 10.30s  (485.6 req/s)
Latency  : p50=1.088s  p99=1.420s

After:

Sending 5000 multimodal requests (concurrency=500) to http://localhost:8000 bad_words=['hello world']
Model: Qwen/Qwen-VL-Chat  mode=multimodal
==================================================
Requests : 5000/5000 ok, 0 errors
Wall time: 10.32s  (484.6 req/s)
Latency  : p50=1.125s  p99=1.492s

</details> <details> <summary>DeepSeek-OCR</summary>

Before:

Sending 5000 multimodal requests (concurrency=500) to http://localhost:8000
Model: deepseek-ai/DeepSeek-OCR  mode=multimodal
==================================================
Requests : 5000/5000 ok, 0 errors
Wall time: 11.14s  (448.9 req/s)
Latency  : p50=1.063s  p99=1.565s

After:

Sending 5000 multimodal requests (concurrency=500) to http://localhost:8000 bad_words=['hello world']
Model: deepseek-ai/DeepSeek-OCR  mode=multimodal
==================================================
Requests : 5000/5000 ok, 0 errors
Wall time: 10.54s  (474.5 req/s)
Latency  : p50=1.040s  p99=1.805s

</details>

AI Assistance

Made with Cursor

cc @sfeng33 @bbrowning

<details> <summary> Essential Elements of an Effective PR Description Checklist </summary>

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.

</details>

Changed files

vllm/multimodal/processing/context.py (modified, +0/-22)
vllm/renderers/base.py (modified, +2/-11)
vllm/renderers/hf.py (modified, +15/-1)
vllm/tokenizers/__init__.py (modified, +2/-0)
vllm/tokenizers/hf.py (modified, +88/-1)

Code Example

got:      "Sure, I can help you with that. The Fibonacci sequence is a series ..."
expected: "Sure, I can help with that. The Fibonacci sequence is a series ..."

---

.venv/bin/python -m pytest tests/v1/e2e/general/test_cascade_attention.py::test_cascade_attention -k FLASH_ATTN -v

RAW_BUFFERClick to expand / collapse

Summary

got:      "Sure, I can help you with that. The Fibonacci sequence is a series ..."
expected: "Sure, I can help with that. The Fibonacci sequence is a series ..."

(rest of the response is byte-identical). The test is a deterministic-output check (response.outputs[0].text == ref_output), so a single-token divergence trips the assertion.

Reproducible: failed on first run AND retry of the same build.

Environment

torch: 2.12.0+cu130 (test channel)
triton: 3.7.0
CUDA: 13.0
Python: 3.12
Hardware: 1× H100
vLLM: release_212_tests branch, commit 62574c091f

Reproducibility

Build	Commit	Date	e2e Core (1 GPU)
64468	54590d5162	2026-05-05	passed
64577 (run 1)	62574c091f	2026-05-06	failed — https://buildkite.com/vllm/ci/builds/64577#019dfadf-f056-4bdd-b214-d07868ed74f3
64577 (retry)	62574c091f	2026-05-06	failed (same signature) — https://buildkite.com/vllm/ci/builds/64577#019dfeb6-57c3-46dd-b318-6d623869bfc7

Passes on every recent main build (torch 2.11):

64551 (2026-05-05 daily): passed
64616 (2026-05-05 nightly): passed

The regression appeared between 64468 (passing) and 64577 (failing), in a 17-commit rebase from main. The strongest suspect is:

2228fe686 [Attention] Move FA3→FA4 upgrade into get_flash_attn_version() (#40815)

This commit changed how the FA version is selected. The failing test parametrization is [FLASH_ATTN] specifically, which is direct evidence the FA path is the trigger.

Other commits in the delta worth a look if FA3→FA4 isn't it:

01b9b5af6 [Attention] Minor refactor: layer takes ownership of the MLA prefill backend (#41744)
48954de23 Fix DeepGEMM ep_scatter output address overflow (#39213)
84bd8a3c1 Remove unnecessary runtime asserts from linear layers (#41729)

Reproduction

.venv/bin/python -m pytest tests/v1/e2e/general/test_cascade_attention.py::test_cascade_attention -k FLASH_ATTN -v

Diagnosis request

Could pytorch / vllm confirm whether the FA3→FA4 path on torch 2.12 produces a different KV / attention numerical result for the test_cascade_attention workload than 2.11? A single-token change at this position (after help) suggests a subtle softmax / scoring drift rather than a structural bug.

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

pytorch - ✅(Solved) Fix [vllm] [2.12 regression][FLASH_ATTN] test_cascade_attention divergence: extra "you" in generated Fibonacci output [1 pull requests, 2 comments, 1 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Fix Action

Fixed

PR fix notes

PR #41181: [Bugfix] Fix `RuntimeError: Already borrowed` by adding thread-safe Hugging Face fast-tokenizer wrappers

Description (problem / solution / changelog)

Purpose

Test Plan

Test Result

AI Assistance

Changed files

Code Example

Summary

Environment

Reproducibility

Reproduction

Diagnosis request

Links

Still need to ship something?

TRENDING

pytorch - ✅(Solved) Fix [vllm] [2.12 regression][FLASH_ATTN] test_cascade_attention divergence: extra "you" in generated Fibonacci output [1 pull requests, 2 comments, 1 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Fix Action

Fixed

PR fix notes

PR #41181: [Bugfix] Fix RuntimeError: Already borrowed by adding thread-safe Hugging Face fast-tokenizer wrappers

Description (problem / solution / changelog)

Purpose

Test Plan

Test Result

AI Assistance

Changed files

Code Example

Summary

Environment

Reproducibility

Reproduction

Diagnosis request

Links

Still need to ship something?

RELATED_DISCOVERY

TRENDING

PR #41181: [Bugfix] Fix `RuntimeError: Already borrowed` by adding thread-safe Hugging Face fast-tokenizer wrappers