vllm - ✅(Solved) Fix [Performance]: Regression in Gemma3 MM throughput of ~5% [1 pull requests, 4 comments, 4 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#43078Fetched 2026-05-20 03:40:01
View on GitHub
Comments
4
Participants
4
Timeline
9
Reactions
1
Author
Timeline (top)
commented ×4subscribed ×2cross-referenced ×1labeled ×1

Fix Action

Fixed

PR fix notes

PR #41181: [Bugfix] Fix RuntimeError: Already borrowed by adding thread-safe Hugging Face fast-tokenizer wrappers

Description (problem / solution / changelog)

Purpose

Thread-safe HuggingFace fast tokenizer wrapper for the RuntimeError: Already borrowed concurrency issue reported in #40949 .

  • Uses a tokenizer pool that dispatches calls to borrowing methods to a free deepcopied tokenizer instance.

Fix concurrency issues with bad_words sampling param. Removes the need to use deepcopy for multimodal processor.

Limitations:

  • Mutation is not propagated to tokenizers in the pool.
  • Adjacent method calls could happen on different deep copies.
  • Direct access to _tokenzier is not supported by the pool (this is done by FastIncrementalDetokenizer).

Test Plan

pytest tests/models/multimodal/processing/test_common.py

benchmark.py

python benchmark.py --prompt-length 512 --iterations 5000 --warmup 5000 --mixed
vllm serve Qwen/Qwen3-4B-Instruct-2507-FP8 --renderer_num_workers 4 --api-server-count=4

vllm serve deepseek-ai/DeepSeek-V4-Flash --renderer_num_workers 4   --api-server-count=4 --trust-remote-code --tensor-parallel-size=2 --max-model-len 4096 --kv-cache-dtype fp8

vllm serve deepseek-ai/DeepSeek-OCR --renderer_num_workers 4 --mm-processor-cache-gb 0 --api-server-count=4

vllm serve Qwen/Qwen-VL-Chat --renderer_num_workers 4 --mm-processor-cache-gb 0 --api-server-count=4 --trust-remote-code --hf-overrides '{"architectures": ["QwenVLForConditionalGeneration"]}'

stress_send.py

python stress_send.py -n 5000 -c 500 [--mm] [--bad-word]

Test Result

Unite test pass

Benchmark:

Loading tokenizer from meta-llama/Llama-3.1-8B-Instruct …
Prompt: 522 tokens, 2236 chars
Config: iterations=5000  warmup=5000  threads=[1, 2, 8]  mixed=True  truncation_max_length=1044


=== 1 thread(s) ===
  raw (no wrapper)                mean=0.526 ms  median=0.490 ms  p99=0.855 ms  total=2627.7 ms  wall=2629.1 ms  n=5000
  lock wrapper                    mean=0.502 ms  median=0.485 ms  p99=0.682 ms  total=2510.0 ms  wall=2511.4 ms  n=5000
  copy wrapper (threading.local)  mean=0.546 ms  median=0.524 ms  p99=0.761 ms  total=2730.2 ms  wall=2746.4 ms  n=5000
  copy wrapper (dict)             mean=0.527 ms  median=0.520 ms  p99=0.699 ms  total=2634.0 ms  wall=2635.5 ms  n=5000
  queue wrapper                   mean=0.537 ms  median=0.534 ms  p99=0.603 ms  total=2686.5 ms  wall=2688.0 ms  n=5000

=== 2 thread(s) ===
  raw (no wrapper)                failures=2418
  lock wrapper                    mean=0.983 ms  median=0.527 ms  p99=1.106 ms  total=4913.6 ms  wall=2702.8 ms  n=5000
  copy wrapper (threading.local)  mean=0.564 ms  median=0.545 ms  p99=0.925 ms  total=2818.5 ms  wall=1448.9 ms  n=5000
  copy wrapper (dict)             mean=0.596 ms  median=0.554 ms  p99=1.095 ms  total=2978.9 ms  wall=1497.5 ms  n=5000
  queue wrapper                   mean=0.575 ms  median=0.550 ms  p99=0.999 ms  total=2873.8 ms  wall=1471.6 ms  n=5000

=== 8 thread(s) ===
  raw (no wrapper)                failures=2495
  lock wrapper                    mean=2.364 ms  median=0.493 ms  p99=0.763 ms  total=11819.0 ms  wall=2549.5 ms  n=5000
  copy wrapper (threading.local)  mean=0.633 ms  median=0.543 ms  p99=0.692 ms  total=3166.1 ms  wall=470.1 ms  n=5000
  copy wrapper (dict)             mean=0.618 ms  median=0.552 ms  p99=1.125 ms  total=3088.6 ms  wall=444.9 ms  n=5000
  queue wrapper                   mean=0.741 ms  median=0.570 ms  p99=3.086 ms  total=3702.6 ms  wall=471.7 ms  n=5000
<details> <summary>Qwen3-4B-Instruct-2507-FP8</summary>

Before:

Sending 5000 text-only requests (concurrency=500) to http://localhost:8000
Model: Qwen/Qwen3-4B-Instruct-2507-FP8  mode=text
==================================================
Requests : 5000/5000 ok, 0 errors
Wall time: 2.61s  (1916.1 req/s)
Latency  : p50=0.296s  p99=0.475

After:

Sending 5000 text-only requests (concurrency=500) to http://localhost:8000 bad_words=['hello world']
Model: Qwen/Qwen3-4B-Instruct-2507-FP8  mode=text
==================================================
Requests : 5000/5000 ok, 0 errors
Wall time: 2.61s  (1918.9 req/s)
Latency  : p50=0.307s  p99=0.549s
</details> <details> <summary>DeepSeek-V4-Flash</summary>

Before:

Sending 5000 text-only requests (concurrency=500) to http://localhost:8000
Model: deepseek-ai/DeepSeek-V4-Flash  mode=text
==================================================
Requests : 5000/5000 ok, 0 errors
Wall time: 13.53s  (369.5 req/s)
Latency  : p50=1.335s  p99=2.478s

After:

Sending 5000 text-only requests (concurrency=500) to http://localhost:8000 bad_words=['hello world']
Model: deepseek-ai/DeepSeek-V4-Flash  mode=text
==================================================
Requests : 5000/5000 ok, 0 errors
Wall time: 12.95s  (386.1 req/s)
Latency  : p50=1.340s  p99=2.374s
</details> <details> <summary>Qwen-VL-Chat</summary>

Before:

Sending 5000 multimodal requests (concurrency=500) to http://localhost:8000
Model: Qwen/Qwen-VL-Chat  mode=multimodal
==================================================
Requests : 5000/5000 ok, 0 errors
Wall time: 10.30s  (485.6 req/s)
Latency  : p50=1.088s  p99=1.420s

After:

Sending 5000 multimodal requests (concurrency=500) to http://localhost:8000 bad_words=['hello world']
Model: Qwen/Qwen-VL-Chat  mode=multimodal
==================================================
Requests : 5000/5000 ok, 0 errors
Wall time: 10.32s  (484.6 req/s)
Latency  : p50=1.125s  p99=1.492s
</details> <details> <summary>DeepSeek-OCR</summary>

Before:

Sending 5000 multimodal requests (concurrency=500) to http://localhost:8000
Model: deepseek-ai/DeepSeek-OCR  mode=multimodal
==================================================
Requests : 5000/5000 ok, 0 errors
Wall time: 11.14s  (448.9 req/s)
Latency  : p50=1.063s  p99=1.565s

After:

Sending 5000 multimodal requests (concurrency=500) to http://localhost:8000 bad_words=['hello world']
Model: deepseek-ai/DeepSeek-OCR  mode=multimodal
==================================================
Requests : 5000/5000 ok, 0 errors
Wall time: 10.54s  (474.5 req/s)
Latency  : p50=1.040s  p99=1.805s
</details>

AI Assistance

Made with Cursor

cc @sfeng33 @bbrowning


<details> <summary> Essential Elements of an Effective PR Description Checklist </summary>
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
</details>

Changed files

  • vllm/multimodal/processing/context.py (modified, +0/-22)
  • vllm/renderers/base.py (modified, +2/-11)
  • vllm/renderers/hf.py (modified, +15/-1)
  • vllm/tokenizers/__init__.py (modified, +2/-0)
  • vllm/tokenizers/hf.py (modified, +88/-1)

Code Example

vllm bench throughput --model RedHatAI/gemma-3-12B-it-quantized.w8a8 --backend vllm-chat --dataset-name hf --dataset-path lmarena-ai/VisionArena-Chat --num-prompts 128 --override-generation-config '{"temperature": "0.0", "top_p": "1.0"}' --max-num-batched-tokens 12288

---

With this PR, commit 20dcd984f9a49b8dc69c400486eed50953cb16cf, I get:
RAW_BUFFERClick to expand / collapse

Report of performance regression

This PR appears to have introduced a regression in Gemma3 multimodal throughput. It's small, but especially noticeable with quantised models:

vllm bench throughput --model RedHatAI/gemma-3-12B-it-quantized.w8a8 --backend vllm-chat --dataset-name hf --dataset-path lmarena-ai/VisionArena-Chat --num-prompts 128 --override-generation-config '{"temperature": "0.0", "top_p": "1.0"}' --max-num-batched-tokens 12288

Prior to this PR, eg commit 6fca51815706fbb0735849b95c0dafa05da20f9f, I get: Throughput: 0.73 requests/s, 347.12 total tokens/s, 93.45 output tokens/s

With this PR, commit 20dcd984f9a49b8dc69c400486eed50953cb16cf, I get: Throughput: 0.68 requests/s, 324.38 total tokens/s, 87.33 output tokens/s

These results were generated on 96 neoverse V2 cores. I have observed the regression on x86 as well.

I haven't raised this as a bug as it may be that this is considered an acceptable cost of thread safety. Is this considered acceptable or something that needs a fix?

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING