Fix Action

Fixed

Fixed by PR: [Bugfix] Fix RuntimeError: Already borrowed by adding thread-safe Hugging Face fast-tokenizer wrappers (https://github.com/vllm-project/vllm/pull/41181)

PR fix notes

PR #41181: [Bugfix] Fix `RuntimeError: Already borrowed` by adding thread-safe Hugging Face fast-tokenizer wrappers

Repository: vllm-project/vllm
Author: yzong-rh
State: closed | merged: True
Link: https://github.com/vllm-project/vllm/pull/41181

Description (problem / solution / changelog)

Purpose

Thread-safe HuggingFace fast tokenizer wrapper for the RuntimeError: Already borrowed concurrency issue reported in #40949 .

Uses a tokenizer pool that dispatches calls to borrowing methods to a free deepcopied tokenizer instance.

Fix concurrency issues with bad_words sampling param. Removes the need to use deepcopy for multimodal processor.

Limitations:

Mutation is not propagated to tokenizers in the pool.
Adjacent method calls could happen on different deep copies.
Direct access to _tokenzier is not supported by the pool (this is done by FastIncrementalDetokenizer).

Test Plan

pytest tests/models/multimodal/processing/test_common.py

benchmark.py

python benchmark.py --prompt-length 512 --iterations 5000 --warmup 5000 --mixed

vllm serve Qwen/Qwen3-4B-Instruct-2507-FP8 --renderer_num_workers 4 --api-server-count=4

vllm serve deepseek-ai/DeepSeek-V4-Flash --renderer_num_workers 4   --api-server-count=4 --trust-remote-code --tensor-parallel-size=2 --max-model-len 4096 --kv-cache-dtype fp8

vllm serve deepseek-ai/DeepSeek-OCR --renderer_num_workers 4 --mm-processor-cache-gb 0 --api-server-count=4

vllm serve Qwen/Qwen-VL-Chat --renderer_num_workers 4 --mm-processor-cache-gb 0 --api-server-count=4 --trust-remote-code --hf-overrides '{"architectures": ["QwenVLForConditionalGeneration"]}'

stress_send.py

python stress_send.py -n 5000 -c 500 [--mm] [--bad-word]

Test Result

Unite test pass

Benchmark:

Loading tokenizer from meta-llama/Llama-3.1-8B-Instruct …
Prompt: 522 tokens, 2236 chars
Config: iterations=5000  warmup=5000  threads=[1, 2, 8]  mixed=True  truncation_max_length=1044


=== 1 thread(s) ===
  raw (no wrapper)                mean=0.526 ms  median=0.490 ms  p99=0.855 ms  total=2627.7 ms  wall=2629.1 ms  n=5000
  lock wrapper                    mean=0.502 ms  median=0.485 ms  p99=0.682 ms  total=2510.0 ms  wall=2511.4 ms  n=5000
  copy wrapper (threading.local)  mean=0.546 ms  median=0.524 ms  p99=0.761 ms  total=2730.2 ms  wall=2746.4 ms  n=5000
  copy wrapper (dict)             mean=0.527 ms  median=0.520 ms  p99=0.699 ms  total=2634.0 ms  wall=2635.5 ms  n=5000
  queue wrapper                   mean=0.537 ms  median=0.534 ms  p99=0.603 ms  total=2686.5 ms  wall=2688.0 ms  n=5000

=== 2 thread(s) ===
  raw (no wrapper)                failures=2418
  lock wrapper                    mean=0.983 ms  median=0.527 ms  p99=1.106 ms  total=4913.6 ms  wall=2702.8 ms  n=5000
  copy wrapper (threading.local)  mean=0.564 ms  median=0.545 ms  p99=0.925 ms  total=2818.5 ms  wall=1448.9 ms  n=5000
  copy wrapper (dict)             mean=0.596 ms  median=0.554 ms  p99=1.095 ms  total=2978.9 ms  wall=1497.5 ms  n=5000
  queue wrapper                   mean=0.575 ms  median=0.550 ms  p99=0.999 ms  total=2873.8 ms  wall=1471.6 ms  n=5000

=== 8 thread(s) ===
  raw (no wrapper)                failures=2495
  lock wrapper                    mean=2.364 ms  median=0.493 ms  p99=0.763 ms  total=11819.0 ms  wall=2549.5 ms  n=5000
  copy wrapper (threading.local)  mean=0.633 ms  median=0.543 ms  p99=0.692 ms  total=3166.1 ms  wall=470.1 ms  n=5000
  copy wrapper (dict)             mean=0.618 ms  median=0.552 ms  p99=1.125 ms  total=3088.6 ms  wall=444.9 ms  n=5000
  queue wrapper                   mean=0.741 ms  median=0.570 ms  p99=3.086 ms  total=3702.6 ms  wall=471.7 ms  n=5000

<details> <summary>Qwen3-4B-Instruct-2507-FP8</summary>

Before:

Sending 5000 text-only requests (concurrency=500) to http://localhost:8000
Model: Qwen/Qwen3-4B-Instruct-2507-FP8  mode=text
==================================================
Requests : 5000/5000 ok, 0 errors
Wall time: 2.61s  (1916.1 req/s)
Latency  : p50=0.296s  p99=0.475

After:

Sending 5000 text-only requests (concurrency=500) to http://localhost:8000 bad_words=['hello world']
Model: Qwen/Qwen3-4B-Instruct-2507-FP8  mode=text
==================================================
Requests : 5000/5000 ok, 0 errors
Wall time: 2.61s  (1918.9 req/s)
Latency  : p50=0.307s  p99=0.549s

</details> <details> <summary>DeepSeek-V4-Flash</summary>

Before:

Sending 5000 text-only requests (concurrency=500) to http://localhost:8000
Model: deepseek-ai/DeepSeek-V4-Flash  mode=text
==================================================
Requests : 5000/5000 ok, 0 errors
Wall time: 13.53s  (369.5 req/s)
Latency  : p50=1.335s  p99=2.478s

After:

Sending 5000 text-only requests (concurrency=500) to http://localhost:8000 bad_words=['hello world']
Model: deepseek-ai/DeepSeek-V4-Flash  mode=text
==================================================
Requests : 5000/5000 ok, 0 errors
Wall time: 12.95s  (386.1 req/s)
Latency  : p50=1.340s  p99=2.374s

</details> <details> <summary>Qwen-VL-Chat</summary>

Before:

Sending 5000 multimodal requests (concurrency=500) to http://localhost:8000
Model: Qwen/Qwen-VL-Chat  mode=multimodal
==================================================
Requests : 5000/5000 ok, 0 errors
Wall time: 10.30s  (485.6 req/s)
Latency  : p50=1.088s  p99=1.420s

After:

Sending 5000 multimodal requests (concurrency=500) to http://localhost:8000 bad_words=['hello world']
Model: Qwen/Qwen-VL-Chat  mode=multimodal
==================================================
Requests : 5000/5000 ok, 0 errors
Wall time: 10.32s  (484.6 req/s)
Latency  : p50=1.125s  p99=1.492s

</details> <details> <summary>DeepSeek-OCR</summary>

Before:

Sending 5000 multimodal requests (concurrency=500) to http://localhost:8000
Model: deepseek-ai/DeepSeek-OCR  mode=multimodal
==================================================
Requests : 5000/5000 ok, 0 errors
Wall time: 11.14s  (448.9 req/s)
Latency  : p50=1.063s  p99=1.565s

After:

Sending 5000 multimodal requests (concurrency=500) to http://localhost:8000 bad_words=['hello world']
Model: deepseek-ai/DeepSeek-OCR  mode=multimodal
==================================================
Requests : 5000/5000 ok, 0 errors
Wall time: 10.54s  (474.5 req/s)
Latency  : p50=1.040s  p99=1.805s

</details>

AI Assistance

Made with Cursor

cc @sfeng33 @bbrowning

<details> <summary> Essential Elements of an Effective PR Description Checklist </summary>

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.

</details>

Changed files

vllm/multimodal/processing/context.py (modified, +0/-22)
vllm/renderers/base.py (modified, +2/-11)
vllm/renderers/hf.py (modified, +15/-1)
vllm/tokenizers/__init__.py (modified, +2/-0)
vllm/tokenizers/hf.py (modified, +88/-1)

Code Example

vllm bench throughput --model RedHatAI/gemma-3-12B-it-quantized.w8a8 --backend vllm-chat --dataset-name hf --dataset-path lmarena-ai/VisionArena-Chat --num-prompts 128 --override-generation-config '{"temperature": "0.0", "top_p": "1.0"}' --max-num-batched-tokens 12288

---

With this PR, commit 20dcd984f9a49b8dc69c400486eed50953cb16cf, I get:

vllm bench throughput --model RedHatAI/gemma-3-12B-it-quantized.w8a8 --backend vllm-chat --dataset-name hf --dataset-path lmarena-ai/VisionArena-Chat --num-prompts 128 --override-generation-config '{"temperature": "0.0", "top_p": "1.0"}' --max-num-batched-tokens 12288

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

vllm - ✅(Solved) Fix [Performance]: Regression in Gemma3 MM throughput of ~5% [1 pull requests, 4 comments, 4 participants]

Recommended Tools

GitHub issue graph ai analysis

Fix Action

Fixed

PR fix notes

PR #41181: [Bugfix] Fix `RuntimeError: Already borrowed` by adding thread-safe Hugging Face fast-tokenizer wrappers

Description (problem / solution / changelog)

Purpose

Test Plan

Test Result

AI Assistance

Changed files

Code Example

Report of performance regression

Still need to ship something?

TRENDING

vllm - ✅(Solved) Fix [Performance]: Regression in Gemma3 MM throughput of ~5% [1 pull requests, 4 comments, 4 participants]

Recommended Tools

GitHub issue graph ai analysis

Fix Action

Fixed

PR fix notes

PR #41181: [Bugfix] Fix RuntimeError: Already borrowed by adding thread-safe Hugging Face fast-tokenizer wrappers

Description (problem / solution / changelog)

Purpose

Test Plan

Test Result

AI Assistance

Changed files

Code Example

Report of performance regression

Still need to ship something?

RELATED_DISCOVERY

TRENDING

PR #41181: [Bugfix] Fix `RuntimeError: Already borrowed` by adding thread-safe Hugging Face fast-tokenizer wrappers