vllm - 💡(How to fix) Fix [Feature][Performance]: Kimi K2.5: 2.4x tokenization overhead from slow HF pipeline [1 participants]

vllm2026-04-11 22:25:57

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

vllm-project/vllm#39590•Fetched 2026-04-12 13:24:31

View on GitHub

Comments

Participants

Timeline

Reactions

Author

liuzijing2014

Participants

liuzijing2014

Timeline (top)

mentioned ×3subscribed ×3labeled ×1

Root Cause

Root Cause (instrumented call stack)

Fix Action

Fix / Workaround

Tokens	Before (original `__call__`)	After (patched `__call__`)	Speedup
1k	1.64 ms	0.71 ms	2.32x
16k	24.97 ms	10.46 ms	2.39x
64k	99.41 ms	41.55 ms	2.39x
120k	186.79 ms	77.71 ms	2.40x

Measured with patch applied:

Prompt	Original	Patched	Speedup
1k	1.34 ms	0.58 ms	2.30x
16k	19.00 ms	8.05 ms	2.36x
64k	75.66 ms	31.87 ms	2.37x

Code Example

_encode_plus → get_input_ids:
    → tokenize(text):
        → tokens_trie.split(text)                 ← 76% of overhead: O(n*k) trie scan
        → _tokenize(segment):
            → self.encode(segment) [tiktoken/Rust → list[int]]
            → [self.decoder[t] for t ...]          ← 3%: int→str
    → convert_tokens_to_ids(strings)               ← 21%: str→int
→ prepare_for_model(ids, truncation, special tokens)

---

vllm/renderers/base.py:399         await tokenizer.encode(prompt, **kwargs)
vllm/utils/async_utils.py:63       return (await self(prompt, **kwargs)).input_ids
  → async_utils.py:118               self.tokenizer(text, **kwargs)       # __call__
  → tokenization_utils_base.py:2996  __call__ → encode_plus
  → tokenization_utils.py:743        _encode_plus():
      → line 767                        tokens = self.tokenize(text)       # slow chain starts
      → tokenization_utils.py:661         tokens_trie.split(text)          # 76% overhead
      → tokenization_utils.py:697         self._tokenize(segment)
      → tokenization_kimi.py:283            [self.decoder[t] for t in self.encode(text)]
                                            # encode(text) → fast tiktoken → list[int]
                                            # then int→str for each token (3%)
      → line 768                        self.convert_tokens_to_ids(tokens)  # str→int (21%)
      → line 803                        self.prepare_for_model(ids, ...)    # truncation etc.

---

def _encode_plus(self, text, text_pair=None, add_special_tokens=True,
                 padding_strategy=PaddingStrategy.DO_NOT_PAD,
                 truncation_strategy=TruncationStrategy.DO_NOT_TRUNCATE,
                 max_length=None, stride=0, is_split_into_words=False,
                 pad_to_multiple_of=None, padding_side=None,
                 return_tensors=None, return_token_type_ids=None,
                 return_attention_mask=None, return_overflowing_tokens=False,
                 return_special_tokens_mask=False, return_offsets_mapping=False,
                 return_length=False, verbose=True, **kwargs):
    # Fast path: call self.encode() directly → tiktoken (Rust).
    # Bypasses tokenize() → _tokenize() → convert_tokens_to_ids() overhead.
    if isinstance(text, str) and text_pair is None and not is_split_into_words:
        first_ids = self.encode(text)
    else:
        return super()._encode_plus(
            text, text_pair=text_pair, add_special_tokens=add_special_tokens,
            padding_strategy=padding_strategy, truncation_strategy=truncation_strategy,
            max_length=max_length, stride=stride, is_split_into_words=is_split_into_words,
            pad_to_multiple_of=pad_to_multiple_of, padding_side=padding_side,
            return_tensors=return_tensors, return_token_type_ids=return_token_type_ids,
            return_attention_mask=return_attention_mask,
            return_overflowing_tokens=return_overflowing_tokens,
            return_special_tokens_mask=return_special_tokens_mask,
            return_offsets_mapping=return_offsets_mapping,
            return_length=return_length, verbose=verbose, **kwargs)

    return self.prepare_for_model(
        first_ids, add_special_tokens=add_special_tokens,
        padding=padding_strategy.value, truncation=truncation_strategy.value,
        max_length=max_length, stride=stride,
        pad_to_multiple_of=pad_to_multiple_of, padding_side=padding_side,
        return_tensors=return_tensors, prepend_batch_axis=True,
        return_attention_mask=return_attention_mask,
        return_token_type_ids=return_token_type_ids,
        return_overflowing_tokens=return_overflowing_tokens,
        return_special_tokens_mask=return_special_tokens_mask,
        return_length=return_length, verbose=verbose)

---

import time
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("nvidia/Kimi-K2.5-NVFP4", trust_remote_code=True)
prompt = "The quick brown fox jumps over the lazy dog. " * 1300

for _ in range(3):  # warmup
    tokenizer.encode(prompt)
    tokenizer(prompt, add_special_tokens=True, truncation=True, max_length=131073)

t0 = time.perf_counter()
fast = tokenizer.encode(prompt)
t_fast = time.perf_counter() - t0

t0 = time.perf_counter()
slow = tokenizer(prompt, add_special_tokens=True, truncation=True, max_length=131073).input_ids
t_slow = time.perf_counter() - t0

print(f"encode():  {t_fast*1000:.1f} ms | __call__(): {t_slow*1000:.1f} ms | {t_slow/t_fast:.1f}x | match={fast==slow}")

RAW_BUFFERClick to expand / collapse

🚀 The feature, motivation and pitch

[Performance] Kimi K2.5: 2.4x tokenization overhead from HF pipeline bypass

TLDR

Inferact/Kimi-K2.5-NVFP4 now ships an updated tokenization_kimi.py that fixes a 2.4x tokenization overhead in vLLM. No model weight changes. No vLLM code changes needed. Tokenization output is identical (verified on 69k+ comparisons across 6 code paths, zero mismatches).

The fix overrides _encode_plus to call tiktoken (Rust) directly, bypassing an unnecessary Python pipeline (tokenize() → _tokenize() → convert_tokens_to_ids()) that was converting int→str→int on every token.

Problem

Kimi K2.5's TikTokenTokenizer has a fast encode() that calls tiktoken (Rust) directly, but vLLM never uses it. The AsyncMicrobatchTokenizer.encode() routes through HF's __call__() → _encode_plus(), which forces a slow Python pipeline:

_encode_plus → get_input_ids:
    → tokenize(text):
        → tokens_trie.split(text)                 ← 76% of overhead: O(n*k) trie scan
        → _tokenize(segment):
            → self.encode(segment) [tiktoken/Rust → list[int]]
            → [self.decoder[t] for t ...]          ← 3%: int→str
    → convert_tokens_to_ids(strings)               ← 21%: str→int
→ prepare_for_model(ids, truncation, special tokens)

Kimi's encode(text) already produces the final list[int] with proper trie split included — the tokenize() → _tokenize() → convert_tokens_to_ids() chain is pure overhead.

Benchmark

Tokens	Before (original `__call__`)	After (patched `__call__`)	Speedup
1k	1.64 ms	0.71 ms	2.32x
16k	24.97 ms	10.46 ms	2.39x
64k	99.41 ms	41.55 ms	2.39x
120k	186.79 ms	77.71 ms	2.40x

Consistent ~2.4x, scales linearly with input length. At 120k tokens, ~109ms wasted per request.

Environment: Python 3.12, transformers 4.57.6, tiktoken 0.12.0, aarch64 Linux.

Root Cause (instrumented call stack)

vllm/renderers/base.py:399         await tokenizer.encode(prompt, **kwargs)
vllm/utils/async_utils.py:63       return (await self(prompt, **kwargs)).input_ids
  → async_utils.py:118               self.tokenizer(text, **kwargs)       # __call__
  → tokenization_utils_base.py:2996  __call__ → encode_plus
  → tokenization_utils.py:743        _encode_plus():
      → line 767                        tokens = self.tokenize(text)       # slow chain starts
      → tokenization_utils.py:661         tokens_trie.split(text)          # 76% overhead
      → tokenization_utils.py:697         self._tokenize(segment)
      → tokenization_kimi.py:283            [self.decoder[t] for t in self.encode(text)]
                                            # encode(text) → fast tiktoken → list[int]
                                            # then int→str for each token (3%)
      → line 768                        self.convert_tokens_to_ids(tokens)  # str→int (21%)
      → line 803                        self.prepare_for_model(ids, ...)    # truncation etc.

Note: the kwargs (add_special_tokens, truncation, max_length) are consumed by _encode_plus as named parameters. They never reach Kimi's encode(). The fast tiktoken path runs inside _tokenize() — but its list[int] output is converted to strings and back, and the entire input is scanned by the trie first.

Proposed Fix

The fix is on model repo files directly. No vLLM code change. Override _encode_plus in tokenization_kimi.py to call self.encode(text) directly, skipping the tokenize() → _tokenize() → convert_tokens_to_ids() chain. prepare_for_model() still handles truncation and special tokens correctly.

def _encode_plus(self, text, text_pair=None, add_special_tokens=True,
                 padding_strategy=PaddingStrategy.DO_NOT_PAD,
                 truncation_strategy=TruncationStrategy.DO_NOT_TRUNCATE,
                 max_length=None, stride=0, is_split_into_words=False,
                 pad_to_multiple_of=None, padding_side=None,
                 return_tensors=None, return_token_type_ids=None,
                 return_attention_mask=None, return_overflowing_tokens=False,
                 return_special_tokens_mask=False, return_offsets_mapping=False,
                 return_length=False, verbose=True, **kwargs):
    # Fast path: call self.encode() directly → tiktoken (Rust).
    # Bypasses tokenize() → _tokenize() → convert_tokens_to_ids() overhead.
    if isinstance(text, str) and text_pair is None and not is_split_into_words:
        first_ids = self.encode(text)
    else:
        return super()._encode_plus(
            text, text_pair=text_pair, add_special_tokens=add_special_tokens,
            padding_strategy=padding_strategy, truncation_strategy=truncation_strategy,
            max_length=max_length, stride=stride, is_split_into_words=is_split_into_words,
            pad_to_multiple_of=pad_to_multiple_of, padding_side=padding_side,
            return_tensors=return_tensors, return_token_type_ids=return_token_type_ids,
            return_attention_mask=return_attention_mask,
            return_overflowing_tokens=return_overflowing_tokens,
            return_special_tokens_mask=return_special_tokens_mask,
            return_offsets_mapping=return_offsets_mapping,
            return_length=return_length, verbose=verbose, **kwargs)

    return self.prepare_for_model(
        first_ids, add_special_tokens=add_special_tokens,
        padding=padding_strategy.value, truncation=truncation_strategy.value,
        max_length=max_length, stride=stride,
        pad_to_multiple_of=pad_to_multiple_of, padding_side=padding_side,
        return_tensors=return_tensors, prepend_batch_axis=True,
        return_attention_mask=return_attention_mask,
        return_token_type_ids=return_token_type_ids,
        return_overflowing_tokens=return_overflowing_tokens,
        return_special_tokens_mask=return_special_tokens_mask,
        return_length=return_length, verbose=verbose)

Why this is safe:

Truncation: Handled by prepare_for_model(), not post-hoc.
Special tokens: build_inputs_with_special_tokens() called inside prepare_for_model().
Trie split skipped: Kimi's encode() calls tiktoken with allowed_special="all", so special tokens in the input are handled natively by Rust — the Python trie scan is redundant.
Complex inputs: Text pairs, pre-split words fall back to super()._encode_plus().
Both paths fixed: __call__ and encode(**kwargs) both funnel through _encode_plus.
No vLLM changes needed.

Measured with patch applied:

Prompt	Original	Patched	Speedup
1k	1.34 ms	0.58 ms	2.30x
16k	19.00 ms	8.05 ms	2.36x
64k	75.66 ms	31.87 ms	2.37x

Correctness Verification

Tested on a 100-conversation ShareGPT dataset (11,122 text segments + 400 synthetic special-token edge cases), 6 code paths each:

__call__(text, add_special_tokens=True, truncation=True, max_length=131073)
encode(text, add_special_tokens=True, truncation=True, max_length=131073)
encode(text) (no kwargs)
__call__ with active truncation to 50 tokens
encode with active truncation to 50 tokens
add_special_tokens=False

69,132 total comparisons. Zero mismatches. Before-fix and after-fix produce identical token IDs in every case.

Reproduction

import time
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("nvidia/Kimi-K2.5-NVFP4", trust_remote_code=True)
prompt = "The quick brown fox jumps over the lazy dog. " * 1300

for _ in range(3):  # warmup
    tokenizer.encode(prompt)
    tokenizer(prompt, add_special_tokens=True, truncation=True, max_length=131073)

t0 = time.perf_counter()
fast = tokenizer.encode(prompt)
t_fast = time.perf_counter() - t0

t0 = time.perf_counter()
slow = tokenizer(prompt, add_special_tokens=True, truncation=True, max_length=131073).input_ids
t_slow = time.perf_counter() - t0

print(f"encode():  {t_fast*1000:.1f} ms | __call__(): {t_slow*1000:.1f} ms | {t_slow/t_fast:.1f}x | match={fast==slow}")

cc @zixi-qi @zhewenl @ywang96

Alternatives

No response

Additional context

No response

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

TL;DR

Override the _encode_plus method in tokenization_kimi.py to directly call self.encode(text) and bypass the slow Python pipeline.

Guidance

Identify the performance bottleneck in the tokenization process, which is the unnecessary Python pipeline (tokenize() → _tokenize() → convert_tokens_to_ids()) that converts int→str→int on every token.
Verify that the proposed fix correctly handles truncation, special tokens, and complex inputs by checking the prepare_for_model() function and the build_inputs_with_special_tokens() function.
Test the fix using the provided reproduction code and verify that the speedup is consistent with the expected 2.4x improvement.
Ensure that the fix does not introduce any regressions by running the correctness verification tests on a large dataset.

Example

The provided reproduction code can be used to test the fix:

import time
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("nvidia/Kimi-K2.5-NVFP4", trust_remote_code=True)
prompt = "The quick brown fox jumps over the lazy dog. " * 1300

for _ in range(3):  # warmup
    tokenizer.encode(prompt)
    tokenizer(prompt, add_special_tokens=True, truncation=True, max_length=131073)

t0 = time.perf_counter()
fast = tokenizer.encode(prompt)
t_fast = time.perf_counter() - t0

t0 = time.perf_counter()
slow = tokenizer(prompt, add_special_tokens=True, truncation=True, max_length=131073).input_ids
t_slow = time.perf_counter() - t0

print(f"encode():  {t_fast*1000:.1f} ms | __call__(): {t_slow*1000:.1f} ms | {t_slow/t_fast:.1f}x | match={fast==slow}")

Notes

The fix assumes that the tokenization_kimi.py file is modifiable and that the self.encode(text) function correctly handles the tokenization process. Additionally, the fix may not be applicable to all use cases, and further testing may be necessary to ensure its correctness.

Recommendation

Apply the workaround by overriding the _encode_plus method in tokenization_kimi.py to directly call self.encode(text), as this fix has been verified to provide a consistent 2.4x speedup and does not introduce any regressions.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#cache error #pipeline error #runtime error #dependency conflict #environment setup

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

vllm - 💡(How to fix) Fix [Feature][Performance]: Kimi K2.5: 2.4x tokenization overhead from slow HF pipeline [1 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Root Cause (instrumented call stack)

Fix Action

Fix / Workaround

Code Example

🚀 The feature, motivation and pitch

[Performance] Kimi K2.5: 2.4x tokenization overhead from HF pipeline bypass

TLDR

Problem

Benchmark

Root Cause (instrumented call stack)

Proposed Fix

Correctness Verification

Reproduction

Alternatives

Additional context

Before submitting a new issue...

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

TRENDING

vllm - 💡(How to fix) Fix [Feature][Performance]: Kimi K2.5: 2.4x tokenization overhead from slow HF pipeline [1 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Root Cause (instrumented call stack)

Fix Action

Fix / Workaround

Code Example

🚀 The feature, motivation and pitch

[Performance] Kimi K2.5: 2.4x tokenization overhead from HF pipeline bypass

TLDR

Problem

Benchmark

Root Cause (instrumented call stack)

Proposed Fix

Correctness Verification

Reproduction

Alternatives

Additional context

Before submitting a new issue...

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING