vllm - 💡(How to fix) Fix [Feature][Performance]: Kimi K2.5: 2.4x tokenization overhead from slow HF pipeline [1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#39590Fetched 2026-04-12 13:24:31
View on GitHub
Comments
0
Participants
1
Timeline
7
Reactions
0
Participants
Timeline (top)
mentioned ×3subscribed ×3labeled ×1

Root Cause

Root Cause (instrumented call stack)

Fix Action

Fix / Workaround

TokensBefore (original __call__)After (patched __call__)Speedup
1k1.64 ms0.71 ms2.32x
16k24.97 ms10.46 ms2.39x
64k99.41 ms41.55 ms2.39x
120k186.79 ms77.71 ms2.40x

Measured with patch applied:

PromptOriginalPatchedSpeedup
1k1.34 ms0.58 ms2.30x
16k19.00 ms8.05 ms2.36x
64k75.66 ms31.87 ms2.37x

Code Example

_encode_plus → get_input_ids:
tokenize(text):
        → tokens_trie.split(text)76% of overhead: O(n*k) trie scan
_tokenize(segment):
            → self.encode(segment) [tiktoken/Rust → list[int]]
[self.decoder[t] for t ...]3%: int→str
convert_tokens_to_ids(strings)21%: str→int
prepare_for_model(ids, truncation, special tokens)

---

vllm/renderers/base.py:399         await tokenizer.encode(prompt, **kwargs)
vllm/utils/async_utils.py:63       return (await self(prompt, **kwargs)).input_ids
  → async_utils.py:118               self.tokenizer(text, **kwargs)       # __call__
  → tokenization_utils_base.py:2996  __call__ → encode_plus
  → tokenization_utils.py:743        _encode_plus():
      → line 767                        tokens = self.tokenize(text)       # slow chain starts
      → tokenization_utils.py:661         tokens_trie.split(text)          # 76% overhead
      → tokenization_utils.py:697         self._tokenize(segment)
      → tokenization_kimi.py:283            [self.decoder[t] for t in self.encode(text)]
                                            # encode(text) → fast tiktoken → list[int]
                                            # then int→str for each token (3%)
      → line 768                        self.convert_tokens_to_ids(tokens)  # str→int (21%)
      → line 803                        self.prepare_for_model(ids, ...)    # truncation etc.

---

def _encode_plus(self, text, text_pair=None, add_special_tokens=True,
                 padding_strategy=PaddingStrategy.DO_NOT_PAD,
                 truncation_strategy=TruncationStrategy.DO_NOT_TRUNCATE,
                 max_length=None, stride=0, is_split_into_words=False,
                 pad_to_multiple_of=None, padding_side=None,
                 return_tensors=None, return_token_type_ids=None,
                 return_attention_mask=None, return_overflowing_tokens=False,
                 return_special_tokens_mask=False, return_offsets_mapping=False,
                 return_length=False, verbose=True, **kwargs):
    # Fast path: call self.encode() directly → tiktoken (Rust).
    # Bypasses tokenize()_tokenize()convert_tokens_to_ids() overhead.
    if isinstance(text, str) and text_pair is None and not is_split_into_words:
        first_ids = self.encode(text)
    else:
        return super()._encode_plus(
            text, text_pair=text_pair, add_special_tokens=add_special_tokens,
            padding_strategy=padding_strategy, truncation_strategy=truncation_strategy,
            max_length=max_length, stride=stride, is_split_into_words=is_split_into_words,
            pad_to_multiple_of=pad_to_multiple_of, padding_side=padding_side,
            return_tensors=return_tensors, return_token_type_ids=return_token_type_ids,
            return_attention_mask=return_attention_mask,
            return_overflowing_tokens=return_overflowing_tokens,
            return_special_tokens_mask=return_special_tokens_mask,
            return_offsets_mapping=return_offsets_mapping,
            return_length=return_length, verbose=verbose, **kwargs)

    return self.prepare_for_model(
        first_ids, add_special_tokens=add_special_tokens,
        padding=padding_strategy.value, truncation=truncation_strategy.value,
        max_length=max_length, stride=stride,
        pad_to_multiple_of=pad_to_multiple_of, padding_side=padding_side,
        return_tensors=return_tensors, prepend_batch_axis=True,
        return_attention_mask=return_attention_mask,
        return_token_type_ids=return_token_type_ids,
        return_overflowing_tokens=return_overflowing_tokens,
        return_special_tokens_mask=return_special_tokens_mask,
        return_length=return_length, verbose=verbose)

---

import time
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("nvidia/Kimi-K2.5-NVFP4", trust_remote_code=True)
prompt = "The quick brown fox jumps over the lazy dog. " * 1300

for _ in range(3):  # warmup
    tokenizer.encode(prompt)
    tokenizer(prompt, add_special_tokens=True, truncation=True, max_length=131073)

t0 = time.perf_counter()
fast = tokenizer.encode(prompt)
t_fast = time.perf_counter() - t0

t0 = time.perf_counter()
slow = tokenizer(prompt, add_special_tokens=True, truncation=True, max_length=131073).input_ids
t_slow = time.perf_counter() - t0

print(f"encode():  {t_fast*1000:.1f} ms | __call__(): {t_slow*1000:.1f} ms | {t_slow/t_fast:.1f}x | match={fast==slow}")
RAW_BUFFERClick to expand / collapse

🚀 The feature, motivation and pitch

[Performance] Kimi K2.5: 2.4x tokenization overhead from HF pipeline bypass

TLDR

Inferact/Kimi-K2.5-NVFP4 now ships an updated tokenization_kimi.py that fixes a 2.4x tokenization overhead in vLLM. No model weight changes. No vLLM code changes needed. Tokenization output is identical (verified on 69k+ comparisons across 6 code paths, zero mismatches).

The fix overrides _encode_plus to call tiktoken (Rust) directly, bypassing an unnecessary Python pipeline (tokenize()_tokenize()convert_tokens_to_ids()) that was converting int→str→int on every token.

Problem

Kimi K2.5's TikTokenTokenizer has a fast encode() that calls tiktoken (Rust) directly, but vLLM never uses it. The AsyncMicrobatchTokenizer.encode() routes through HF's __call__()_encode_plus(), which forces a slow Python pipeline:

_encode_plus → get_input_ids:
    → tokenize(text):
        → tokens_trie.split(text)                 ← 76% of overhead: O(n*k) trie scan
        → _tokenize(segment):
            → self.encode(segment) [tiktoken/Rust → list[int]]
            → [self.decoder[t] for t ...]          ← 3%: int→str
    → convert_tokens_to_ids(strings)               ← 21%: str→int
→ prepare_for_model(ids, truncation, special tokens)

Kimi's encode(text) already produces the final list[int] with proper trie split included — the tokenize()_tokenize()convert_tokens_to_ids() chain is pure overhead.

Benchmark

TokensBefore (original __call__)After (patched __call__)Speedup
1k1.64 ms0.71 ms2.32x
16k24.97 ms10.46 ms2.39x
64k99.41 ms41.55 ms2.39x
120k186.79 ms77.71 ms2.40x

Consistent ~2.4x, scales linearly with input length. At 120k tokens, ~109ms wasted per request.

Environment: Python 3.12, transformers 4.57.6, tiktoken 0.12.0, aarch64 Linux.

Root Cause (instrumented call stack)

vllm/renderers/base.py:399         await tokenizer.encode(prompt, **kwargs)
vllm/utils/async_utils.py:63       return (await self(prompt, **kwargs)).input_ids
  → async_utils.py:118               self.tokenizer(text, **kwargs)       # __call__
  → tokenization_utils_base.py:2996  __call__ → encode_plus
  → tokenization_utils.py:743        _encode_plus():
      → line 767                        tokens = self.tokenize(text)       # slow chain starts
      → tokenization_utils.py:661         tokens_trie.split(text)          # 76% overhead
      → tokenization_utils.py:697         self._tokenize(segment)
      → tokenization_kimi.py:283            [self.decoder[t] for t in self.encode(text)]
                                            # encode(text) → fast tiktoken → list[int]
                                            # then int→str for each token (3%)
      → line 768                        self.convert_tokens_to_ids(tokens)  # str→int (21%)
      → line 803                        self.prepare_for_model(ids, ...)    # truncation etc.

Note: the kwargs (add_special_tokens, truncation, max_length) are consumed by _encode_plus as named parameters. They never reach Kimi's encode(). The fast tiktoken path runs inside _tokenize() — but its list[int] output is converted to strings and back, and the entire input is scanned by the trie first.

Proposed Fix

The fix is on model repo files directly. No vLLM code change. Override _encode_plus in tokenization_kimi.py to call self.encode(text) directly, skipping the tokenize()_tokenize()convert_tokens_to_ids() chain. prepare_for_model() still handles truncation and special tokens correctly.

def _encode_plus(self, text, text_pair=None, add_special_tokens=True,
                 padding_strategy=PaddingStrategy.DO_NOT_PAD,
                 truncation_strategy=TruncationStrategy.DO_NOT_TRUNCATE,
                 max_length=None, stride=0, is_split_into_words=False,
                 pad_to_multiple_of=None, padding_side=None,
                 return_tensors=None, return_token_type_ids=None,
                 return_attention_mask=None, return_overflowing_tokens=False,
                 return_special_tokens_mask=False, return_offsets_mapping=False,
                 return_length=False, verbose=True, **kwargs):
    # Fast path: call self.encode() directly → tiktoken (Rust).
    # Bypasses tokenize() → _tokenize() → convert_tokens_to_ids() overhead.
    if isinstance(text, str) and text_pair is None and not is_split_into_words:
        first_ids = self.encode(text)
    else:
        return super()._encode_plus(
            text, text_pair=text_pair, add_special_tokens=add_special_tokens,
            padding_strategy=padding_strategy, truncation_strategy=truncation_strategy,
            max_length=max_length, stride=stride, is_split_into_words=is_split_into_words,
            pad_to_multiple_of=pad_to_multiple_of, padding_side=padding_side,
            return_tensors=return_tensors, return_token_type_ids=return_token_type_ids,
            return_attention_mask=return_attention_mask,
            return_overflowing_tokens=return_overflowing_tokens,
            return_special_tokens_mask=return_special_tokens_mask,
            return_offsets_mapping=return_offsets_mapping,
            return_length=return_length, verbose=verbose, **kwargs)

    return self.prepare_for_model(
        first_ids, add_special_tokens=add_special_tokens,
        padding=padding_strategy.value, truncation=truncation_strategy.value,
        max_length=max_length, stride=stride,
        pad_to_multiple_of=pad_to_multiple_of, padding_side=padding_side,
        return_tensors=return_tensors, prepend_batch_axis=True,
        return_attention_mask=return_attention_mask,
        return_token_type_ids=return_token_type_ids,
        return_overflowing_tokens=return_overflowing_tokens,
        return_special_tokens_mask=return_special_tokens_mask,
        return_length=return_length, verbose=verbose)

Why this is safe:

  • Truncation: Handled by prepare_for_model(), not post-hoc.
  • Special tokens: build_inputs_with_special_tokens() called inside prepare_for_model().
  • Trie split skipped: Kimi's encode() calls tiktoken with allowed_special="all", so special tokens in the input are handled natively by Rust — the Python trie scan is redundant.
  • Complex inputs: Text pairs, pre-split words fall back to super()._encode_plus().
  • Both paths fixed: __call__ and encode(**kwargs) both funnel through _encode_plus.
  • No vLLM changes needed.

Measured with patch applied:

PromptOriginalPatchedSpeedup
1k1.34 ms0.58 ms2.30x
16k19.00 ms8.05 ms2.36x
64k75.66 ms31.87 ms2.37x

Correctness Verification

Tested on a 100-conversation ShareGPT dataset (11,122 text segments + 400 synthetic special-token edge cases), 6 code paths each:

  • __call__(text, add_special_tokens=True, truncation=True, max_length=131073)
  • encode(text, add_special_tokens=True, truncation=True, max_length=131073)
  • encode(text) (no kwargs)
  • __call__ with active truncation to 50 tokens
  • encode with active truncation to 50 tokens
  • add_special_tokens=False

69,132 total comparisons. Zero mismatches. Before-fix and after-fix produce identical token IDs in every case.

Reproduction

import time
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("nvidia/Kimi-K2.5-NVFP4", trust_remote_code=True)
prompt = "The quick brown fox jumps over the lazy dog. " * 1300

for _ in range(3):  # warmup
    tokenizer.encode(prompt)
    tokenizer(prompt, add_special_tokens=True, truncation=True, max_length=131073)

t0 = time.perf_counter()
fast = tokenizer.encode(prompt)
t_fast = time.perf_counter() - t0

t0 = time.perf_counter()
slow = tokenizer(prompt, add_special_tokens=True, truncation=True, max_length=131073).input_ids
t_slow = time.perf_counter() - t0

print(f"encode():  {t_fast*1000:.1f} ms | __call__(): {t_slow*1000:.1f} ms | {t_slow/t_fast:.1f}x | match={fast==slow}")

cc @zixi-qi @zhewenl @ywang96

Alternatives

No response

Additional context

No response

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

TL;DR

Override the _encode_plus method in tokenization_kimi.py to directly call self.encode(text) and bypass the slow Python pipeline.

Guidance

  • Identify the performance bottleneck in the tokenization process, which is the unnecessary Python pipeline (tokenize()_tokenize()convert_tokens_to_ids()) that converts int→str→int on every token.
  • Verify that the proposed fix correctly handles truncation, special tokens, and complex inputs by checking the prepare_for_model() function and the build_inputs_with_special_tokens() function.
  • Test the fix using the provided reproduction code and verify that the speedup is consistent with the expected 2.4x improvement.
  • Ensure that the fix does not introduce any regressions by running the correctness verification tests on a large dataset.

Example

The provided reproduction code can be used to test the fix:

import time
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("nvidia/Kimi-K2.5-NVFP4", trust_remote_code=True)
prompt = "The quick brown fox jumps over the lazy dog. " * 1300

for _ in range(3):  # warmup
    tokenizer.encode(prompt)
    tokenizer(prompt, add_special_tokens=True, truncation=True, max_length=131073)

t0 = time.perf_counter()
fast = tokenizer.encode(prompt)
t_fast = time.perf_counter() - t0

t0 = time.perf_counter()
slow = tokenizer(prompt, add_special_tokens=True, truncation=True, max_length=131073).input_ids
t_slow = time.perf_counter() - t0

print(f"encode():  {t_fast*1000:.1f} ms | __call__(): {t_slow*1000:.1f} ms | {t_slow/t_fast:.1f}x | match={fast==slow}")

Notes

The fix assumes that the tokenization_kimi.py file is modifiable and that the self.encode(text) function correctly handles the tokenization process. Additionally, the fix may not be applicable to all use cases, and further testing may be necessary to ensure its correctness.

Recommendation

Apply the workaround by overriding the _encode_plus method in tokenization_kimi.py to directly call self.encode(text), as this fix has been verified to provide a consistent 2.4x speedup and does not introduce any regressions.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING