vllm - ✅(Solved) Fix [Bug]: Kimi K2.5 multimodal inference broken — media_placeholder_token_id mismatch with runtime tokenizer [1 pull requests, 1 participants]

vllm2026-04-08 03:09:56

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

vllm-project/vllm#39261•Fetched 2026-04-09 07:52:16

View on GitHub

Comments

Participants

Timeline

Reactions

Author

pstefa1707

Participants

pstefa1707

Timeline (top)

referenced ×2cross-referenced ×1labeled ×1

Error Message

AssertionError: Failed to apply prompt replacement for mm_items['vision_chunk'][0]

Root Cause

The root cause: KimiK25Config.media_placeholder_token_id is set to 163605, but at runtime the tokenizer maps <|media_pad|> to token ID 163602. Token 163605 is actually [UNK]. When _get_prompt_updates builds a PromptReplacement targeting [163605], it searches the tokenized input for a token that doesn't exist — the actual <|media_pad|> tokens are at 163602, so the target is never found and the assertion fires.

Fix Action

Fixed

Fixed by PR: fix(kimi_k25): resolve media_placeholder_token_id from tokenizer (https://github.com/vllm-project/vllm/pull/39344)

PR fix notes

PR #39344: fix(kimi_k25): resolve media_placeholder_token_id from tokenizer

Repository: vllm-project/vllm
Author: r266-tech
State: open | merged: False
Link: https://github.com/vllm-project/vllm/pull/39344

Description (problem / solution / changelog)

Summary

Kimi-K2.5 multimodal inference (images/video) is completely broken because KimiK25Config.media_placeholder_token_id (163605) disagrees with the tokenizer's actual mapping for <|media_pad|> (163602).

Root cause: Kimi-K2.5 is the only major model that doesn't ship a tokenizer.json, forcing transformers to auto-convert from a slow tiktoken-based tokenizer. This auto-conversion silently compacts special token ID gaps, shifting <|media_pad|> from 163605 to 163602.

Fix:

In KimiK25ProcessingInfo.__init__, resolve the correct token ID from the tokenizer via convert_tokens_to_ids("<|media_pad|>") and patch the config if they disagree (with a warning log)
In _get_prompt_updates, use the already-resolved self.info.media_token_id instead of re-reading from config

This ensures the correct token ID is used throughout the processing pipeline regardless of whether the upstream model's config.json has the right value.

Fixes #39261

Test plan

Run existing Kimi-K2.5 multimodal tests to verify they pass
Test with moonshotai/Kimi-K2.5 model + image input to verify the assertion error no longer occurs
Verify text-only Kimi-K2.5 inference is unaffected (no regression)

Changed files

vllm/model_executor/models/kimi_k25.py (modified, +28/-3)

Code Example

Collecting environment information...
PyTorch version              : 2.10.0+cu128
Is debug build               : False
CUDA used to build PyTorch   : 12.8
ROCM used to build PyTorch   : N/A
OS                           : Ubuntu 24.04.1 LTS (x86_64)
GCC version                  : (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0
Python version               : 3.12.3 (64-bit runtime)
Is CUDA available            : True
CUDA runtime version         : 12.8.93
GPU models and configuration :
GPU 0-7: NVIDIA H100 80GB HBM3
Nvidia driver version        : 570.148.08
cuDNN version                : 9.20.0
vLLM Version                 : 0.19.0
vLLM Build Flags:
  CUDA Archs: Not Set; ROCm: Disabled

[pip3] transformers==5.5.0
[pip3] torch==2.10.0+cu128
[pip3] flash-attn==2.8.3+cu128torch2.10
[pip3] transformer-engine-torch==2.12.0

---

AssertionError: Failed to apply prompt replacement for mm_items['vision_chunk'][0]

---

media_placeholder_token_id: int = 163605

---

def _get_prompt_updates(self, ...):
    hf_config = self.info.get_hf_config()
    media_token_id = hf_config.media_placeholder_token_id  # 163605 = [UNK]
    ...
    return [
        PromptReplacement(
            modality="vision_chunk",
            target=[media_token_id],  # looking for 163605, never found
            replacement=get_replacement,
        ),
    ]

---

from transformers import AutoTokenizer, AutoConfig

tok = AutoTokenizer.from_pretrained("moonshotai/Kimi-K2.5", trust_remote_code=True)
cfg = AutoConfig.from_pretrained("moonshotai/Kimi-K2.5", trust_remote_code=True)

# config.json says 163605, but the tokenizer says 163602
print(f"config.media_placeholder_token_id = {cfg.media_placeholder_token_id}")  # 163605
print(f"tokenizer('<|media_pad|>') = {tok.convert_tokens_to_ids('<|media_pad|>')}")  # 163602
print(f"token at 163605 = {tok.convert_ids_to_tokens(163605)}")  # [UNK]

---

config.media_placeholder_token_id = 163605
tokenizer('<|media_pad|>') = 163602
token at 163605 = [UNK]

---

from transformers import AutoTokenizer, AutoConfig

tok = AutoTokenizer.from_pretrained("moonshotai/Kimi-K2.5", trust_remote_code=True)
cfg = AutoConfig.from_pretrained("moonshotai/Kimi-K2.5", trust_remote_code=True)

# Tokenize a VL prompt using the chat template
messages = [{"role": "user", "content": [
    {"type": "image_url", "image_url": {"url": "placeholder"}},
    {"type": "text", "text": "Describe this image"},
]}]
text = tok.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
input_ids = tok.encode(text, add_special_tokens=False)

config_target = cfg.media_placeholder_token_id  # 163605
actual_media_id = tok.convert_tokens_to_ids("<|media_pad|>")  # 163602

print(f"Config target {config_target} ([UNK]) in input_ids: {config_target in input_ids}")  # False
print(f"Actual <|media_pad|> {actual_media_id} in input_ids: {actual_media_id in input_ids}")  # True

---

Config target 163605 ([UNK]) in input_ids: False
Actual <|media_pad|> 163602 in input_ids: True

---

from vllm import LLM, SamplingParams

llm = LLM(
    model="moonshotai/Kimi-K2.5",
    trust_remote_code=True,
    tensor_parallel_size=8,
)

sampling_params = SamplingParams(temperature=0.7, max_tokens=256)

outputs = llm.generate(
    {
        "prompt": "<|im_user|>user<|im_middle|><|media_pad|>Describe this image<|im_end|><|im_assistant|>assistant<|im_middle|>",
        "multi_modal_data": {
            "image": "https://upload.wikimedia.org/wikipedia/commons/thumb/4/47/PNG_transparency_demonstration_1.png/300px-PNG_transparency_demonstration_1.png",
        },
    },
    sampling_params,
)

---

AssertionError: Failed to apply prompt replacement for mm_items['vision_chunk'][0]

---

def _get_prompt_updates(self, ...):
    media_token_id = self.info.media_token_id  # resolved from tokenizer
    ...

---

def __init__(self, ctx):
    ...
    tokenizer = self.get_tokenizer()
    correct_id = tokenizer.convert_tokens_to_ids("<|media_pad|>")
    if isinstance(correct_id, int) and correct_id != self.media_token_id:
        self.media_token_id = correct_id
        self.media_token = tokenizer.decode(correct_id)
        self.hf_processor.media_token_id = correct_id
        self.hf_config.media_placeholder_token_id = correct_id

RAW_BUFFERClick to expand / collapse

Your current environment

<details> <summary>The output of <code>python collect_env.py</code></summary>

Collecting environment information...
PyTorch version              : 2.10.0+cu128
Is debug build               : False
CUDA used to build PyTorch   : 12.8
ROCM used to build PyTorch   : N/A
OS                           : Ubuntu 24.04.1 LTS (x86_64)
GCC version                  : (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0
Python version               : 3.12.3 (64-bit runtime)
Is CUDA available            : True
CUDA runtime version         : 12.8.93
GPU models and configuration :
GPU 0-7: NVIDIA H100 80GB HBM3
Nvidia driver version        : 570.148.08
cuDNN version                : 9.20.0
vLLM Version                 : 0.19.0
vLLM Build Flags:
  CUDA Archs: Not Set; ROCm: Disabled

[pip3] transformers==5.5.0
[pip3] torch==2.10.0+cu128
[pip3] flash-attn==2.8.3+cu128torch2.10
[pip3] transformer-engine-torch==2.12.0

</details>

🐛 Describe the bug

Kimi K2.5 multimodal inference (images/video) is completely broken. Any request with image or video input fails with:

AssertionError: Failed to apply prompt replacement for mm_items['vision_chunk'][0]

Why this happens

vllm/transformers_utils/configs/kimi_k25.py hardcodes:

media_placeholder_token_id: int = 163605

This value comes from Kimi K2.5's config.json, which was written for the slow TikTokenTokenizer. However, transformers v5 auto-converts the slow tokenizer to a fast TokenizersBackend, compacting gaps in the special token ID range. After compaction, <|media_pad|> is at 163602 and [UNK] moves down to occupy 163605.

In KimiK25MultiModalProcessor._get_prompt_updates, the PromptReplacement target is set using this stale config value:

def _get_prompt_updates(self, ...):
    hf_config = self.info.get_hf_config()
    media_token_id = hf_config.media_placeholder_token_id  # 163605 = [UNK]
    ...
    return [
        PromptReplacement(
            modality="vision_chunk",
            target=[media_token_id],  # looking for 163605, never found
            replacement=get_replacement,
        ),
    ]

The chat template tokenizes <|media_pad|> as 163602, so the PromptReplacement target [163605] is never present in the tokenized input_ids.

How to reproduce

We ran all of these locally and confirmed the results.

1. Verify the token ID mismatch (no GPU needed):

from transformers import AutoTokenizer, AutoConfig

tok = AutoTokenizer.from_pretrained("moonshotai/Kimi-K2.5", trust_remote_code=True)
cfg = AutoConfig.from_pretrained("moonshotai/Kimi-K2.5", trust_remote_code=True)

# config.json says 163605, but the tokenizer says 163602
print(f"config.media_placeholder_token_id = {cfg.media_placeholder_token_id}")  # 163605
print(f"tokenizer('<|media_pad|>') = {tok.convert_tokens_to_ids('<|media_pad|>')}")  # 163602
print(f"token at 163605 = {tok.convert_ids_to_tokens(163605)}")  # [UNK]

Output:

config.media_placeholder_token_id = 163605
tokenizer('<|media_pad|>') = 163602
token at 163605 = [UNK]

2. Verify the PromptReplacement target is absent from tokenized input (no GPU needed):

from transformers import AutoTokenizer, AutoConfig

tok = AutoTokenizer.from_pretrained("moonshotai/Kimi-K2.5", trust_remote_code=True)
cfg = AutoConfig.from_pretrained("moonshotai/Kimi-K2.5", trust_remote_code=True)

# Tokenize a VL prompt using the chat template
messages = [{"role": "user", "content": [
    {"type": "image_url", "image_url": {"url": "placeholder"}},
    {"type": "text", "text": "Describe this image"},
]}]
text = tok.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
input_ids = tok.encode(text, add_special_tokens=False)

config_target = cfg.media_placeholder_token_id  # 163605
actual_media_id = tok.convert_tokens_to_ids("<|media_pad|>")  # 163602

print(f"Config target {config_target} ([UNK]) in input_ids: {config_target in input_ids}")  # False
print(f"Actual <|media_pad|> {actual_media_id} in input_ids: {actual_media_id in input_ids}")  # True

Output:

Config target 163605 ([UNK]) in input_ids: False
Actual <|media_pad|> 163602 in input_ids: True

The PromptReplacement target (163605) is never present in the tokenized input. The actual <|media_pad|> tokens are at 163602, but vLLM is searching for the wrong ID.

3. Multimodal inference fails (requires GPU + full model):

from vllm import LLM, SamplingParams

llm = LLM(
    model="moonshotai/Kimi-K2.5",
    trust_remote_code=True,
    tensor_parallel_size=8,
)

sampling_params = SamplingParams(temperature=0.7, max_tokens=256)

outputs = llm.generate(
    {
        "prompt": "<|im_user|>user<|im_middle|><|media_pad|>Describe this image<|im_end|><|im_assistant|>assistant<|im_middle|>",
        "multi_modal_data": {
            "image": "https://upload.wikimedia.org/wikipedia/commons/thumb/4/47/PNG_transparency_demonstration_1.png/300px-PNG_transparency_demonstration_1.png",
        },
    },
    sampling_params,
)

AssertionError: Failed to apply prompt replacement for mm_items['vision_chunk'][0]

Suggested Fix

_get_prompt_updates should use self.info.media_token_id (which is resolved from the tokenizer in KimiK25ProcessingInfo.__init__) instead of re-reading from hf_config.media_placeholder_token_id:

def _get_prompt_updates(self, ...):
    media_token_id = self.info.media_token_id  # resolved from tokenizer
    ...

Additionally, KimiK25ProcessingInfo.__init__ should cross-check the config value against the tokenizer and override it if they disagree:

def __init__(self, ctx):
    ...
    tokenizer = self.get_tokenizer()
    correct_id = tokenizer.convert_tokens_to_ids("<|media_pad|>")
    if isinstance(correct_id, int) and correct_id != self.media_token_id:
        self.media_token_id = correct_id
        self.media_token = tokenizer.decode(correct_id)
        self.hf_processor.media_token_id = correct_id
        self.hf_config.media_placeholder_token_id = correct_id

Related Issues

vllm-project/vllm-ascend#6934 — Kimi K2.5: "Attempted to assign 4225 multimodal tokens to 1 placeholder" (likely same root cause — wrong placeholder ID means media tokens aren't found during embedding merge)

Before submitting a new issue...

Searched existing issues — no prior report of this specific token ID mismatch
This is a vLLM bug, not a transformers bug (though the upstream model's config.json is also wrong, vLLM should resolve token IDs from the tokenizer rather than trusting config values)

Additional context

The upstream model repo (moonshotai/Kimi-K2.5) has this bug in its config.json. Kimi K2.5 is the only major model that doesn't ship a tokenizer.json, forcing transformers to auto-convert from a slow tiktoken-based tokenizer to a fast tokenizer — which silently compacts gaps in the special token IDs, shifting <|media_pad|> from 163605 to 163602 and moving [UNK] into 163605's slot. All other major models (Qwen, DeepSeek, Llama, Gemma) ship tokenizer.json and avoid this entirely.

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

TL;DR

Update the KimiK25MultiModalProcessor._get_prompt_updates method to use self.info.media_token_id instead of hf_config.media_placeholder_token_id to fix the token ID mismatch issue.

Guidance

Verify the token ID mismatch by running the provided code snippets to confirm that the config.json value differs from the tokenizer's mapping.
Update KimiK25ProcessingInfo.__init__ to cross-check the config value against the tokenizer and override it if they disagree.
Apply the suggested fix to _get_prompt_updates to use the correct media_token_id resolved from the tokenizer.
Test the updated code with multimodal inference to ensure the AssertionError is resolved.

Example

def _get_prompt_updates(self, ...):
    media_token_id = self.info.media_token_id  # resolved from tokenizer
    ...

Notes

The issue is specific to Kimi K2.5 due to its config.json and the auto-conversion of the slow tiktoken-based tokenizer to a fast tokenizer, which compacts gaps in special token IDs. Other models that ship tokenizer.json are not affected.

Recommendation

Apply the suggested workaround to update KimiK25MultiModalProcessor._get_prompt_updates and KimiK25ProcessingInfo.__init__ to resolve the token ID mismatch issue, as the upstream model repo's config.json is incorrect and vLLM should resolve token IDs from the tokenizer.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #ssr #installation #API versioning #request timeout

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

vllm - ✅(Solved) Fix [Bug]: Kimi K2.5 multimodal inference broken — media_placeholder_token_id mismatch with runtime tokenizer [1 pull requests, 1 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Fixed

PR fix notes

PR #39344: fix(kimi_k25): resolve media_placeholder_token_id from tokenizer

Description (problem / solution / changelog)

Summary

Test plan

Changed files

Code Example

Your current environment

🐛 Describe the bug

Why this happens

How to reproduce

Suggested Fix

Related Issues

Before submitting a new issue...

Additional context

Before submitting a new issue...

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING