vllm - ✅(Solved) Fix [Bug]: Failed to call /chat/completions after /tokenize for same multimodal query [1 pull requests, 1 participants]

vllm2026-03-30 12:19:07

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

vllm-project/vllm#38543•Fetched 2026-04-08 01:53:26

View on GitHub

Comments

Participants

Timeline

Reactions

Author

sergey-zinchenko

Participants

sergey-zinchenko

Timeline (top)

cross-referenced ×2closed ×1labeled ×1

Error Message

When calling the /tokenize endpoint with a multimodal message (e.g. an image), and then calling /v1/chat/completions with the same message, the chat completion request fails with an internal server error.

Fix Action

Fixed

Fixed by PR: [Bugfix] Skip multimodal processor cache in /tokenize to prevent stale SenderCache entries (https://github.com/vllm-project/vllm/pull/38545)

PR fix notes

PR #38545: [Bugfix] Skip multimodal processor cache in /tokenize to prevent stale SenderCache entries

Repository: vllm-project/vllm
Author: sergey-zinchenko
State: open | merged: False
Link: https://github.com/vllm-project/vllm/pull/38545

Description (problem / solution / changelog)

Purpose

Fix /v1/chat/completions failing after /tokenize is called with the same multimodal input (e.g. an image).

The /tokenize endpoint runs the full multimodal processing pipeline, which populates both the HF processor cache and the SenderCache. Since tokenization never sends a request to the engine core, those SenderCache entries are never consumed. A subsequent /v1/chat/completions request with the same image hits the processor cache, gets a stale SenderCache reference, and the engine fails with an internal error.

Fixes https://github.com/vllm-project/vllm/issues/38543

Approach

Added a ContextVar flag skip_mm_processor_cache in processor.py. When set to True, BaseMultiModalProcessor.apply() calls _apply_hf_processor directly instead of _cached_apply_hf_processor, so nothing is written to or read from the cache.
In the /tokenize endpoint (serving.py), the flag is set before preprocessing and reset in a finally block. This ensures multimodal data is processed correctly for token counting but doesn't pollute the SenderCache

Test Plan

.venv/bin/python -m pytest tests/entrypoints/serve/tokenize/test_tokenize_then_chat_vlm.py -v

Test Result

Both tests pass — tokenize followed by chat completion (sync and async) succeeds without errors.

Changed files

tests/entrypoints/serve/tokenize/test_tokenize_then_chat_vlm.py (added, +133/-0)
vllm/entrypoints/serve/tokenize/serving.py (modified, +36/-28)
vllm/multimodal/processing/processor.py (modified, +20/-5)

Code Example

vllm serve Qwen/Qwen2.5-VL-3B-Instruct \
     --dtype bfloat16 --max-model-len 4096 --enforce-eager \
     --limit-mm-per-prompt '{"image": 1}'

---

def test_tokenize_then_chat_completion_with_image(
    server: RemoteOpenAIServer,
    local_asset_server,
):
    """Tokenize a multimodal message, then send the same message to chat
    completions.  The chat completion must succeed (not 500)."""

    image_url = local_asset_server.url_for("stop_sign.jpg")
    messages = [
        {
            "role": "user",
            "content": [
                {"type": "image_url", "image_url": {"url": image_url}},
                {"type": "text", "text": "Describe this image briefly."},
            ],
        }
    ]

    # Step 1: tokenize (this triggers multimodal processing in the renderer)
    tok_resp = requests.post(
        server.url_for("tokenize"),
        json={"model": MODEL_NAME, "messages": messages},
    )
    tok_resp.raise_for_status()
    tok_data = tok_resp.json()
    assert tok_data["count"] > 0, "Tokenization must return tokens"

    # Step 2: chat completion with the SAME multimodal message
    chat_resp = requests.post(
        server.url_for("v1/chat/completions"),
        json={
            "model": MODEL_NAME,
            "messages": messages,
            "max_tokens": 10,
            "temperature": 0.0,
        },
    )

    assert chat_resp.status_code == 200, (
        f"Chat completion failed after tokenize: "
        f"status={chat_resp.status_code}, body={chat_resp.text}"
    )
    chat_data = chat_resp.json()
    assert chat_data["choices"][0]["message"]["content"], (
        "Chat completion must produce non-empty content"
    )

RAW_BUFFERClick to expand / collapse

Your current environment

<details> vLLM main branch (latest), tested with `Qwen/Qwen2.5-VL-3B-Instruct`. </details>

🐛 Describe the bug

How to reproduce

Start a vLLM server with a VLM:

vllm serve Qwen/Qwen2.5-VL-3B-Instruct \
  --dtype bfloat16 --max-model-len 4096 --enforce-eager \
  --limit-mm-per-prompt '{"image": 1}'

Send a /tokenize request with an image, then send the exact same payload to /v1/chat/completions. The second call fails.

def test_tokenize_then_chat_completion_with_image(
    server: RemoteOpenAIServer,
    local_asset_server,
):
    """Tokenize a multimodal message, then send the same message to chat
    completions.  The chat completion must succeed (not 500)."""

    image_url = local_asset_server.url_for("stop_sign.jpg")
    messages = [
        {
            "role": "user",
            "content": [
                {"type": "image_url", "image_url": {"url": image_url}},
                {"type": "text", "text": "Describe this image briefly."},
            ],
        }
    ]

    # Step 1: tokenize (this triggers multimodal processing in the renderer)
    tok_resp = requests.post(
        server.url_for("tokenize"),
        json={"model": MODEL_NAME, "messages": messages},
    )
    tok_resp.raise_for_status()
    tok_data = tok_resp.json()
    assert tok_data["count"] > 0, "Tokenization must return tokens"

    # Step 2: chat completion with the SAME multimodal message
    chat_resp = requests.post(
        server.url_for("v1/chat/completions"),
        json={
            "model": MODEL_NAME,
            "messages": messages,
            "max_tokens": 10,
            "temperature": 0.0,
        },
    )

    assert chat_resp.status_code == 200, (
        f"Chat completion failed after tokenize: "
        f"status={chat_resp.status_code}, body={chat_resp.text}"
    )
    chat_data = chat_resp.json()
    assert chat_data["choices"][0]["message"]["content"], (
        "Chat completion must produce non-empty content"
    )

extent analysis

Fix Plan

The fix involves modifying the /v1/chat/completions endpoint to properly handle multimodal messages that have been previously tokenized.

Update the /v1/chat/completions endpoint to check if the input message has already been tokenized.
If the message has been tokenized, use the cached tokenization result to avoid re-processing the multimodal content.

Example code snippet:

# In the /v1/chat/completions endpoint
if 'tokenized' in request.json:
    # Use the cached tokenization result
    tokenized_message = request.json['tokenized']
else:
    # Tokenize the message and cache the result
    tokenized_message = tokenize_message(request.json['messages'])
    request.json['tokenized'] = tokenized_message

# Proceed with chat completion using the tokenized message

Verification

To verify that the fix worked, re-run the test case test_tokenize_then_chat_completion_with_image and check that the chat completion request succeeds with a 200 status code.

# Re-run the test case
test_tokenize_then_chat_completion_with_image(server, local_asset_server)

Extra Tips

Make sure to update the tokenize endpoint to cache the tokenization results for multimodal messages.
Consider adding a timeout or expiration mechanism for the cached tokenization results to avoid stale data.
Review the code for any potential security vulnerabilities when handling cached tokenization results.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#dependency conflict #environment setup #docker error #permission error #memory optimization

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

vllm - ✅(Solved) Fix [Bug]: Failed to call /chat/completions after /tokenize for same multimodal query [1 pull requests, 1 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Fix Action

Fixed

PR fix notes

PR #38545: [Bugfix] Skip multimodal processor cache in /tokenize to prevent stale SenderCache entries

Description (problem / solution / changelog)

Purpose

Approach

Test Plan

Test Result

Changed files

Code Example

Your current environment

🐛 Describe the bug

How to reproduce

extent analysis

Fix Plan

Verification

Extra Tips

Still need to ship something?

TRENDING

vllm - ✅(Solved) Fix [Bug]: Failed to call /chat/completions after /tokenize for same multimodal query [1 pull requests, 1 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Fix Action

Fixed

PR fix notes

PR #38545: [Bugfix] Skip multimodal processor cache in /tokenize to prevent stale SenderCache entries

Description (problem / solution / changelog)

Purpose

Approach

Test Plan

Test Result

Changed files

Code Example

Your current environment

🐛 Describe the bug

How to reproduce

extent analysis

Fix Plan

Verification

Extra Tips

Still need to ship something?

RELATED_DISCOVERY

TRENDING