vllm - ✅(Solved) Fix [Bug]: Failed to call /chat/completions after /tokenize for same multimodal query [1 pull requests, 1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#38543Fetched 2026-04-08 01:53:26
View on GitHub
Comments
0
Participants
1
Timeline
4
Reactions
0
Participants
Timeline (top)
cross-referenced ×2closed ×1labeled ×1

Error Message

When calling the /tokenize endpoint with a multimodal message (e.g. an image), and then calling /v1/chat/completions with the same message, the chat completion request fails with an internal server error.

Fix Action

Fixed

PR fix notes

PR #38545: [Bugfix] Skip multimodal processor cache in /tokenize to prevent stale SenderCache entries

Description (problem / solution / changelog)

Purpose

Fix /v1/chat/completions failing after /tokenize is called with the same multimodal input (e.g. an image).

The /tokenize endpoint runs the full multimodal processing pipeline, which populates both the HF processor cache and the SenderCache. Since tokenization never sends a request to the engine core, those SenderCache entries are never consumed. A subsequent /v1/chat/completions request with the same image hits the processor cache, gets a stale SenderCache reference, and the engine fails with an internal error.

Fixes https://github.com/vllm-project/vllm/issues/38543

Approach

  • Added a ContextVar flag skip_mm_processor_cache in processor.py. When set to True, BaseMultiModalProcessor.apply() calls _apply_hf_processor directly instead of _cached_apply_hf_processor, so nothing is written to or read from the cache.
  • In the /tokenize endpoint (serving.py), the flag is set before preprocessing and reset in a finally block. This ensures multimodal data is processed correctly for token counting but doesn't pollute the SenderCache

Test Plan

.venv/bin/python -m pytest tests/entrypoints/serve/tokenize/test_tokenize_then_chat_vlm.py -v

Test Result

Both tests pass — tokenize followed by chat completion (sync and async) succeeds without errors.


Changed files

  • tests/entrypoints/serve/tokenize/test_tokenize_then_chat_vlm.py (added, +133/-0)
  • vllm/entrypoints/serve/tokenize/serving.py (modified, +36/-28)
  • vllm/multimodal/processing/processor.py (modified, +20/-5)

Code Example

vllm serve Qwen/Qwen2.5-VL-3B-Instruct \
     --dtype bfloat16 --max-model-len 4096 --enforce-eager \
     --limit-mm-per-prompt '{"image": 1}'

---

def test_tokenize_then_chat_completion_with_image(
    server: RemoteOpenAIServer,
    local_asset_server,
):
    """Tokenize a multimodal message, then send the same message to chat
    completions.  The chat completion must succeed (not 500)."""

    image_url = local_asset_server.url_for("stop_sign.jpg")
    messages = [
        {
            "role": "user",
            "content": [
                {"type": "image_url", "image_url": {"url": image_url}},
                {"type": "text", "text": "Describe this image briefly."},
            ],
        }
    ]

    # Step 1: tokenize (this triggers multimodal processing in the renderer)
    tok_resp = requests.post(
        server.url_for("tokenize"),
        json={"model": MODEL_NAME, "messages": messages},
    )
    tok_resp.raise_for_status()
    tok_data = tok_resp.json()
    assert tok_data["count"] > 0, "Tokenization must return tokens"

    # Step 2: chat completion with the SAME multimodal message
    chat_resp = requests.post(
        server.url_for("v1/chat/completions"),
        json={
            "model": MODEL_NAME,
            "messages": messages,
            "max_tokens": 10,
            "temperature": 0.0,
        },
    )

    assert chat_resp.status_code == 200, (
        f"Chat completion failed after tokenize: "
        f"status={chat_resp.status_code}, body={chat_resp.text}"
    )
    chat_data = chat_resp.json()
    assert chat_data["choices"][0]["message"]["content"], (
        "Chat completion must produce non-empty content"
    )
RAW_BUFFERClick to expand / collapse

Your current environment

<details> vLLM main branch (latest), tested with `Qwen/Qwen2.5-VL-3B-Instruct`. </details>

🐛 Describe the bug

When calling the /tokenize endpoint with a multimodal message (e.g. an image), and then calling /v1/chat/completions with the same message, the chat completion request fails with an internal server error.

How to reproduce

  1. Start a vLLM server with a VLM:

    vllm serve Qwen/Qwen2.5-VL-3B-Instruct \
      --dtype bfloat16 --max-model-len 4096 --enforce-eager \
      --limit-mm-per-prompt '{"image": 1}'
  2. Send a /tokenize request with an image, then send the exact same payload to /v1/chat/completions. The second call fails.

def test_tokenize_then_chat_completion_with_image(
    server: RemoteOpenAIServer,
    local_asset_server,
):
    """Tokenize a multimodal message, then send the same message to chat
    completions.  The chat completion must succeed (not 500)."""

    image_url = local_asset_server.url_for("stop_sign.jpg")
    messages = [
        {
            "role": "user",
            "content": [
                {"type": "image_url", "image_url": {"url": image_url}},
                {"type": "text", "text": "Describe this image briefly."},
            ],
        }
    ]

    # Step 1: tokenize (this triggers multimodal processing in the renderer)
    tok_resp = requests.post(
        server.url_for("tokenize"),
        json={"model": MODEL_NAME, "messages": messages},
    )
    tok_resp.raise_for_status()
    tok_data = tok_resp.json()
    assert tok_data["count"] > 0, "Tokenization must return tokens"

    # Step 2: chat completion with the SAME multimodal message
    chat_resp = requests.post(
        server.url_for("v1/chat/completions"),
        json={
            "model": MODEL_NAME,
            "messages": messages,
            "max_tokens": 10,
            "temperature": 0.0,
        },
    )

    assert chat_resp.status_code == 200, (
        f"Chat completion failed after tokenize: "
        f"status={chat_resp.status_code}, body={chat_resp.text}"
    )
    chat_data = chat_resp.json()
    assert chat_data["choices"][0]["message"]["content"], (
        "Chat completion must produce non-empty content"
    )

extent analysis

Fix Plan

The fix involves modifying the /v1/chat/completions endpoint to properly handle multimodal messages that have been previously tokenized.

  • Update the /v1/chat/completions endpoint to check if the input message has already been tokenized.
  • If the message has been tokenized, use the cached tokenization result to avoid re-processing the multimodal content.

Example code snippet:

# In the /v1/chat/completions endpoint
if 'tokenized' in request.json:
    # Use the cached tokenization result
    tokenized_message = request.json['tokenized']
else:
    # Tokenize the message and cache the result
    tokenized_message = tokenize_message(request.json['messages'])
    request.json['tokenized'] = tokenized_message

# Proceed with chat completion using the tokenized message

Verification

To verify that the fix worked, re-run the test case test_tokenize_then_chat_completion_with_image and check that the chat completion request succeeds with a 200 status code.

# Re-run the test case
test_tokenize_then_chat_completion_with_image(server, local_asset_server)

Extra Tips

  • Make sure to update the tokenize endpoint to cache the tokenization results for multimodal messages.
  • Consider adding a timeout or expiration mechanism for the cached tokenization results to avoid stale data.
  • Review the code for any potential security vulnerabilities when handling cached tokenization results.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING