vllm - 💡(How to fix) Fix [RFC]: Enable prompt_embeds content parts in Chat Completions API [3 comments, 3 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#39504Fetched 2026-04-11 06:13:11
View on GitHub
Comments
3
Participants
3
Timeline
12
Reactions
3
Timeline (top)
mentioned ×4subscribed ×4commented ×3labeled ×1

Code Example

{
  "messages": [
    {
      "role": "user",
      "content": [
        {"type": "text", "text": "Let's work on this task!."}
        {"type": "prompt_embeds", "data": "<base64_encoded_tensor>"},
      ]
    }
  ]
}

---

1. Request: messages[i].content[j] = {"type": "prompt_embeds", "data": "<base64>"}

2. parse_prompt_embeds()
  - safe_load_prompt_embeds() -> tensor (num_tokens, hidden_size)
  - tracker.add("prompt_embeds", tensor)
  - placeholder_str = "<prompt_embeds>" * num_tokens
  - inject placeholder_str as text into conversation
  
3. apply_chat_template(tokenize=True) -> token_ids (with placeholder token IDs)

4. _build_prompt_embeds_positions()
    - find consecutive chunks of placeholder token ID
    - return [(start_pos, length), ...] per tensor

5. Build full-length prompt_embeds tensor + is_token_ids mask.

6. EmbedsInput(prompt_embeds=full_tensor, prompt_token_ids=token_ids, is_token_ids=mask)

7. EngineCoreRequest -> Request -> InputBatch.add_request()
  - token_ids_cpu = prompt_token_ids
  - is_token_ids = per-position mask
  - req_prompt_embeds = full embedding tensor
  
8. gpu_model_runner forward pass (existing path, no changes)
    - is_token_ids=True  -> model.embed_input_ids()
    - is_token_ids=False -> pre-computed embeddings from buffer
RAW_BUFFERClick to expand / collapse

Motivation.

vLLM supports prompt_embeds (pre-computed embeddings) in the Completions API (/v1/completions) via the --enable-prompt-embeds flag. This allows users to bypass the model's embedding layer by providing a serialized tensor of shape (num_tokens, hidden_size).

However, the Chat Completions API (/v1/chat/completions) does not support prompt_embeds. Users who need to mix pre-computed embeddings with plain text content in a multi-turn conversation are forced manually apply chat-templates/tokenize/embed outside of vLLM to use the completions endpoint. Additionally, they cannot take advantage of features of /v1/chat/completions such as tool call parsing.

This RFC proposes adding prompt_embeds as a new content part type in the Chat Completions API, allowing users to interleave pre-computed embeddings with text within any message role.

Proposed Change.

API Surface

A new content part type "prompt_embeds" is added to chat messages, following the same pattern as "image_embeds" and other multimodal content parts:

{
  "messages": [
    {
      "role": "user",
      "content": [
        {"type": "text", "text": "Let's work on this task!."}
        {"type": "prompt_embeds", "data": "<base64_encoded_tensor>"},
      ]
    }
  ]
}

The data field contains a base64-encoded serialized torch.Tensor of shape (num_tokens, hidden_size), identical to the existing Completions API prompt_embeds format.

The feature remains gated behind --enable-prompt-embeds.

Multiple prompt_embeds parts can appear in a single message or across messages (system, user, assistant), and can be freely interleaved with text parts.

Design Overview

The key challenge is that chat templates expect text, but prompt_embeds are pre-computed embeddings. The approach is:

Placeholder substitution: During message parsing, each prompt_embeds part is replaced with N (where N is the tensor's num_tokens dimension) copies of a dedicated placeholder token (<prompt_embeds>, registered as a special token in the tokenizer).

Template rendering: The chat template sees only text (including placeholder tokens) and renders normally.

Position detection: After tokenization, the placeholder token IDs are located in the token sequence to determine exact positions.

Mask construction: A full-length is_token_ids mask and embedding tensor are built, mapping each position to either the model's embedding layer (mask = True) or the pre-computed embedding (mask = False).

GPU pipeline: The existing enable_prompt_embeds forward pass handles the rest, no changes to gpu_model_runner.py.

Placeholder Token

A dedicated <prompt_embeds> special token is registered via tokenizer.add_special_tokens() at startup (when --enable-prompt-embeds is set). Special tokens are matched before BPE/WordPiece, so they always encode to exactly 1 token ID, never split into subwords. This approach:

  • Guarantees 1:1 placeholder-to-position mapping.
  • Avoids collision with chat template structural tokens (e.g eos_token)
  • Works across all tokenizer families.
  • Has precedent in vLLM (DeepSeek VL2 uses the same API at vllm/transformers_utils/processors/deepseek_vl2.py).

Data Flow

1. Request: messages[i].content[j] = {"type": "prompt_embeds", "data": "<base64>"}

2. parse_prompt_embeds()
  - safe_load_prompt_embeds() -> tensor (num_tokens, hidden_size)
  - tracker.add("prompt_embeds", tensor)
  - placeholder_str = "<prompt_embeds>" * num_tokens
  - inject placeholder_str as text into conversation
  
3. apply_chat_template(tokenize=True) -> token_ids (with placeholder token IDs)

4. _build_prompt_embeds_positions()
    - find consecutive chunks of placeholder token ID
    - return [(start_pos, length), ...] per tensor

5. Build full-length prompt_embeds tensor + is_token_ids mask.

6. EmbedsInput(prompt_embeds=full_tensor, prompt_token_ids=token_ids, is_token_ids=mask)

7. EngineCoreRequest -> Request -> InputBatch.add_request()
  - token_ids_cpu = prompt_token_ids
  - is_token_ids = per-position mask
  - req_prompt_embeds = full embedding tensor
  
8. gpu_model_runner forward pass (existing path, no changes)
    - is_token_ids=True  -> model.embed_input_ids()
    - is_token_ids=False -> pre-computed embeddings from buffer

Prefix Caching

The existing prefix caching mechanism works correctly for mixed mode without modification:

  • request.all_token_ids will include both real and placeholder token IDs (primary hash key),
  • _gen_prompt_embeds_extra_hash_keys() adds a SHA256 of the embedding tensor as an extra key.

Outcome: same tokens + same embeddings produce a cache hit.

Feedback Period.

1 week

CC List.

@qthequartermasterman @Nan2018

Any Other Things.

  • This builds on the existing --enable-prompt-embeds infrastructure. The GPU pipeline, the secure (safe_load_prompt_embeds with weights_only=True), and configuration flag are all reused.
  • The implementation follows the same content part pattern used for image_url, input_audio, audio_embeds, and image_embeds. No new abstractions are introduced.

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

TL;DR

To add support for prompt_embeds in the Chat Completions API, implement a new content part type "prompt_embeds" and modify the message parsing and template rendering to handle pre-computed embeddings.

Guidance

  • Register a dedicated <prompt_embeds> special token in the tokenizer using tokenizer.add_special_tokens() when --enable-prompt-embeds is set.
  • Modify the parse_prompt_embeds() function to replace prompt_embeds parts with placeholder tokens and inject them into the conversation.
  • Update the _build_prompt_embeds_positions() function to find consecutive chunks of placeholder token IDs and return their positions.
  • Build a full-length prompt_embeds tensor and is_token_ids mask to handle pre-computed embeddings.
  • Verify that the existing prefix caching mechanism works correctly with mixed mode without modification.

Example

{
  "messages": [
    {
      "role": "user",
      "content": [
        {"type": "text", "text": "Let's work on this task!"},
        {"type": "prompt_embeds", "data": "<base64_encoded_tensor>"},
      ]
    }
  ]
}

Notes

The implementation should follow the same content part pattern used for other multimodal content parts, such as image_embeds, and reuse the existing --enable-prompt-embeds infrastructure.

Recommendation

Apply the proposed changes to add support for prompt_embeds in the Chat Completions API, as it provides a consistent and efficient way to handle pre-computed embeddings in multi-turn conversations.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

vllm - 💡(How to fix) Fix [RFC]: Enable prompt_embeds content parts in Chat Completions API [3 comments, 3 participants]