vllm - ✅(Solved) Fix [Bug]: `/v2/embed` with `input_type` returns misleading 400 on nemotron-embed-vl [1 pull requests, 1 participants]

vllm2026-04-22 11:30:21

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

vllm-project/vllm#40616•Fetched 2026-04-23 07:23:55

View on GitHub

Comments

Participants

Timeline

Reactions

Author

oliverholworthy

Participants

oliverholworthy

Timeline (top)

cross-referenced ×1labeled ×1

The Cohere /v2/embed path synthesizes [system, user] messages internally when input_type maps to a task_instructions prefix (#38362). Chat templates that guard against messages | length > 1 then reject this 2-message input with an error that
reads like a caller-side problem. Today this breaks nvidia/llama-nemotron-embed-vl-1b-v2.

Error Message

The Cohere /v2/embed path synthesizes [system, user] messages internally when input_type maps to a task_instructions prefix (#38362). Chat templates that guard against messages | length > 1 then reject this 2-message input with an error that
{"error":{"message":"Embedding models should only embed one message at a time","type":"BadRequestError","code":400}} l back to inlining the task prefix into user content (pre-#38362 behavior) or return an error that names the real cause.

The error string "Embedding models should only embed one message at a time" is repeated across the three guarded templates and reads as universal — worth rewording independently of the fix.

Root Cause

EmbedIOProcessor._mixed_input_to_messages (vllm/entrypoints/pooling/embed/io_processor.py:298-321) builds a 2-message list [{role:"system", content:task_prefix}, {role:"user", content:text}] whenever a task_prefix is present. _pre_process_coh\ ere_online then routes that through _batch_render_chat, where the template's {% if messages | length > 1 %}{{ raise_exception(...) }}{% endif %} guard fires. The caller never sent messages and has no way to know the server is responsible for the 2-message shape.

Fix Action

Fixed

Fixed by PR: [Feature] Batch embedding support for /v1/embeddings messages format (https://github.com/vllm-project/vllm/pull/40042)

PR fix notes

PR #40042: [Feature] Batch embedding support for /v1/embeddings messages format

Repository: vllm-project/vllm
Author: oliverholworthy
State: closed | merged: False
Link: https://github.com/vllm-project/vllm/pull/40042

Description (problem / solution / changelog)

Purpose

Each message in the /v1/embeddings messages list now produces a separate embedding, enabling efficient batching for multimodal embedding models.

Previously, all messages were concatenated into a single conversation producing one embedding. This forced models like nemotron-embed-vl to guard against multiple messages in their chat templates and prevented batch embedding via the messages format.

The change is scoped to EmbedIOProcessor only — other pooling endpoints (classification, scoring) retain existing multi-message-as-conversation behavior. The downstream pipeline (engine submission, result collection, response building) already supported multiple engine inputs.

Test Plan

Unit tests added in tests/entrypoints/pooling/embed/test_io_processor.py covering:
- Multiple messages produce multiple engine inputs
- Single message produces one engine input
- Untrusted chat template rejection
- Prompt extras passthrough (mm_processor_kwargs, cache_salt)
- Chunked processing runs after chat preprocessing
Integration tests updated in tests/entrypoints/pooling/embed/test_online.py:
- Single message produces 1 embedding matching the completion path
- Multiple messages produce multiple embeddings (one per message)
- Each batch embedding matches its individual single-message request
- add_generation_prompt, continue_final_message, add_special_tokens flags still work
- Conflicting flags error still works

# Unit tests
.venv/bin/python -m pytest tests/entrypoints/pooling/embed/test_io_processor.py -v

# Integration tests
.venv/bin/python -m pytest tests/entrypoints/pooling/embed/test_online.py -v

Test Result

Unit tests pass. Integration tests require a running vLLM server with embedding model support.

<details> <summary> Essential Elements of an Effective PR Description Checklist </summary>

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.

</details>

Changed files

tests/entrypoints/pooling/embed/test_io_processor.py (modified, +164/-0)
tests/entrypoints/pooling/embed/test_online.py (modified, +63/-38)
vllm/entrypoints/pooling/embed/io_processor.py (modified, +51/-0)

Code Example

vllm serve nvidia/llama-nemotron-embed-vl-1b-v2 \
  --trust-remote-code --max-model-len 10240 \
  --chat-template examples/pooling/embed/template/nemotron_embed_vl.jinja

---

curl -sS -X POST http://localhost:8000/v2/embed \
  -H "Content-Type: application/json" \
  -d '{
    "model": "nvidia/llama-nemotron-embed-vl-1b-v2",
    "texts": ["machine learning", "the cat sat on the mat"],
    "input_type": "query",
    "embedding_types": ["float"]
  }'

---

{"error":{"message":"Embedding models should only embed one message at a time","type":"BadRequestError","code":400}}

RAW_BUFFERClick to expand / collapse

Your current environment

n/a - not an env specifc issue

🐛 Describe the bug

[Bug] `/v2/embed` with `input_type` returns misleading 400 on nemotron-embed-vl

Summary

Reproducer

vllm serve nvidia/llama-nemotron-embed-vl-1b-v2 \
  --trust-remote-code --max-model-len 10240 \
  --chat-template examples/pooling/embed/template/nemotron_embed_vl.jinja

curl -sS -X POST http://localhost:8000/v2/embed \
  -H "Content-Type: application/json" \
  -d '{
    "model": "nvidia/llama-nemotron-embed-vl-1b-v2",
    "texts": ["machine learning", "the cat sat on the mat"],
    "input_type": "query",
    "embedding_types": ["float"]
  }'

{"error":{"message":"Embedding models should only embed one message at a time","type":"BadRequestError","code":400}}

Root cause

Scope

The failure requires both: model declares task_instructions and is served with a length-guarded template. Bundled templates today:

Template	Guards length > 1?
`examples/pooling/embed/template/nemotron_embed_vl.jinja`	yes
`examples/pooling/embed/template/vlm2vec_phi3v.jinja`	yes
`examples/pooling/embed/template/vlm2vec_qwen2vl.jinja`	yes
`examples/pooling/embed/template/dse_qwen2_vl.jinja`	no (composes)
`vllm/transformers_utils/chat_templates/template_basic.jinja`	no (concatenates)

Nemotron-embed-vl is the only model that currently hits both conditions. Future multimodal embedding models that declare task_instructions and ship with a guarded template will break on arrival.

Proposed direction

Move the multi-message policy out of jinja and into EmbedIOProcessor. Declare per-model/template capability (e.g. supports_multi_message_chat) and check it in _pre_process_cohere_online before rendering. For incompatible combinations, either fal
l back to inlining the task prefix into user content (pre-#38362 behavior) or return an error that names the real cause.

Related observations (not this issue)

The error string "Embedding models should only embed one message at a time" is repeated across the three guarded templates and reads as universal — worth rewording independently of the fix.
template_basic.jinja silently concatenates multi-message input for CLIP/Siglip/ColPali/PaliGemma/Chameleon, which is a separate footgun that may deserve its own issue.

Out of scope

Changing the /v1/embeddings messages contract (single conversation → single embedding). See #40042 discussion.

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

TL;DR

Move the multi-message policy out of templates and into EmbedIOProcessor to fix the misleading 400 error on /v2/embed with input_type for models like nvidia/llama-nemotron-embed-vl-1b-v2.

Guidance

Identify models that declare task_instructions and are served with length-guarded templates, as these are prone to the error.
Modify EmbedIOProcessor to declare per-model/template capability (e.g., supports_multi_message_chat) and check it before rendering.
For incompatible combinations, either fall back to inlining the task prefix into user content or return an error that names the real cause.
Review and adjust template guards and error messages to provide clearer feedback to callers.

Example

No specific code example is provided due to the complexity of the issue, but the fix involves adjusting the EmbedIOProcessor and potentially the template rendering logic.

Notes

This solution focuses on addressing the immediate issue with nvidia/llama-nemotron-embed-vl-1b-v2 and similar models. It does not address the broader issue of silent concatenation in certain templates or the contract of /v1/embeddings, which are noted as out of scope or deserving of separate issues.

Recommendation

Apply the workaround by modifying EmbedIOProcessor to handle multi-message policies, as this directly addresses the root cause of the misleading 400 error without requiring a version upgrade or more invasive changes.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#orchestration issue #cache issue #memory leak #API versioning #request timeout

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.