vllm - ✅(Solved) Fix [RFC]: Create separation between independent HTTP API definitions [1 pull requests, 4 comments, 4 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#37204Fetched 2026-04-08 00:48:43
View on GitHub
Comments
4
Participants
4
Timeline
13
Reactions
1
Author
Timeline (top)
commented ×4subscribed ×4mentioned ×3cross-referenced ×1

Fix Action

Fixed

PR fix notes

PR #37074: [Feature][Frontend] add support for Cohere Embed v2 API

Description (problem / solution / changelog)

Purpose

Implementation for RFC #37000

Add a Cohere-compatible /v2/embed endpoint (https://docs.cohere.com/reference/embed) to vLLM's pooling API. This is a adapter layer on top of the existing OpenAI /v1/embeddings pipeline to help with code reuse. Any embedding model that already works with with /v1/embeddings can be used with /v2/embed

Key features of the Cohere API compared to existing embedding endpoint:

  • support for embedding_type with automatic conversion for: float, binary, ubinary, base64
  • support for input_type with dynamic prompt prefixes loaded from the model's task_instructions (in config.json) or prompts (in config_sentence_transformers.json)
  • support for truncation which will truncate input using different strategies (END, START, NONE) when input is > max tokens.
  • support for output_dimension

There was only one slightly disruptive change:

  1. Add truncation_side field to TokenizeParams and all pooling request protocols to support left-side truncation (for truncate START)

Test Plan

Added a bunch of new tests

tests/entrypoints/pooling/embed/test_protocol.py
tests/entrypoints/pooling/embed/test_io_processor.py
tests/entrypoints/pooling/embed/test_cohere_online.py
tests/entrypoints/pooling/embed/test_cohere_online_vision.py
tests/entrypoints/pooling/embed/test_cohere_openai_parity.py

Test Result

All new tests pass ✅


<details> <summary> Essential Elements of an Effective PR Description Checklist </summary>
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.
</details>

Changed files

  • docs/serving/openai_compatible_server.md (modified, +134/-0)
  • tests/entrypoints/pooling/embed/test_cohere_online.py (added, +310/-0)
  • tests/entrypoints/pooling/embed/test_cohere_online_vision.py (added, +135/-0)
  • tests/entrypoints/pooling/embed/test_cohere_openai_parity.py (added, +102/-0)
  • tests/entrypoints/pooling/embed/test_io_processor.py (added, +208/-0)
  • tests/entrypoints/pooling/embed/test_protocol.py (added, +129/-0)
  • vllm/entrypoints/pooling/base/protocol.py (modified, +9/-1)
  • vllm/entrypoints/pooling/classify/protocol.py (modified, +2/-0)
  • vllm/entrypoints/pooling/embed/api_router.py (modified, +26/-5)
  • vllm/entrypoints/pooling/embed/io_processor.py (modified, +302/-17)
  • vllm/entrypoints/pooling/embed/protocol.py (modified, +167/-3)
  • vllm/entrypoints/pooling/embed/serving.py (modified, +53/-11)
  • vllm/entrypoints/pooling/pooling/protocol.py (modified, +3/-0)
  • vllm/entrypoints/pooling/score/protocol.py (modified, +2/-0)
  • vllm/entrypoints/pooling/typing.py (modified, +2/-0)
  • vllm/renderers/params.py (modified, +24/-2)
RAW_BUFFERClick to expand / collapse

Motivation.

Right now we have overloaded the "OpenAI-compatible" API server with APIs that come from many different sources. As a recent example, PR #37074 proposes support for v2/embed from Cohere. It is added to our existing API server. If OpenAI were to introduce a v2/embed API in the future, we would have a non-backwards-compatible problem to resolve if we wanted to support that.

https://docs.vllm.ai/en/latest/serving/openai_compatible_server/#supported-apis

In the "Supported APIs" section of the docs, the APIs in question are under "In addition, we have the following custom APIs:".

Another reason this is problematic is that we regularly get security reports about how APIs within the same server have such drastically different security considerations. For example:

  • Some APIs support API tokens, while others do not
  • Some APIs are intended for end users, while others are for internal usage only and would be a major security risk if exposed.

Having cleaner separation of the APIs would help us maintain cleaner security boundaries between APIs based on their intended usage and exposure.

Proposed Change.

Split APIs based on their source into their own HTTP endpoints. Do this in accordance with the vllm deprecation policy.

  • Step 1 Allow APIs to run on their own endpoint (port number), but not by deafult. Announce that this will change in the future.
  • Step 2 Run APIs on separate port numbers by default, but allow re-combining them.
  • Step 3 Run APIs on separate port numbers.

Expected API endpoints would include, at least:

  • OpenAI APIs only
    • Specific APIs may include vLLM custom extensions, but the APIs are explicitly OpenAI's definition
  • Cohere APIs
    • v1/rerank also includes some extensions from Jina's API, but is still compatible with Cohere's defintion of v1/rerank
    • new v2/embed would go here (#37074)
  • vLLM custom APIs -- APIs that are entirely custom to vLLM

A more complete spec would be produced prior to implementation if there is general support for the direction.

Also note that if someone would like to include all of these APIs under the same endpoint, even after Step 3 above, it is still easy to do that via a proxy in front of vLLM, which would almost always be included in a production deployment.

Feedback Period.

TBD

CC List.

No response

Any Other Things.

No response

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

Fix Plan

To address the issue of overloaded API server and security concerns, we will split APIs based on their source into separate HTTP endpoints. Here are the steps:

  • Step 1: Allow APIs to run on their own endpoint (port number) but not by default.
    • Create a new configuration option to enable separate endpoints.
    • Update the API server to support multiple endpoints.
  • Step 2: Run APIs on separate port numbers by default, but allow re-combining them.
    • Update the default configuration to use separate endpoints.
    • Add an option to combine APIs into a single endpoint.
  • Step 3: Run APIs on separate port numbers.
    • Remove the option to combine APIs into a single endpoint.

Example code snippet in Python using Flask:

from flask import Flask, jsonify

app = Flask(__name__)

# Define separate endpoints for each API source
openai_app = Flask('openai')
cohere_app = Flask('cohere')
vllm_app = Flask('vllm')

# Define routes for each API source
@openai_app.route('/v1/embed', methods=['POST'])
def openai_embed():
    # OpenAI embed API implementation
    pass

@cohere_app.route('/v1/rerank', methods=['POST'])
def cohere_rerank():
    # Cohere rerank API implementation
    pass

@vllm_app.route('/custom-api', methods=['POST'])
def vllm_custom_api():
    # vLLM custom API implementation
    pass

# Run each API source on a separate port
if __name__ == '__main__':
    openai_app.run(port=5001)
    cohere_app.run(port=5002)
    vllm_app.run(port=5003)

Verification

To verify that the fix worked, test each API source on its separate endpoint and port number. Ensure that each API source is only accessible on its designated port and that security boundaries are maintained.

Extra Tips

  • Use a proxy server to combine APIs into a single endpoint if needed.
  • Document the new endpoint structure and configuration options.
  • Test thoroughly to ensure that the separate endpoints do not introduce any new security vulnerabilities.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

vllm - ✅(Solved) Fix [RFC]: Create separation between independent HTTP API definitions [1 pull requests, 4 comments, 4 participants]