vllm - 💡(How to fix) Fix [Feature]: Enable simultaneous generate and embed endpoints in a single vLLM instance [1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#37971Fetched 2026-04-08 01:22:22
View on GitHub
Comments
0
Participants
1
Timeline
1
Reactions
0
Author
Participants
Timeline (top)
labeled ×1

Fix Action

Fix / Workaround

Although this issue is marked as resolved via https://github.com/vllm-project/vllm/issues/33118, the current solution does not fully meet the expected use case. The workaround relies on using the generate endpoint to extract hidden states, which are written to disk (configured via kv_connector_extra_config and shared_storage_path). In this setup, the embed endpoint remains unavailable when the task is set to generate.

RAW_BUFFERClick to expand / collapse

🚀 The feature, motivation and pitch

Feature Request: Enable simultaneous generate and embed endpoints in a single vLLM instance

As referenced in https://github.com/vllm-project/vllm/issues/11905, it would be highly beneficial to support running a single vLLM instance that exposes both generate and embed endpoints concurrently.

Although this issue is marked as resolved via https://github.com/vllm-project/vllm/issues/33118, the current solution does not fully meet the expected use case. The workaround relies on using the generate endpoint to extract hidden states, which are written to disk (configured via kv_connector_extra_config and shared_storage_path). In this setup, the embed endpoint remains unavailable when the task is set to generate.

This approach introduces additional complexity and overhead, and does not provide a true embedding API experience.

Requested improvement

Enable native support for the embed endpoint alongside the generate endpoint within the same vLLM instance. This would allow users to:

  • Generate text and compute embeddings without maintaining separate deployments
  • Use both base models and LoRA-adapted models seamlessly for both tasks
  • Avoid disk-based intermediate steps and associated latency/overhead

Such functionality would significantly simplify deployment and improve efficiency for applications that require both text generation and embedding capabilities.

Alternatives

No response

Additional context

No response

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

Fix Plan

To enable simultaneous generate and embed endpoints in a single vLLM instance, we need to modify the existing code to support both endpoints concurrently. Here are the steps:

  • Update the vllm instance to use a shared model for both generate and embed tasks
  • Modify the generate endpoint to return the hidden states in memory instead of writing to disk
  • Implement a new embed endpoint that uses the shared model to compute embeddings
  • Update the routing logic to handle both endpoints

Example Code

# Update the vllm instance to use a shared model
from vllm import VLLM

vllm = VLLM(shared_model=True)

# Modify the generate endpoint to return hidden states in memory
def generate(text, task):
    # ...
    hidden_states = model.generate(text, task)
    return hidden_states

# Implement the new embed endpoint
def embed(text):
    # ...
    embeddings = vllm.model.embed(text)
    return embeddings

# Update the routing logic to handle both endpoints
from flask import Flask, request

app = Flask(__name__)

@app.route('/generate', methods=['POST'])
def generate_endpoint():
    text = request.json['text']
    task = request.json['task']
    hidden_states = generate(text, task)
    return {'hidden_states': hidden_states}

@app.route('/embed', methods=['POST'])
def embed_endpoint():
    text = request.json['text']
    embeddings = embed(text)
    return {'embeddings': embeddings}

Verification

To verify that the fix worked, you can test the generate and embed endpoints using a tool like curl or a REST client. For example:

curl -X POST -H "Content-Type: application/json" -d '{"text": "Hello World", "task": "generate"}' http://localhost:5000/generate
curl -X POST -H "Content-Type: application/json" -d '{"text": "Hello World"}' http://localhost:5000/embed

This should return the expected responses from both endpoints.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING