vllm - 💡(How to fix) Fix [Feature]: Enable simultaneous generate and embed endpoints in a single vLLM instance [1 participants]

vllm2026-03-24 07:14:44

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

vllm-project/vllm#37971•Fetched 2026-04-08 01:22:22

View on GitHub

Comments

Participants

Timeline

Reactions

Author

biba10

Participants

biba10

Timeline (top)

labeled ×1

Fix Action

Fix / Workaround

Although this issue is marked as resolved via https://github.com/vllm-project/vllm/issues/33118, the current solution does not fully meet the expected use case. The workaround relies on using the generate endpoint to extract hidden states, which are written to disk (configured via kv_connector_extra_config and shared_storage_path). In this setup, the embed endpoint remains unavailable when the task is set to generate.

RAW_BUFFERClick to expand / collapse

🚀 The feature, motivation and pitch

Feature Request: Enable simultaneous `generate` and `embed` endpoints in a single vLLM instance

As referenced in https://github.com/vllm-project/vllm/issues/11905, it would be highly beneficial to support running a single vLLM instance that exposes both generate and embed endpoints concurrently.

This approach introduces additional complexity and overhead, and does not provide a true embedding API experience.

Requested improvement

Enable native support for the embed endpoint alongside the generate endpoint within the same vLLM instance. This would allow users to:

Generate text and compute embeddings without maintaining separate deployments
Use both base models and LoRA-adapted models seamlessly for both tasks
Avoid disk-based intermediate steps and associated latency/overhead

Such functionality would significantly simplify deployment and improve efficiency for applications that require both text generation and embedding capabilities.

Alternatives

No response

Additional context

No response

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

Fix Plan

To enable simultaneous generate and embed endpoints in a single vLLM instance, we need to modify the existing code to support both endpoints concurrently. Here are the steps:

Update the vllm instance to use a shared model for both generate and embed tasks
Modify the generate endpoint to return the hidden states in memory instead of writing to disk
Implement a new embed endpoint that uses the shared model to compute embeddings
Update the routing logic to handle both endpoints

Example Code

# Update the vllm instance to use a shared model
from vllm import VLLM

vllm = VLLM(shared_model=True)

# Modify the generate endpoint to return hidden states in memory
def generate(text, task):
    # ...
    hidden_states = model.generate(text, task)
    return hidden_states

# Implement the new embed endpoint
def embed(text):
    # ...
    embeddings = vllm.model.embed(text)
    return embeddings

# Update the routing logic to handle both endpoints
from flask import Flask, request

app = Flask(__name__)

@app.route('/generate', methods=['POST'])
def generate_endpoint():
    text = request.json['text']
    task = request.json['task']
    hidden_states = generate(text, task)
    return {'hidden_states': hidden_states}

@app.route('/embed', methods=['POST'])
def embed_endpoint():
    text = request.json['text']
    embeddings = embed(text)
    return {'embeddings': embeddings}

Verification

To verify that the fix worked, you can test the generate and embed endpoints using a tool like curl or a REST client. For example:

curl -X POST -H "Content-Type: application/json" -d '{"text": "Hello World", "task": "generate"}' http://localhost:5000/generate
curl -X POST -H "Content-Type: application/json" -d '{"text": "Hello World"}' http://localhost:5000/embed

This should return the expected responses from both endpoints.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #environment setup #docker error #permission error #memory optimization

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

vllm - 💡(How to fix) Fix [Feature]: Enable simultaneous generate and embed endpoints in a single vLLM instance [1 participants]

Recommended Tools

GitHub issue graph ai analysis

Fix Action

Fix / Workaround

🚀 The feature, motivation and pitch

Feature Request: Enable simultaneous `generate` and `embed` endpoints in a single vLLM instance

Requested improvement

Alternatives

Additional context

Before submitting a new issue...

extent analysis

Fix Plan

Example Code

Verification

Still need to ship something?

TRENDING

vllm - 💡(How to fix) Fix [Feature]: Enable simultaneous generate and embed endpoints in a single vLLM instance [1 participants]

Recommended Tools

GitHub issue graph ai analysis

Fix Action

Fix / Workaround

🚀 The feature, motivation and pitch

Feature Request: Enable simultaneous generate and embed endpoints in a single vLLM instance

Requested improvement

Alternatives

Additional context

Before submitting a new issue...

extent analysis

Fix Plan

Example Code

Verification

Still need to ship something?

RELATED_DISCOVERY

TRENDING

Feature Request: Enable simultaneous `generate` and `embed` endpoints in a single vLLM instance