vllm - 💡(How to fix) Fix Support SentenceTransformer Dense projection layers for embedding models (stella_en_1.5B_v5) [1 participants]

Q: Expected behavior

vLLM should detect the `modules.json` in SentenceTransformer models and load any `Dense` projection modules, applying them as part of the pooling step.

vllm2026-04-11 16:20:51

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

vllm-project/vllm#39579•Fetched 2026-04-12 13:24:39

View on GitHub

Comments

Participants

Timeline

Reactions

Author

tomlue

Participants

tomlue

stella_en_1.5B_v5 (and other SentenceTransformer models with a 2_Dense_* projection layer) cannot be served correctly via vLLM's --runner pooling mode. vLLM only loads weights from the root model.safetensors and ignores the modules.json / 2_Dense_* subdirectories that SentenceTransformer models use for linear projection layers.

Root Cause

The server starts and returns embeddings, but they are 1536-dim instead of 1024-dim, and do not match sentence-transformers output because 2_Dense_1024/model.safetensors is never loaded.

Fix Action

Workaround

Currently the only correct way to serve stella is via sentence-transformers directly or Infinity (ARM64 image not available). Sentence-transformers gives ~46 vecs/s vs vLLM's ~2950 vecs/s — a 64x throughput gap that makes stella impractical for high-volume pipelines.

Code Example

vllm serve dunzhang/stella_en_1.5B_v5 \
  --runner pooling \
  --trust-remote-code \
  --dtype bfloat16 \
  --override-pooler-config '{"pooling_type": "MEAN"}'

RAW_BUFFERClick to expand / collapse

Summary

Impact

stella_en_1.5B_v5 is the highest-performing open embedding model in the 1–2B parameter range on MTEB retrieval (nDCG@10: 61.01 — beats text-embedding-004's 55.70 and e5-mistral-7b's 56.89). It's widely used for RAG/semantic search pipelines.

Without the projection layer, vLLM returns raw 1536-dim mean-pool vectors instead of the correct normalized 1024-dim embeddings. This produces embeddings that do not match sentence-transformers output and likely degrades retrieval quality significantly.

Reproduction

vllm serve dunzhang/stella_en_1.5B_v5 \
  --runner pooling \
  --trust-remote-code \
  --dtype bfloat16 \
  --override-pooler-config '{"pooling_type": "MEAN"}'

The server starts and returns embeddings, but they are 1536-dim instead of 1024-dim, and do not match sentence-transformers output because 2_Dense_1024/model.safetensors is never loaded.

Expected behavior

vLLM should detect the modules.json in SentenceTransformer models and load any Dense projection modules, applying them as part of the pooling step.

#10119 — stella not supported (closed as not planned, June 2025)
#22614 — unmerged PR attempting to add generic ST Dense projection loading (closed stale, Nov 2025)

Workaround

Request

Either:

Revive and merge a cleaned-up version of #22614
Add a SentenceTransformerDensePooler to the vLLM pooling infrastructure that reads modules.json and loads projection layers
Or provide a supported path to pass custom projection weights via --override-pooler-config

extent analysis

TL;DR

The most likely fix is to implement a custom pooler that loads the SentenceTransformer model's projection layers from the modules.json file.

Guidance

Verify that the modules.json file and 2_Dense_* subdirectories are present in the model directory and contain the necessary projection layer weights.
Consider reviving and merging the unmerged PR #22614, which attempted to add generic ST Dense projection loading, as a potential solution.
Alternatively, explore adding a SentenceTransformerDensePooler to the vLLM pooling infrastructure to read modules.json and load projection layers.
Investigate providing a supported path to pass custom projection weights via --override-pooler-config as a possible workaround.

Example

No code example is provided due to the complexity of the issue and the need for a custom implementation.

Notes

The current workaround of using sentence-transformers directly or Infinity has a significant throughput gap compared to vLLM, making it impractical for high-volume pipelines. The requested changes would require modifications to the vLLM pooling infrastructure or the addition of custom pooler functionality.

Recommendation

Apply a workaround by reviving and merging a cleaned-up version of #22614, as it attempts to address the issue of loading SentenceTransformer model projection layers. This would provide a more efficient solution than the current workaround and potentially improve the performance of high-volume pipelines.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

FAQ

Expected behavior

vLLM should detect the modules.json in SentenceTransformer models and load any Dense projection modules, applying them as part of the pooling step.

#docker error #permission error #memory optimization #batch processing #GPU compatibility

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

vllm - 💡(How to fix) Fix Support SentenceTransformer Dense projection layers for embedding models (stella_en_1.5B_v5) [1 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Fix Action

Workaround

Code Example

Summary

Impact

Reproduction

Expected behavior

Related

Workaround

Request

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

FAQ

Expected behavior

Still need to ship something?

TRENDING

vllm - 💡(How to fix) Fix Support SentenceTransformer Dense projection layers for embedding models (stella_en_1.5B_v5) [1 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Fix Action

Workaround

Code Example

Summary

Impact

Reproduction

Expected behavior

Related

Workaround

Request

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

FAQ

Expected behavior

Still need to ship something?

RELATED_DISCOVERY

TRENDING