vllm - 💡(How to fix) Fix Support SentenceTransformer Dense projection layers for embedding models (stella_en_1.5B_v5) [1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#39579Fetched 2026-04-12 13:24:39
View on GitHub
Comments
0
Participants
1
Timeline
0
Reactions
0
Author
Participants

stella_en_1.5B_v5 (and other SentenceTransformer models with a 2_Dense_* projection layer) cannot be served correctly via vLLM's --runner pooling mode. vLLM only loads weights from the root model.safetensors and ignores the modules.json / 2_Dense_* subdirectories that SentenceTransformer models use for linear projection layers.

Root Cause

The server starts and returns embeddings, but they are 1536-dim instead of 1024-dim, and do not match sentence-transformers output because 2_Dense_1024/model.safetensors is never loaded.

Fix Action

Workaround

Currently the only correct way to serve stella is via sentence-transformers directly or Infinity (ARM64 image not available). Sentence-transformers gives ~46 vecs/s vs vLLM's ~2950 vecs/s — a 64x throughput gap that makes stella impractical for high-volume pipelines.

Code Example

vllm serve dunzhang/stella_en_1.5B_v5 \
  --runner pooling \
  --trust-remote-code \
  --dtype bfloat16 \
  --override-pooler-config '{"pooling_type": "MEAN"}'
RAW_BUFFERClick to expand / collapse

Summary

stella_en_1.5B_v5 (and other SentenceTransformer models with a 2_Dense_* projection layer) cannot be served correctly via vLLM's --runner pooling mode. vLLM only loads weights from the root model.safetensors and ignores the modules.json / 2_Dense_* subdirectories that SentenceTransformer models use for linear projection layers.

Impact

stella_en_1.5B_v5 is the highest-performing open embedding model in the 1–2B parameter range on MTEB retrieval (nDCG@10: 61.01 — beats text-embedding-004's 55.70 and e5-mistral-7b's 56.89). It's widely used for RAG/semantic search pipelines.

Without the projection layer, vLLM returns raw 1536-dim mean-pool vectors instead of the correct normalized 1024-dim embeddings. This produces embeddings that do not match sentence-transformers output and likely degrades retrieval quality significantly.

Reproduction

vllm serve dunzhang/stella_en_1.5B_v5 \
  --runner pooling \
  --trust-remote-code \
  --dtype bfloat16 \
  --override-pooler-config '{"pooling_type": "MEAN"}'

The server starts and returns embeddings, but they are 1536-dim instead of 1024-dim, and do not match sentence-transformers output because 2_Dense_1024/model.safetensors is never loaded.

Expected behavior

vLLM should detect the modules.json in SentenceTransformer models and load any Dense projection modules, applying them as part of the pooling step.

Related

  • #10119 — stella not supported (closed as not planned, June 2025)
  • #22614 — unmerged PR attempting to add generic ST Dense projection loading (closed stale, Nov 2025)

Workaround

Currently the only correct way to serve stella is via sentence-transformers directly or Infinity (ARM64 image not available). Sentence-transformers gives ~46 vecs/s vs vLLM's ~2950 vecs/s — a 64x throughput gap that makes stella impractical for high-volume pipelines.

Request

Either:

  1. Revive and merge a cleaned-up version of #22614
  2. Add a SentenceTransformerDensePooler to the vLLM pooling infrastructure that reads modules.json and loads projection layers
  3. Or provide a supported path to pass custom projection weights via --override-pooler-config

extent analysis

TL;DR

The most likely fix is to implement a custom pooler that loads the SentenceTransformer model's projection layers from the modules.json file.

Guidance

  • Verify that the modules.json file and 2_Dense_* subdirectories are present in the model directory and contain the necessary projection layer weights.
  • Consider reviving and merging the unmerged PR #22614, which attempted to add generic ST Dense projection loading, as a potential solution.
  • Alternatively, explore adding a SentenceTransformerDensePooler to the vLLM pooling infrastructure to read modules.json and load projection layers.
  • Investigate providing a supported path to pass custom projection weights via --override-pooler-config as a possible workaround.

Example

No code example is provided due to the complexity of the issue and the need for a custom implementation.

Notes

The current workaround of using sentence-transformers directly or Infinity has a significant throughput gap compared to vLLM, making it impractical for high-volume pipelines. The requested changes would require modifications to the vLLM pooling infrastructure or the addition of custom pooler functionality.

Recommendation

Apply a workaround by reviving and merging a cleaned-up version of #22614, as it attempts to address the issue of loading SentenceTransformer model projection layers. This would provide a more efficient solution than the current workaround and potentially improve the performance of high-volume pipelines.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

FAQ

Expected behavior

vLLM should detect the modules.json in SentenceTransformer models and load any Dense projection modules, applying them as part of the pooling step.

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

vllm - 💡(How to fix) Fix Support SentenceTransformer Dense projection layers for embedding models (stella_en_1.5B_v5) [1 participants]