langchain - ✅(Solved) Fix perf: HuggingFaceEmbeddings causes excessive device-to-cpu transfers per batch [1 pull requests, 1 comments, 1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
langchain-ai/langchain#36126Fetched 2026-04-08 01:08:07
View on GitHub
Comments
1
Participants
1
Timeline
4
Reactions
0
Participants
Timeline (top)
commented ×1cross-referenced ×1labeled ×1referenced ×1

Root Cause

HuggingFaceEmbeddings is significantly slower than calling sentence_transformers directly. Profiling reveals two compounding root causes.

Fix Action

Fixed

PR fix notes

PR #36127: perf(huggingface,chroma): reduce device-to-cpu transfer overhead in embedding pipeline

Description (problem / solution / changelog)

Motivation

HuggingFaceEmbeddings can be much slower than calling sentence_transformers or transformers directly. Profiling revealed two compounding root causes — neither of which was obvious from the code alone.

Root cause 1 — per-batch device→CPU memory transfers (HuggingFace)

sentence_transformers.encode() defaults to convert_to_numpy=True, which calls .cpu().float().numpy() inside every micro-batch iteration. On MPS (Apple Silicon) and CUDA, each .cpu() flushes the hardware command buffer — a full synchronisation point. For 1,000 texts at batch_size=32 that means 32 synchronisations instead of 1, adding ~640 ms of pure transfer overhead on an M4 MacBook Air.

Root cause 2 — re-embedding per storage batch (Chroma)

Chroma.from_texts calls create_batches(documents=texts) without pre-computed embeddings and then calls add_texts() (which calls embed_documents()) for each ChromaDB storage batch. For a corpus that spans multiple ChromaDB batches (default max_batch_size ≈ 5,461), embeddings are recomputed from scratch for each slice. update_documents already does the right thing — embed once, batch only the storage — but from_texts did not follow the same pattern.

Fixes #36126

Changes

langchain_huggingface/embeddings/huggingface.py

  • Default to convert_to_tensor=True in _embed so all micro-batch outputs stay on the model's device and are torch.cat'd there. This reduces device→CPU transfers from N_batches to 1 (at the very end).
  • Final conversion uses .cpu().numpy().tolist() — one sync, then numpy's C-implemented tolist(). (PyTorch's Tensor.tolist() is a Python-level loop and is significantly slower for 2-D float arrays.)
  • Add batch_size: int = 32 as a first-class field (sentence-transformers already uses 32 as its default; this just surfaces it for easy tuning without knowing the internal encode_kwargs API). Users can still override both fields via encode_kwargs.

langchain_chroma/vectorstores.py

  • In from_texts: pre-compute all embeddings once with embedding.embed_documents(texts), then pass embeddings=all_embeddings to create_batches and write directly via _collection.add — matching the pattern already used in update_documents.

Benchmark (BAAI/bge-small-en-v1.5, 1,000 texts, M4 MacBook Air)

ConfigTimevs baseline
baseline — convert_to_numpy=True, batch_size=321.157 s
fixed — convert_to_tensor=True, batch_size=320.631 s1.84× faster
fixed — convert_to_tensor=True, batch_size=1280.642 s1.80× faster
direct sentence_transformers (reference)0.594 s1.95× faster

With the Chroma fix applied on top (for datasets > 5,461 docs) the combined improvement is proportional to the number of storage batches.

Backward compatibility

  • embed_documents / embed_query return types are unchanged.
  • Users who explicitly pass encode_kwargs={"convert_to_tensor": False} or encode_kwargs={"convert_to_numpy": True} get the original numpy path (the hasattr(embeddings, "cpu") branch handles this correctly).
  • Chroma.from_texts / from_documents signatures are unchanged.
  • For datasets with a single ChromaDB batch (< 5,461 docs, the common case) the Chroma change is functionally identical to the previous behaviour.

Areas requiring careful review

  1. hasattr(embeddings, "cpu") duck-typing — used to distinguish a torch.Tensor from a numpy.ndarray without importing torch at the module level. This should be safe but reviewers should verify edge cases (e.g. IPEX models, custom pooling layers that return unusual objects).
  2. Chroma _collection.add bypassfrom_texts now writes directly to _collection.add rather than going through add_texts. The metadata and document handling is delegated to create_batches (same as update_documents). Please verify this covers all the metadata edge cases that add_texts handled.

Test plan

  • 13 new unit tests (tests/unit_tests/test_embeddings.py) — all passing, no network required
  • All existing unit tests pass
  • Benchmark script (libs/partners/huggingface/scripts/benchmark_embeddings.py) confirms 1.84× wall-clock improvement on M4

🤖 This contribution was developed with the assistance of Claude Code (Anthropic). The root-cause analysis, implementation, and tests were designed collaboratively.

Changed files

  • libs/partners/chroma/langchain_chroma/vectorstores.py (modified, +20/-4)
  • libs/partners/huggingface/langchain_huggingface/embeddings/huggingface.py (modified, +26/-5)
  • libs/partners/huggingface/pyproject.toml (modified, +3/-0)
  • libs/partners/huggingface/scripts/benchmark_embeddings.py (added, +109/-0)
  • libs/partners/huggingface/scripts/check_imports.py (modified, +2/-2)
  • libs/partners/huggingface/tests/unit_tests/test_embeddings.py (added, +235/-0)
  • libs/partners/huggingface/uv.lock (modified, +9/-3)
RAW_BUFFERClick to expand / collapse

Problem

HuggingFaceEmbeddings is significantly slower than calling sentence_transformers directly. Profiling reveals two compounding root causes.

Root cause 1 — per-batch device→CPU memory transfers (HuggingFace)

sentence_transformers.encode() defaults to convert_to_numpy=True, which calls .cpu().float().numpy() inside every micro-batch iteration. On MPS (Apple Silicon) and CUDA, each .cpu() flushes the hardware command buffer — a full synchronisation point.

For 1,000 texts at batch_size=32, this means 32 synchronisations instead of 1, adding ~640 ms of pure transfer overhead on an M4 MacBook Air.

Root cause 2 — re-embedding per storage batch (Chroma)

Chroma.from_texts calls create_batches(documents=texts) without pre-computed embeddings, then calls add_texts() (which calls embed_documents()) for each ChromaDB storage batch. For corpora spanning multiple ChromaDB batches (default max_batch_size ≈ 5,461), embeddings are recomputed from scratch for each slice.

update_documents already does the right thing — embed once, batch only the storage — but from_texts did not follow the same pattern.

Benchmark (BAAI/bge-small-en-v1.5, 1,000 texts, M4 MacBook Air)

ConfigTimevs baseline
baseline — convert_to_numpy=True, batch_size=321.157 s
fixed — convert_to_tensor=True, batch_size=320.631 s1.84× faster
fixed — convert_to_tensor=True, batch_size=1280.642 s1.80× faster
direct sentence_transformers (reference)0.594 s1.95× faster

Proposed fix

  1. Default to convert_to_tensor=True in HuggingFaceEmbeddings._embed so outputs stay on-device across all micro-batches, with a single device→CPU transfer at the end.
  2. Pre-compute all embeddings once in Chroma.from_texts before passing them to create_batches, matching the pattern already in update_documents.

extent analysis

Fix Plan

To address the performance issues with HuggingFaceEmbeddings, we will implement the following steps:

  • Modify HuggingFaceEmbeddings._embed to default to convert_to_tensor=True
  • Pre-compute embeddings in Chroma.from_texts before creating batches

Code Changes

# In HuggingFaceEmbeddings._embed
def _embed(self, texts):
    # ...
    outputs = self.model.encode(texts, convert_to_tensor=True)  # Default to convert_to_tensor=True
    # ...

# In Chroma.from_texts
def from_texts(self, texts):
    # Pre-compute embeddings
    embeddings = HuggingFaceEmbeddings().embed(texts)
    # Create batches with pre-computed embeddings
    batches = create_batches(documents=texts, embeddings=embeddings)
    # ...

Verification

To verify the fix, run the benchmark test with the modified code and compare the results to the baseline. The expected outcome is a significant reduction in processing time, similar to the results shown in the benchmark table.

Extra Tips

  • When working with large datasets, it's essential to minimize device→CPU memory transfers to optimize performance.
  • Pre-computing embeddings before creating batches can significantly reduce processing time, especially for large corpora.
  • Consider adjusting the batch_size parameter to find the optimal balance between memory usage and processing time.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING