langchain - ✅(Solved) Fix perf: HuggingFaceEmbeddings causes excessive device-to-cpu transfers per batch [1 pull requests, 1 comments, 1 participants]

airbagdeer · 2026-03-20T17:33:14Z

[langchain] PR 36127: perf huggingface,chroma : reduce device-to-cpu transfer overhead in embedding pipeline - Repository: langchain-ai/langchain - Author: air… # PR #36127: perf(huggingface,chroma): reduce device-to-cpu transfer overhead in embedding pipeline - Repository: langchain-ai/langchain - Author: airbagdeer - State: closed | merged: False - Link: https://github.com/langchain-ai/langchain/pull/36127 ## Description (problem / solution / changelog) ## Motivation `HuggingFaceEmbeddings` can be **much slower** than calling `sentence_transformers` or `transformers` directly. Profiling revealed two compounding root causes — neither of which was obvious from the code alone. ### Root cause 1 — per-batch device→CPU memory transfers (HuggingFace) `sentence_transformers.encode()` defaults to `convert_to_numpy=True`, which calls `.cpu().float().numpy()` **inside every micro-batch iteration**. On MPS (Apple Silicon) and CUDA, each `.cpu()` flushes the hardware command buffer — a full synchronisation point. For 1,000 texts at `batch_size=32` that means **32 synchronisations instead of 1**, adding ~640 ms of pure transfer overhead on an M4 MacBook Air. ### Root cause 2 — re-embedding per storage batch (Chroma) `Chroma.from_texts` calls `create_batches(documents=texts)` *without* pre-computed embeddings and then calls `add_texts()` (which calls `embed_documents()`) for **each** ChromaDB storage batch. For a corpus that spans multiple ChromaDB batches (default `max_batch_size ≈ 5,461`), embeddings are recomputed from scratch for each slice. `update_documents` already does the right thing — embed once, batch only the storage — but `from_texts` did not follow the same pattern. Fixes #36126 ## Changes ### `langchain_huggingface/embeddings/huggingface.py` - Default to `convert_to_tensor=True` in `_embed` so all micro-batch outputs stay on the model's device and are `torch.cat`'d there. This reduces device→CPU transfers from **N_batches to 1** (at the very end). - Final conversion uses `.cpu().numpy().tolist()` — one sync, then numpy's C-implemented `tolist()`. (PyTorch's `Tensor.tolist()` is a Python-level loop and is significantly slower for 2-D float arrays.) - Add `batch_size: int = 32` as a first-class field (sentence-transformers already uses 32 as its default; this just surfaces it for easy tuning without knowing the internal `encode_kwargs` API). Users can still override both fields via `encode_kwargs`. ### `langchain_chroma/vectorstores.py` - In `from_texts`: pre-compute all embeddings once with `embedding.embed_documents(texts)`, then pass `embeddings=all_embeddings` to `create_batches` and write directly via `_collection.add` — matching the pattern already used in `update_documents`. ## Benchmark (BAAI/bge-small-en-v1.5, 1,000 texts, M4 MacBook Air) | Config | Time | vs baseline | |---|---|---| | baseline — `convert_to_numpy=True`, `batch_size=32` | 1.157 s | — | | **fixed — `convert_to_tensor=True`, `batch_size=32`** | **0.631 s** | **1.84× faster** | | fixed — `convert_to_tensor=True`, `batch_size=128` | 0.642 s | 1.80× faster | | direct `sentence_transformers` (reference) | 0.594 s | 1.95× faster | With the Chroma fix applied on top (for datasets > 5,461 docs) the combined improvement is proportional to the number of storage batches. ## Backward compatibility - `embed_documents` / `embed_query` return types are unchanged. - Users who explicitly pass `encode_kwargs={"convert_to_tensor": False}` or `encode_kwargs={"convert_to_numpy": True}` get the original numpy path (the `hasattr(embeddings, "cpu")` branch handles this correctly). - `Chroma.from_texts` / `from_documents` signatures are unchanged. - For datasets with a single ChromaDB batch (< 5,461 docs, the common case) the Chroma change is functionally identical to the previous behaviour. ## Areas requiring careful review 1. **`hasattr(embeddings, "cpu")` duck-typing** — used to distinguish a `torch.Tensor` from a `numpy.ndarray` without importing torch at the module level. This should be safe but reviewers should verify edge cases (e.g. IPEX models, custom pooling layers that return unusual objects). 2. **Chroma `_collection.add` bypass** — `from_texts` now writes directly to `_collection.add` rather than going through `add_texts`. The metadata and document handling is delegated to `create_batches` (same as `update_documents`). Please verify this covers all the metadata edge cases that `add_texts` handled. ## Test plan - [x] 13 new unit tests (`tests/unit_tests/test_embeddings.py`) — all passing, no network required - [x] All existing unit tests pass - [x] Benchmark script (`libs/partners/huggingface/scripts/benchmark_embeddings.py`) confirms 1.84× wall-clock improvement on M4 --- > 🤖 This contribution was developed with the assistance of Claude Code > (Anthropic). The root-cause analysis, implementation, and tests were > designed collaboratively. ## Changed files - `libs/partners/chroma/langchain_chroma/vectorstores.py` (modified, +20/-4) - `libs/part

langchain2026-03-20 17:33:14

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

langchain-ai/langchain#36126•Fetched 2026-04-08 01:08:07

View on GitHub

Comments

Participants

Timeline

Reactions

Author

airbagdeer

Participants

airbagdeer

Timeline (top)

commented ×1cross-referenced ×1labeled ×1referenced ×1

Root Cause

HuggingFaceEmbeddings is significantly slower than calling sentence_transformers directly. Profiling reveals two compounding root causes.

Fix Action

Fixed

Fixed by PR: perf(huggingface,chroma): reduce device-to-cpu transfer overhead in embedding pipeline (https://github.com/langchain-ai/langchain/pull/36127)

PR fix notes

PR #36127: perf(huggingface,chroma): reduce device-to-cpu transfer overhead in embedding pipeline

Repository: langchain-ai/langchain
Author: airbagdeer
State: closed | merged: False
Link: https://github.com/langchain-ai/langchain/pull/36127

Description (problem / solution / changelog)

Motivation

HuggingFaceEmbeddings can be much slower than calling sentence_transformers or transformers directly. Profiling revealed two compounding root causes — neither of which was obvious from the code alone.

Root cause 1 — per-batch device→CPU memory transfers (HuggingFace)

sentence_transformers.encode() defaults to convert_to_numpy=True, which calls .cpu().float().numpy() inside every micro-batch iteration. On MPS (Apple Silicon) and CUDA, each .cpu() flushes the hardware command buffer — a full synchronisation point. For 1,000 texts at batch_size=32 that means 32 synchronisations instead of 1, adding ~640 ms of pure transfer overhead on an M4 MacBook Air.

Root cause 2 — re-embedding per storage batch (Chroma)

Chroma.from_texts calls create_batches(documents=texts) without pre-computed embeddings and then calls add_texts() (which calls embed_documents()) for each ChromaDB storage batch. For a corpus that spans multiple ChromaDB batches (default max_batch_size ≈ 5,461), embeddings are recomputed from scratch for each slice. update_documents already does the right thing — embed once, batch only the storage — but from_texts did not follow the same pattern.

Fixes #36126

Changes

`langchain_huggingface/embeddings/huggingface.py`

Default to convert_to_tensor=True in _embed so all micro-batch outputs stay on the model's device and are torch.cat'd there. This reduces device→CPU transfers from N_batches to 1 (at the very end).
Final conversion uses .cpu().numpy().tolist() — one sync, then numpy's C-implemented tolist(). (PyTorch's Tensor.tolist() is a Python-level loop and is significantly slower for 2-D float arrays.)
Add batch_size: int = 32 as a first-class field (sentence-transformers already uses 32 as its default; this just surfaces it for easy tuning without knowing the internal encode_kwargs API). Users can still override both fields via encode_kwargs.

`langchain_chroma/vectorstores.py`

In from_texts: pre-compute all embeddings once with embedding.embed_documents(texts), then pass embeddings=all_embeddings to create_batches and write directly via _collection.add — matching the pattern already used in update_documents.

Benchmark (BAAI/bge-small-en-v1.5, 1,000 texts, M4 MacBook Air)

Config	Time	vs baseline
baseline — `convert_to_numpy=True`, `batch_size=32`	1.157 s	—
fixed — `convert_to_tensor=True`, `batch_size=32`	0.631 s	1.84× faster
fixed — `convert_to_tensor=True`, `batch_size=128`	0.642 s	1.80× faster
direct `sentence_transformers` (reference)	0.594 s	1.95× faster

With the Chroma fix applied on top (for datasets > 5,461 docs) the combined improvement is proportional to the number of storage batches.

Backward compatibility

embed_documents / embed_query return types are unchanged.
Users who explicitly pass encode_kwargs={"convert_to_tensor": False} or encode_kwargs={"convert_to_numpy": True} get the original numpy path (the hasattr(embeddings, "cpu") branch handles this correctly).
Chroma.from_texts / from_documents signatures are unchanged.
For datasets with a single ChromaDB batch (< 5,461 docs, the common case) the Chroma change is functionally identical to the previous behaviour.

Areas requiring careful review

hasattr(embeddings, "cpu") duck-typing — used to distinguish a torch.Tensor from a numpy.ndarray without importing torch at the module level. This should be safe but reviewers should verify edge cases (e.g. IPEX models, custom pooling layers that return unusual objects).
Chroma _collection.add bypass — from_texts now writes directly to _collection.add rather than going through add_texts. The metadata and document handling is delegated to create_batches (same as update_documents). Please verify this covers all the metadata edge cases that add_texts handled.

Test plan

13 new unit tests (tests/unit_tests/test_embeddings.py) — all passing, no network required
All existing unit tests pass
Benchmark script (libs/partners/huggingface/scripts/benchmark_embeddings.py) confirms 1.84× wall-clock improvement on M4

🤖 This contribution was developed with the assistance of Claude Code (Anthropic). The root-cause analysis, implementation, and tests were designed collaboratively.

Changed files

libs/partners/chroma/langchain_chroma/vectorstores.py (modified, +20/-4)
libs/partners/huggingface/langchain_huggingface/embeddings/huggingface.py (modified, +26/-5)
libs/partners/huggingface/pyproject.toml (modified, +3/-0)
libs/partners/huggingface/scripts/benchmark_embeddings.py (added, +109/-0)
libs/partners/huggingface/scripts/check_imports.py (modified, +2/-2)
libs/partners/huggingface/tests/unit_tests/test_embeddings.py (added, +235/-0)
libs/partners/huggingface/uv.lock (modified, +9/-3)

RAW_BUFFERClick to expand / collapse

Problem

HuggingFaceEmbeddings is significantly slower than calling sentence_transformers directly. Profiling reveals two compounding root causes.

Root cause 1 — per-batch device→CPU memory transfers (HuggingFace)

For 1,000 texts at batch_size=32, this means 32 synchronisations instead of 1, adding ~640 ms of pure transfer overhead on an M4 MacBook Air.

Root cause 2 — re-embedding per storage batch (Chroma)

Chroma.from_texts calls create_batches(documents=texts) without pre-computed embeddings, then calls add_texts() (which calls embed_documents()) for each ChromaDB storage batch. For corpora spanning multiple ChromaDB batches (default max_batch_size ≈ 5,461), embeddings are recomputed from scratch for each slice.

update_documents already does the right thing — embed once, batch only the storage — but from_texts did not follow the same pattern.

Benchmark (BAAI/bge-small-en-v1.5, 1,000 texts, M4 MacBook Air)

Config	Time	vs baseline
baseline — `convert_to_numpy=True`, `batch_size=32`	1.157 s	—
fixed — `convert_to_tensor=True`, `batch_size=32`	0.631 s	1.84× faster
fixed — `convert_to_tensor=True`, `batch_size=128`	0.642 s	1.80× faster
direct `sentence_transformers` (reference)	0.594 s	1.95× faster

Proposed fix

Default to convert_to_tensor=True in HuggingFaceEmbeddings._embed so outputs stay on-device across all micro-batches, with a single device→CPU transfer at the end.
Pre-compute all embeddings once in Chroma.from_texts before passing them to create_batches, matching the pattern already in update_documents.

extent analysis

Fix Plan

To address the performance issues with HuggingFaceEmbeddings, we will implement the following steps:

Modify HuggingFaceEmbeddings._embed to default to convert_to_tensor=True
Pre-compute embeddings in Chroma.from_texts before creating batches

Code Changes

# In HuggingFaceEmbeddings._embed
def _embed(self, texts):
    # ...
    outputs = self.model.encode(texts, convert_to_tensor=True)  # Default to convert_to_tensor=True
    # ...

# In Chroma.from_texts
def from_texts(self, texts):
    # Pre-compute embeddings
    embeddings = HuggingFaceEmbeddings().embed(texts)
    # Create batches with pre-computed embeddings
    batches = create_batches(documents=texts, embeddings=embeddings)
    # ...

Verification

To verify the fix, run the benchmark test with the modified code and compare the results to the baseline. The expected outcome is a significant reduction in processing time, similar to the results shown in the benchmark table.

Extra Tips

When working with large datasets, it's essential to minimize device→CPU memory transfers to optimize performance.
Pre-computing embeddings before creating batches can significantly reduce processing time, especially for large corpora.
Consider adjusting the batch_size parameter to find the optimal balance between memory usage and processing time.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #ssr #installation #tensor shape #autograd error #tokenizer error #prompt formatting #chain error #conversation history #tool integration

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

langchain - ✅(Solved) Fix perf: HuggingFaceEmbeddings causes excessive device-to-cpu transfers per batch [1 pull requests, 1 comments, 1 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Fix Action

Fixed

PR fix notes

PR #36127: perf(huggingface,chroma): reduce device-to-cpu transfer overhead in embedding pipeline

Description (problem / solution / changelog)

Motivation

Root cause 1 — per-batch device→CPU memory transfers (HuggingFace)

Root cause 2 — re-embedding per storage batch (Chroma)

Changes

langchain_huggingface/embeddings/huggingface.py

langchain_chroma/vectorstores.py

Benchmark (BAAI/bge-small-en-v1.5, 1,000 texts, M4 MacBook Air)

Backward compatibility

Areas requiring careful review

Test plan

Changed files

Problem

Root cause 1 — per-batch device→CPU memory transfers (HuggingFace)

Root cause 2 — re-embedding per storage batch (Chroma)

Benchmark (BAAI/bge-small-en-v1.5, 1,000 texts, M4 MacBook Air)

Proposed fix

extent analysis

Fix Plan

Code Changes

Verification

Extra Tips

Still need to ship something?

RELATED_DISCOVERY

TRENDING

`langchain_huggingface/embeddings/huggingface.py`

`langchain_chroma/vectorstores.py`