llamaIndex - ✅(Solved) Fix [Bug]: Node.hash uses MetadataMode.ALL, causing unnecessary re-embeds when volatile file-stat metadata changes [1 pull requests, 1 comments, 1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
run-llama/llama_index#21461Fetched 2026-04-24 05:51:29
View on GitHub
Comments
1
Participants
1
Timeline
3
Reactions
0
Author
Participants
Timeline (top)
cross-referenced ×2commented ×1

Root Cause

PR #18303 correctly addressed #17871 (metadata changes not being detected by IngestionPipeline) by adding metadata to the hash. The fix used MetadataMode.ALL rather than MetadataMode.EMBED. Since excluded_embed_metadata_keys exists precisely to mark fields as volatile, operational, or not content-relevant, including them in the hash defeats the purpose of that mechanism.

Fix Action

Fix / Workaround

llama-index-core==0.14.21 (current at time of writing). Bug introduced in #18303, merged into llama-index-core prior to v0.12.20 release.

PR fix notes

PR #21462: fix: exclude volatile metadata from Node/TextNode hashing and IngestionCache keys to prevent unnecessary re-embeds

Description (problem / solution / changelog)

Description

Fixes the unnecessary re-embed behavior described in #21461.

Three parallel hashing code paths in llama-index-core include metadata without respecting excluded_embed_metadata_keys. The combined effect is that any volatile metadata change (file-stat timestamps, sizes, etc.) invalidates caches and re-triggers embedding for byte-identical content.

This PR changes all three sites to use MetadataMode.EMBED, which respects excluded_embed_metadata_keys. The mechanism already exists in the codebase for exactly the purpose of marking fields as not-content-relevant.

Meaningful metadata changes (user-added tags, custom enrichments, any field not explicitly excluded by the reader) are still tracked. The fix preserves the intent of #18303 (detect content-relevant metadata changes) while eliminating the churn on fields that were explicitly excluded from embedding.

Fixes #21461

The three changes

  1. Node.hash in llama-index-core/llama_index/core/schema.py:

    -        metadata_str = self.get_metadata_str(mode=MetadataMode.ALL)
    +        metadata_str = self.get_metadata_str(mode=MetadataMode.EMBED)
  2. TextNode.hash in llama-index-core/llama_index/core/schema.py. Previously built doc_identity as str(self.text) + str(self.metadata), which included all metadata regardless of exclusions. Updated to use get_metadata_str(mode=MetadataMode.EMBED) so volatile fields do not participate:

    -        doc_identity = str(self.text) + str(self.metadata)
    +        metadata_str = self.get_metadata_str(mode=MetadataMode.EMBED)
    +        doc_identity = str(self.text) + metadata_str
  3. get_transformation_hash in llama-index-core/llama_index/core/ingestion/pipeline.py. This is the key used by IngestionCache to skip transformations on repeated input. Previously used MetadataMode.ALL, causing cache misses on volatile metadata change. Updated to MetadataMode.EMBED:

    -        [str(node.get_content(metadata_mode=MetadataMode.ALL)) for node in nodes]
    +        [str(node.get_content(metadata_mode=MetadataMode.EMBED)) for node in nodes]

Without change #3, even with #1 and #2 applied, IngestionCache would still miss on the volatile-metadata change and re-run the whole transformation chain including the embedder. The three changes together are the minimal set.

Test coverage

Adds a regression test in llama-index-core/tests/ingestion/test_pipeline.py that:

  1. Constructs a Document with both content-relevant metadata (tag) and volatile metadata (last_modified_date).
  2. Runs an IngestionPipeline with a mock embedding model whose get_text_embedding_batch is wrapped to count invocations.
  3. First ingest: embedder is called.
  4. Re-ingest after mutating only the volatile field (which is listed in excluded_embed_metadata_keys): asserts the embedder is NOT called again.
  5. Re-ingest after mutating a content-relevant field (tag): asserts the embedder IS called again, preserving the #18303 fix for #17871.

This covers the minimal behavior contract for both directions.

Why MetadataMode.EMBED is the right choice

  • It respects excluded_embed_metadata_keys, which readers already use to mark fields that should not participate in the embedded text.
  • The intent of excluding a field from embedding is, by construction, that the field is not content-relevant. By the same reasoning, it should not invalidate the cached embedding.
  • SimpleDirectoryReader and other readers (S3Reader, GCSReader, etc.) already populate excluded_embed_metadata_keys with volatile file-stat and API-datetime fields by default. These changes make the hashing paths align with that existing intent, without any reader-side modifications.
  • Users who want a particular field to always participate in the hash can simply keep it out of excluded_embed_metadata_keys.

Potential concerns and responses

"This changes hash semantics and could break users depending on the old behavior." The change makes these hashes less aggressive at invalidation. They only flip on fields the reader hasn't marked excluded. Any previously-valid invalidation (on non-excluded fields) still occurs. The only change in behavior is that excluded-from-embed fields no longer cause unnecessary re-embeds. If a deployment relied on the old churn behavior as a form of time-based re-embedding, they can opt back in by removing specific fields from excluded_embed_metadata_keys.

"Why not just document the current behavior?" The current behavior silently charges real API tokens to the embedding provider on every cross-day ingestion run that touches any file. Documentation doesn't fix the cost; it just explains it. The reproducer repo demonstrates real OpenAI billed usage under realistic workflows.

"What about custom file_metadata callables that don't set excluded_embed_metadata_keys?" Those callers retain full control. Any field they include in metadata without excluding it will still participate in the hash. The fix is opt-in by virtue of respecting the existing exclusion mechanism.

Reproducer

External reproducer repo with 5 progressive levels of evidence (hash comparison, counting embedder, real OpenAI API end-to-end, reader format families, S3 end-to-end): https://github.com/stirelli/llamaindex-embedding-churn

Level 3 (the real-OpenAI E2E test) goes from 100% overhead to 0% overhead after this PR is applied. Verified locally with llama-index-core==0.14.21.

Type of change

  • Bug fix (non-breaking change which fixes an issue)

Checklist

  • I have performed a self-review
  • I added a regression test
  • Existing tests still pass (test_pipeline_update_metadata was the existing coverage for the inverse behavior, which still passes since its metadata fields are not in excluded_embed_metadata_keys)
  • Verified locally with the external reproducer
  • I ran uv run make format; uv run make lint (blocked on this macOS x86_64 machine by an unrelated tree-sitter-language-pack wheel availability issue; CI will cover lint/format)

Actual diffs to commit

File 1: llama-index-core/llama_index/core/schema.py (Node.hash)

     @property
     def hash(self) -> str:
         """
         Generate a hash representing the state of the node.

         The hash is generated based on the available resources (audio, image, text or video) and its metadata.
         """
         doc_identities = []
-        metadata_str = self.get_metadata_str(mode=MetadataMode.ALL)
+        metadata_str = self.get_metadata_str(mode=MetadataMode.EMBED)
         if metadata_str:
             doc_identities.append(metadata_str)

File 1: llama-index-core/llama_index/core/schema.py (TextNode.hash)

     @property
     def hash(self) -> str:
-        doc_identity = str(self.text) + str(self.metadata)
+        metadata_str = self.get_metadata_str(mode=MetadataMode.EMBED)
+        doc_identity = str(self.text) + metadata_str
         return str(sha256(doc_identity.encode("utf-8", "surrogatepass")).hexdigest())

File 2: llama-index-core/llama_index/core/ingestion/pipeline.py

 def get_transformation_hash(
     nodes: Sequence[BaseNode], transformation: TransformComponent
 ) -> str:
     """Get the hash of a transformation."""
     nodes_str = "".join(
-        [str(node.get_content(metadata_mode=MetadataMode.ALL)) for node in nodes]
+        [str(node.get_content(metadata_mode=MetadataMode.EMBED)) for node in nodes]
     )

File 3: llama-index-core/tests/ingestion/test_pipeline.py

Append the following regression test after test_pipeline_update_metadata:

def test_pipeline_does_not_re_embed_on_volatile_metadata_change() -> None:
    """Regression test: fields listed in `excluded_embed_metadata_keys` must
    NOT trigger re-embedding on upsert. This covers the churn scenario where
    volatile metadata such as file-stat timestamps or sizes would otherwise
    force redundant embedding API calls for byte-identical content.

    Companion to `test_pipeline_update_metadata`, which verifies the inverse:
    a change to a content-relevant (non-excluded) metadata field DOES trigger
    re-processing.
    """
    from unittest.mock import MagicMock

    embed_model = MockEmbedding(embed_dim=4)
    # Wrap the batch method so we can count invocations.
    embed_model.get_text_embedding_batch = MagicMock(  # type: ignore[method-assign]
        wraps=embed_model.get_text_embedding_batch
    )

    pipeline = IngestionPipeline(
        transformations=[
            SentenceSplitter(chunk_size=256, chunk_overlap=0),
            embed_model,
        ],
        docstore=SimpleDocumentStore(),
    )

    doc = Document(
        id_="0",
        text="hello world, this is a short test document for the pipeline",
        metadata={"tag": "v1", "last_modified_date": "2026-04-22"},
        excluded_embed_metadata_keys=["last_modified_date"],
    )

    # Phase 1: first ingest (should embed).
    pipeline.run(documents=[doc])
    embed_calls_after_phase1 = embed_model.get_text_embedding_batch.call_count
    assert embed_calls_after_phase1 >= 1

    # Phase 2: mutate only the volatile field, which is in
    # `excluded_embed_metadata_keys`. This must NOT re-embed.
    doc.metadata["last_modified_date"] = "2026-04-23"
    pipeline.run(documents=[doc])
    assert embed_model.get_text_embedding_batch.call_count == embed_calls_after_phase1, (
        "Re-embedding was triggered by a change to metadata listed in "
        "excluded_embed_metadata_keys. This re-introduces the churn bug."
    )

    # Phase 3: mutate a non-excluded (content-relevant) field. This SHOULD
    # re-embed, preserving the behavior added in #18303 for #17871.
    doc.metadata["tag"] = "v2"
    pipeline.run(documents=[doc])
    assert embed_model.get_text_embedding_batch.call_count > embed_calls_after_phase1, (
        "Content-relevant metadata change was not detected. The fix must still "
        "invalidate the cache when a non-excluded field changes."
    )

Changed files

  • llama-index-core/llama_index/core/ingestion/pipeline.py (modified, +1/-1)
  • llama-index-core/llama_index/core/schema.py (modified, +3/-2)
  • llama-index-core/tests/ingestion/test_pipeline.py (modified, +57/-0)

Code Example

import os
from datetime import datetime, timedelta
from llama_index.core import Document, SimpleDirectoryReader
from llama_index.core.ingestion import IngestionPipeline
from llama_index.core.node_parser import SentenceSplitter
from llama_index.core.storage.docstore import SimpleDocumentStore
from llama_index.core.vector_stores import SimpleVectorStore
from llama_index.core.embeddings import BaseEmbedding
from pydantic import PrivateAttr
from pathlib import Path

class CountingEmbedding(BaseEmbedding):
    _call_count: int = PrivateAttr(default=0)
    @classmethod
    def class_name(cls): return "Counting"
    def _get_text_embedding(self, text):
        self._call_count += 1
        return [0.0] * 1536
    def _get_query_embedding(self, q): return self._get_text_embedding(q)
    async def _aget_text_embedding(self, t): return self._get_text_embedding(t)
    async def _aget_query_embedding(self, q): return self._get_text_embedding(q)
    def _get_text_embeddings(self, ts): return [self._get_text_embedding(t) for t in ts]
    async def _aget_text_embeddings(self, ts): return self._get_text_embeddings(ts)

docs_dir = Path("/tmp/churn_repro")
docs_dir.mkdir(exist_ok=True)
(docs_dir / "doc.md").write_text("Hello, world." * 100)
os.utime(docs_dir / "doc.md", ((datetime.now() - timedelta(days=1)).timestamp(),) * 2)

embedder = CountingEmbedding()
pipeline = IngestionPipeline(
    transformations=[SentenceSplitter(chunk_size=256), embedder],
    docstore=SimpleDocumentStore(),
    vector_store=SimpleVectorStore(),
)

pipeline.run(documents=SimpleDirectoryReader(input_dir=str(docs_dir)).load_data())
print(f"Phase 1 (initial):  embed_calls={embedder._call_count}")

pipeline.run(documents=SimpleDirectoryReader(input_dir=str(docs_dir)).load_data())
print(f"Phase 2 (no change): embed_calls={embedder._call_count}  # cache works")

(docs_dir / "doc.md").touch()  # bump mtime to today
pipeline.run(documents=SimpleDirectoryReader(input_dir=str(docs_dir)).load_data())
print(f"Phase 3 (post-touch): embed_calls={embedder._call_count}  # bug fires")

---

# llama-index-core/llama_index/core/schema.py, Node.hash @property
-        metadata_str = self.get_metadata_str(mode=MetadataMode.ALL)
+        metadata_str = self.get_metadata_str(mode=MetadataMode.EMBED)

---

# llama-index-core/llama_index/core/readers/file/base.py
excluded_embed_metadata_keys=[
    "file_name", "file_type", "file_size",
    "creation_date", "last_modified_date", "last_accessed_date",
],
RAW_BUFFERClick to expand / collapse

Bug Description

Since PR #18303 (merged 2025-03-30), Node.hash includes metadata via MetadataMode.ALL. That mode ignores excluded_embed_metadata_keys, so every metadata field ends up in the hash. This includes file-stat-derived fields that SimpleDirectoryReader populates via default_file_metadata_func (last_modified_date, creation_date, last_accessed_date, file_size).

Under IngestionPipeline._handle_upserts, a hash mismatch triggers vector_store.delete() followed by a full re-embed. Combined with MetadataMode.ALL, this means any modification to volatile filesystem metadata will trigger a full re-embedding of the document's chunks on the next ingestion run, even when the text content is byte-identical.

The trigger is silent: no warning, no log, no indication that the pipeline is re-embedding unchanged content. The cost scales linearly with corpus size, modification rate, and re-indexing frequency.

Root cause

PR #18303 correctly addressed #17871 (metadata changes not being detected by IngestionPipeline) by adding metadata to the hash. The fix used MetadataMode.ALL rather than MetadataMode.EMBED. Since excluded_embed_metadata_keys exists precisely to mark fields as volatile, operational, or not content-relevant, including them in the hash defeats the purpose of that mechanism.

Scope

The bug manifests under these conditions:

  • SimpleDirectoryReader over a local filesystem. last_modified_date is formatted as "%Y-%m-%d" (date only), so the trigger is cross-day modifications. Any file modified between scheduled ingestion runs re-embeds on the next run.
  • Any reader that populates temporal metadata with sub-day precision (via a custom file_metadata callable, or via get_resource_info() code paths). Every modification triggers re-embedding, at the format's precision.
  • Manually-constructed Documents with volatile metadata. Same as above.

The bug does not manifest for SimpleDirectoryReader over fsspec cloud backends (s3fs, gcsfs, adlfs) in the current default load_data() path. This is because default_file_metadata_func (readers/file/base.py:164-168) queries POSIX-style stat keys (mtime, atime, created) that fsspec backends don't emit. The fix for that separate gap (using fs.modified(path) instead) would simultaneously activate this bug for every fsspec-backed reader.

Reproducers

Five progressive reproducers in this repo: https://github.com/stirelli/llamaindex-embedding-churn

Minimal reproduction (no API calls, no cost):

import os
from datetime import datetime, timedelta
from llama_index.core import Document, SimpleDirectoryReader
from llama_index.core.ingestion import IngestionPipeline
from llama_index.core.node_parser import SentenceSplitter
from llama_index.core.storage.docstore import SimpleDocumentStore
from llama_index.core.vector_stores import SimpleVectorStore
from llama_index.core.embeddings import BaseEmbedding
from pydantic import PrivateAttr
from pathlib import Path

class CountingEmbedding(BaseEmbedding):
    _call_count: int = PrivateAttr(default=0)
    @classmethod
    def class_name(cls): return "Counting"
    def _get_text_embedding(self, text):
        self._call_count += 1
        return [0.0] * 1536
    def _get_query_embedding(self, q): return self._get_text_embedding(q)
    async def _aget_text_embedding(self, t): return self._get_text_embedding(t)
    async def _aget_query_embedding(self, q): return self._get_text_embedding(q)
    def _get_text_embeddings(self, ts): return [self._get_text_embedding(t) for t in ts]
    async def _aget_text_embeddings(self, ts): return self._get_text_embeddings(ts)

docs_dir = Path("/tmp/churn_repro")
docs_dir.mkdir(exist_ok=True)
(docs_dir / "doc.md").write_text("Hello, world." * 100)
os.utime(docs_dir / "doc.md", ((datetime.now() - timedelta(days=1)).timestamp(),) * 2)

embedder = CountingEmbedding()
pipeline = IngestionPipeline(
    transformations=[SentenceSplitter(chunk_size=256), embedder],
    docstore=SimpleDocumentStore(),
    vector_store=SimpleVectorStore(),
)

pipeline.run(documents=SimpleDirectoryReader(input_dir=str(docs_dir)).load_data())
print(f"Phase 1 (initial):  embed_calls={embedder._call_count}")

pipeline.run(documents=SimpleDirectoryReader(input_dir=str(docs_dir)).load_data())
print(f"Phase 2 (no change): embed_calls={embedder._call_count}  # cache works")

(docs_dir / "doc.md").touch()  # bump mtime to today
pipeline.run(documents=SimpleDirectoryReader(input_dir=str(docs_dir)).load_data())
print(f"Phase 3 (post-touch): embed_calls={embedder._call_count}  # bug fires")

Expected behavior:

  • Phase 1 (initial): some positive number N of embed calls (depends on chunking).
  • Phase 2 (no change): same as Phase 1. Cache hit, no new calls.
  • Phase 3 (post-touch): embed count has doubled to 2N. This is an unnecessary re-embed of byte-identical content, triggered by a single touch that moved mtime across a calendar day.

Proposed fix

Three options to consider. Each preserves the intent of #18303 (detect meaningful metadata changes) to varying degrees.

Option A (recommended): Use MetadataMode.EMBED instead of MetadataMode.ALL

One line change in Node.hash:

# llama-index-core/llama_index/core/schema.py, Node.hash @property
-        metadata_str = self.get_metadata_str(mode=MetadataMode.ALL)
+        metadata_str = self.get_metadata_str(mode=MetadataMode.EMBED)

This respects excluded_embed_metadata_keys, the mechanism already present in the codebase for exactly this purpose. SimpleDirectoryReader already adds the volatile file-stat fields to excluded_embed_metadata_keys by default:

# llama-index-core/llama_index/core/readers/file/base.py
excluded_embed_metadata_keys=[
    "file_name", "file_type", "file_size",
    "creation_date", "last_modified_date", "last_accessed_date",
],

so the fix correctly excludes volatile metadata from hash computation without requiring any reader-side changes.

Meaningful metadata changes (user-added tags, custom enrichments) remain tracked because excluded_embed_metadata_keys contains only the fields the reader explicitly marked as excluded from embedding. That matches the intent of #17871.

Pros:

  • One-line diff.
  • Reuses existing infrastructure (no new API surface).
  • Preserves the #17871 fix for content-relevant metadata changes.

Cons:

  • Conflates two concerns in excluded_embed_metadata_keys: "don't send to embedder" and "don't invalidate cached embedding." In practice these align (if a field isn't sent to the embedder, the embedding output won't change when the field changes, so re-embedding is waste), but a maintainer may prefer a dedicated mechanism.

Option B: Remove metadata from the hash entirely (content-only hash)

Revert the metadata inclusion added by #18303. The hash becomes a pure function of the text content.

Pros:

  • Cleanest semantics. No ambiguity about what triggers re-embed.
  • Smallest possible surface for the hash.

Cons:

  • Regresses #17871. Users who rely on metadata updates triggering re-embed (the original bug #18303 was designed to fix) lose that behavior.
  • Shifts the burden back to users to explicitly invalidate documents when metadata changes.

Option C: Add a dedicated excluded_hash_metadata_keys field

Introduce a new attribute on Node / Document that controls which metadata fields participate in the hash, independent of excluded_embed_metadata_keys.

Pros:

  • Cleanest separation of concerns. Embedding exclusion and hash exclusion are different user-facing levers.
  • Supports edge cases (e.g., a field excluded from embedding but wanted in the hash, or vice versa).

Cons:

  • API surface growth. Every reader that populates volatile metadata would need to populate the new field too.
  • More breaking for downstream users who have already built around the current two-field model.
  • Defaults have to be chosen carefully (should it default to excluded_embed_metadata_keys? to empty? something else?).

My recommendation

Option A. It fixes the bug with a one-line change, preserves the #17871 fix, and uses existing infrastructure. Options B and C are legitimate but come with larger trade-offs. Happy to be overruled by whichever option the maintainers prefer.

Impact

  • Production deployments using scheduled SimpleDirectoryReader ingestion over local corpora that change between runs re-embed modified files on every cross-day ingestion cycle, silently.
  • Users adding custom file_metadata callables that include timestamps (a natural pattern) experience re-embedding on every source modification at the format's precision.
  • get_resource_info() code paths in S3Reader, GCSReader, and similar readers populate sub-second datetime fields and would fire this bug on every source update.
  • Users who manually construct Documents with any volatile metadata field experience the same behavior.

Version

llama-index-core==0.14.21 (current at time of writing). Bug introduced in #18303, merged into llama-index-core prior to v0.12.20 release.

Steps to reproduce

See reproducer above, or run the repo at https://github.com/stirelli/llamaindex-embedding-churn. It has five progressive levels of evidence, including end-to-end verification against a real OpenAI API key for billed-cost confirmation.

I'd be happy to

Open a PR with the one-line fix plus a unit test demonstrating the regression is prevented, if the maintainers agree with the approach. Alternative fixes, such as adding a hash_mode parameter to give users control, are also possible. Happy to discuss the trade-off here first.

extent analysis

TL;DR

The most likely fix for the bug is to use MetadataMode.EMBED instead of MetadataMode.ALL in Node.hash to respect excluded_embed_metadata_keys and prevent unnecessary re-embedding of unchanged content.

Guidance

  • Identify the Node.hash method and update the mode parameter to MetadataMode.EMBED to fix the bug.
  • Verify that excluded_embed_metadata_keys is correctly populated with volatile metadata fields, such as file-stat-derived fields, to ensure they are excluded from the hash computation.
  • Test the fix using the provided reproducer or by running the llamaindex-embedding-churn repo to confirm that the bug is resolved.
  • Consider the trade-offs of alternative fixes, such as removing metadata from the hash entirely or introducing a dedicated excluded_hash_metadata_keys field.

Example

# llama-index-core/llama_index/core/schema.py, Node.hash @property
-        metadata_str = self.get_metadata_str(mode=MetadataMode.ALL)
+        metadata_str = self.get_metadata_str(mode=MetadataMode.EMBED)

Notes

The fix assumes that excluded_embed_metadata_keys is correctly populated with volatile metadata fields. If this is not the case, additional changes may be required to ensure that these fields are excluded from the hash computation.

Recommendation

Apply the workaround by using MetadataMode.EMBED instead of MetadataMode.ALL in Node.hash, as it is a simple and effective fix that preserves the intent of the original fix (#18303) and uses existing infrastructure.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

llamaIndex - ✅(Solved) Fix [Bug]: Node.hash uses MetadataMode.ALL, causing unnecessary re-embeds when volatile file-stat metadata changes [1 pull requests, 1 comments, 1 participants]