llamaIndex - ✅(Solved) Fix [Bug]: IngestionPipeline permanently overwrites docstore_strategy to DUPLICATES_ONLY when no vector store is attached [1 pull requests, 4 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
run-llama/llama_index#20823Fetched 2026-04-08 00:30:44
View on GitHub
Comments
4
Participants
2
Timeline
10
Reactions
0
Timeline (top)
commented ×4labeled ×2closed ×1cross-referenced ×1

Fix Action

Fixed

PR fix notes

PR #20824: fix(core): preserve docstore_strategy across pipeline runs when no vector store is attached

Description (problem / solution / changelog)

Description

Fixes #20823

Fixed this by computing a local effective_strategy per run instead of mutating instance state, and added a UserWarning so users know when the fallback is happening. I also added regression tests for both sync and async paths.

Also, 11 tests in the same file fired the new warning since they used the default docstore_strategy=UPSERTS without a vector store. I fixed this by explicitly setting docstore_strategy=DocstoreStrategy.DUPLICATES_ONLY on each.

New Package?

Did I fill in the tool.llamahub section in the pyproject.toml and provide a detailed README.md for my new integration or package?

  • Yes
  • No

Version Bump?

Did I bump the version in the pyproject.toml file of the package I am updating? (Except for the llama-index-core package)

  • Yes
  • No

Type of Change

Please delete options that are not relevant.

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update

How Has This Been Tested?

Your pull-request will likely not be merged unless it is covered by some form of impactful unit testing.

  • I added new unit tests to cover this change
  • I believe this change is already covered by existing unit tests

Suggested Checklist:

  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have added Google Colab support for the newly added notebooks.
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes
  • I ran uv run make format; uv run make lint to appease the lint gods

Changed files

  • llama-index-core/llama_index/core/ingestion/pipeline.py (modified, +62/-41)
  • llama-index-core/tests/ingestion/test_pipeline.py (modified, +39/-1)

Code Example

### Relevant Logs/Tracbacks
RAW_BUFFERClick to expand / collapse

Bug Description

When an IngestionPipeline is configured with DocstoreStrategy.UPSERTS or UPSERTS_AND_DELETE and run() is called without a vector store attached, the pipeline permanently overwrites self.docstore_strategy to DUPLICATES_ONLY. This mutation persists for all future runs ( even after a vector store is later attached ), causing document updates to never clean up stale embeddings in the vector store.

Version

0.14.15

Steps to Reproduce

  from llama_index.core.ingestion import IngestionPipeline, DocstoreStrategy
  from llama_index.core.storage.docstore import SimpleDocumentStore
  from llama_index.core.vector_stores import SimpleVectorStore
  from llama_index.core.schema import Document

  pipeline = IngestionPipeline(
      transformations=[],
      docstore=SimpleDocumentStore(),
      docstore_strategy=DocstoreStrategy.UPSERTS,
  )

  pipeline.run(documents=[Document(doc_id="doc1", text="hello")])

  print(pipeline.docstore_strategy)

  pipeline.vector_store = SimpleVectorStore()
  pipeline.run(documents=[Document(doc_id="doc1", text="updated content")])

  print(pipeline.docstore_strategy)

Relevant Logs/Tracbacks

# Expected: DocstoreStrategy.UPSERTS
# Actual: DocstoreStrategy.DUPLICATES_ONLY

extent analysis

Fix Plan

Fix Name

Prevent DocstoreStrategy mutation when no vector store is attached.

Steps to Fix

  1. Update IngestionPipeline class:

    • In IngestionPipeline class, add a check to ensure a vector store is attached before running the pipeline with UPSERTS or UPSERTS_AND_DELETE strategy.
    • If no vector store is attached, raise a ValueError with a descriptive message.

class IngestionPipeline: # ...

def run(self, documents):
    if self.docstore_strategy in [DocstoreStrategy.UPSERTS, DocstoreStrategy.UPSERTS_AND_DELETE] and not self.vector_store:
        raise ValueError("Vector store is required for UPSERTS or UPSERTS_AND_DELETE strategy")
    # ...

2. **Update `SimpleIngestionPipeline`** (if applicable):
   - If you have a custom `SimpleIngestionPipeline` class that inherits from `IngestionPipeline`, update it to include the same check.

   ```python
class SimpleIngestionPipeline(IngestionPipeline):
    # ...

    def run(self, documents):
        super().run(documents)

Verification

  1. Run the reproduction test case with the updated code.
  2. Verify that the docstore_strategy remains DocstoreStrategy.UPSERTS after attaching a vector store and running the pipeline again.
pipeline = IngestionPipeline(
    transformations=[],
    docstore=SimpleDocumentStore(),
    docstore_strategy=DocstoreStrategy.UPSERTS,
)

pipeline.run(documents=[Document(doc_id="doc1", text="hello")])

print(pipeline.docstore_strategy)  # Should print: DocstoreStrategy.UPSERTS

pipeline.vector_store = SimpleVectorStore()
pipeline.run(documents=[Document(doc_id="doc1",

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING