llamaIndex - ✅(Solved) Fix BaseExtractor crashes entire pipeline on transient LLM errors [1 pull requests, 2 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
run-llama/llama_index#20692Fetched 2026-04-08 00:31:27
View on GitHub
Comments
2
Participants
2
Timeline
15
Reactions
0
Timeline (top)
referenced ×5mentioned ×3subscribed ×3commented ×2

Error Message

When an LLM call fails during metadata extraction (e.g. Azure content safety false positive, rate limit, transient network error), the entire ingestion pipeline crashes. This happens because BaseExtractor.aprocess_nodes() calls aextract() with no error handling at all -- a single failed node kills the whole batch. 3. None of the standard extractors (Title, Keyword, QA, Summary) have per-node error handling

Root Cause

When an LLM call fails during metadata extraction (e.g. Azure content safety false positive, rate limit, transient network error), the entire ingestion pipeline crashes. This happens because BaseExtractor.aprocess_nodes() calls aextract() with no error handling at all -- a single failed node kills the whole batch.

Fix Action

Fixed

PR fix notes

PR #20693: Add retry and error handling to BaseExtractor

Description (problem / solution / changelog)

Fixes #20692 Related: #20054

What changed

BaseExtractor.aprocess_nodes() calls aextract() with zero error handling. One transient LLM failure (rate limit, Azure content filter, network blip) crashes the whole ingestion pipeline. This is a real problem at scale -- the reporter in #20054 hits it every ~15,000 nodes.

This adds three opt-in fields to BaseExtractor:

FieldDefaultBehaviour
max_retries0No retry (current behaviour)
retry_backoff1.0Base delay in seconds, exponential (1s, 2s, 4s, ...)
on_extraction_error"raise""raise" = propagate error (current), "skip" = log warning + return empty metadata

All defaults preserve current behaviour. The retry logic lives in a single _aextract_with_retry() private method called from aprocess_nodes(). Every extractor that inherits from BaseExtractor (Title, Keyword, QA, Summary, etc.) gets this for free.

Example

extractor = TitleExtractor(
    llm=llm,
    max_retries=3,
    retry_backoff=2.0,
    on_extraction_error="skip",
)

Retries up to 3x with exponential backoff (2s, 4s, 8s). If all retries fail, logs a warning and continues with empty metadata instead of crashing the pipeline.

What did NOT change

  • run_jobs() in async_utils.py -- changing its error semantics would affect the entire codebase
  • Individual extractors -- they inherit resilience automatically
  • DocumentContextExtractor -- already has its own retry logic, no conflict
  • aextract() signature or return type -- fully backwards compatible

Testing

$ python3 -m pytest llama-index-core/tests/extractors/ -v
12 passed in 0.44s

7 new tests covering:

  • Default behaviour (raises on error, no retry)
  • Retry succeeds after transient failure
  • Skip policy returns empty metadata
  • Retries exhausted then raises
  • Retries exhausted then skips
  • Exponential backoff delays verified via mock
  • No-retry mode calls aextract exactly once

5 existing DocumentContextExtractor tests still pass.

Changed files

  • llama-index-core/llama_index/core/extractors/interface.py (modified, +62/-1)
  • llama-index-core/tests/extractors/test_extractor_resilience.py (added, +118/-0)

Code Example

from llama_index.core.extractors import TitleExtractor

extractor = TitleExtractor(
    llm=llm,
    max_retries=3,
    retry_backoff=2.0,
    on_extraction_error="skip",
)
RAW_BUFFERClick to expand / collapse

When an LLM call fails during metadata extraction (e.g. Azure content safety false positive, rate limit, transient network error), the entire ingestion pipeline crashes. This happens because BaseExtractor.aprocess_nodes() calls aextract() with no error handling at all -- a single failed node kills the whole batch.

This is the scenario described in #20054. The reporter hits this about every 15,000 nodes with Azure OpenAI guardrails.

Root cause

  1. aprocess_nodes() calls await self.aextract(new_nodes) on line 129 of interface.py with no try/catch
  2. run_jobs() in async_utils.py uses asyncio.gather() without return_exceptions=True, so one failed job kills the batch
  3. None of the standard extractors (Title, Keyword, QA, Summary) have per-node error handling
  4. Only DocumentContextExtractor has any resilience, but it's a hardcoded 5-retry with 60s backoff that only catches rate limit errors

Proposed fix

Add three configurable fields to BaseExtractor that all extractors inherit automatically:

  • max_retries (default 0 -- current behaviour, no retry)
  • retry_backoff (default 1.0s, exponential backoff)
  • on_extraction_error ("raise" or "skip" -- "raise" is current behaviour)

The retry logic lives in a single _aextract_with_retry() method called from aprocess_nodes(). Fully backwards compatible since all defaults match existing behaviour.

Example usage for someone hitting the Azure guardrail issue:

from llama_index.core.extractors import TitleExtractor

extractor = TitleExtractor(
    llm=llm,
    max_retries=3,
    retry_backoff=2.0,
    on_extraction_error="skip",
)

This would retry up to 3 times with exponential backoff (2s, 4s, 8s), and if all retries fail, log a warning and continue with empty metadata instead of crashing.

extent analysis

Fix Plan

To address the issue of the ingestion pipeline crashing due to a single failed node, we will implement a retry mechanism with exponential backoff in the BaseExtractor class. This will allow extractors to inherit the retry logic automatically.

Step-by-Step Solution

  1. Add configurable fields to BaseExtractor:

    • max_retries: The maximum number of retries (default 0).
    • retry_backoff: The initial backoff time in seconds (default 1.0s).
    • on_extraction_error: The behavior on extraction error ("raise" or "skip", default "raise").
  2. Implement _aextract_with_retry() method:

import asyncio import logging

class BaseExtractor: # ... existing code ...

async def _aextract_with_retry(self, node, max_retries, retry_backoff, on_extraction_error):
    for attempt in range(max_retries + 1):
        try:
            return await self.aextract(node)
        except Exception as e:
            if attempt < max_retries:
                # Exponential backoff
                await asyncio.sleep(retry_backoff * (2 ** attempt))
                logging.warning(f"Retrying extraction for node {node} (attempt {attempt + 1}/{max_retries + 1})")
            else:
                if on_extraction_error == "raise":
                    raise
                elif on_extraction_error == "skip":
                    logging.warning(f"Skipping extraction for node {node} after {max_retries + 1} attempts")
                    return None

3. **Modify `aprocess_nodes()` to use `_aextract_with_retry()`**:
```python
async def aprocess_nodes(self, nodes):
 # ... existing code ...
 results = await asyncio.gather(
     *(self._aextract_with_retry(node, self.max_retries, self.retry_backoff, self.on_extraction_error) for node in nodes),
     return_exceptions=True
 )
 # ... existing code ...
  1. Update run_jobs() in async_utils.py to use return_exceptions=True:

async def run_jobs(jobs): return await asyncio.gather(*jobs, return_exceptions=True)


### Verification
To verify the fix, you can test the ingestion pipeline with a node that is likely to fail (e.g., a node that triggers the Azure guardrail issue). With the retry mechanism in place, the pipeline should no longer crash and should instead log a warning and continue with empty metadata.

### Example Usage
```python
from llama_index.core.extractors import TitleExtractor

extractor = TitleExtractor(
 llm=llm,
 max_retries=3,
 retry_backoff=2.0,
 on_extraction_error="skip",
)

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

llamaIndex - ✅(Solved) Fix BaseExtractor crashes entire pipeline on transient LLM errors [1 pull requests, 2 comments, 2 participants]