llamaIndex - ✅(Solved) Fix BaseExtractor crashes entire pipeline on transient LLM errors [1 pull requests, 2 comments, 2 participants]

llamaIndex2026-02-12 21:43:57

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

run-llama/llama_index#20692•Fetched 2026-04-08 00:31:27

View on GitHub

Comments

Participants

Timeline

Reactions

Author

debu-sinha

Participants

debu-sinha

dosubot[bot]

Timeline (top)

referenced ×5mentioned ×3subscribed ×3commented ×2

Error Message

When an LLM call fails during metadata extraction (e.g. Azure content safety false positive, rate limit, transient network error), the entire ingestion pipeline crashes. This happens because BaseExtractor.aprocess_nodes() calls aextract() with no error handling at all -- a single failed node kills the whole batch. 3. None of the standard extractors (Title, Keyword, QA, Summary) have per-node error handling

Root Cause

Fix Action

Fixed

Fixed by PR: Add retry and error handling to BaseExtractor (https://github.com/run-llama/llama_index/pull/20693)

PR fix notes

PR #20693: Add retry and error handling to BaseExtractor

Repository: run-llama/llama_index
Author: debu-sinha
State: closed | merged: True
Link: https://github.com/run-llama/llama_index/pull/20693

Description (problem / solution / changelog)

Fixes #20692 Related: #20054

What changed

BaseExtractor.aprocess_nodes() calls aextract() with zero error handling. One transient LLM failure (rate limit, Azure content filter, network blip) crashes the whole ingestion pipeline. This is a real problem at scale -- the reporter in #20054 hits it every ~15,000 nodes.

This adds three opt-in fields to BaseExtractor:

Field	Default	Behaviour
`max_retries`	`0`	No retry (current behaviour)
`retry_backoff`	`1.0`	Base delay in seconds, exponential (1s, 2s, 4s, ...)
`on_extraction_error`	`"raise"`	`"raise"` = propagate error (current), `"skip"` = log warning + return empty metadata

All defaults preserve current behaviour. The retry logic lives in a single _aextract_with_retry() private method called from aprocess_nodes(). Every extractor that inherits from BaseExtractor (Title, Keyword, QA, Summary, etc.) gets this for free.

Example

extractor = TitleExtractor(
    llm=llm,
    max_retries=3,
    retry_backoff=2.0,
    on_extraction_error="skip",
)

Retries up to 3x with exponential backoff (2s, 4s, 8s). If all retries fail, logs a warning and continues with empty metadata instead of crashing the pipeline.

What did NOT change

run_jobs() in async_utils.py -- changing its error semantics would affect the entire codebase
Individual extractors -- they inherit resilience automatically
DocumentContextExtractor -- already has its own retry logic, no conflict
aextract() signature or return type -- fully backwards compatible

Testing

$ python3 -m pytest llama-index-core/tests/extractors/ -v
12 passed in 0.44s

7 new tests covering:

Default behaviour (raises on error, no retry)
Retry succeeds after transient failure
Skip policy returns empty metadata
Retries exhausted then raises
Retries exhausted then skips
Exponential backoff delays verified via mock
No-retry mode calls aextract exactly once

5 existing DocumentContextExtractor tests still pass.

Changed files

llama-index-core/llama_index/core/extractors/interface.py (modified, +62/-1)
llama-index-core/tests/extractors/test_extractor_resilience.py (added, +118/-0)

Code Example

from llama_index.core.extractors import TitleExtractor

extractor = TitleExtractor(
    llm=llm,
    max_retries=3,
    retry_backoff=2.0,
    on_extraction_error="skip",
)

RAW_BUFFERClick to expand / collapse

This is the scenario described in #20054. The reporter hits this about every 15,000 nodes with Azure OpenAI guardrails.

Root cause

aprocess_nodes() calls await self.aextract(new_nodes) on line 129 of interface.py with no try/catch
run_jobs() in async_utils.py uses asyncio.gather() without return_exceptions=True, so one failed job kills the batch
None of the standard extractors (Title, Keyword, QA, Summary) have per-node error handling
Only DocumentContextExtractor has any resilience, but it's a hardcoded 5-retry with 60s backoff that only catches rate limit errors

Proposed fix

Add three configurable fields to BaseExtractor that all extractors inherit automatically:

max_retries (default 0 -- current behaviour, no retry)
retry_backoff (default 1.0s, exponential backoff)
on_extraction_error ("raise" or "skip" -- "raise" is current behaviour)

The retry logic lives in a single _aextract_with_retry() method called from aprocess_nodes(). Fully backwards compatible since all defaults match existing behaviour.

Example usage for someone hitting the Azure guardrail issue:

from llama_index.core.extractors import TitleExtractor

extractor = TitleExtractor(
    llm=llm,
    max_retries=3,
    retry_backoff=2.0,
    on_extraction_error="skip",
)

This would retry up to 3 times with exponential backoff (2s, 4s, 8s), and if all retries fail, log a warning and continue with empty metadata instead of crashing.

extent analysis

Fix Plan

To address the issue of the ingestion pipeline crashing due to a single failed node, we will implement a retry mechanism with exponential backoff in the BaseExtractor class. This will allow extractors to inherit the retry logic automatically.

Step-by-Step Solution

Add configurable fields to BaseExtractor:
- max_retries: The maximum number of retries (default 0).
- retry_backoff: The initial backoff time in seconds (default 1.0s).
- on_extraction_error: The behavior on extraction error ("raise" or "skip", default "raise").
Implement _aextract_with_retry() method:

import asyncio import logging

class BaseExtractor: # ... existing code ...

async def _aextract_with_retry(self, node, max_retries, retry_backoff, on_extraction_error):
    for attempt in range(max_retries + 1):
        try:
            return await self.aextract(node)
        except Exception as e:
            if attempt < max_retries:
                # Exponential backoff
                await asyncio.sleep(retry_backoff * (2 ** attempt))
                logging.warning(f"Retrying extraction for node {node} (attempt {attempt + 1}/{max_retries + 1})")
            else:
                if on_extraction_error == "raise":
                    raise
                elif on_extraction_error == "skip":
                    logging.warning(f"Skipping extraction for node {node} after {max_retries + 1} attempts")
                    return None


3. **Modify `aprocess_nodes()` to use `_aextract_with_retry()`**:
```python
async def aprocess_nodes(self, nodes):
 # ... existing code ...
 results = await asyncio.gather(
     *(self._aextract_with_retry(node, self.max_retries, self.retry_backoff, self.on_extraction_error) for node in nodes),
     return_exceptions=True
 )
 # ... existing code ...

Update run_jobs() in async_utils.py to use return_exceptions=True:

async def run_jobs(jobs): return await asyncio.gather(*jobs, return_exceptions=True)


### Verification
To verify the fix, you can test the ingestion pipeline with a node that is likely to fail (e.g., a node that triggers the Azure guardrail issue). With the retry mechanism in place, the pipeline should no longer crash and should instead log a warning and continue with empty metadata.

### Example Usage
```python
from llama_index.core.extractors import TitleExtractor

extractor = TitleExtractor(
 llm=llm,
 max_retries=3,
 retry_backoff=2.0,
 on_extraction_error="skip",
)

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #ssr #installation #tensor shape #autograd error #ISR setup #authentication setup #request error #file not found #serialization error

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

llamaIndex - ✅(Solved) Fix BaseExtractor crashes entire pipeline on transient LLM errors [1 pull requests, 2 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Fixed

PR fix notes

PR #20693: Add retry and error handling to BaseExtractor

Description (problem / solution / changelog)

What changed

Example

What did NOT change

Testing

Changed files

Code Example

extent analysis

Fix Plan

Step-by-Step Solution

Still need to ship something?

TRENDING

llamaIndex - ✅(Solved) Fix BaseExtractor crashes entire pipeline on transient LLM errors [1 pull requests, 2 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Fixed

PR fix notes

PR #20693: Add retry and error handling to BaseExtractor

Description (problem / solution / changelog)

What changed

Example

What did NOT change

Testing

Changed files

Code Example

extent analysis

Fix Plan

Step-by-Step Solution

Still need to ship something?

RELATED_DISCOVERY

TRENDING