langchain - 💡(How to fix) Fix EnsembleRetriever: arank_fusion uses different normalization logic than rank_fusion causing ValidationError in async

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

I noticed that EnsembleRetriever has two separate methods for normalizing retriever output into Document objects — rank_fusion (sync) and arank_fusion (async) — and they use different conditions for the wrapping logic.

rank_fusion (sync):

Document(page_content=cast("str", doc)) if isinstance(doc, str) else doc  # type: ignore[unreachable]

Only wraps values that are already strings.

arank_fusion (async):

Document(page_content=doc) if not isinstance(doc, Document) else doc

Wraps anything that isn't a Document — including None, integers, or other objects — by passing them straight into Document(page_content=...).

When a custom retriever returns a non-string, non-Document value through the async path, Pydantic rejects it with a ValidationError because page_content must be a string. The sync path doesn't crash — it passes the value through unchanged.

Error Message

pydantic_core._pydantic_core.ValidationError: 1 validation error for Document page_content Input should be a valid string [type=string_type, ...]

Root Cause

When a custom retriever returns a non-string, non-Document value through the async path, Pydantic rejects it with a ValidationError because page_content must be a string. The sync path doesn't crash — it passes the value through unchanged.

Fix Action

Fix / Workaround

  • This is a bug, not a usage question.
  • I added a clear and descriptive title that summarizes this issue.
  • I used the GitHub search to find a similar question and didn't find it.
  • I am sure that this is a bug in LangChain rather than my code.
  • The bug is not resolved by updating to the latest stable version of LangChain (or the specific integration package).
  • This is not related to the langchain-community package.
  • I posted a self-contained, minimal, reproducible example. A maintainer can copy it and run it AS IS.

Code Example

langchain==0.3.x (latest)
langchain-core==0.3.x (latest)

---

Document(page_content=cast("str", doc)) if isinstance(doc, str) else doc  # type: ignore[unreachable]

---

Document(page_content=doc) if not isinstance(doc, Document) else doc

---

import asyncio
from langchain_core.callbacks.manager import (
    AsyncCallbackManagerForRetrieverRun,
    CallbackManagerForRetrieverRun,
)
from langchain_core.documents import Document
from langchain_core.retrievers import BaseRetriever
from langchain.retrievers import EnsembleRetriever


class WeirdRetriever(BaseRetriever):
    def _get_relevant_documents(self, query, *, run_manager=None):
        return [42]  # not a Document, not a string

    async def _aget_relevant_documents(self, query, *, run_manager=None):
        return [42]


ensemble = EnsembleRetriever(retrievers=[WeirdRetriever()], weights=[1.0])

# Sync: no crash, 42 passes through
ensemble.invoke("test")

# Async: raises ValidationErrorDocument(page_content=42) fails Pydantic validation
asyncio.run(ensemble.ainvoke("test"))

---

pydantic_core._pydantic_core.ValidationError: 1 validation error for Document
  page_content
    Input should be a valid string [type=string_type, ...]

---

# before
Document(page_content=doc) if not isinstance(doc, Document) else doc

# after
Document(page_content=cast("str", doc)) if isinstance(doc, str) else doc  # type: ignore[unreachable]
RAW_BUFFERClick to expand / collapse

Submission checklist

  • This is a bug, not a usage question.
  • I added a clear and descriptive title that summarizes this issue.
  • I used the GitHub search to find a similar question and didn't find it.
  • I am sure that this is a bug in LangChain rather than my code.
  • The bug is not resolved by updating to the latest stable version of LangChain (or the specific integration package).
  • This is not related to the langchain-community package.
  • I posted a self-contained, minimal, reproducible example. A maintainer can copy it and run it AS IS.

Package (Required)

  • langchain

Language

  • Python

LangChain / Python version info

langchain==0.3.x (latest)
langchain-core==0.3.x (latest)

Description

I noticed that EnsembleRetriever has two separate methods for normalizing retriever output into Document objects — rank_fusion (sync) and arank_fusion (async) — and they use different conditions for the wrapping logic.

rank_fusion (sync):

Document(page_content=cast("str", doc)) if isinstance(doc, str) else doc  # type: ignore[unreachable]

Only wraps values that are already strings.

arank_fusion (async):

Document(page_content=doc) if not isinstance(doc, Document) else doc

Wraps anything that isn't a Document — including None, integers, or other objects — by passing them straight into Document(page_content=...).

When a custom retriever returns a non-string, non-Document value through the async path, Pydantic rejects it with a ValidationError because page_content must be a string. The sync path doesn't crash — it passes the value through unchanged.

Reproduction

import asyncio
from langchain_core.callbacks.manager import (
    AsyncCallbackManagerForRetrieverRun,
    CallbackManagerForRetrieverRun,
)
from langchain_core.documents import Document
from langchain_core.retrievers import BaseRetriever
from langchain.retrievers import EnsembleRetriever


class WeirdRetriever(BaseRetriever):
    def _get_relevant_documents(self, query, *, run_manager=None):
        return [42]  # not a Document, not a string

    async def _aget_relevant_documents(self, query, *, run_manager=None):
        return [42]


ensemble = EnsembleRetriever(retrievers=[WeirdRetriever()], weights=[1.0])

# Sync: no crash, 42 passes through
ensemble.invoke("test")

# Async: raises ValidationError — Document(page_content=42) fails Pydantic validation
asyncio.run(ensemble.ainvoke("test"))

Error:

pydantic_core._pydantic_core.ValidationError: 1 validation error for Document
  page_content
    Input should be a valid string [type=string_type, ...]

Expected behavior

Both rank_fusion and arank_fusion should apply the same normalization. The async path should match the sync path and only wrap bare strings, not arbitrary non-Document values.

The fix is a one-line change in arank_fusion:

# before
Document(page_content=doc) if not isinstance(doc, Document) else doc

# after
Document(page_content=cast("str", doc)) if isinstance(doc, str) else doc  # type: ignore[unreachable]

This keeps the behavior consistent between the two methods.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

FAQ

Expected behavior

Both rank_fusion and arank_fusion should apply the same normalization. The async path should match the sync path and only wrap bare strings, not arbitrary non-Document values.

The fix is a one-line change in arank_fusion:

# before
Document(page_content=doc) if not isinstance(doc, Document) else doc

# after
Document(page_content=cast("str", doc)) if isinstance(doc, str) else doc  # type: ignore[unreachable]

This keeps the behavior consistent between the two methods.

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING