langchain - 💡(How to fix) Fix Tool integration: anybrowse for Cloudflare-bypass web scraping in LangChain agents [3 comments, 3 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
langchain-ai/langchain#36134Fetched 2026-04-08 01:08:02
View on GitHub
Comments
3
Participants
3
Timeline
5
Reactions
0
Author
Participants
Timeline (top)
commented ×3closed ×1labeled ×1

LangChain web loaders and WebBaseLoader fail on Cloudflare-protected sites -- most major news outlets, LinkedIn, Amazon, government pages. This causes silent failures in research chains.

Root Cause

LangChain web loaders and WebBaseLoader fail on Cloudflare-protected sites -- most major news outlets, LinkedIn, Amazon, government pages. This causes silent failures in research chains.

Code Example

from langchain.tools import tool
import requests

@tool
def scrape_url(url: str) -> str:
    """Scrape any URL and return clean markdown, including Cloudflare-protected sites."""
    r = requests.post("https://anybrowse.dev/scrape", json={"url": url})
    if r.ok:
        return r.json().get("markdown", "")
    return f"Scrape failed: {r.status_code}"

---

from langchain_core.documents import Document
import requests

class AnybrowseLoader:
    def __init__(self, url: str, api_key: str = None):
        self.url = url
        self.api_key = api_key

    def load(self) -> list[Document]:
        headers = {}
        if self.api_key:
            headers["Authorization"] = f"Bearer {self.api_key}"
        r = requests.post("https://anybrowse.dev/scrape",
                          json={"url": self.url}, headers=headers)
        data = r.json()
        return [Document(page_content=data["markdown"],
                         metadata={"source": self.url, "title": data.get("title", "")})]
RAW_BUFFERClick to expand / collapse

Context

LangChain web loaders and WebBaseLoader fail on Cloudflare-protected sites -- most major news outlets, LinkedIn, Amazon, government pages. This causes silent failures in research chains.

Proposed tool

anybrowse uses real residential Chrome to bypass Cloudflare and return clean markdown. Could fit as a BaseTool or BaseLoader:

from langchain.tools import tool
import requests

@tool
def scrape_url(url: str) -> str:
    """Scrape any URL and return clean markdown, including Cloudflare-protected sites."""
    r = requests.post("https://anybrowse.dev/scrape", json={"url": url})
    if r.ok:
        return r.json().get("markdown", "")
    return f"Scrape failed: {r.status_code}"

Or as a Document loader:

from langchain_core.documents import Document
import requests

class AnybrowseLoader:
    def __init__(self, url: str, api_key: str = None):
        self.url = url
        self.api_key = api_key

    def load(self) -> list[Document]:
        headers = {}
        if self.api_key:
            headers["Authorization"] = f"Bearer {self.api_key}"
        r = requests.post("https://anybrowse.dev/scrape",
                          json={"url": self.url}, headers=headers)
        data = r.json()
        return [Document(page_content=data["markdown"],
                         metadata={"source": self.url, "title": data.get("title", "")})]

extent analysis

Fix Plan

To resolve the issue of LangChain web loaders failing on Cloudflare-protected sites, we can integrate the anybrowse tool as a BaseTool or BaseLoader. Here are the concrete steps:

  • Option 1: Using anybrowse as a BaseTool
    1. Install the requests library if not already installed: pip install requests
    2. Use the provided scrape_url function as a tool in LangChain
    3. Example usage:

from langchain import LLMChain, PromptTemplate

template = PromptTemplate( input_variables=["url"], template="Scrape {url} and return the markdown.", )

chain = LLMChain( llm=None, # Use a suitable LLM prompt=template, tool=tool(scrape_url), )

output = chain({"url": "https://example.com"}) print(output)


* **Option 2: Using `anybrowse` as a `BaseLoader`**
  1. Install the `requests` library if not already installed: `pip install requests`
  2. Use the provided `AnybrowseLoader` class as a document loader in LangChain
  3. Example usage:
  ```python
loader = AnybrowseLoader("https://example.com")
docs = loader.load()
for doc in docs:
    print(doc.page_content)

Verification

To verify that the fix worked, test the scrape_url function or the AnybrowseLoader class with a Cloudflare-protected URL and check if the returned markdown is correct.

Extra Tips

  • Make sure to handle errors and exceptions properly when using the anybrowse tool or loader.
  • Consider implementing a retry mechanism to handle temporary failures.
  • Be aware of the usage limits and costs associated with the anybrowse API.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING