llamaIndex - 💡(How to fix) Fix Feature: anybrowse web reader for Cloudflare-protected sites [1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
run-llama/llama_index#21104Fetched 2026-04-08 01:08:10
View on GitHub
Comments
0
Participants
1
Timeline
0
Reactions
0
Author
Participants

Code Example

import requests
from llama_index.core import Document

def anybrowse_reader(url: str, api_key: str = None) -> Document:
    headers = {"Content-Type": "application/json"}
    if api_key:
        headers["Authorization"] = f"Bearer {api_key}"
    r = requests.post("https://anybrowse.dev/scrape", json={"url": url}, headers=headers)
    data = r.json()
    return Document(
        text=data["markdown"],
        metadata={"url": url, "title": data.get("title", "")}
    )

---

{
  "mcpServers": {
    "anybrowse": {
      "type": "streamable-http",
      "url": "https://anybrowse.dev/mcp"
    }
  }
}
RAW_BUFFERClick to expand / collapse

Background

LlamaIndex web loaders fail on Cloudflare-protected sites. SimpleWebPageReader gets a 403 on ~60% of high-value URLs (news sites, LinkedIn, e-commerce, government pages).

Proposal: anybrowse integration

anybrowse handles this via real residential Chrome. Could fit as a BaseReader:

import requests
from llama_index.core import Document

def anybrowse_reader(url: str, api_key: str = None) -> Document:
    headers = {"Content-Type": "application/json"}
    if api_key:
        headers["Authorization"] = f"Bearer {api_key}"
    r = requests.post("https://anybrowse.dev/scrape", json={"url": url}, headers=headers)
    data = r.json()
    return Document(
        text=data["markdown"],
        metadata={"url": url, "title": data.get("title", "")}
    )

Or via MCP:

{
  "mcpServers": {
    "anybrowse": {
      "type": "streamable-http",
      "url": "https://anybrowse.dev/mcp"
    }
  }
}

Happy to contribute an AnybrowseReader to llama-hub if this seems useful.

extent analysis

Fix Plan

To resolve the issue of LlamaIndex web loaders failing on Cloudflare-protected sites, we will integrate anybrowse into the system. This can be achieved by creating an AnybrowseReader that utilizes the anybrowse API to fetch and parse web pages.

Step-by-Step Solution

  1. Install required libraries: Ensure you have requests installed. You can install it via pip:

pip install requests

2. **Implement AnybrowseReader**: Create a new file `anybrowse_reader.py` and add the following code:
   ```python
import requests
from llama_index.core import Document

class AnybrowseReader:
    def __init__(self, api_key: str = None):
        self.api_key = api_key

    def read(self, url: str) -> Document:
        headers = {"Content-Type": "application/json"}
        if self.api_key:
            headers["Authorization"] = f"Bearer {self.api_key}"
        r = requests.post("https://anybrowse.dev/scrape", json={"url": url}, headers=headers)
        data = r.json()
        return Document(
            text=data["markdown"],
            metadata={"url": url, "title": data.get("title", "")}
        )
  1. Configure AnybrowseReader: Initialize the AnybrowseReader with your anybrowse API key (if you have one). You can then use this reader to fetch documents from Cloudflare-protected sites.

Verification

To verify that the fix worked, you can test the AnybrowseReader with a Cloudflare-protected URL:

reader = AnybrowseReader(api_key="YOUR_API_KEY")
document = reader.read("https://example.com/cloudflare-protected-page")
print(document.text)

Replace "YOUR_API_KEY" with your actual anybrowse API key.

Extra Tips

  • Make sure to handle rate limits and errors properly when using the anybrowse API.
  • Consider implementing a fallback mechanism in case the anybrowse API is unavailable.
  • Refer to the anybrowse documentation for more information on usage and pricing.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING