llamaIndex - 💡(How to fix) Fix Feature: anybrowse web reader for Cloudflare-protected sites [1 participants]

llamaIndex2026-03-21 01:24:13

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

run-llama/llama_index#21104•Fetched 2026-04-08 01:08:10

View on GitHub

Comments

Participants

Timeline

Reactions

Author

kc23go

Participants

kc23go

Code Example

import requests
from llama_index.core import Document

def anybrowse_reader(url: str, api_key: str = None) -> Document:
    headers = {"Content-Type": "application/json"}
    if api_key:
        headers["Authorization"] = f"Bearer {api_key}"
    r = requests.post("https://anybrowse.dev/scrape", json={"url": url}, headers=headers)
    data = r.json()
    return Document(
        text=data["markdown"],
        metadata={"url": url, "title": data.get("title", "")}
    )

---

{
  "mcpServers": {
    "anybrowse": {
      "type": "streamable-http",
      "url": "https://anybrowse.dev/mcp"
    }
  }
}

RAW_BUFFERClick to expand / collapse

Background

LlamaIndex web loaders fail on Cloudflare-protected sites. SimpleWebPageReader gets a 403 on ~60% of high-value URLs (news sites, LinkedIn, e-commerce, government pages).

Proposal: anybrowse integration

anybrowse handles this via real residential Chrome. Could fit as a BaseReader:

import requests
from llama_index.core import Document

def anybrowse_reader(url: str, api_key: str = None) -> Document:
    headers = {"Content-Type": "application/json"}
    if api_key:
        headers["Authorization"] = f"Bearer {api_key}"
    r = requests.post("https://anybrowse.dev/scrape", json={"url": url}, headers=headers)
    data = r.json()
    return Document(
        text=data["markdown"],
        metadata={"url": url, "title": data.get("title", "")}
    )

Or via MCP:

{
  "mcpServers": {
    "anybrowse": {
      "type": "streamable-http",
      "url": "https://anybrowse.dev/mcp"
    }
  }
}

Free: 10/day, no key
Paid: $5 for 3,000 docs (never expire)
Docs: https://anybrowse.dev/docs

Happy to contribute an AnybrowseReader to llama-hub if this seems useful.

extent analysis

Fix Plan

To resolve the issue of LlamaIndex web loaders failing on Cloudflare-protected sites, we will integrate anybrowse into the system. This can be achieved by creating an AnybrowseReader that utilizes the anybrowse API to fetch and parse web pages.

Step-by-Step Solution

Install required libraries: Ensure you have requests installed. You can install it via pip:

pip install requests

2. **Implement AnybrowseReader**: Create a new file `anybrowse_reader.py` and add the following code:
   ```python
import requests
from llama_index.core import Document

class AnybrowseReader:
    def __init__(self, api_key: str = None):
        self.api_key = api_key

    def read(self, url: str) -> Document:
        headers = {"Content-Type": "application/json"}
        if self.api_key:
            headers["Authorization"] = f"Bearer {self.api_key}"
        r = requests.post("https://anybrowse.dev/scrape", json={"url": url}, headers=headers)
        data = r.json()
        return Document(
            text=data["markdown"],
            metadata={"url": url, "title": data.get("title", "")}
        )

Configure AnybrowseReader: Initialize the AnybrowseReader with your anybrowse API key (if you have one). You can then use this reader to fetch documents from Cloudflare-protected sites.

Verification

To verify that the fix worked, you can test the AnybrowseReader with a Cloudflare-protected URL:

reader = AnybrowseReader(api_key="YOUR_API_KEY")
document = reader.read("https://example.com/cloudflare-protected-page")
print(document.text)

Replace "YOUR_API_KEY" with your actual anybrowse API key.

Extra Tips

Make sure to handle rate limits and errors properly when using the anybrowse API.
Consider implementing a fallback mechanism in case the anybrowse API is unavailable.
Refer to the anybrowse documentation for more information on usage and pricing.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #ssr #installation #tensor shape #autograd error #chain error #conversation history #tool integration #LLM response

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

llamaIndex - 💡(How to fix) Fix Feature: anybrowse web reader for Cloudflare-protected sites [1 participants]

Recommended Tools

GitHub issue graph ai analysis

Code Example

Background

Proposal: anybrowse integration

extent analysis

Fix Plan

Step-by-Step Solution

Verification

Extra Tips

Still need to ship something?

TRENDING

llamaIndex - 💡(How to fix) Fix Feature: anybrowse web reader for Cloudflare-protected sites [1 participants]

Recommended Tools

GitHub issue graph ai analysis

Code Example

Background

Proposal: anybrowse integration

extent analysis

Fix Plan

Step-by-Step Solution

Verification

Extra Tips

Still need to ship something?

RELATED_DISCOVERY

TRENDING