llamaIndex - ✅(Solved) Fix Add CRW web scraper reader integration [1 pull requests, 3 comments, 2 participants]

us · 2026-03-26T14:18:28Z

[llamaIndex] PR 21177: feat: Add CRW web scraper reader integration - Repository: run-llama/llama index - Author: rainbowgore - State: closed | merged: False -… # PR #21177: feat: Add CRW web scraper reader integration - Repository: run-llama/llama_index - Author: rainbowgore - State: closed | merged: False - Link: https://github.com/run-llama/llama_index/pull/21177 ## Description (problem / solution / changelog) # Description CRW web scraper reader added to the `llama-index-readers-web` package. CRW is an open-source, self-hosted, Firecrawl-compatible web scraper written in Rust. It exposes a local REST API with no cloud account or API key required by default (optional Bearer token supported for authenticated deployments and the managed cloud at fastcrw.com). Three modes are supported: - **scrape**: single page to one Document (POST /v1/scrape) - **crawl**: async BFS crawl with polling, returns one Document per page (POST /v1/crawl + GET /v1/crawl/{id}) - **map**: link discovery, returns one Document per URL (POST /v1/map) Mode can be set on the constructor or overridden per-call via `load_data(url, mode=...)`. Both `CrwWebReader` and `CrwReader` (alias matching the issue proposal) are exported. No new dependencies required (`requests` is already a dependency of `llama-index-readers-web`). Fixes #21167 ## New Package? Did I fill in the `tool.llamahub` section in the `pyproject.toml` and provide a detailed README.md for my new integration or package? - [ ] Yes - [x] No ## Version Bump? Did I bump the version in the `pyproject.toml` file of the package I am updating? (Except for the `llama-index-core` package) - [x] Yes - [ ] No ## Type of Change - [x] New feature (non-breaking change which adds functionality) ## How Has This Been Tested? - [x] I added new unit tests to cover this change 22 unit tests covering all three modes (scrape, crawl, map), error handling, mode override on `load_data`, polling, timeout, and the `CrwReader` alias. Additionally verified against a live self-hosted CRW server with scrape returning real content, crawl returning 99 documents, and map returning 0 (target site blocks link discovery, expected behavior). ## Suggested Checklist: - [x] I have performed a self-review of my own code - [ ] I have commented my code, particularly in hard-to-understand areas - [ ] I have made corresponding changes to the documentation - [ ] I have added Google Colab support for the newly added notebooks. - [x] My changes generate no new warnings - [x] I have added tests that prove my fix is effective or that my feature works - [x] New and existing unit tests pass locally with my changes - [x] I ran `uv run make format; uv run make lint` to appease the lint gods ## Changed files - `llama-index-integrations/readers/llama-index-readers-web/llama_index/readers/crw/README.md` (added, +24/-0) - `llama-index-integrations/readers/llama-index-readers-web/llama_index/readers/crw/__init__.py` (added, +19/-0) - `llama-index-integrations/readers/llama-index-readers-web/llama_index/readers/web/__init__.py` (modified, +3/-0) - `llama-index-integrations/readers/llama-index-readers-web/llama_index/readers/web/crw_web/__init__.py` (added, +3/-0) - `llama-index-integrations/readers/llama-index-readers-web/llama_index/readers/web/crw_web/base.py` (added, +185/-0) - `llama-index-integrations/readers/llama-index-readers-web/pyproject.toml` (modified, +3/-1) - `llama-index-integrations/readers/llama-index-readers-web/tests/test_crw_web_reader.py` (added, +265/-0) ## Feature Request Add a CRW reader integration (`llama-index-readers-crw`) for web scraping, crawling, and site mapping. ### What is CRW? [CRW](https://github.com/us/crw) is an open-source, Firecrawl-compatible web scraper built in Rust. Single binary, ~6 MB idle RAM, built-in MCP server. ### Proposed Integration A `CrwReader` extending `BaseReader` with three modes: - **scrape**: Single URL → Documents (markdown) - **crawl**: BFS crawl → multiple Documents - **map**: URL discovery → Documents with URLs ### API Endpoints - `POST /v1/scrape` — single page - `POST /v1/crawl` + `GET /v1/crawl/{id}` — async crawl with polling - `POST /v1/map` — sitemap/link discovery ### Example Usage ```python from llama_index.readers.crw import CrwReader reader = CrwReader() # defaults to localhost:3000 docs = reader.load_data(url="https://example.com", mode="scrape") ``` I have a working implementation ready and can submit a PR once this is approved. - GitHub: https://github.com/us/crw - API docs: https://us.github.io/crw/rest-api

llamaIndex2026-03-26 14:18:28

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

run-llama/llama_index#21167•Fetched 2026-04-08 01:36:24

View on GitHub

Comments

Participants

Timeline

Reactions

Author

Participants

rainbowgore

Timeline (top)

commented ×3cross-referenced ×1mentioned ×1subscribed ×1

Code Example

from llama_index.readers.crw import CrwReader

reader = CrwReader()  # defaults to localhost:3000
docs = reader.load_data(url="https://example.com", mode="scrape")

RAW_BUFFERClick to expand / collapse

Feature Request

Add a CRW reader integration (llama-index-readers-crw) for web scraping, crawling, and site mapping.

What is CRW?

CRW is an open-source, Firecrawl-compatible web scraper built in Rust. Single binary, ~6 MB idle RAM, built-in MCP server.

Proposed Integration

A CrwReader extending BaseReader with three modes:

scrape: Single URL → Documents (markdown)
crawl: BFS crawl → multiple Documents
map: URL discovery → Documents with URLs

API Endpoints

POST /v1/scrape — single page
POST /v1/crawl + GET /v1/crawl/{id} — async crawl with polling
POST /v1/map — sitemap/link discovery

Example Usage

from llama_index.readers.crw import CrwReader

reader = CrwReader()  # defaults to localhost:3000
docs = reader.load_data(url="https://example.com", mode="scrape")

I have a working implementation ready and can submit a PR once this is approved.

GitHub: https://github.com/us/crw
API docs: https://us.github.io/crw/rest-api

extent analysis

Fix Plan

To integrate the CRW reader, we need to implement the CrwReader class and its API endpoints.

Step-by-Step Solution

Implement the CrwReader class extending BaseReader:

from llama_index.readers.base import BaseReader

class CrwReader(BaseReader):
    def __init__(self, host: str = "localhost", port: int = 3000):
        self.host = host
        self.port = port

    def load_data(self, url: str, mode: str):
        # Implement logic for scrape, crawl, and map modes
        if mode == "scrape":
            return self._scrape(url)
        elif mode == "crawl":
            return self._crawl(url)
        elif mode == "map":
            return self._map(url)
        else:
            raise ValueError("Invalid mode")

Implement the _scrape, _crawl, and _map methods:

import requests

class CrwReader(BaseReader):
    # ...

    def _scrape(self, url: str):
        response = requests.post(f"http://{self.host}:{self.port}/v1/scrape", json={"url": url})
        return response.json()

    def _crawl(self, url: str):
        response = requests.post(f"http://{self.host}:{self.port}/v1/crawl", json={"url": url})
        crawl_id = response.json()["id"]
        # Poll for crawl completion
        while True:
            response = requests.get(f"http://{self.host}:{self.port}/v1/crawl/{crawl_id}")
            if response.json()["status"] == "completed":
                break
        return response.json()["documents"]

    def _map(self, url: str):
        response = requests.post(f"http://{self.host}:{self.port}/v1/map", json={"url": url})
        return response.json()

Create API endpoints for POST /v1/scrape, POST /v1/crawl, GET /v1/crawl/{id}, and POST /v1/map:

from fastapi import FastAPI, HTTPException

app = FastAPI()

@app.post("/v1/scrape")
def scrape(url: str):
    # Implement scrape logic
    pass

@app.post("/v1/crawl")
def crawl(url: str):
    # Implement crawl logic
    pass

@app.get("/v1/crawl/{crawl_id}")
def get_crawl(crawl_id: str):
    # Implement logic to retrieve crawl status and documents
    pass

@app.post("/v1/map")
def map(url: str):
    # Implement map logic
    pass

Verification

To verify the fix, test the CrwReader class and its API endpoints using the example usage provided:

from llama_index.readers.crw import Crw

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #configuration error #environment variable #network issue #logging issue

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

llamaIndex - ✅(Solved) Fix Add CRW web scraper reader integration [1 pull requests, 3 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

PR fix notes

PR #21177: feat: Add CRW web scraper reader integration

Description (problem / solution / changelog)

Description

New Package?

Version Bump?

Type of Change

How Has This Been Tested?

Suggested Checklist:

Changed files

Code Example

Feature Request

What is CRW?

Proposed Integration

API Endpoints

Example Usage

extent analysis

Fix Plan

Step-by-Step Solution

Verification

Still need to ship something?

TRENDING

llamaIndex - ✅(Solved) Fix Add CRW web scraper reader integration [1 pull requests, 3 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

PR fix notes

PR #21177: feat: Add CRW web scraper reader integration

Description (problem / solution / changelog)

Description

New Package?

Version Bump?

Type of Change

How Has This Been Tested?

Suggested Checklist:

Changed files

Code Example

Feature Request

What is CRW?

Proposed Integration

API Endpoints

Example Usage

extent analysis

Fix Plan

Step-by-Step Solution

Verification

Still need to ship something?

RELATED_DISCOVERY

TRENDING