llamaIndex - ✅(Solved) Fix Add CRW web scraper reader integration [1 pull requests, 3 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
run-llama/llama_index#21167Fetched 2026-04-08 01:36:24
View on GitHub
Comments
3
Participants
2
Timeline
6
Reactions
0
Author
Participants
Timeline (top)
commented ×3cross-referenced ×1mentioned ×1subscribed ×1

PR fix notes

PR #21177: feat: Add CRW web scraper reader integration

Description (problem / solution / changelog)

Description

CRW web scraper reader added to the llama-index-readers-web package. CRW is an open-source, self-hosted, Firecrawl-compatible web scraper written in Rust. It exposes a local REST API with no cloud account or API key required by default (optional Bearer token supported for authenticated deployments and the managed cloud at fastcrw.com).

Three modes are supported:

  • scrape: single page to one Document (POST /v1/scrape)
  • crawl: async BFS crawl with polling, returns one Document per page (POST /v1/crawl + GET /v1/crawl/{id})
  • map: link discovery, returns one Document per URL (POST /v1/map)

Mode can be set on the constructor or overridden per-call via load_data(url, mode=...).

Both CrwWebReader and CrwReader (alias matching the issue proposal) are exported.

No new dependencies required (requests is already a dependency of llama-index-readers-web).

Fixes #21167

New Package?

Did I fill in the tool.llamahub section in the pyproject.toml and provide a detailed README.md for my new integration or package?

  • Yes
  • No

Version Bump?

Did I bump the version in the pyproject.toml file of the package I am updating? (Except for the llama-index-core package)

  • Yes
  • No

Type of Change

  • New feature (non-breaking change which adds functionality)

How Has This Been Tested?

  • I added new unit tests to cover this change

22 unit tests covering all three modes (scrape, crawl, map), error handling, mode override on load_data, polling, timeout, and the CrwReader alias. Additionally verified against a live self-hosted CRW server with scrape returning real content, crawl returning 99 documents, and map returning 0 (target site blocks link discovery, expected behavior).

Suggested Checklist:

  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have added Google Colab support for the newly added notebooks.
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes
  • I ran uv run make format; uv run make lint to appease the lint gods

Changed files

  • llama-index-integrations/readers/llama-index-readers-web/llama_index/readers/crw/README.md (added, +24/-0)
  • llama-index-integrations/readers/llama-index-readers-web/llama_index/readers/crw/__init__.py (added, +19/-0)
  • llama-index-integrations/readers/llama-index-readers-web/llama_index/readers/web/__init__.py (modified, +3/-0)
  • llama-index-integrations/readers/llama-index-readers-web/llama_index/readers/web/crw_web/__init__.py (added, +3/-0)
  • llama-index-integrations/readers/llama-index-readers-web/llama_index/readers/web/crw_web/base.py (added, +185/-0)
  • llama-index-integrations/readers/llama-index-readers-web/pyproject.toml (modified, +3/-1)
  • llama-index-integrations/readers/llama-index-readers-web/tests/test_crw_web_reader.py (added, +265/-0)

Code Example

from llama_index.readers.crw import CrwReader

reader = CrwReader()  # defaults to localhost:3000
docs = reader.load_data(url="https://example.com", mode="scrape")
RAW_BUFFERClick to expand / collapse

Feature Request

Add a CRW reader integration (llama-index-readers-crw) for web scraping, crawling, and site mapping.

What is CRW?

CRW is an open-source, Firecrawl-compatible web scraper built in Rust. Single binary, ~6 MB idle RAM, built-in MCP server.

Proposed Integration

A CrwReader extending BaseReader with three modes:

  • scrape: Single URL → Documents (markdown)
  • crawl: BFS crawl → multiple Documents
  • map: URL discovery → Documents with URLs

API Endpoints

  • POST /v1/scrape — single page
  • POST /v1/crawl + GET /v1/crawl/{id} — async crawl with polling
  • POST /v1/map — sitemap/link discovery

Example Usage

from llama_index.readers.crw import CrwReader

reader = CrwReader()  # defaults to localhost:3000
docs = reader.load_data(url="https://example.com", mode="scrape")

I have a working implementation ready and can submit a PR once this is approved.

extent analysis

Fix Plan

To integrate the CRW reader, we need to implement the CrwReader class and its API endpoints.

Step-by-Step Solution

  • Implement the CrwReader class extending BaseReader:
from llama_index.readers.base import BaseReader

class CrwReader(BaseReader):
    def __init__(self, host: str = "localhost", port: int = 3000):
        self.host = host
        self.port = port

    def load_data(self, url: str, mode: str):
        # Implement logic for scrape, crawl, and map modes
        if mode == "scrape":
            return self._scrape(url)
        elif mode == "crawl":
            return self._crawl(url)
        elif mode == "map":
            return self._map(url)
        else:
            raise ValueError("Invalid mode")
  • Implement the _scrape, _crawl, and _map methods:
import requests

class CrwReader(BaseReader):
    # ...

    def _scrape(self, url: str):
        response = requests.post(f"http://{self.host}:{self.port}/v1/scrape", json={"url": url})
        return response.json()

    def _crawl(self, url: str):
        response = requests.post(f"http://{self.host}:{self.port}/v1/crawl", json={"url": url})
        crawl_id = response.json()["id"]
        # Poll for crawl completion
        while True:
            response = requests.get(f"http://{self.host}:{self.port}/v1/crawl/{crawl_id}")
            if response.json()["status"] == "completed":
                break
        return response.json()["documents"]

    def _map(self, url: str):
        response = requests.post(f"http://{self.host}:{self.port}/v1/map", json={"url": url})
        return response.json()
  • Create API endpoints for POST /v1/scrape, POST /v1/crawl, GET /v1/crawl/{id}, and POST /v1/map:
from fastapi import FastAPI, HTTPException

app = FastAPI()

@app.post("/v1/scrape")
def scrape(url: str):
    # Implement scrape logic
    pass

@app.post("/v1/crawl")
def crawl(url: str):
    # Implement crawl logic
    pass

@app.get("/v1/crawl/{crawl_id}")
def get_crawl(crawl_id: str):
    # Implement logic to retrieve crawl status and documents
    pass

@app.post("/v1/map")
def map(url: str):
    # Implement map logic
    pass

Verification

To verify the fix, test the CrwReader class and its API endpoints using the example usage provided:

from llama_index.readers.crw import Crw

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING