# Deepcrawl: Open-Source Firecrawl Alternative ## Overview **Repository:** [github.com/lumpinif/deepcrawl](https://github.com/lumpinif/deepcrawl) **Documentation:** [deepcrawl.dev/docs](https://deepcrawl.dev/docs) **License:** MIT (100% open source) **Author:** Felix Lu (@felixLu) **Stars:** 318+ | **Forks:** 35 Deepcrawl is a 100% free, open-source alternative to Firecrawl - an agents-oriented website data context extraction platform designed to make any website data AI-ready. --- ## What It Does Deepcrawl extracts: - **Cleaned markdown** of page content - - **Hierarchical links tree** (agent-favored navigation structure) - - **Page metadata** that LLMs can digest with minimal token cost The goal is to reduce context switching and hallucination when feeding web content to AI agents. --- ## Technical Architecture ### Core Stack | Component | Technology | Purpose | |-----------|------------|---------| | **API Workers** | Cloudflare Workers | Edge-first fetching, normalization, link graph generation | | **Dashboard** | Next.js 16 App Router | UI for crawls, API playground, logs, key management | | **Routing/Middleware** | Hono | Lightweight edge-first framework | | **RPC Layer** | oRPC | Contract-first RPC generating OpenAPI + SDK from single source | | **Auth** | better-auth | Enterprise-grade session and API key management | | **Schemas** | Zod v4 | Type safety across workers, SDKs, and docs | | **Database** | Drizzle ORM | Supports both Neon PostgreSQL and Cloudflare D1 SQLite | | **Caching** | Upstash Redis + Cloudflare KV | Layered caching for performance/cost balance | | **UI** | shadcn/ui + Tailwind CSS v4 | Accessible component system | | **State Management** | nuqs | URL state serialization for shareable configurations | | **Data Fetching** | React Query | Server-side prefetching and cache sync | | **Build** | Turborepo (monorepo) | Shared packages, database modules, scripts | | **Bundler** | tsdown (Rolldown-based) | Rust-powered bundling | | **Linting** | Biome | Unified formatter and linter | ### Language Breakdown - TypeScript: 94.5% - - MDX: 4.8% - - Other: 0.7% --- ## Core API Endpoints ### 1. `getMarkdown` (GET /read) Fast endpoint returning prompt-ready markdown. Ideal for caching snippets or feeding LLM prompts. ```bash curl \ -H "Authorization: Bearer $DEEPCRAWL_API_KEY" \ "https://api.deepcrawl.dev/read?url=https://example.com" ``` ### 2. `readUrl` (POST /read) Full operation with metadata, cleaned HTML, markdown, robots, sitemap data, and metrics in one payload. ### 3. `extractLinks` (POST /links) Configurable crawl endpoint that builds agent-navigable links tree, filters external/media links. ### 4. `getLinks` (GET /links) Quick fetch of extracted links for a page. --- ## SDK Usage ```typescript import { DeepcrawlApp } from 'deepcrawl'; const deepcrawl = new DeepcrawlApp({ apiKey: process.env.DEEPCRAWL_API_KEY as string, }); // Simple markdown extraction const markdown = await deepcrawl.getMarkdown('https://example.com'); // With options const markdown = await deepcrawl.getMarkdown('https://example.com', { cacheOptions: { expirationTtl: 3600 }, cleaningProcessor: 'cheerio-reader', // or 'html-rewriter' }); ``` --- ## Total Cost of Ownership (TCO) Analysis ### Deepcrawl Self-Hosted Costs | Component | Free Tier | Paid Tier | |-----------|-----------|-----------| | **Software** | $0 (MIT license) | $0 | | **Cloudflare Workers** | $0/mo (100k req/day) | $5/mo + $0.30/M requests | | **Cloudflare KV** | Included | Included | | **Cloudflare R2** | Zero egress | Zero egress | | **Vercel (Dashboard)** | $0/mo (Hobby) | $20/mo (Pro) | | **Database (D1/Neon)** | Free tier available | Variable | **Typical TCO: $0/month** for most use cases ### Firecrawl Comparison (Commercial) | Plan | Monthly Cost | Credits | Cost/1K pages | |------|-------------|---------|---------------| | Free | $0 (one-time) | 500 | N/A | | Hobby | $16/mo | 3,000 | $5.33 | | Standard | $83/mo | 100,000 | $0.83 | | Growth | $333/mo | 500,000 | $0.67 | ### Cost Savings Summary | Usage Level | Firecrawl Cost | Deepcrawl Cost | Savings | |-------------|---------------|----------------|---------| | 3,000 pages/mo | $16/mo | $0 | $192/year | | 100,000 pages/mo | $83/mo | $0-5/mo | $936-996/year | | 500,000 pages/mo | $333/mo | $5-20/mo | $3,756-3,936/year | --- ## Key Features & Advantages ### Performance - **5-10x faster** than alternatives on standard HTML parsing - No headless browser overhead for simple content - Edge-native V8 Workers return responses in milliseconds - Dynamic cache controls reduce redundant crawls ### AI Optimization - **Markdown-first output** removes ads, scripts, boilerplate - **Token-efficient formatting** cuts prompt costs - **Link tree intelligence** maps site topology for agent navigation ### Developer Experience - Fully typed TypeScript SDK - Zod schemas plug directly into AI frameworks (ai-sdk, LangChain) - OpenAPI and oRPC endpoints work across any runtime - Consistent REST API across curl, Python, serverless functions ### Infrastructure - Cloudflare Workers run globally, minimizing latency - CDN-backed responses for consistent worldwide performance - Automatic retries for flaky sites --- ## Deployment Options ### Option 1: Use Hosted API (Free) - Sign up at deepcrawl.dev - Get API key from dashboard - Use SDK or REST API ### Option 2: Self-Host (Full Control) ```bash # Clone repo git clone https://github.com/lumpinif/deepcrawl.git # Deploy dashboard to Vercel # Deploy workers to Cloudflare # Configure database (D1 or Neon PostgreSQL) ``` Full self-hosting guide: [deepcrawl.dev/docs/reference/self-hosting](https://deepcrawl.dev/docs/reference/self-hosting) --- ## Integration Examples ### With Vercel AI SDK ```typescript import { DeepcrawlApp } from 'deepcrawl'; import { generateText } from 'ai'; const deepcrawl = new DeepcrawlApp({ apiKey: process.env.DEEPCRAWL_API_KEY }); const markdown = await deepcrawl.getMarkdown('https://example.com'); const result = await generateText({ model: yourModel, prompt: `Summarize this page:\n\n${markdown}`, }); ``` ### With LangChain ```typescript const markdown = await deepcrawl.getMarkdown(url); const docs = [new Document({ pageContent: markdown })]; // Use with your LangChain pipeline ``` --- ## Limitations & Considerations **Active Development Warning:** > DO NOT USE DEEPCRAWL IN PRODUCTION RIGHT NOW AS IT IS SUBJECT TO CHANGE AND STILL UNDER RAPID DEVELOPMENT. USE WITH YOUR OWN RISK! ### Current Limitations - APIs and SDKs subject to change - Many planned features still in development - Early-stage stability - Community support (no enterprise SLA) ### When to Choose Firecrawl Instead - Need production SLA guarantees - Require enterprise support - Don't want to manage infrastructure - Need features not yet in Deepcrawl --- ## Quick Start ```bash # Install SDK npm install deepcrawl # Set API key export DEEPCRAWL_API_KEY="dc_sk_live_your_key" ``` ```typescript import { DeepcrawlApp } from 'deepcrawl'; const dc = new DeepcrawlApp({ apiKey: process.env.DEEPCRAWL_API_KEY }); // Get markdown from any URL const content = await dc.getMarkdown('https://news.ycombinator.com'); console.log(content); ``` --- ## Resources - **GitHub:** https://github.com/lumpinif/deepcrawl - **Documentation:** https://deepcrawl.dev/docs - **API Reference:** https://api.deepcrawl.dev/docs - **Online Playground:** https://deepcrawl.dev/app - **Author Twitter:** [@felixlu1018](https://twitter.com/felixlu1018) --- ## Conclusion Deepcrawl offers a compelling open-source alternative to Firecrawl with: **$0 TCO** for most use cases (vs $16-333/mo for Firecrawl) **Full source code access** and self-hosting capability **Modern tech stack** (Next.js 16, Cloudflare Workers, TypeScript) **AI-optimized output** designed for LLM consumption **Active development** with 318+ stars and growing community The main trade-off is stability - it's explicitly marked as not production-ready. For side projects, development, or cost-sensitive applications, it's an excellent choice. For mission-critical production workloads requiring SLA guarantees, evaluate carefully or wait for v1.0. --- *Research compiled: December 2024*