# Deepcrawl: Open-Source Firecrawl Alternative

## Overview

**Repository:** [github.com/lumpinif/deepcrawl](https://github.com/lumpinif/deepcrawl)
**Documentation:** [deepcrawl.dev/docs](https://deepcrawl.dev/docs)
**License:** MIT (100% open source)
**Author:** Felix Lu (@felixLu)
**Stars:** 318+ | **Forks:** 35

Deepcrawl is a 100% free, open-source alternative to Firecrawl - an agents-oriented website data context extraction platform designed to make any website data AI-ready.

---

## What It Does

Deepcrawl extracts:
- **Cleaned markdown** of page content
- - **Hierarchical links tree** (agent-favored navigation structure)
- - **Page metadata** that LLMs can digest with minimal token cost

The goal is to reduce context switching and hallucination when feeding web content to AI agents.

---

## Technical Architecture

### Core Stack

| Component | Technology | Purpose |
|-----------|------------|---------|
| **API Workers** | Cloudflare Workers | Edge-first fetching, normalization, link graph generation |
| **Dashboard** | Next.js 16 App Router | UI for crawls, API playground, logs, key management |
| **Routing/Middleware** | Hono | Lightweight edge-first framework |
| **RPC Layer** | oRPC | Contract-first RPC generating OpenAPI + SDK from single source |
| **Auth** | better-auth | Enterprise-grade session and API key management |
| **Schemas** | Zod v4 | Type safety across workers, SDKs, and docs |
| **Database** | Drizzle ORM | Supports both Neon PostgreSQL and Cloudflare D1 SQLite |
| **Caching** | Upstash Redis + Cloudflare KV | Layered caching for performance/cost balance |
| **UI** | shadcn/ui + Tailwind CSS v4 | Accessible component system |
| **State Management** | nuqs | URL state serialization for shareable configurations |
| **Data Fetching** | React Query | Server-side prefetching and cache sync |
| **Build** | Turborepo (monorepo) | Shared packages, database modules, scripts |
| **Bundler** | tsdown (Rolldown-based) | Rust-powered bundling |
| **Linting** | Biome | Unified formatter and linter |

### Language Breakdown
- TypeScript: 94.5%
- - MDX: 4.8%
- - Other: 0.7%

---

## Core API Endpoints

### 1. `getMarkdown` (GET /read)
Fast endpoint returning prompt-ready markdown. Ideal for caching snippets or feeding LLM prompts.

```bash
curl \
  -H "Authorization: Bearer $DEEPCRAWL_API_KEY" \
    "https://api.deepcrawl.dev/read?url=https://example.com"
    ```
    
    ### 2. `readUrl` (POST /read)
    Full operation with metadata, cleaned HTML, markdown, robots, sitemap data, and metrics in one payload.
    
    ### 3. `extractLinks` (POST /links)
    Configurable crawl endpoint that builds agent-navigable links tree, filters external/media links.
    
    ### 4. `getLinks` (GET /links)
    Quick fetch of extracted links for a page.
    
    ---
    
    ## SDK Usage
    
    ```typescript
    import { DeepcrawlApp } from 'deepcrawl';
    
    const deepcrawl = new DeepcrawlApp({
      apiKey: process.env.DEEPCRAWL_API_KEY as string,
      });
      
      // Simple markdown extraction
      const markdown = await deepcrawl.getMarkdown('https://example.com');
      
      // With options
      const markdown = await deepcrawl.getMarkdown('https://example.com', {
        cacheOptions: { expirationTtl: 3600 },
          cleaningProcessor: 'cheerio-reader', // or 'html-rewriter'
          });
          ```
          
          ---
          
          ## Total Cost of Ownership (TCO) Analysis
          
          ### Deepcrawl Self-Hosted Costs
          
          | Component | Free Tier | Paid Tier |
          |-----------|-----------|-----------|
          | **Software** | $0 (MIT license) | $0 |
          | **Cloudflare Workers** | $0/mo (100k req/day) | $5/mo + $0.30/M requests |
          | **Cloudflare KV** | Included | Included |
          | **Cloudflare R2** | Zero egress | Zero egress |
          | **Vercel (Dashboard)** | $0/mo (Hobby) | $20/mo (Pro) |
          | **Database (D1/Neon)** | Free tier available | Variable |
          
          **Typical TCO: $0/month** for most use cases
          
          ### Firecrawl Comparison (Commercial)
          
          | Plan | Monthly Cost | Credits | Cost/1K pages |
          |------|-------------|---------|---------------|
          | Free | $0 (one-time) | 500 | N/A |
          | Hobby | $16/mo | 3,000 | $5.33 |
          | Standard | $83/mo | 100,000 | $0.83 |
          | Growth | $333/mo | 500,000 | $0.67 |
          
          ### Cost Savings Summary
          
          | Usage Level | Firecrawl Cost | Deepcrawl Cost | Savings |
          |-------------|---------------|----------------|---------|
          | 3,000 pages/mo | $16/mo | $0 | $192/year |
          | 100,000 pages/mo | $83/mo | $0-5/mo | $936-996/year |
          | 500,000 pages/mo | $333/mo | $5-20/mo | $3,756-3,936/year |
          
          ---
          
          ## Key Features & Advantages
          
          ### Performance
          - **5-10x faster** than alternatives on standard HTML parsing
          - No headless browser overhead for simple content
          - Edge-native V8 Workers return responses in milliseconds
          - Dynamic cache controls reduce redundant crawls
          
          ### AI Optimization
          - **Markdown-first output** removes ads, scripts, boilerplate
          - **Token-efficient formatting** cuts prompt costs
          - **Link tree intelligence** maps site topology for agent navigation
          
          ### Developer Experience
          - Fully typed TypeScript SDK
          - Zod schemas plug directly into AI frameworks (ai-sdk, LangChain)
          - OpenAPI and oRPC endpoints work across any runtime
          - Consistent REST API across curl, Python, serverless functions
          
          ### Infrastructure
          - Cloudflare Workers run globally, minimizing latency
          - CDN-backed responses for consistent worldwide performance
          - Automatic retries for flaky sites
          
          ---
          
          ## Deployment Options
          
          ### Option 1: Use Hosted API (Free)
          - Sign up at deepcrawl.dev
          - Get API key from dashboard
          - Use SDK or REST API
          
          ### Option 2: Self-Host (Full Control)
          ```bash
          # Clone repo
          git clone https://github.com/lumpinif/deepcrawl.git
          
          # Deploy dashboard to Vercel
          # Deploy workers to Cloudflare
          # Configure database (D1 or Neon PostgreSQL)
          ```
          
          Full self-hosting guide: [deepcrawl.dev/docs/reference/self-hosting](https://deepcrawl.dev/docs/reference/self-hosting)
          
          ---
          
          ## Integration Examples
          
          ### With Vercel AI SDK
          ```typescript
          import { DeepcrawlApp } from 'deepcrawl';
          import { generateText } from 'ai';
          
          const deepcrawl = new DeepcrawlApp({ apiKey: process.env.DEEPCRAWL_API_KEY });
          
          const markdown = await deepcrawl.getMarkdown('https://example.com');
          const result = await generateText({
            model: yourModel,
              prompt: `Summarize this page:\n\n${markdown}`,
              });
              ```
              
              ### With LangChain
              ```typescript
              const markdown = await deepcrawl.getMarkdown(url);
              const docs = [new Document({ pageContent: markdown })];
              // Use with your LangChain pipeline
              ```
              
              ---
              
              ## Limitations & Considerations
              
               **Active Development Warning:**
               > DO NOT USE DEEPCRAWL IN PRODUCTION RIGHT NOW AS IT IS SUBJECT TO CHANGE AND STILL UNDER RAPID DEVELOPMENT. USE WITH YOUR OWN RISK!
               
               ### Current Limitations
               - APIs and SDKs subject to change
               - Many planned features still in development
               - Early-stage stability
               - Community support (no enterprise SLA)
               
               ### When to Choose Firecrawl Instead
               - Need production SLA guarantees
               - Require enterprise support
               - Don't want to manage infrastructure
               - Need features not yet in Deepcrawl
               
               ---
               
               ## Quick Start
               
               ```bash
               # Install SDK
               npm install deepcrawl
               
               # Set API key
               export DEEPCRAWL_API_KEY="dc_sk_live_your_key"
               ```
               
               ```typescript
               import { DeepcrawlApp } from 'deepcrawl';
               
               const dc = new DeepcrawlApp({
                 apiKey: process.env.DEEPCRAWL_API_KEY
                 });
                 
                 // Get markdown from any URL
                 const content = await dc.getMarkdown('https://news.ycombinator.com');
                 console.log(content);
                 ```
                 
                 ---
                 
                 ## Resources
                 
                 - **GitHub:** https://github.com/lumpinif/deepcrawl
                 - **Documentation:** https://deepcrawl.dev/docs
                 - **API Reference:** https://api.deepcrawl.dev/docs
                 - **Online Playground:** https://deepcrawl.dev/app
                 - **Author Twitter:** [@felixlu1018](https://twitter.com/felixlu1018)
                 
                 ---
                 
                 ## Conclusion
                 
                 Deepcrawl offers a compelling open-source alternative to Firecrawl with:
                 
                  **$0 TCO** for most use cases (vs $16-333/mo for Firecrawl)
                   **Full source code access** and self-hosting capability
                    **Modern tech stack** (Next.js 16, Cloudflare Workers, TypeScript)
                     **AI-optimized output** designed for LLM consumption
                      **Active development** with 318+ stars and growing community
                      
                      The main trade-off is stability - it's explicitly marked as not production-ready. For side projects, development, or cost-sensitive applications, it's an excellent choice. For mission-critical production workloads requiring SLA guarantees, evaluate carefully or wait for v1.0.
                      
                      ---
                      
                      *Research compiled: December 2024*