Created
March 24, 2026 11:53
-
-
Save hemeda3/80b63bdc8735cd1b2829eda89940a80e to your computer and use it in GitHub Desktop.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| # Reverse-Engineering Cursor's fast Code Search algorithm | |
| *Ahmed Yousri · March 2026* | |
| --- | |
| In March 2026, Cursor published a blog post titled ["Fast regex search"](https://cursor.com/blog/fast-regex-search) by Vicent Marti. It described how they'd built an indexed code search system that could find a pattern across an entire codebase in 13 milliseconds. Their claim: sparse n-gram inverted indexes, two files on disk, a single binary search on a memory-mapped lookup table. | |
| I read it and thought: I can build that. | |
| What followed was a deep dive into research papers from the 1990s, operating system internals, memory-mapped I/O, and SIMD byte scanning. The result is [ayg](https://github.com/hemeda3/aygrep). On a 2-vCPU GitHub Actions runner with 7.8 GB RAM, it searches 79,225 Linux kernel files in 5.8ms hot for a selective query — and 460ms for a broad one matching 49,481 files. On the same runner, ripgrep takes 747ms and 1,563ms respectively. | |
| The worst case: a query matching 62% of all files, cold cache, cheapest CI hardware. ayg: 12.4 seconds. ripgrep: 13.7 seconds. A 1.1× speedup. Barely worth the index. | |
| The best case on the same hardware, same conditions: a selective query matching 13 files. ayg: 155ms. ripgrep: 13.1 seconds. An 85× speedup. | |
| Every number in this post is sourced from either a [public CI run](https://github.com/hemeda3/aygrep/actions) or documented local measurements with stated cache conditions. | |
| --- | |
| ## Table of Contents | |
| 1. [The Problem: Why Grep Is Slow on Large Codebases](#1-the-problem) | |
| 2. [The Research: What Cursor Told Us (And What They Didn't)](#2-the-research) | |
| 3. [Phase 1: The Trigram Index Baseline](#3-phase-1) | |
| 4. [The Warm Cache Lie](#4-warm-cache-lie) | |
| 5. [Phase 2: Killing the Subprocess](#5-phase-2) | |
| 6. [Phase 3: Probe Before You Fetch](#6-phase-3) | |
| 7. [Phase 4: Content-in-Index](#7-phase-4) | |
| 8. [Phase 5: Sparse N-grams — The Algorithm Gap](#8-phase-5) | |
| 9. [Phase 6: The Library Audit That Mostly Didn't Matter](#9-phase-6) | |
| 10. [Phase 7: Zero-Copy Scanning via mmap](#10-phase-7) | |
| 11. [Phase 8: The Adaptive Architecture](#11-phase-8) | |
| 12. [The Benchmarks](#12-benchmarks) | |
| 13. [Where ayg Loses](#13-where-ayg-loses) | |
| 14. [What Cursor Described vs What We Built](#14-cursor-comparison) | |
| 15. [Lessons Learned](#15-lessons) | |
| --- | |
| ## 1. The Problem: Why Grep Is Slow on Large Codebases {#1-the-problem} | |
| Chromium is one of the largest actively-developed open-source C++ projects. Some numbers: | |
| - **436,610 source files** (after filtering binaries) | |
| - **6.7 GB** of source code | |
| - Spread across ~45,000 directories | |
| When an AI coding agent needs to find something — say, every file that references `MAX_FILE_SIZE` — it runs `rg "MAX_FILE_SIZE"`. ripgrep is the fastest grep tool in existence. It uses SIMD-accelerated byte scanning, respects `.gitignore`, parallelizes across cores, and has been optimized by Andrew Gallant for years. | |
| On an M3 Max MacBook Pro with Chromium: | |
| ``` | |
| $ time rg -c "MAX_FILE_SIZE" chromium/ | |
| 15 files matched | |
| real 16.567s | |
| ``` | |
| On a 2-vCPU GitHub Actions runner with the Linux kernel (79,225 files, 1.2 GB): | |
| ``` | |
| $ time rg -c "PM_RESUME" | |
| 13 files matched | |
| real 13.129s ← cold cache | |
| real 0.747s ← hot cache | |
| ``` | |
| The pattern `MAX_FILE_SIZE` appears in exactly 15 files out of 436,610. That's a 0.003% hit rate. The other 99.997% of I/O was wasted. | |
| Can we figure out *which* files to read without reading all of them? That's what an inverted index does. That's what Cursor built. | |
| --- | |
| ## 2. The Research: What Cursor Told Us (And What They Didn't) {#2-the-research} | |
| ### What the blog post says | |
| Cursor's post walks through the evolution of code search indexing: | |
| 1. **Trigrams** (fixed 3-byte subsequences) — functional but posting lists are large | |
| 2. **Suffix arrays** — exact matching but memory-intensive | |
| 3. **Bloom filter masks** — augment trigrams with next-byte masks, but "saturate too quickly" | |
| 4. **Sparse n-grams** — their actual solution | |
| The key innovation is a **frequency-weighted boundary detector**. A 256×256 table maps every byte pair to a weight. Rare byte pairs get high weights. N-gram boundaries are placed at local maxima. This produces variable-length n-grams that are longer and more selective than fixed trigrams. | |
| ### What the blog post doesn't say | |
| - **How they scan candidate files.** "Doing this on the client is trivial" implies an embedded SIMD scanner, not a subprocess. | |
| - **They store only hashes, not full n-grams.** 3-byte (24-bit) hash keys. Collisions merge posting lists — safe, just less selective. | |
| - **They mmap only the lookup table.** Postings are read via `pread()`. The editor process stays lean. | |
| - **No medium or broad query performance.** The 13ms number is for a selective query on an M2 MacBook. | |
| ### The papers | |
| - **Zobel, Moffat, Sacks-Davis (1993)** — inverted index fundamentals | |
| - **Russ Cox (2012)** — trigram indexing for regex search (Google Code Search) | |
| - **ClickHouse** — sparse n-gram text indexing for database columns | |
| --- | |
| ## 3. Phase 1: The Trigram Index Baseline {#3-phase-1} | |
| Extract every overlapping 3-byte sequence from every file. Build an inverted index mapping trigram → file IDs. Write to disk. | |
| Two files: | |
| - `lookup.bin`: sorted array of `[trigram: 3B, pad: 1B, offset: 8B, count: 4B]` = 16 bytes per entry. Binary search to find any trigram. | |
| - `postings.bin`: concatenated delta+varint encoded posting lists. | |
| For the scan phase, I spawned `rg -c pattern file1 file2 file3...` as a subprocess. This was a mistake. | |
| ### Baseline results (M3 Max, Chromium, warm cache, development build) | |
| | Query | Candidates | Total time | vs ripgrep 16.6s | | |
| |-------|-----------|-----------|-----------------| | |
| | MAX_FILE_SIZE | ~200 | 11.3ms | ~1,500× | | |
| | gpu::Mailbox | ~245 | 9.1ms | ~1,800× | | |
| | NOTREACHED | ~8,300 | 225ms | ~74× | | |
| | std::unique_ptr | ~43,400 | 2,125ms | ~8× | | |
| These numbers had a problem I hadn't noticed yet. | |
| --- | |
| ## 4. The Warm Cache Lie {#4-warm-cache-lie} | |
| My benchmark ran each query 3 times and took the median. All warm. After building the index, the OS page cache held every source file in memory — because the build phase had just read them all. | |
| After a 38 GB memory flood + `purge` on macOS: | |
| ``` | |
| MAX_FILE_SIZE truly cold: 25ms (was 11.3ms "warm") | |
| ``` | |
| The 11.3ms was a warm-cache number compared against ripgrep's cold-cache 16.6 seconds. | |
| On Linux, `echo 3 > /proc/sys/vm/drop_caches` actually works. This is why we later tested on GCP and GitHub Actions with proper cache clearing. | |
| **Lesson: Always measure cold cache. Always document cache state.** | |
| --- | |
| ## 5. Phase 2: Killing the Subprocess {#5-phase-2} | |
| Each `Command::new("rg")` does a `fork()` + `exec()`. On macOS, that costs 2–5 milliseconds. For a query with 24 candidates that should take 0.1ms to scan, the subprocess overhead was 50× the actual work. | |
| Replace with in-process SIMD scanning using the `memchr` crate: | |
| ```rust | |
| use memchr::memmem; | |
| let finder = memmem::Finder::new(pattern_bytes); | |
| let count = finder.find_iter(&data).count(); | |
| ``` | |
| `memmem::Finder` uses SIMD (NEON on ARM, SSE2/AVX2 on x86) to scan at 8+ GB/s. | |
| | Query | Before (subprocess) | After (memchr) | Improvement | | |
| |-------|----:|----:|----:| | |
| | MAX_FILE_SIZE | 11.3ms | 6.5ms | 1.7× | | |
| | kMaxBufferSize | 8.6ms | 1.9ms | 4.5× | | |
| | gpu::Mailbox | 9.1ms | 1.5ms | 6.1× | | |
| This single change mattered more than every "clever" optimization that followed. | |
| --- | |
| ## 6. Phase 3: Probe Before You Fetch {#6-phase-3} | |
| The lookup table is mmap'd. Binary search costs ~5μs. The table also contains the posting list *count*. We can see how selective a trigram is without reading its posting list. | |
| Split into probe (free — mmap'd) and fetch (expensive — pread from disk): | |
| ```rust | |
| // Probe all trigrams — free | |
| probes.sort_by_key(|&(_, _, count)| count); | |
| // Fetch shortest first, stop early | |
| let mut candidates = fetch(probes[0]); | |
| for probe in &probes[1..] { | |
| if candidates.len() <= 50 { break; } | |
| candidates = intersect(&candidates, &fetch(probe)); | |
| } | |
| ``` | |
| Also: `MADV_WILLNEED` to prefault the lookup table on startup. Shaved ~4ms off the first cold query. | |
| --- | |
| ## 7. Phase 4: Content-in-Index {#7-phase-4} | |
| For 24 candidate files, `fs::read()` does `open()` → `stat()` → `malloc()` → `read()` → `close()`. That's ~4 syscalls per file. | |
| Stream all file contents into a single `content.bin` during index build. At query time, mmap it and access candidates as slices: | |
| ```rust | |
| fn file_content(&self, fid: u32) -> &[u8] { | |
| let (offset, length) = self.content_offsets[fid as usize]; | |
| &self.content_mmap[offset as usize .. offset as usize + length as usize] | |
| } | |
| ``` | |
| Zero allocation. Zero syscall. Index grew from 831 MB to 3,123 MB. | |
| The build streams content to disk during Pass 1. Peak memory: ~50 MB. No OOM on 8 GB machines. | |
| --- | |
| ## 8. Phase 5: Sparse N-grams {#8-phase-5} | |
| ### The bug | |
| The first implementation had a boundary alignment bug. `build_covering` (query-time extraction) used edge boundaries that didn't match the index-time extraction context. | |
| ### The fix | |
| Use only **interior boundaries** at query time — positions where both neighbors exist and the weight is a local maximum: | |
| ```rust | |
| fn find_interior_boundaries(w: &[u8]) -> Vec<usize> { | |
| (1..w.len()-1).filter(|&i| w[i] >= w[i-1] && w[i] >= w[i+1]).collect() | |
| } | |
| ``` | |
| For `build_all` (index time), use ALL boundaries. For `build_covering` (query time), use only interior boundaries. This guarantees every covering n-gram exists in the index. | |
| ### Two-tier query | |
| Short common queries (`NOTREACHED`) have few interior peaks. Fallback to raw trigrams scored by rarity: | |
| ```rust | |
| let keys = if covering_ranges.len() >= 3 { | |
| sparse_ngram_path(covering_ranges) // selective | |
| } else { | |
| raw_trigram_fallback(pattern) // always works | |
| }; | |
| ``` | |
| | Query | Trigram candidates | Sparse candidates | | |
| |-------|------------------:|------------------:| | |
| | MAX_FILE_SIZE | ~200 | **24** | | |
| | gpu::Mailbox | ~350 | **143** | | |
| | NOTREACHED | ~8,300 | ~8,300 (fallback) | | |
| Sparse n-grams can't help NOTREACHED — it genuinely appears in 8,300+ files. | |
| --- | |
| ## 9. Phase 6: The Library Audit That Mostly Didn't Matter {#9-phase-6} | |
| | Change | Suspected cost | Actual query impact | | |
| |--------|---------------|-------------------| | |
| | Replace `HashSet` with bitvec | SipHash overhead | **Unmeasurable** | | |
| | Replace `HashMap` with FxHashMap | SipHash overhead | **Build only** | | |
| | Replace `walkdir` with `git ls-files` | 8s directory walk | **Build: 7.5s saved** | | |
| | Replace `rayon` with `std::thread::scope` | Thread pool overhead | **Unmeasurable** | | |
| Three of four changes had zero measurable impact on query latency. | |
| **Lesson: Profile before optimizing.** | |
| --- | |
| ## 10. Phase 7: Zero-Copy Scanning via mmap {#10-phase-7} | |
| Replace per-file pread (one syscall each) with direct mmap slice access. Sort candidates by content offset for sequential access. | |
| | Query | pread scan | mmap scan | Improvement | | |
| |-------|----------:|---------:|------------:| | |
| | MAX_FILE_SIZE | 1.6ms | **0.1ms** | 16× | | |
| | NOTREACHED | 29ms | **8.7ms** | 3.3× | | |
| | std::unique_ptr | 93ms | **22.2ms** | 4.2× | | |
| These are M3 Max warm cache measurements (development). Verified final numbers are in the benchmarks section below. | |
| --- | |
| ## 11. Phase 8: The Adaptive Architecture {#11-phase-8} | |
| Not everyone has 36 GB of RAM. Detect available memory at startup: | |
| | Mode | When | Strategy | | |
| |------|------|----------| | |
| | **ContentMmap** | content.bin exists, RAM > 2× content | Zero-copy mmap | | |
| | **ContentPread** | content.bin exists, RAM tight | One-alloc pread | | |
| | **Filesystem** | No content.bin or low RAM | Read from repo | | |
| The `--no-content` flag skips content.bin during build. Index shrinks from 3.1 GB to 831 MB. On machines with <8 GB available RAM, the build does this automatically. | |
| --- | |
| ## 12. The Benchmarks {#12-benchmarks} | |
| ### Methodology | |
| **Cold cache (Linux):** `sync && echo 3 | sudo tee /proc/sys/vm/drop_caches` before each query. | |
| **Cold cache (macOS):** `dd if=/dev/zero of=/tmp/mf bs=1m count=38000 && rm /tmp/mf && purge`. | |
| **Timing:** ayg uses internal `Instant::now()` around the search core (excludes process startup and index loading). ripgrep measured via Python `time.perf_counter()` around `subprocess.run()`. This favors ayg by ~50ms (startup cost excluded). For fair CLI wall-clock comparison, see the customer-facing section. | |
| **Correctness:** All match counts verified identical to ripgrep. | |
| ### Hardware | |
| | Machine | CPU | vCPU | RAM | Storage | | |
| |---------|-----|-----:|----:|---------| | |
| | **GitHub Actions** | AMD EPYC 7763 | 2 | 7.8 GB | Azure SSD | | |
| | **M3 Max MacBook Pro** | Apple M3 Max @ 4.05 GHz | 14 | 36 GB | Apple NVMe (~7 GB/s) | | |
| | **GCP e2-standard-2** | AMD EPYC 7B12 @ 2.25 GHz | 2 | 8 GB | pd-ssd (~400 MB/s) | | |
| --- | |
| ### Linux kernel (79,225 files) — GitHub Actions, no content.bin | |
| Public CI verification: [benchmark run](https://github.com/hemeda3/aygrep/actions). Queries ordered worst speedup first. | |
| | Query | ayg cold | ayg hot | rg cold | rg hot | Cold × | Hot × | Candidates | Files | | |
| |-------|--------:|-------:|-------:|------:|-------:|------:|-----------:|------:| | |
| | `Copyright` | **12,412ms** | **461ms** | 13,697ms | 1,563ms | **1.1×** | **3.4×** | 49,507 | 49,481 | | |
| | `struct device` | **6,406ms** | **221ms** | 13,253ms | 918ms | **2.1×** | **4.2×** | 21,185 | 11,408 | | |
| | `mutex_lock` | **2,119ms** | **72ms** | 13,203ms | 875ms | **6.2×** | **12×** | 5,617 | 5,472 | | |
| | `EXPORT_SYMBOL_GPL` | **1,814ms** | **62ms** | 13,112ms | 891ms | **7.2×** | **14×** | 5,694 | 3,130 | | |
| | `PM_RESUME` | **155ms** | **5.8ms** | 13,129ms | 747ms | **85×** | **129×** | 235 | 13 | | |
| | | Build | ayg total | rg total | Speedup | | |
| |-|------:|----------:|---------:|--------:| | |
| | **Cold** | 29.46s | 22,906ms | 66,394ms | **2.9×** | | |
| | **Hot** | 29.46s | 822ms | 4,940ms | **6.0×** | | |
| The `Copyright` query matches 49,481 out of 79,225 files (62%). At cold cache, ayg barely wins. Without content.bin, scanning 49,507 candidate files from the filesystem is almost as slow as scanning all 79,225. | |
| --- | |
| ### Linux kernel — M3 Max, content.bin, cold (38 GB flood + purge) | |
| | Query | ayg | rg | Speedup | Files | | |
| |-------|----:|---:|--------:|------:| | |
| | `Copyright` | **74.3ms** | 3,783ms | **51×** | 49,482 | | |
| | `EXPORT_SYMBOL_GPL` | **15.8ms** | 3,549ms | **225×** | 3,129 | | |
| | `PM_RESUME` | **6.2ms** | 3,585ms | **578×** | 13 | | |
| With content.bin on fast local NVMe, even the broad query is 51× faster. | |
| --- | |
| ### Linux kernel — M3 Max, content.bin, hot | |
| | Query | ayg | rg | Speedup | Files | | |
| |-------|----:|---:|--------:|------:| | |
| | `Copyright` | **96.7ms** | 2,550ms | **26×** | 49,482 | | |
| | `struct device` | **75.6ms** | 2,693ms | **36×** | 11,408 | | |
| | `mutex_lock` | **45.2ms** | 2,577ms | **57×** | 5,471 | | |
| | `EXPORT_SYMBOL_GPL` | **30.1ms** | 2,578ms | **86×** | 3,129 | | |
| | `PM_RESUME` | **7.0ms** | 2,761ms | **394×** | 13 | | |
| | | ayg total | rg total | Speedup | | |
| |-|----------:|---------:|--------:| | |
| | **Hot** | 254.6ms | 13,159ms | **52×** | | |
| --- | |
| ### Chromium (436,610 files) — M3 Max, content.bin, cold | |
| | Query | ayg | rg | Speedup | Candidates | | |
| |-------|----:|---:|--------:|-----------:| | |
| | `#include "base/` | **109.4ms** | ~16,500ms | **~151×** | 78,979 | | |
| | `NOTREACHED` | **36.2ms** | 18,440ms | **509×** | 8,310 | | |
| | `MAX_FILE_SIZE` | **1.0ms** | 16,567ms | **16,567×** | 24 | | |
| --- | |
| ### Chromium — M3 Max, no content.bin, cold (filesystem scan) | |
| | Query | ayg | rg | Speedup | Candidates | | |
| |-------|----:|---:|--------:|-----------:| | |
| | `std::unique_ptr` | **4,223ms** | 11,684ms | **2.8×** | 43,371 | | |
| | `NOTREACHED` | **1,648ms** | 18,440ms | **11×** | 8,310 | | |
| | `gpu::Mailbox` | **412ms** | 16,797ms | **41×** | 419 | | |
| | `kMaxBufferSize` | **60ms** | 17,041ms | **284×** | 140 | | |
| | `MAX_FILE_SIZE` | **8.0ms** | 16,567ms | **2,071×** | 24 | | |
| Without content.bin, broad queries (43K candidates) approach ripgrep speed because 43K individual `open/read/close` syscall chains become the bottleneck. | |
| --- | |
| ### Chromium — GCP e2-standard-2 (8 GB, 2 vCPU), no content.bin, cold | |
| The absolute worst case: slowest CPU, smallest RAM, network-attached disk, no content file, cold cache. | |
| | Query | ayg | rg | Speedup | Candidates | | |
| |-------|----:|---:|--------:|-----------:| | |
| | `std::unique_ptr` | **39,332ms** | 264,000ms | **6.7×** | 43,376 | | |
| | `NOTREACHED` | **10,487ms** | 223,000ms | **21×** | 8,313 | | |
| | `gpu::Mailbox` | **667ms** | ~209,000ms | **~313×** | 419 | | |
| | `MAX_FILE_SIZE` | **160ms** | 209,000ms | **1,306×** | 24 | | |
| ripgrep goes from 16s (local NVMe) to 209s (network SSD) — 13× slower. ayg goes from 8ms to 160ms — 20× slower. Both degrade, but ayg degrades less because it reads 24 files instead of 436,000. | |
| --- | |
| ### Customer-facing wall-clock (Homebrew binary, M3 Max, Chromium) | |
| The numbers above use internal `Instant::now()` which excludes ~50ms of process startup and index loading. This is what a human running `ayg search` from the terminal actually sees: | |
| | Scenario | ayg CLI wall | ayg internal | rg wall | Speedup (wall) | | |
| |----------|------------:|------------:|--------:|--------:| | |
| | Warm | **59–64ms** | 0.5–1.5ms | 29.24s | **~460×** | | |
| | Cold | **0.25s** | 45.4ms | 48.20s | **~193×** | | |
| The ~460× wall-clock speedup is the honest end-to-end comparison. The sub-millisecond internal time is what an AI agent with a persistent process sees after the first query (index already loaded). | |
| --- | |
| ## 13. Where ayg Loses {#13-where-ayg-loses} | |
| ayg does not always win. Showing where it loses matters more than showing where it wins. | |
| **1. Broad queries without content.bin, warm cache:** | |
| M3 Max, Chromium, `std::unique_ptr` (43,371 candidates), no content.bin, warm: | |
| | Tool | Time | | |
| |------|-----:| | |
| | ripgrep | 11,684ms | | |
| | **ayg** | **16,785ms** | | |
| ayg is **0.7×** — slower than ripgrep. 43K individual `open/read/close` syscalls are slower than ripgrep's optimized parallel SIMD scan of all files sequentially. With content.bin, the same query takes 22.2ms. | |
| **2. Any query where >50% of files match:** | |
| `Copyright` matches 49,481 out of 79,225 Linux kernel files. On CI cold cache: ayg 12.4s vs ripgrep 13.7s = 1.1×. The index provides almost no selectivity. | |
| **3. Small corpora, warm cache:** | |
| On the Linux kernel (1.2 GB) with 32 GB RAM and warm cache, ripgrep scans everything from page cache in ~2.5s. The SIMD scan at 8+ GB/s has near-zero overhead. ayg's index lookup + scan is faster (254ms), but ripgrep's 2.5s is already fast enough that the index build (22s) takes 9 queries to amortize. | |
| **4. Regex:** | |
| ayg handles literals only. ripgrep handles `\p{Greek}`, lookaheads, Unicode word boundaries, and every regex feature the `regex` crate supports. | |
| ### When ayg wins | |
| - **The query is selective.** Agent patterns like `kMaxBufferSize`, `gpu::Mailbox` match 10–400 files. The index eliminates 99.9%+ of I/O. | |
| - **The corpus doesn't fit in RAM.** Chromium (6.7 GB) on 8 GB: ripgrep takes 3.5 minutes. ayg takes 160ms. | |
| - **The storage is slow.** Network-attached SSD: ripgrep 209s → ayg 160ms. | |
| - **Repeated searches.** AI agents search 50–200 times per session. Build amortizes after 2–7 queries depending on hardware. | |
| --- | |
| ## 14. What Cursor Described vs What We Built {#14-cursor-comparison} | |
| | Aspect | Cursor's blog | ayg | | |
| |--------|--------------|-----| | |
| | N-gram algorithm | Sparse n-grams with frequency boundaries | Same: 256×256 byte-pair table, interior boundaries for covering | | |
| | Index format | Two files: lookup (mmap) + postings (pread) | Same: 16-byte sorted entries, delta+varint posting lists | | |
| | Lookup | "mmap this table, and only this table" | Same | | |
| | Content scanning | "trivial" (filesystem) | Adaptive: mmap / pread / filesystem based on RAM | | |
| | 8 GB machines | Not discussed | Streaming build, auto-skip content.bin | | |
| | Query fallback | Not discussed | Two-tier: sparse path + raw trigram fallback | | |
| | Regex | Implied but not detailed | Not implemented | | |
| | Git-based versioning | Yes | Not implemented | | |
| Cursor's claimed 13ms on M2 is consistent with our measurements. On M3 Max cold with content.bin: 1.0ms for MAX_FILE_SIZE. Cursor's number likely includes editor IPC overhead. | |
| --- | |
| ## 15. Lessons Learned {#15-lessons} | |
| ### What moved the needle | |
| | Rank | Change | Impact | Category | | |
| |-----:|--------|--------|----------| | |
| | 1 | mmap content file (zero-copy scan) | NOTREACHED scan: 29ms → 8.7ms | Systems | | |
| | 2 | Kill subprocess (embedded memchr) | Query floor: 5ms → 0.1ms | Systems | | |
| | 3 | Content-in-index (single file) | Enabled #1; eliminated per-file syscalls | Architecture | | |
| | 4 | Sparse n-grams | MAX_FILE_SIZE: 200 → 24 candidates | Algorithm | | |
| | 5 | Probe-before-fetch + early termination | Saved 2–3 cold pread calls | Systems | | |
| | 6 | MADV_WILLNEED prefault | Cold first query: -4ms | OS hints | | |
| | 7 | git ls-files (replace walkdir) | Build: 8s → 0.2s | Build | | |
| | 8 | FxHash, bitvec, galloping intersect | Unmeasurable | Noise | | |
| Three of the top five are "boring" systems changes. The algorithm change (sparse n-grams) matters for selective queries but is irrelevant for broad ones where there are zero false positives. | |
| ### What didn't matter | |
| - **FxHash vs SipHash:** Not on query path. | |
| - **rayon vs std::thread::scope:** Identical performance. | |
| - **Galloping intersection:** Marginal. Only helps when list sizes differ 64×+. | |
| ### What I got wrong | |
| 1. **Believed `purge` cleared the cache on macOS.** It doesn't fully on Apple Silicon. Needed a 38 GB memory flood. | |
| 2. **Abandoned sparse n-grams when the first implementation had a bug.** The bug was boundary alignment. Fixing it took 30 minutes once diagnosed. | |
| 3. **Assumed library overhead was the bottleneck.** Spent hours replacing HashSet, HashMap, rayon. Net query improvement: zero. | |
| 4. **The M3 Max warm content table in the first version of this post had wrong speedup calculations.** Listed 126,978× for MAX_FILE_SIZE but 16,567ms ÷ 0.2ms = 82,835×. The raw times were correct; the division was wrong across the entire table. Fixed after a reader checked the math. | |
| --- | |
| ## Reproducing | |
| ```bash | |
| git clone https://github.com/hemeda3/aygrep.git | |
| cd aygrep && cargo install --path . | |
| ./scripts/benchmark-full.sh linux # ~5 min, clones kernel | |
| ./scripts/benchmark-full.sh chromium # ~30 min, clones Chromium | |
| ``` | |
| CI runs the Linux benchmark on every push and publishes results to [hemeda3.github.io/aygrep](https://hemeda3.github.io/aygrep/). | |
| --- | |
| ## Acknowledgments | |
| Developed with AI assistance (Claude) for code generation, optimization iteration, and benchmarking. The author directed all architectural decisions and validated results on real hardware. | |
| - **Vicent Marti** and the Cursor team for the blog post | |
| - **Andrew Gallant (BurntSushi)** for ripgrep, memchr, and the regex crate | |
| - **Russ Cox** for trigram indexing | |
| - The **zoekt** team for the content-in-index architecture | |
| --- | |
| *Ahmed Yousri · [github.com/hemeda3](https://github.com/hemeda3) · March 2026* |
Author
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Reverse-Engineering Cursor's fast Code Search algorithm
Ahmed Yousri · March 2026
In March 2026, Cursor published a blog post titled "Fast regex search" by Vicent Marti. It described how they'd built an indexed code search system that could find a pattern across an entire codebase in 13 milliseconds. Their claim: sparse n-gram inverted indexes, two files on disk, a single binary search on a memory-mapped lookup table.
I read it and thought: I can build that.
What followed was a deep dive into research papers from the 1990s, operating system internals, memory-mapped I/O, and SIMD byte scanning. The result is ayg. On a 2-vCPU GitHub Actions runner with 7.8 GB RAM, it searches 79,225 Linux kernel files in 5.8ms hot for a selective query — and 460ms for a broad one matching 49,481 files. On the same runner, ripgrep takes 747ms and 1,563ms respectively.
The worst case: a query matching 62% of all files, cold cache, cheapest CI hardware. ayg: 12.4 seconds. ripgrep: 13.7 seconds. A 1.1× speedup. Barely worth the index.
The best case on the same hardware, same conditions: a selective query matching 13 files. ayg: 155ms. ripgrep: 13.1 seconds. An 85× speedup.
Every number in this post is sourced from either a public CI run or documented local measurements with stated cache conditions.
Table of Contents
1. The Problem: Why Grep Is Slow on Large Codebases {#1-the-problem}
Chromium is one of the largest actively-developed open-source C++ projects. Some numbers:
When an AI coding agent needs to find something — say, every file that references
MAX_FILE_SIZE— it runsrg "MAX_FILE_SIZE". ripgrep is the fastest grep tool in existence. It uses SIMD-accelerated byte scanning, respects.gitignore, parallelizes across cores, and has been optimized by Andrew Gallant for years.On an M3 Max MacBook Pro with Chromium:
On a 2-vCPU GitHub Actions runner with the Linux kernel (79,225 files, 1.2 GB):
The pattern
MAX_FILE_SIZEappears in exactly 15 files out of 436,610. That's a 0.003% hit rate. The other 99.997% of I/O was wasted.Can we figure out which files to read without reading all of them? That's what an inverted index does. That's what Cursor built.
2. The Research: What Cursor Told Us (And What They Didn't) {#2-the-research}
What the blog post says
Cursor's post walks through the evolution of code search indexing:
The key innovation is a frequency-weighted boundary detector. A 256×256 table maps every byte pair to a weight. Rare byte pairs get high weights. N-gram boundaries are placed at local maxima. This produces variable-length n-grams that are longer and more selective than fixed trigrams.
What the blog post doesn't say
pread(). The editor process stays lean.The papers
3. Phase 1: The Trigram Index Baseline {#3-phase-1}
Extract every overlapping 3-byte sequence from every file. Build an inverted index mapping trigram → file IDs. Write to disk.
Two files:
lookup.bin: sorted array of[trigram: 3B, pad: 1B, offset: 8B, count: 4B]= 16 bytes per entry. Binary search to find any trigram.postings.bin: concatenated delta+varint encoded posting lists.For the scan phase, I spawned
rg -c pattern file1 file2 file3...as a subprocess. This was a mistake.Baseline results (M3 Max, Chromium, warm cache, development build)
These numbers had a problem I hadn't noticed yet.
4. The Warm Cache Lie {#4-warm-cache-lie}
My benchmark ran each query 3 times and took the median. All warm. After building the index, the OS page cache held every source file in memory — because the build phase had just read them all.
After a 38 GB memory flood +
purgeon macOS:The 11.3ms was a warm-cache number compared against ripgrep's cold-cache 16.6 seconds.
On Linux,
echo 3 > /proc/sys/vm/drop_cachesactually works. This is why we later tested on GCP and GitHub Actions with proper cache clearing.Lesson: Always measure cold cache. Always document cache state.
5. Phase 2: Killing the Subprocess {#5-phase-2}
Each
Command::new("rg")does afork()+exec(). On macOS, that costs 2–5 milliseconds. For a query with 24 candidates that should take 0.1ms to scan, the subprocess overhead was 50× the actual work.Replace with in-process SIMD scanning using the
memchrcrate:memmem::Finderuses SIMD (NEON on ARM, SSE2/AVX2 on x86) to scan at 8+ GB/s.This single change mattered more than every "clever" optimization that followed.
6. Phase 3: Probe Before You Fetch {#6-phase-3}
The lookup table is mmap'd. Binary search costs ~5μs. The table also contains the posting list count. We can see how selective a trigram is without reading its posting list.
Split into probe (free — mmap'd) and fetch (expensive — pread from disk):
Also:
MADV_WILLNEEDto prefault the lookup table on startup. Shaved ~4ms off the first cold query.7. Phase 4: Content-in-Index {#7-phase-4}
For 24 candidate files,
fs::read()doesopen()→stat()→malloc()→read()→close(). That's ~4 syscalls per file.Stream all file contents into a single
content.binduring index build. At query time, mmap it and access candidates as slices:Zero allocation. Zero syscall. Index grew from 831 MB to 3,123 MB.
The build streams content to disk during Pass 1. Peak memory: ~50 MB. No OOM on 8 GB machines.
8. Phase 5: Sparse N-grams {#8-phase-5}
The bug
The first implementation had a boundary alignment bug.
build_covering(query-time extraction) used edge boundaries that didn't match the index-time extraction context.The fix
Use only interior boundaries at query time — positions where both neighbors exist and the weight is a local maximum:
For
build_all(index time), use ALL boundaries. Forbuild_covering(query time), use only interior boundaries. This guarantees every covering n-gram exists in the index.Two-tier query
Short common queries (
NOTREACHED) have few interior peaks. Fallback to raw trigrams scored by rarity:Sparse n-grams can't help NOTREACHED — it genuinely appears in 8,300+ files.
9. Phase 6: The Library Audit That Mostly Didn't Matter {#9-phase-6}
HashSetwith bitvecHashMapwith FxHashMapwalkdirwithgit ls-filesrayonwithstd::thread::scopeThree of four changes had zero measurable impact on query latency.
Lesson: Profile before optimizing.
10. Phase 7: Zero-Copy Scanning via mmap {#10-phase-7}
Replace per-file pread (one syscall each) with direct mmap slice access. Sort candidates by content offset for sequential access.
These are M3 Max warm cache measurements (development). Verified final numbers are in the benchmarks section below.
11. Phase 8: The Adaptive Architecture {#11-phase-8}
Not everyone has 36 GB of RAM. Detect available memory at startup:
The
--no-contentflag skips content.bin during build. Index shrinks from 3.1 GB to 831 MB. On machines with <8 GB available RAM, the build does this automatically.12. The Benchmarks {#12-benchmarks}
Methodology
Cold cache (Linux):
sync && echo 3 | sudo tee /proc/sys/vm/drop_cachesbefore each query.Cold cache (macOS):
dd if=/dev/zero of=/tmp/mf bs=1m count=38000 && rm /tmp/mf && purge.Timing: ayg uses internal
Instant::now()around the search core (excludes process startup and index loading). ripgrep measured via Pythontime.perf_counter()aroundsubprocess.run(). This favors ayg by ~50ms (startup cost excluded). For fair CLI wall-clock comparison, see the customer-facing section.Correctness: All match counts verified identical to ripgrep.
Hardware
Linux kernel (79,225 files) — GitHub Actions, no content.bin
Public CI verification: benchmark run. Queries ordered worst speedup first.
Copyrightstruct devicemutex_lockEXPORT_SYMBOL_GPLPM_RESUMEThe
Copyrightquery matches 49,481 out of 79,225 files (62%). At cold cache, ayg barely wins. Without content.bin, scanning 49,507 candidate files from the filesystem is almost as slow as scanning all 79,225.Linux kernel — M3 Max, content.bin, cold (38 GB flood + purge)
CopyrightEXPORT_SYMBOL_GPLPM_RESUMEWith content.bin on fast local NVMe, even the broad query is 51× faster.
Linux kernel — M3 Max, content.bin, hot
Copyrightstruct devicemutex_lockEXPORT_SYMBOL_GPLPM_RESUMEChromium (436,610 files) — M3 Max, content.bin, cold
#include "base/NOTREACHEDMAX_FILE_SIZEChromium — M3 Max, no content.bin, cold (filesystem scan)
std::unique_ptrNOTREACHEDgpu::MailboxkMaxBufferSizeMAX_FILE_SIZEWithout content.bin, broad queries (43K candidates) approach ripgrep speed because 43K individual
open/read/closesyscall chains become the bottleneck.Chromium — GCP e2-standard-2 (8 GB, 2 vCPU), no content.bin, cold
The absolute worst case: slowest CPU, smallest RAM, network-attached disk, no content file, cold cache.
std::unique_ptrNOTREACHEDgpu::MailboxMAX_FILE_SIZEripgrep goes from 16s (local NVMe) to 209s (network SSD) — 13× slower. ayg goes from 8ms to 160ms — 20× slower. Both degrade, but ayg degrades less because it reads 24 files instead of 436,000.
Customer-facing wall-clock (Homebrew binary, M3 Max, Chromium)
The numbers above use internal
Instant::now()which excludes ~50ms of process startup and index loading. This is what a human runningayg searchfrom the terminal actually sees:The ~460× wall-clock speedup is the honest end-to-end comparison. The sub-millisecond internal time is what an AI agent with a persistent process sees after the first query (index already loaded).
13. Where ayg Loses {#13-where-ayg-loses}
ayg does not always win. Showing where it loses matters more than showing where it wins.
1. Broad queries without content.bin, warm cache:
M3 Max, Chromium,
std::unique_ptr(43,371 candidates), no content.bin, warm:ayg is 0.7× — slower than ripgrep. 43K individual
open/read/closesyscalls are slower than ripgrep's optimized parallel SIMD scan of all files sequentially. With content.bin, the same query takes 22.2ms.2. Any query where >50% of files match:
Copyrightmatches 49,481 out of 79,225 Linux kernel files. On CI cold cache: ayg 12.4s vs ripgrep 13.7s = 1.1×. The index provides almost no selectivity.3. Small corpora, warm cache:
On the Linux kernel (1.2 GB) with 32 GB RAM and warm cache, ripgrep scans everything from page cache in ~2.5s. The SIMD scan at 8+ GB/s has near-zero overhead. ayg's index lookup + scan is faster (254ms), but ripgrep's 2.5s is already fast enough that the index build (22s) takes 9 queries to amortize.
4. Regex:
ayg handles literals only. ripgrep handles
\p{Greek}, lookaheads, Unicode word boundaries, and every regex feature theregexcrate supports.When ayg wins
kMaxBufferSize,gpu::Mailboxmatch 10–400 files. The index eliminates 99.9%+ of I/O.14. What Cursor Described vs What We Built {#14-cursor-comparison}
Cursor's claimed 13ms on M2 is consistent with our measurements. On M3 Max cold with content.bin: 1.0ms for MAX_FILE_SIZE. Cursor's number likely includes editor IPC overhead.
15. Lessons Learned {#15-lessons}
What moved the needle
Three of the top five are "boring" systems changes. The algorithm change (sparse n-grams) matters for selective queries but is irrelevant for broad ones where there are zero false positives.
What didn't matter
What I got wrong
Believed
purgecleared the cache on macOS. It doesn't fully on Apple Silicon. Needed a 38 GB memory flood.Abandoned sparse n-grams when the first implementation had a bug. The bug was boundary alignment. Fixing it took 30 minutes once diagnosed.
Assumed library overhead was the bottleneck. Spent hours replacing HashSet, HashMap, rayon. Net query improvement: zero.
The M3 Max warm content table in the first version of this post had wrong speedup calculations. Listed 126,978× for MAX_FILE_SIZE but 16,567ms ÷ 0.2ms = 82,835×. The raw times were correct; the division was wrong across the entire table. Fixed after a reader checked the math.
Reproducing
CI runs the Linux benchmark on every push and publishes results to hemeda3.github.io/aygrep.
Acknowledgments
Developed with AI assistance (Claude) for code generation, optimization iteration, and benchmarking. The author directed all architectural decisions and validated results on real hardware.
Ahmed Yousri · github.com/hemeda3 · March 2026