Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Select an option

  • Save hemeda3/80b63bdc8735cd1b2829eda89940a80e to your computer and use it in GitHub Desktop.

Select an option

Save hemeda3/80b63bdc8735cd1b2829eda89940a80e to your computer and use it in GitHub Desktop.
# Reverse-Engineering Cursor's fast Code Search algorithm
*Ahmed Yousri · March 2026*
---
In March 2026, Cursor published a blog post titled ["Fast regex search"](https://cursor.com/blog/fast-regex-search) by Vicent Marti. It described how they'd built an indexed code search system that could find a pattern across an entire codebase in 13 milliseconds. Their claim: sparse n-gram inverted indexes, two files on disk, a single binary search on a memory-mapped lookup table.
I read it and thought: I can build that.
What followed was a deep dive into research papers from the 1990s, operating system internals, memory-mapped I/O, and SIMD byte scanning. The result is [ayg](https://github.com/hemeda3/aygrep). On a 2-vCPU GitHub Actions runner with 7.8 GB RAM, it searches 79,225 Linux kernel files in 5.8ms hot for a selective query — and 460ms for a broad one matching 49,481 files. On the same runner, ripgrep takes 747ms and 1,563ms respectively.
The worst case: a query matching 62% of all files, cold cache, cheapest CI hardware. ayg: 12.4 seconds. ripgrep: 13.7 seconds. A 1.1× speedup. Barely worth the index.
The best case on the same hardware, same conditions: a selective query matching 13 files. ayg: 155ms. ripgrep: 13.1 seconds. An 85× speedup.
Every number in this post is sourced from either a [public CI run](https://github.com/hemeda3/aygrep/actions) or documented local measurements with stated cache conditions.
---
## Table of Contents
1. [The Problem: Why Grep Is Slow on Large Codebases](#1-the-problem)
2. [The Research: What Cursor Told Us (And What They Didn't)](#2-the-research)
3. [Phase 1: The Trigram Index Baseline](#3-phase-1)
4. [The Warm Cache Lie](#4-warm-cache-lie)
5. [Phase 2: Killing the Subprocess](#5-phase-2)
6. [Phase 3: Probe Before You Fetch](#6-phase-3)
7. [Phase 4: Content-in-Index](#7-phase-4)
8. [Phase 5: Sparse N-grams — The Algorithm Gap](#8-phase-5)
9. [Phase 6: The Library Audit That Mostly Didn't Matter](#9-phase-6)
10. [Phase 7: Zero-Copy Scanning via mmap](#10-phase-7)
11. [Phase 8: The Adaptive Architecture](#11-phase-8)
12. [The Benchmarks](#12-benchmarks)
13. [Where ayg Loses](#13-where-ayg-loses)
14. [What Cursor Described vs What We Built](#14-cursor-comparison)
15. [Lessons Learned](#15-lessons)
---
## 1. The Problem: Why Grep Is Slow on Large Codebases {#1-the-problem}
Chromium is one of the largest actively-developed open-source C++ projects. Some numbers:
- **436,610 source files** (after filtering binaries)
- **6.7 GB** of source code
- Spread across ~45,000 directories
When an AI coding agent needs to find something — say, every file that references `MAX_FILE_SIZE` — it runs `rg "MAX_FILE_SIZE"`. ripgrep is the fastest grep tool in existence. It uses SIMD-accelerated byte scanning, respects `.gitignore`, parallelizes across cores, and has been optimized by Andrew Gallant for years.
On an M3 Max MacBook Pro with Chromium:
```
$ time rg -c "MAX_FILE_SIZE" chromium/
15 files matched
real 16.567s
```
On a 2-vCPU GitHub Actions runner with the Linux kernel (79,225 files, 1.2 GB):
```
$ time rg -c "PM_RESUME"
13 files matched
real 13.129s ← cold cache
real 0.747s ← hot cache
```
The pattern `MAX_FILE_SIZE` appears in exactly 15 files out of 436,610. That's a 0.003% hit rate. The other 99.997% of I/O was wasted.
Can we figure out *which* files to read without reading all of them? That's what an inverted index does. That's what Cursor built.
---
## 2. The Research: What Cursor Told Us (And What They Didn't) {#2-the-research}
### What the blog post says
Cursor's post walks through the evolution of code search indexing:
1. **Trigrams** (fixed 3-byte subsequences) — functional but posting lists are large
2. **Suffix arrays** — exact matching but memory-intensive
3. **Bloom filter masks** — augment trigrams with next-byte masks, but "saturate too quickly"
4. **Sparse n-grams** — their actual solution
The key innovation is a **frequency-weighted boundary detector**. A 256×256 table maps every byte pair to a weight. Rare byte pairs get high weights. N-gram boundaries are placed at local maxima. This produces variable-length n-grams that are longer and more selective than fixed trigrams.
### What the blog post doesn't say
- **How they scan candidate files.** "Doing this on the client is trivial" implies an embedded SIMD scanner, not a subprocess.
- **They store only hashes, not full n-grams.** 3-byte (24-bit) hash keys. Collisions merge posting lists — safe, just less selective.
- **They mmap only the lookup table.** Postings are read via `pread()`. The editor process stays lean.
- **No medium or broad query performance.** The 13ms number is for a selective query on an M2 MacBook.
### The papers
- **Zobel, Moffat, Sacks-Davis (1993)** — inverted index fundamentals
- **Russ Cox (2012)** — trigram indexing for regex search (Google Code Search)
- **ClickHouse** — sparse n-gram text indexing for database columns
---
## 3. Phase 1: The Trigram Index Baseline {#3-phase-1}
Extract every overlapping 3-byte sequence from every file. Build an inverted index mapping trigram → file IDs. Write to disk.
Two files:
- `lookup.bin`: sorted array of `[trigram: 3B, pad: 1B, offset: 8B, count: 4B]` = 16 bytes per entry. Binary search to find any trigram.
- `postings.bin`: concatenated delta+varint encoded posting lists.
For the scan phase, I spawned `rg -c pattern file1 file2 file3...` as a subprocess. This was a mistake.
### Baseline results (M3 Max, Chromium, warm cache, development build)
| Query | Candidates | Total time | vs ripgrep 16.6s |
|-------|-----------|-----------|-----------------|
| MAX_FILE_SIZE | ~200 | 11.3ms | ~1,500× |
| gpu::Mailbox | ~245 | 9.1ms | ~1,800× |
| NOTREACHED | ~8,300 | 225ms | ~74× |
| std::unique_ptr | ~43,400 | 2,125ms | ~8× |
These numbers had a problem I hadn't noticed yet.
---
## 4. The Warm Cache Lie {#4-warm-cache-lie}
My benchmark ran each query 3 times and took the median. All warm. After building the index, the OS page cache held every source file in memory — because the build phase had just read them all.
After a 38 GB memory flood + `purge` on macOS:
```
MAX_FILE_SIZE truly cold: 25ms (was 11.3ms "warm")
```
The 11.3ms was a warm-cache number compared against ripgrep's cold-cache 16.6 seconds.
On Linux, `echo 3 > /proc/sys/vm/drop_caches` actually works. This is why we later tested on GCP and GitHub Actions with proper cache clearing.
**Lesson: Always measure cold cache. Always document cache state.**
---
## 5. Phase 2: Killing the Subprocess {#5-phase-2}
Each `Command::new("rg")` does a `fork()` + `exec()`. On macOS, that costs 2–5 milliseconds. For a query with 24 candidates that should take 0.1ms to scan, the subprocess overhead was 50× the actual work.
Replace with in-process SIMD scanning using the `memchr` crate:
```rust
use memchr::memmem;
let finder = memmem::Finder::new(pattern_bytes);
let count = finder.find_iter(&data).count();
```
`memmem::Finder` uses SIMD (NEON on ARM, SSE2/AVX2 on x86) to scan at 8+ GB/s.
| Query | Before (subprocess) | After (memchr) | Improvement |
|-------|----:|----:|----:|
| MAX_FILE_SIZE | 11.3ms | 6.5ms | 1.7× |
| kMaxBufferSize | 8.6ms | 1.9ms | 4.5× |
| gpu::Mailbox | 9.1ms | 1.5ms | 6.1× |
This single change mattered more than every "clever" optimization that followed.
---
## 6. Phase 3: Probe Before You Fetch {#6-phase-3}
The lookup table is mmap'd. Binary search costs ~5μs. The table also contains the posting list *count*. We can see how selective a trigram is without reading its posting list.
Split into probe (free — mmap'd) and fetch (expensive — pread from disk):
```rust
// Probe all trigrams — free
probes.sort_by_key(|&(_, _, count)| count);
// Fetch shortest first, stop early
let mut candidates = fetch(probes[0]);
for probe in &probes[1..] {
if candidates.len() <= 50 { break; }
candidates = intersect(&candidates, &fetch(probe));
}
```
Also: `MADV_WILLNEED` to prefault the lookup table on startup. Shaved ~4ms off the first cold query.
---
## 7. Phase 4: Content-in-Index {#7-phase-4}
For 24 candidate files, `fs::read()` does `open()` → `stat()` → `malloc()` → `read()` → `close()`. That's ~4 syscalls per file.
Stream all file contents into a single `content.bin` during index build. At query time, mmap it and access candidates as slices:
```rust
fn file_content(&self, fid: u32) -> &[u8] {
let (offset, length) = self.content_offsets[fid as usize];
&self.content_mmap[offset as usize .. offset as usize + length as usize]
}
```
Zero allocation. Zero syscall. Index grew from 831 MB to 3,123 MB.
The build streams content to disk during Pass 1. Peak memory: ~50 MB. No OOM on 8 GB machines.
---
## 8. Phase 5: Sparse N-grams {#8-phase-5}
### The bug
The first implementation had a boundary alignment bug. `build_covering` (query-time extraction) used edge boundaries that didn't match the index-time extraction context.
### The fix
Use only **interior boundaries** at query time — positions where both neighbors exist and the weight is a local maximum:
```rust
fn find_interior_boundaries(w: &[u8]) -> Vec<usize> {
(1..w.len()-1).filter(|&i| w[i] >= w[i-1] && w[i] >= w[i+1]).collect()
}
```
For `build_all` (index time), use ALL boundaries. For `build_covering` (query time), use only interior boundaries. This guarantees every covering n-gram exists in the index.
### Two-tier query
Short common queries (`NOTREACHED`) have few interior peaks. Fallback to raw trigrams scored by rarity:
```rust
let keys = if covering_ranges.len() >= 3 {
sparse_ngram_path(covering_ranges) // selective
} else {
raw_trigram_fallback(pattern) // always works
};
```
| Query | Trigram candidates | Sparse candidates |
|-------|------------------:|------------------:|
| MAX_FILE_SIZE | ~200 | **24** |
| gpu::Mailbox | ~350 | **143** |
| NOTREACHED | ~8,300 | ~8,300 (fallback) |
Sparse n-grams can't help NOTREACHED — it genuinely appears in 8,300+ files.
---
## 9. Phase 6: The Library Audit That Mostly Didn't Matter {#9-phase-6}
| Change | Suspected cost | Actual query impact |
|--------|---------------|-------------------|
| Replace `HashSet` with bitvec | SipHash overhead | **Unmeasurable** |
| Replace `HashMap` with FxHashMap | SipHash overhead | **Build only** |
| Replace `walkdir` with `git ls-files` | 8s directory walk | **Build: 7.5s saved** |
| Replace `rayon` with `std::thread::scope` | Thread pool overhead | **Unmeasurable** |
Three of four changes had zero measurable impact on query latency.
**Lesson: Profile before optimizing.**
---
## 10. Phase 7: Zero-Copy Scanning via mmap {#10-phase-7}
Replace per-file pread (one syscall each) with direct mmap slice access. Sort candidates by content offset for sequential access.
| Query | pread scan | mmap scan | Improvement |
|-------|----------:|---------:|------------:|
| MAX_FILE_SIZE | 1.6ms | **0.1ms** | 16× |
| NOTREACHED | 29ms | **8.7ms** | 3.3× |
| std::unique_ptr | 93ms | **22.2ms** | 4.2× |
These are M3 Max warm cache measurements (development). Verified final numbers are in the benchmarks section below.
---
## 11. Phase 8: The Adaptive Architecture {#11-phase-8}
Not everyone has 36 GB of RAM. Detect available memory at startup:
| Mode | When | Strategy |
|------|------|----------|
| **ContentMmap** | content.bin exists, RAM > 2× content | Zero-copy mmap |
| **ContentPread** | content.bin exists, RAM tight | One-alloc pread |
| **Filesystem** | No content.bin or low RAM | Read from repo |
The `--no-content` flag skips content.bin during build. Index shrinks from 3.1 GB to 831 MB. On machines with <8 GB available RAM, the build does this automatically.
---
## 12. The Benchmarks {#12-benchmarks}
### Methodology
**Cold cache (Linux):** `sync && echo 3 | sudo tee /proc/sys/vm/drop_caches` before each query.
**Cold cache (macOS):** `dd if=/dev/zero of=/tmp/mf bs=1m count=38000 && rm /tmp/mf && purge`.
**Timing:** ayg uses internal `Instant::now()` around the search core (excludes process startup and index loading). ripgrep measured via Python `time.perf_counter()` around `subprocess.run()`. This favors ayg by ~50ms (startup cost excluded). For fair CLI wall-clock comparison, see the customer-facing section.
**Correctness:** All match counts verified identical to ripgrep.
### Hardware
| Machine | CPU | vCPU | RAM | Storage |
|---------|-----|-----:|----:|---------|
| **GitHub Actions** | AMD EPYC 7763 | 2 | 7.8 GB | Azure SSD |
| **M3 Max MacBook Pro** | Apple M3 Max @ 4.05 GHz | 14 | 36 GB | Apple NVMe (~7 GB/s) |
| **GCP e2-standard-2** | AMD EPYC 7B12 @ 2.25 GHz | 2 | 8 GB | pd-ssd (~400 MB/s) |
---
### Linux kernel (79,225 files) — GitHub Actions, no content.bin
Public CI verification: [benchmark run](https://github.com/hemeda3/aygrep/actions). Queries ordered worst speedup first.
| Query | ayg cold | ayg hot | rg cold | rg hot | Cold × | Hot × | Candidates | Files |
|-------|--------:|-------:|-------:|------:|-------:|------:|-----------:|------:|
| `Copyright` | **12,412ms** | **461ms** | 13,697ms | 1,563ms | **1.1×** | **3.4×** | 49,507 | 49,481 |
| `struct device` | **6,406ms** | **221ms** | 13,253ms | 918ms | **2.1×** | **4.2×** | 21,185 | 11,408 |
| `mutex_lock` | **2,119ms** | **72ms** | 13,203ms | 875ms | **6.2×** | **12×** | 5,617 | 5,472 |
| `EXPORT_SYMBOL_GPL` | **1,814ms** | **62ms** | 13,112ms | 891ms | **7.2×** | **14×** | 5,694 | 3,130 |
| `PM_RESUME` | **155ms** | **5.8ms** | 13,129ms | 747ms | **85×** | **129×** | 235 | 13 |
| | Build | ayg total | rg total | Speedup |
|-|------:|----------:|---------:|--------:|
| **Cold** | 29.46s | 22,906ms | 66,394ms | **2.9×** |
| **Hot** | 29.46s | 822ms | 4,940ms | **6.0×** |
The `Copyright` query matches 49,481 out of 79,225 files (62%). At cold cache, ayg barely wins. Without content.bin, scanning 49,507 candidate files from the filesystem is almost as slow as scanning all 79,225.
---
### Linux kernel — M3 Max, content.bin, cold (38 GB flood + purge)
| Query | ayg | rg | Speedup | Files |
|-------|----:|---:|--------:|------:|
| `Copyright` | **74.3ms** | 3,783ms | **51×** | 49,482 |
| `EXPORT_SYMBOL_GPL` | **15.8ms** | 3,549ms | **225×** | 3,129 |
| `PM_RESUME` | **6.2ms** | 3,585ms | **578×** | 13 |
With content.bin on fast local NVMe, even the broad query is 51× faster.
---
### Linux kernel — M3 Max, content.bin, hot
| Query | ayg | rg | Speedup | Files |
|-------|----:|---:|--------:|------:|
| `Copyright` | **96.7ms** | 2,550ms | **26×** | 49,482 |
| `struct device` | **75.6ms** | 2,693ms | **36×** | 11,408 |
| `mutex_lock` | **45.2ms** | 2,577ms | **57×** | 5,471 |
| `EXPORT_SYMBOL_GPL` | **30.1ms** | 2,578ms | **86×** | 3,129 |
| `PM_RESUME` | **7.0ms** | 2,761ms | **394×** | 13 |
| | ayg total | rg total | Speedup |
|-|----------:|---------:|--------:|
| **Hot** | 254.6ms | 13,159ms | **52×** |
---
### Chromium (436,610 files) — M3 Max, content.bin, cold
| Query | ayg | rg | Speedup | Candidates |
|-------|----:|---:|--------:|-----------:|
| `#include "base/` | **109.4ms** | ~16,500ms | **~151×** | 78,979 |
| `NOTREACHED` | **36.2ms** | 18,440ms | **509×** | 8,310 |
| `MAX_FILE_SIZE` | **1.0ms** | 16,567ms | **16,567×** | 24 |
---
### Chromium — M3 Max, no content.bin, cold (filesystem scan)
| Query | ayg | rg | Speedup | Candidates |
|-------|----:|---:|--------:|-----------:|
| `std::unique_ptr` | **4,223ms** | 11,684ms | **2.8×** | 43,371 |
| `NOTREACHED` | **1,648ms** | 18,440ms | **11×** | 8,310 |
| `gpu::Mailbox` | **412ms** | 16,797ms | **41×** | 419 |
| `kMaxBufferSize` | **60ms** | 17,041ms | **284×** | 140 |
| `MAX_FILE_SIZE` | **8.0ms** | 16,567ms | **2,071×** | 24 |
Without content.bin, broad queries (43K candidates) approach ripgrep speed because 43K individual `open/read/close` syscall chains become the bottleneck.
---
### Chromium — GCP e2-standard-2 (8 GB, 2 vCPU), no content.bin, cold
The absolute worst case: slowest CPU, smallest RAM, network-attached disk, no content file, cold cache.
| Query | ayg | rg | Speedup | Candidates |
|-------|----:|---:|--------:|-----------:|
| `std::unique_ptr` | **39,332ms** | 264,000ms | **6.7×** | 43,376 |
| `NOTREACHED` | **10,487ms** | 223,000ms | **21×** | 8,313 |
| `gpu::Mailbox` | **667ms** | ~209,000ms | **~313×** | 419 |
| `MAX_FILE_SIZE` | **160ms** | 209,000ms | **1,306×** | 24 |
ripgrep goes from 16s (local NVMe) to 209s (network SSD) — 13× slower. ayg goes from 8ms to 160ms — 20× slower. Both degrade, but ayg degrades less because it reads 24 files instead of 436,000.
---
### Customer-facing wall-clock (Homebrew binary, M3 Max, Chromium)
The numbers above use internal `Instant::now()` which excludes ~50ms of process startup and index loading. This is what a human running `ayg search` from the terminal actually sees:
| Scenario | ayg CLI wall | ayg internal | rg wall | Speedup (wall) |
|----------|------------:|------------:|--------:|--------:|
| Warm | **59–64ms** | 0.5–1.5ms | 29.24s | **~460×** |
| Cold | **0.25s** | 45.4ms | 48.20s | **~193×** |
The ~460× wall-clock speedup is the honest end-to-end comparison. The sub-millisecond internal time is what an AI agent with a persistent process sees after the first query (index already loaded).
---
## 13. Where ayg Loses {#13-where-ayg-loses}
ayg does not always win. Showing where it loses matters more than showing where it wins.
**1. Broad queries without content.bin, warm cache:**
M3 Max, Chromium, `std::unique_ptr` (43,371 candidates), no content.bin, warm:
| Tool | Time |
|------|-----:|
| ripgrep | 11,684ms |
| **ayg** | **16,785ms** |
ayg is **0.7×** — slower than ripgrep. 43K individual `open/read/close` syscalls are slower than ripgrep's optimized parallel SIMD scan of all files sequentially. With content.bin, the same query takes 22.2ms.
**2. Any query where >50% of files match:**
`Copyright` matches 49,481 out of 79,225 Linux kernel files. On CI cold cache: ayg 12.4s vs ripgrep 13.7s = 1.1×. The index provides almost no selectivity.
**3. Small corpora, warm cache:**
On the Linux kernel (1.2 GB) with 32 GB RAM and warm cache, ripgrep scans everything from page cache in ~2.5s. The SIMD scan at 8+ GB/s has near-zero overhead. ayg's index lookup + scan is faster (254ms), but ripgrep's 2.5s is already fast enough that the index build (22s) takes 9 queries to amortize.
**4. Regex:**
ayg handles literals only. ripgrep handles `\p{Greek}`, lookaheads, Unicode word boundaries, and every regex feature the `regex` crate supports.
### When ayg wins
- **The query is selective.** Agent patterns like `kMaxBufferSize`, `gpu::Mailbox` match 10–400 files. The index eliminates 99.9%+ of I/O.
- **The corpus doesn't fit in RAM.** Chromium (6.7 GB) on 8 GB: ripgrep takes 3.5 minutes. ayg takes 160ms.
- **The storage is slow.** Network-attached SSD: ripgrep 209s → ayg 160ms.
- **Repeated searches.** AI agents search 50–200 times per session. Build amortizes after 2–7 queries depending on hardware.
---
## 14. What Cursor Described vs What We Built {#14-cursor-comparison}
| Aspect | Cursor's blog | ayg |
|--------|--------------|-----|
| N-gram algorithm | Sparse n-grams with frequency boundaries | Same: 256×256 byte-pair table, interior boundaries for covering |
| Index format | Two files: lookup (mmap) + postings (pread) | Same: 16-byte sorted entries, delta+varint posting lists |
| Lookup | "mmap this table, and only this table" | Same |
| Content scanning | "trivial" (filesystem) | Adaptive: mmap / pread / filesystem based on RAM |
| 8 GB machines | Not discussed | Streaming build, auto-skip content.bin |
| Query fallback | Not discussed | Two-tier: sparse path + raw trigram fallback |
| Regex | Implied but not detailed | Not implemented |
| Git-based versioning | Yes | Not implemented |
Cursor's claimed 13ms on M2 is consistent with our measurements. On M3 Max cold with content.bin: 1.0ms for MAX_FILE_SIZE. Cursor's number likely includes editor IPC overhead.
---
## 15. Lessons Learned {#15-lessons}
### What moved the needle
| Rank | Change | Impact | Category |
|-----:|--------|--------|----------|
| 1 | mmap content file (zero-copy scan) | NOTREACHED scan: 29ms → 8.7ms | Systems |
| 2 | Kill subprocess (embedded memchr) | Query floor: 5ms → 0.1ms | Systems |
| 3 | Content-in-index (single file) | Enabled #1; eliminated per-file syscalls | Architecture |
| 4 | Sparse n-grams | MAX_FILE_SIZE: 200 → 24 candidates | Algorithm |
| 5 | Probe-before-fetch + early termination | Saved 2–3 cold pread calls | Systems |
| 6 | MADV_WILLNEED prefault | Cold first query: -4ms | OS hints |
| 7 | git ls-files (replace walkdir) | Build: 8s → 0.2s | Build |
| 8 | FxHash, bitvec, galloping intersect | Unmeasurable | Noise |
Three of the top five are "boring" systems changes. The algorithm change (sparse n-grams) matters for selective queries but is irrelevant for broad ones where there are zero false positives.
### What didn't matter
- **FxHash vs SipHash:** Not on query path.
- **rayon vs std::thread::scope:** Identical performance.
- **Galloping intersection:** Marginal. Only helps when list sizes differ 64×+.
### What I got wrong
1. **Believed `purge` cleared the cache on macOS.** It doesn't fully on Apple Silicon. Needed a 38 GB memory flood.
2. **Abandoned sparse n-grams when the first implementation had a bug.** The bug was boundary alignment. Fixing it took 30 minutes once diagnosed.
3. **Assumed library overhead was the bottleneck.** Spent hours replacing HashSet, HashMap, rayon. Net query improvement: zero.
4. **The M3 Max warm content table in the first version of this post had wrong speedup calculations.** Listed 126,978× for MAX_FILE_SIZE but 16,567ms ÷ 0.2ms = 82,835×. The raw times were correct; the division was wrong across the entire table. Fixed after a reader checked the math.
---
## Reproducing
```bash
git clone https://github.com/hemeda3/aygrep.git
cd aygrep && cargo install --path .
./scripts/benchmark-full.sh linux # ~5 min, clones kernel
./scripts/benchmark-full.sh chromium # ~30 min, clones Chromium
```
CI runs the Linux benchmark on every push and publishes results to [hemeda3.github.io/aygrep](https://hemeda3.github.io/aygrep/).
---
## Acknowledgments
Developed with AI assistance (Claude) for code generation, optimization iteration, and benchmarking. The author directed all architectural decisions and validated results on real hardware.
- **Vicent Marti** and the Cursor team for the blog post
- **Andrew Gallant (BurntSushi)** for ripgrep, memchr, and the regex crate
- **Russ Cox** for trigram indexing
- The **zoekt** team for the content-in-index architecture
---
*Ahmed Yousri · [github.com/hemeda3](https://github.com/hemeda3) · March 2026*
@hemeda3
Copy link
Copy Markdown
Author

hemeda3 commented Mar 24, 2026

Reverse-Engineering Cursor's fast Code Search algorithm

Ahmed Yousri · March 2026


In March 2026, Cursor published a blog post titled "Fast regex search" by Vicent Marti. It described how they'd built an indexed code search system that could find a pattern across an entire codebase in 13 milliseconds. Their claim: sparse n-gram inverted indexes, two files on disk, a single binary search on a memory-mapped lookup table.

I read it and thought: I can build that.

What followed was a deep dive into research papers from the 1990s, operating system internals, memory-mapped I/O, and SIMD byte scanning. The result is ayg. On a 2-vCPU GitHub Actions runner with 7.8 GB RAM, it searches 79,225 Linux kernel files in 5.8ms hot for a selective query — and 460ms for a broad one matching 49,481 files. On the same runner, ripgrep takes 747ms and 1,563ms respectively.

The worst case: a query matching 62% of all files, cold cache, cheapest CI hardware. ayg: 12.4 seconds. ripgrep: 13.7 seconds. A 1.1× speedup. Barely worth the index.

The best case on the same hardware, same conditions: a selective query matching 13 files. ayg: 155ms. ripgrep: 13.1 seconds. An 85× speedup.

Every number in this post is sourced from either a public CI run or documented local measurements with stated cache conditions.


Table of Contents

  1. The Problem: Why Grep Is Slow on Large Codebases
  2. The Research: What Cursor Told Us (And What They Didn't)
  3. Phase 1: The Trigram Index Baseline
  4. The Warm Cache Lie
  5. Phase 2: Killing the Subprocess
  6. Phase 3: Probe Before You Fetch
  7. Phase 4: Content-in-Index
  8. Phase 5: Sparse N-grams — The Algorithm Gap
  9. Phase 6: The Library Audit That Mostly Didn't Matter
  10. Phase 7: Zero-Copy Scanning via mmap
  11. Phase 8: The Adaptive Architecture
  12. The Benchmarks
  13. Where ayg Loses
  14. What Cursor Described vs What We Built
  15. Lessons Learned

1. The Problem: Why Grep Is Slow on Large Codebases {#1-the-problem}

Chromium is one of the largest actively-developed open-source C++ projects. Some numbers:

  • 436,610 source files (after filtering binaries)
  • 6.7 GB of source code
  • Spread across ~45,000 directories

When an AI coding agent needs to find something — say, every file that references MAX_FILE_SIZE — it runs rg "MAX_FILE_SIZE". ripgrep is the fastest grep tool in existence. It uses SIMD-accelerated byte scanning, respects .gitignore, parallelizes across cores, and has been optimized by Andrew Gallant for years.

On an M3 Max MacBook Pro with Chromium:

$ time rg -c "MAX_FILE_SIZE" chromium/
15 files matched

real    16.567s

On a 2-vCPU GitHub Actions runner with the Linux kernel (79,225 files, 1.2 GB):

$ time rg -c "PM_RESUME"
13 files matched

real    13.129s    ← cold cache
real     0.747s    ← hot cache

The pattern MAX_FILE_SIZE appears in exactly 15 files out of 436,610. That's a 0.003% hit rate. The other 99.997% of I/O was wasted.

Can we figure out which files to read without reading all of them? That's what an inverted index does. That's what Cursor built.


2. The Research: What Cursor Told Us (And What They Didn't) {#2-the-research}

What the blog post says

Cursor's post walks through the evolution of code search indexing:

  1. Trigrams (fixed 3-byte subsequences) — functional but posting lists are large
  2. Suffix arrays — exact matching but memory-intensive
  3. Bloom filter masks — augment trigrams with next-byte masks, but "saturate too quickly"
  4. Sparse n-grams — their actual solution

The key innovation is a frequency-weighted boundary detector. A 256×256 table maps every byte pair to a weight. Rare byte pairs get high weights. N-gram boundaries are placed at local maxima. This produces variable-length n-grams that are longer and more selective than fixed trigrams.

What the blog post doesn't say

  • How they scan candidate files. "Doing this on the client is trivial" implies an embedded SIMD scanner, not a subprocess.
  • They store only hashes, not full n-grams. 3-byte (24-bit) hash keys. Collisions merge posting lists — safe, just less selective.
  • They mmap only the lookup table. Postings are read via pread(). The editor process stays lean.
  • No medium or broad query performance. The 13ms number is for a selective query on an M2 MacBook.

The papers

  • Zobel, Moffat, Sacks-Davis (1993) — inverted index fundamentals
  • Russ Cox (2012) — trigram indexing for regex search (Google Code Search)
  • ClickHouse — sparse n-gram text indexing for database columns

3. Phase 1: The Trigram Index Baseline {#3-phase-1}

Extract every overlapping 3-byte sequence from every file. Build an inverted index mapping trigram → file IDs. Write to disk.

Two files:

  • lookup.bin: sorted array of [trigram: 3B, pad: 1B, offset: 8B, count: 4B] = 16 bytes per entry. Binary search to find any trigram.
  • postings.bin: concatenated delta+varint encoded posting lists.

For the scan phase, I spawned rg -c pattern file1 file2 file3... as a subprocess. This was a mistake.

Baseline results (M3 Max, Chromium, warm cache, development build)

Query Candidates Total time vs ripgrep 16.6s
MAX_FILE_SIZE ~200 11.3ms ~1,500×
gpu::Mailbox ~245 9.1ms ~1,800×
NOTREACHED ~8,300 225ms ~74×
std::unique_ptr ~43,400 2,125ms ~8×

These numbers had a problem I hadn't noticed yet.


4. The Warm Cache Lie {#4-warm-cache-lie}

My benchmark ran each query 3 times and took the median. All warm. After building the index, the OS page cache held every source file in memory — because the build phase had just read them all.

After a 38 GB memory flood + purge on macOS:

MAX_FILE_SIZE truly cold: 25ms  (was 11.3ms "warm")

The 11.3ms was a warm-cache number compared against ripgrep's cold-cache 16.6 seconds.

On Linux, echo 3 > /proc/sys/vm/drop_caches actually works. This is why we later tested on GCP and GitHub Actions with proper cache clearing.

Lesson: Always measure cold cache. Always document cache state.


5. Phase 2: Killing the Subprocess {#5-phase-2}

Each Command::new("rg") does a fork() + exec(). On macOS, that costs 2–5 milliseconds. For a query with 24 candidates that should take 0.1ms to scan, the subprocess overhead was 50× the actual work.

Replace with in-process SIMD scanning using the memchr crate:

use memchr::memmem;
let finder = memmem::Finder::new(pattern_bytes);
let count = finder.find_iter(&data).count();

memmem::Finder uses SIMD (NEON on ARM, SSE2/AVX2 on x86) to scan at 8+ GB/s.

Query Before (subprocess) After (memchr) Improvement
MAX_FILE_SIZE 11.3ms 6.5ms 1.7×
kMaxBufferSize 8.6ms 1.9ms 4.5×
gpu::Mailbox 9.1ms 1.5ms 6.1×

This single change mattered more than every "clever" optimization that followed.


6. Phase 3: Probe Before You Fetch {#6-phase-3}

The lookup table is mmap'd. Binary search costs ~5μs. The table also contains the posting list count. We can see how selective a trigram is without reading its posting list.

Split into probe (free — mmap'd) and fetch (expensive — pread from disk):

// Probe all trigrams — free
probes.sort_by_key(|&(_, _, count)| count);

// Fetch shortest first, stop early
let mut candidates = fetch(probes[0]);
for probe in &probes[1..] {
    if candidates.len() <= 50 { break; }
    candidates = intersect(&candidates, &fetch(probe));
}

Also: MADV_WILLNEED to prefault the lookup table on startup. Shaved ~4ms off the first cold query.


7. Phase 4: Content-in-Index {#7-phase-4}

For 24 candidate files, fs::read() does open()stat()malloc()read()close(). That's ~4 syscalls per file.

Stream all file contents into a single content.bin during index build. At query time, mmap it and access candidates as slices:

fn file_content(&self, fid: u32) -> &[u8] {
    let (offset, length) = self.content_offsets[fid as usize];
    &self.content_mmap[offset as usize .. offset as usize + length as usize]
}

Zero allocation. Zero syscall. Index grew from 831 MB to 3,123 MB.

The build streams content to disk during Pass 1. Peak memory: ~50 MB. No OOM on 8 GB machines.


8. Phase 5: Sparse N-grams {#8-phase-5}

The bug

The first implementation had a boundary alignment bug. build_covering (query-time extraction) used edge boundaries that didn't match the index-time extraction context.

The fix

Use only interior boundaries at query time — positions where both neighbors exist and the weight is a local maximum:

fn find_interior_boundaries(w: &[u8]) -> Vec<usize> {
    (1..w.len()-1).filter(|&i| w[i] >= w[i-1] && w[i] >= w[i+1]).collect()
}

For build_all (index time), use ALL boundaries. For build_covering (query time), use only interior boundaries. This guarantees every covering n-gram exists in the index.

Two-tier query

Short common queries (NOTREACHED) have few interior peaks. Fallback to raw trigrams scored by rarity:

let keys = if covering_ranges.len() >= 3 {
    sparse_ngram_path(covering_ranges)   // selective
} else {
    raw_trigram_fallback(pattern)         // always works
};
Query Trigram candidates Sparse candidates
MAX_FILE_SIZE ~200 24
gpu::Mailbox ~350 143
NOTREACHED ~8,300 ~8,300 (fallback)

Sparse n-grams can't help NOTREACHED — it genuinely appears in 8,300+ files.


9. Phase 6: The Library Audit That Mostly Didn't Matter {#9-phase-6}

Change Suspected cost Actual query impact
Replace HashSet with bitvec SipHash overhead Unmeasurable
Replace HashMap with FxHashMap SipHash overhead Build only
Replace walkdir with git ls-files 8s directory walk Build: 7.5s saved
Replace rayon with std::thread::scope Thread pool overhead Unmeasurable

Three of four changes had zero measurable impact on query latency.

Lesson: Profile before optimizing.


10. Phase 7: Zero-Copy Scanning via mmap {#10-phase-7}

Replace per-file pread (one syscall each) with direct mmap slice access. Sort candidates by content offset for sequential access.

Query pread scan mmap scan Improvement
MAX_FILE_SIZE 1.6ms 0.1ms 16×
NOTREACHED 29ms 8.7ms 3.3×
std::unique_ptr 93ms 22.2ms 4.2×

These are M3 Max warm cache measurements (development). Verified final numbers are in the benchmarks section below.


11. Phase 8: The Adaptive Architecture {#11-phase-8}

Not everyone has 36 GB of RAM. Detect available memory at startup:

Mode When Strategy
ContentMmap content.bin exists, RAM > 2× content Zero-copy mmap
ContentPread content.bin exists, RAM tight One-alloc pread
Filesystem No content.bin or low RAM Read from repo

The --no-content flag skips content.bin during build. Index shrinks from 3.1 GB to 831 MB. On machines with <8 GB available RAM, the build does this automatically.


12. The Benchmarks {#12-benchmarks}

Methodology

Cold cache (Linux): sync && echo 3 | sudo tee /proc/sys/vm/drop_caches before each query.

Cold cache (macOS): dd if=/dev/zero of=/tmp/mf bs=1m count=38000 && rm /tmp/mf && purge.

Timing: ayg uses internal Instant::now() around the search core (excludes process startup and index loading). ripgrep measured via Python time.perf_counter() around subprocess.run(). This favors ayg by ~50ms (startup cost excluded). For fair CLI wall-clock comparison, see the customer-facing section.

Correctness: All match counts verified identical to ripgrep.

Hardware

Machine CPU vCPU RAM Storage
GitHub Actions AMD EPYC 7763 2 7.8 GB Azure SSD
M3 Max MacBook Pro Apple M3 Max @ 4.05 GHz 14 36 GB Apple NVMe (~7 GB/s)
GCP e2-standard-2 AMD EPYC 7B12 @ 2.25 GHz 2 8 GB pd-ssd (~400 MB/s)

Linux kernel (79,225 files) — GitHub Actions, no content.bin

Public CI verification: benchmark run. Queries ordered worst speedup first.

Query ayg cold ayg hot rg cold rg hot Cold × Hot × Candidates Files
Copyright 12,412ms 461ms 13,697ms 1,563ms 1.1× 3.4× 49,507 49,481
struct device 6,406ms 221ms 13,253ms 918ms 2.1× 4.2× 21,185 11,408
mutex_lock 2,119ms 72ms 13,203ms 875ms 6.2× 12× 5,617 5,472
EXPORT_SYMBOL_GPL 1,814ms 62ms 13,112ms 891ms 7.2× 14× 5,694 3,130
PM_RESUME 155ms 5.8ms 13,129ms 747ms 85× 129× 235 13
Build ayg total rg total Speedup
Cold 29.46s 22,906ms 66,394ms 2.9×
Hot 29.46s 822ms 4,940ms 6.0×

The Copyright query matches 49,481 out of 79,225 files (62%). At cold cache, ayg barely wins. Without content.bin, scanning 49,507 candidate files from the filesystem is almost as slow as scanning all 79,225.


Linux kernel — M3 Max, content.bin, cold (38 GB flood + purge)

Query ayg rg Speedup Files
Copyright 74.3ms 3,783ms 51× 49,482
EXPORT_SYMBOL_GPL 15.8ms 3,549ms 225× 3,129
PM_RESUME 6.2ms 3,585ms 578× 13

With content.bin on fast local NVMe, even the broad query is 51× faster.


Linux kernel — M3 Max, content.bin, hot

Query ayg rg Speedup Files
Copyright 96.7ms 2,550ms 26× 49,482
struct device 75.6ms 2,693ms 36× 11,408
mutex_lock 45.2ms 2,577ms 57× 5,471
EXPORT_SYMBOL_GPL 30.1ms 2,578ms 86× 3,129
PM_RESUME 7.0ms 2,761ms 394× 13
ayg total rg total Speedup
Hot 254.6ms 13,159ms 52×

Chromium (436,610 files) — M3 Max, content.bin, cold

Query ayg rg Speedup Candidates
#include "base/ 109.4ms ~16,500ms ~151× 78,979
NOTREACHED 36.2ms 18,440ms 509× 8,310
MAX_FILE_SIZE 1.0ms 16,567ms 16,567× 24

Chromium — M3 Max, no content.bin, cold (filesystem scan)

Query ayg rg Speedup Candidates
std::unique_ptr 4,223ms 11,684ms 2.8× 43,371
NOTREACHED 1,648ms 18,440ms 11× 8,310
gpu::Mailbox 412ms 16,797ms 41× 419
kMaxBufferSize 60ms 17,041ms 284× 140
MAX_FILE_SIZE 8.0ms 16,567ms 2,071× 24

Without content.bin, broad queries (43K candidates) approach ripgrep speed because 43K individual open/read/close syscall chains become the bottleneck.


Chromium — GCP e2-standard-2 (8 GB, 2 vCPU), no content.bin, cold

The absolute worst case: slowest CPU, smallest RAM, network-attached disk, no content file, cold cache.

Query ayg rg Speedup Candidates
std::unique_ptr 39,332ms 264,000ms 6.7× 43,376
NOTREACHED 10,487ms 223,000ms 21× 8,313
gpu::Mailbox 667ms ~209,000ms ~313× 419
MAX_FILE_SIZE 160ms 209,000ms 1,306× 24

ripgrep goes from 16s (local NVMe) to 209s (network SSD) — 13× slower. ayg goes from 8ms to 160ms — 20× slower. Both degrade, but ayg degrades less because it reads 24 files instead of 436,000.


Customer-facing wall-clock (Homebrew binary, M3 Max, Chromium)

The numbers above use internal Instant::now() which excludes ~50ms of process startup and index loading. This is what a human running ayg search from the terminal actually sees:

Scenario ayg CLI wall ayg internal rg wall Speedup (wall)
Warm 59–64ms 0.5–1.5ms 29.24s ~460×
Cold 0.25s 45.4ms 48.20s ~193×

The ~460× wall-clock speedup is the honest end-to-end comparison. The sub-millisecond internal time is what an AI agent with a persistent process sees after the first query (index already loaded).


13. Where ayg Loses {#13-where-ayg-loses}

ayg does not always win. Showing where it loses matters more than showing where it wins.

1. Broad queries without content.bin, warm cache:

M3 Max, Chromium, std::unique_ptr (43,371 candidates), no content.bin, warm:

Tool Time
ripgrep 11,684ms
ayg 16,785ms

ayg is 0.7× — slower than ripgrep. 43K individual open/read/close syscalls are slower than ripgrep's optimized parallel SIMD scan of all files sequentially. With content.bin, the same query takes 22.2ms.

2. Any query where >50% of files match:

Copyright matches 49,481 out of 79,225 Linux kernel files. On CI cold cache: ayg 12.4s vs ripgrep 13.7s = 1.1×. The index provides almost no selectivity.

3. Small corpora, warm cache:

On the Linux kernel (1.2 GB) with 32 GB RAM and warm cache, ripgrep scans everything from page cache in ~2.5s. The SIMD scan at 8+ GB/s has near-zero overhead. ayg's index lookup + scan is faster (254ms), but ripgrep's 2.5s is already fast enough that the index build (22s) takes 9 queries to amortize.

4. Regex:

ayg handles literals only. ripgrep handles \p{Greek}, lookaheads, Unicode word boundaries, and every regex feature the regex crate supports.

When ayg wins

  • The query is selective. Agent patterns like kMaxBufferSize, gpu::Mailbox match 10–400 files. The index eliminates 99.9%+ of I/O.
  • The corpus doesn't fit in RAM. Chromium (6.7 GB) on 8 GB: ripgrep takes 3.5 minutes. ayg takes 160ms.
  • The storage is slow. Network-attached SSD: ripgrep 209s → ayg 160ms.
  • Repeated searches. AI agents search 50–200 times per session. Build amortizes after 2–7 queries depending on hardware.

14. What Cursor Described vs What We Built {#14-cursor-comparison}

Aspect Cursor's blog ayg
N-gram algorithm Sparse n-grams with frequency boundaries Same: 256×256 byte-pair table, interior boundaries for covering
Index format Two files: lookup (mmap) + postings (pread) Same: 16-byte sorted entries, delta+varint posting lists
Lookup "mmap this table, and only this table" Same
Content scanning "trivial" (filesystem) Adaptive: mmap / pread / filesystem based on RAM
8 GB machines Not discussed Streaming build, auto-skip content.bin
Query fallback Not discussed Two-tier: sparse path + raw trigram fallback
Regex Implied but not detailed Not implemented
Git-based versioning Yes Not implemented

Cursor's claimed 13ms on M2 is consistent with our measurements. On M3 Max cold with content.bin: 1.0ms for MAX_FILE_SIZE. Cursor's number likely includes editor IPC overhead.


15. Lessons Learned {#15-lessons}

What moved the needle

Rank Change Impact Category
1 mmap content file (zero-copy scan) NOTREACHED scan: 29ms → 8.7ms Systems
2 Kill subprocess (embedded memchr) Query floor: 5ms → 0.1ms Systems
3 Content-in-index (single file) Enabled #1; eliminated per-file syscalls Architecture
4 Sparse n-grams MAX_FILE_SIZE: 200 → 24 candidates Algorithm
5 Probe-before-fetch + early termination Saved 2–3 cold pread calls Systems
6 MADV_WILLNEED prefault Cold first query: -4ms OS hints
7 git ls-files (replace walkdir) Build: 8s → 0.2s Build
8 FxHash, bitvec, galloping intersect Unmeasurable Noise

Three of the top five are "boring" systems changes. The algorithm change (sparse n-grams) matters for selective queries but is irrelevant for broad ones where there are zero false positives.

What didn't matter

  • FxHash vs SipHash: Not on query path.
  • rayon vs std::thread::scope: Identical performance.
  • Galloping intersection: Marginal. Only helps when list sizes differ 64×+.

What I got wrong

  1. Believed purge cleared the cache on macOS. It doesn't fully on Apple Silicon. Needed a 38 GB memory flood.

  2. Abandoned sparse n-grams when the first implementation had a bug. The bug was boundary alignment. Fixing it took 30 minutes once diagnosed.

  3. Assumed library overhead was the bottleneck. Spent hours replacing HashSet, HashMap, rayon. Net query improvement: zero.

  4. The M3 Max warm content table in the first version of this post had wrong speedup calculations. Listed 126,978× for MAX_FILE_SIZE but 16,567ms ÷ 0.2ms = 82,835×. The raw times were correct; the division was wrong across the entire table. Fixed after a reader checked the math.


Reproducing

git clone https://github.com/hemeda3/aygrep.git
cd aygrep && cargo install --path .
./scripts/benchmark-full.sh linux      # ~5 min, clones kernel
./scripts/benchmark-full.sh chromium   # ~30 min, clones Chromium

CI runs the Linux benchmark on every push and publishes results to hemeda3.github.io/aygrep.


Acknowledgments

Developed with AI assistance (Claude) for code generation, optimization iteration, and benchmarking. The author directed all architectural decisions and validated results on real hardware.

  • Vicent Marti and the Cursor team for the blog post
  • Andrew Gallant (BurntSushi) for ripgrep, memchr, and the regex crate
  • Russ Cox for trigram indexing
  • The zoekt team for the content-in-index architecture

Ahmed Yousri · github.com/hemeda3 · March 2026

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment