hemeda3 · March 24, 2026 11:53 · hemeda3 · Mar 24, 2026
diff --git a/Reverse-Engineering Cursor's fast Code Search algorithm (instant grep) b/Reverse-Engineering Cursor's fast Code Search algorithm (instant grep)
 # Reverse-Engineering Cursor's fast Code Search algorithm

 *Ahmed Yousri · March 2026*

 ---

 In March 2026, Cursor published a blog post titled ["Fast regex search"](https://cursor.com/blog/fast-regex-search) by Vicent Marti. It described how they'd built an indexed code search system that could find a pattern across an entire codebase in 13 milliseconds. Their claim: sparse n-gram inverted indexes, two files on disk, a single binary search on a memory-mapped lookup table.

 I read it and thought: I can build that.

 What followed was a deep dive into research papers from the 1990s, operating system internals, memory-mapped I/O, and SIMD byte scanning. The result is [ayg](https://github.com/hemeda3/aygrep). On a 2-vCPU GitHub Actions runner with 7.8 GB RAM, it searches 79,225 Linux kernel files in 5.8ms hot for a selective query — and 460ms for a broad one matching 49,481 files. On the same runner, ripgrep takes 747ms and 1,563ms respectively.

 The worst case: a query matching 62% of all files, cold cache, cheapest CI hardware. ayg: 12.4 seconds. ripgrep: 13.7 seconds. A 1.1× speedup. Barely worth the index.

 The best case on the same hardware, same conditions: a selective query matching 13 files. ayg: 155ms. ripgrep: 13.1 seconds. An 85× speedup.

 Every number in this post is sourced from either a [public CI run](https://github.com/hemeda3/aygrep/actions) or documented local measurements with stated cache conditions.

 ---

 ## Table of Contents

 1. [The Problem: Why Grep Is Slow on Large Codebases](#1-the-problem)
 2. [The Research: What Cursor Told Us (And What They Didn't)](#2-the-research)
 3. [Phase 1: The Trigram Index Baseline](#3-phase-1)
 4. [The Warm Cache Lie](#4-warm-cache-lie)
 5. [Phase 2: Killing the Subprocess](#5-phase-2)
 6. [Phase 3: Probe Before You Fetch](#6-phase-3)
 7. [Phase 4: Content-in-Index](#7-phase-4)
 8. [Phase 5: Sparse N-grams — The Algorithm Gap](#8-phase-5)
 9. [Phase 6: The Library Audit That Mostly Didn't Matter](#9-phase-6)
 10. [Phase 7: Zero-Copy Scanning via mmap](#10-phase-7)
 11. [Phase 8: The Adaptive Architecture](#11-phase-8)
 12. [The Benchmarks](#12-benchmarks)
 13. [Where ayg Loses](#13-where-ayg-loses)
 14. [What Cursor Described vs What We Built](#14-cursor-comparison)
 15. [Lessons Learned](#15-lessons)

 ---

 ## 1. The Problem: Why Grep Is Slow on Large Codebases {#1-the-problem}

 Chromium is one of the largest actively-developed open-source C++ projects. Some numbers:

 - **436,610 source files** (after filtering binaries)
 - **6.7 GB** of source code
 - Spread across ~45,000 directories

 When an AI coding agent needs to find something — say, every file that references `MAX_FILE_SIZE` — it runs `rg "MAX_FILE_SIZE"`. ripgrep is the fastest grep tool in existence. It uses SIMD-accelerated byte scanning, respects `.gitignore`, parallelizes across cores, and has been optimized by Andrew Gallant for years.

 On an M3 Max MacBook Pro with Chromium:

 ```
 $ time rg -c "MAX_FILE_SIZE" chromium/
 15 files matched

 real    16.567s
 ```

 On a 2-vCPU GitHub Actions runner with the Linux kernel (79,225 files, 1.2 GB):

 ```
 $ time rg -c "PM_RESUME"
 13 files matched

 real    13.129s    ← cold cache
 real     0.747s    ← hot cache
 ```

 The pattern `MAX_FILE_SIZE` appears in exactly 15 files out of 436,610. That's a 0.003% hit rate. The other 99.997% of I/O was wasted.

 Can we figure out *which* files to read without reading all of them? That's what an inverted index does. That's what Cursor built.

 ---

 ## 2. The Research: What Cursor Told Us (And What They Didn't) {#2-the-research}

 ### What the blog post says

 Cursor's post walks through the evolution of code search indexing:

 1. **Trigrams** (fixed 3-byte subsequences) — functional but posting lists are large
 2. **Suffix arrays** — exact matching but memory-intensive
 3. **Bloom filter masks** — augment trigrams with next-byte masks, but "saturate too quickly"
 4. **Sparse n-grams** — their actual solution

 The key innovation is a **frequency-weighted boundary detector**. A 256×256 table maps every byte pair to a weight. Rare byte pairs get high weights. N-gram boundaries are placed at local maxima. This produces variable-length n-grams that are longer and more selective than fixed trigrams.

 ### What the blog post doesn't say

 - **How they scan candidate files.** "Doing this on the client is trivial" implies an embedded SIMD scanner, not a subprocess.
 - **They store only hashes, not full n-grams.** 3-byte (24-bit) hash keys. Collisions merge posting lists — safe, just less selective.
 - **They mmap only the lookup table.** Postings are read via `pread()`. The editor process stays lean.
 - **No medium or broad query performance.** The 13ms number is for a selective query on an M2 MacBook.

 ### The papers

 - **Zobel, Moffat, Sacks-Davis (1993)** — inverted index fundamentals
 - **Russ Cox (2012)** — trigram indexing for regex search (Google Code Search)
 - **ClickHouse** — sparse n-gram text indexing for database columns

 ---

 ## 3. Phase 1: The Trigram Index Baseline {#3-phase-1}

 Extract every overlapping 3-byte sequence from every file. Build an inverted index mapping trigram → file IDs. Write to disk.

 Two files:
 - `lookup.bin`: sorted array of `[trigram: 3B, pad: 1B, offset: 8B, count: 4B]` = 16 bytes per entry. Binary search to find any trigram.
 - `postings.bin`: concatenated delta+varint encoded posting lists.

 For the scan phase, I spawned `rg -c pattern file1 file2 file3...` as a subprocess. This was a mistake.

 ### Baseline results (M3 Max, Chromium, warm cache, development build)

 | Query | Candidates | Total time | vs ripgrep 16.6s |
 |-------|-----------|-----------|-----------------|
 | MAX_FILE_SIZE | ~200 | 11.3ms | ~1,500× |
 | gpu::Mailbox | ~245 | 9.1ms | ~1,800× |
 | NOTREACHED | ~8,300 | 225ms | ~74× |
 | std::unique_ptr | ~43,400 | 2,125ms | ~8× |

 These numbers had a problem I hadn't noticed yet.

 ---

 ## 4. The Warm Cache Lie {#4-warm-cache-lie}

 My benchmark ran each query 3 times and took the median. All warm. After building the index, the OS page cache held every source file in memory — because the build phase had just read them all.

 After a 38 GB memory flood + `purge` on macOS:

 ```
 MAX_FILE_SIZE truly cold: 25ms  (was 11.3ms "warm")
 ```

 The 11.3ms was a warm-cache number compared against ripgrep's cold-cache 16.6 seconds.

 On Linux, `echo 3 > /proc/sys/vm/drop_caches` actually works. This is why we later tested on GCP and GitHub Actions with proper cache clearing.

 **Lesson: Always measure cold cache. Always document cache state.**

 ---

 ## 5. Phase 2: Killing the Subprocess {#5-phase-2}

 Each `Command::new("rg")` does a `fork()` + `exec()`. On macOS, that costs 2–5 milliseconds. For a query with 24 candidates that should take 0.1ms to scan, the subprocess overhead was 50× the actual work.

 Replace with in-process SIMD scanning using the `memchr` crate:

 ```rust
 use memchr::memmem;
 let finder = memmem::Finder::new(pattern_bytes);
 let count = finder.find_iter(&data).count();
 ```

 `memmem::Finder` uses SIMD (NEON on ARM, SSE2/AVX2 on x86) to scan at 8+ GB/s.

 | Query | Before (subprocess) | After (memchr) | Improvement |
 |-------|----:|----:|----:|
 | MAX_FILE_SIZE | 11.3ms | 6.5ms | 1.7× |
 | kMaxBufferSize | 8.6ms | 1.9ms | 4.5× |
 | gpu::Mailbox | 9.1ms | 1.5ms | 6.1× |

 This single change mattered more than every "clever" optimization that followed.

 ---

 ## 6. Phase 3: Probe Before You Fetch {#6-phase-3}

 The lookup table is mmap'd. Binary search costs ~5μs. The table also contains the posting list *count*. We can see how selective a trigram is without reading its posting list.

 Split into probe (free — mmap'd) and fetch (expensive — pread from disk):

 ```rust
 // Probe all trigrams — free
 probes.sort_by_key(|&(_, _, count)| count);

 // Fetch shortest first, stop early
 let mut candidates = fetch(probes[0]);
 for probe in &probes[1..] {
    if candidates.len() <= 50 { break; }
    candidates = intersect(&candidates, &fetch(probe));
 }
 ```

 Also: `MADV_WILLNEED` to prefault the lookup table on startup. Shaved ~4ms off the first cold query.

 ---

 ## 7. Phase 4: Content-in-Index {#7-phase-4}

 For 24 candidate files, `fs::read()` does `open()` → `stat()` → `malloc()` → `read()` → `close()`. That's ~4 syscalls per file.

 Stream all file contents into a single `content.bin` during index build. At query time, mmap it and access candidates as slices:

 ```rust
 fn file_content(&self, fid: u32) -> &[u8] {
    let (offset, length) = self.content_offsets[fid as usize];
    &self.content_mmap[offset as usize .. offset as usize + length as usize]
 }
 ```

 Zero allocation. Zero syscall. Index grew from 831 MB to 3,123 MB.

 The build streams content to disk during Pass 1. Peak memory: ~50 MB. No OOM on 8 GB machines.

 ---

 ## 8. Phase 5: Sparse N-grams {#8-phase-5}

 ### The bug

 The first implementation had a boundary alignment bug. `build_covering` (query-time extraction) used edge boundaries that didn't match the index-time extraction context.

 ### The fix

 Use only **interior boundaries** at query time — positions where both neighbors exist and the weight is a local maximum:

 ```rust
 fn find_interior_boundaries(w: &[u8]) -> Vec<usize> {
    (1..w.len()-1).filter(|&i| w[i] >= w[i-1] && w[i] >= w[i+1]).collect()
 }
 ```

 For `build_all` (index time), use ALL boundaries. For `build_covering` (query time), use only interior boundaries. This guarantees every covering n-gram exists in the index.

 ### Two-tier query

 Short common queries (`NOTREACHED`) have few interior peaks. Fallback to raw trigrams scored by rarity:

 ```rust
 let keys = if covering_ranges.len() >= 3 {
    sparse_ngram_path(covering_ranges)   // selective
 } else {
    raw_trigram_fallback(pattern)         // always works
 };
 ```

 | Query | Trigram candidates | Sparse candidates |
 |-------|------------------:|------------------:|
 | MAX_FILE_SIZE | ~200 | **24** |
 | gpu::Mailbox | ~350 | **143** |
 | NOTREACHED | ~8,300 | ~8,300 (fallback) |

 Sparse n-grams can't help NOTREACHED — it genuinely appears in 8,300+ files.

 ---

 ## 9. Phase 6: The Library Audit That Mostly Didn't Matter {#9-phase-6}

 | Change | Suspected cost | Actual query impact |
 |--------|---------------|-------------------|
 | Replace `HashSet` with bitvec | SipHash overhead | **Unmeasurable** |
 | Replace `HashMap` with FxHashMap | SipHash overhead | **Build only** |
 | Replace `walkdir` with `git ls-files` | 8s directory walk | **Build: 7.5s saved** |
 | Replace `rayon` with `std::thread::scope` | Thread pool overhead | **Unmeasurable** |

 Three of four changes had zero measurable impact on query latency.

 **Lesson: Profile before optimizing.**

 ---

 ## 10. Phase 7: Zero-Copy Scanning via mmap {#10-phase-7}

 Replace per-file pread (one syscall each) with direct mmap slice access. Sort candidates by content offset for sequential access.

 | Query | pread scan | mmap scan | Improvement |
 |-------|----------:|---------:|------------:|
 | MAX_FILE_SIZE | 1.6ms | **0.1ms** | 16× |
 | NOTREACHED | 29ms | **8.7ms** | 3.3× |
 | std::unique_ptr | 93ms | **22.2ms** | 4.2× |

 These are M3 Max warm cache measurements (development). Verified final numbers are in the benchmarks section below.

 ---

 ## 11. Phase 8: The Adaptive Architecture {#11-phase-8}

 Not everyone has 36 GB of RAM. Detect available memory at startup:

 | Mode | When | Strategy |
 |------|------|----------|
 | **ContentMmap** | content.bin exists, RAM > 2× content | Zero-copy mmap |
 | **ContentPread** | content.bin exists, RAM tight | One-alloc pread |
 | **Filesystem** | No content.bin or low RAM | Read from repo |

 The `--no-content` flag skips content.bin during build. Index shrinks from 3.1 GB to 831 MB. On machines with <8 GB available RAM, the build does this automatically.

 ---

 ## 12. The Benchmarks {#12-benchmarks}

 ### Methodology

 **Cold cache (Linux):** `sync && echo 3 | sudo tee /proc/sys/vm/drop_caches` before each query.

 **Cold cache (macOS):** `dd if=/dev/zero of=/tmp/mf bs=1m count=38000 && rm /tmp/mf && purge`.

 **Timing:** ayg uses internal `Instant::now()` around the search core (excludes process startup and index loading). ripgrep measured via Python `time.perf_counter()` around `subprocess.run()`. This favors ayg by ~50ms (startup cost excluded). For fair CLI wall-clock comparison, see the customer-facing section.

 **Correctness:** All match counts verified identical to ripgrep.

 ### Hardware

 | Machine | CPU | vCPU | RAM | Storage |
 |---------|-----|-----:|----:|---------|
 | **GitHub Actions** | AMD EPYC 7763 | 2 | 7.8 GB | Azure SSD |
 | **M3 Max MacBook Pro** | Apple M3 Max @ 4.05 GHz | 14 | 36 GB | Apple NVMe (~7 GB/s) |
 | **GCP e2-standard-2** | AMD EPYC 7B12 @ 2.25 GHz | 2 | 8 GB | pd-ssd (~400 MB/s) |

 ---

 ### Linux kernel (79,225 files) — GitHub Actions, no content.bin

 Public CI verification: [benchmark run](https://github.com/hemeda3/aygrep/actions). Queries ordered worst speedup first.

 | Query | ayg cold | ayg hot | rg cold | rg hot | Cold × | Hot × | Candidates | Files |
 |-------|--------:|-------:|-------:|------:|-------:|------:|-----------:|------:|
 | `Copyright` | **12,412ms** | **461ms** | 13,697ms | 1,563ms | **1.1×** | **3.4×** | 49,507 | 49,481 |
 | `struct device` | **6,406ms** | **221ms** | 13,253ms | 918ms | **2.1×** | **4.2×** | 21,185 | 11,408 |
 | `mutex_lock` | **2,119ms** | **72ms** | 13,203ms | 875ms | **6.2×** | **12×** | 5,617 | 5,472 |
 | `EXPORT_SYMBOL_GPL` | **1,814ms** | **62ms** | 13,112ms | 891ms | **7.2×** | **14×** | 5,694 | 3,130 |
 | `PM_RESUME` | **155ms** | **5.8ms** | 13,129ms | 747ms | **85×** | **129×** | 235 | 13 |

 | | Build | ayg total | rg total | Speedup |
 |-|------:|----------:|---------:|--------:|
 | **Cold** | 29.46s | 22,906ms | 66,394ms | **2.9×** |
 | **Hot** | 29.46s | 822ms | 4,940ms | **6.0×** |

 The `Copyright` query matches 49,481 out of 79,225 files (62%). At cold cache, ayg barely wins. Without content.bin, scanning 49,507 candidate files from the filesystem is almost as slow as scanning all 79,225.

 ---

 ### Linux kernel — M3 Max, content.bin, cold (38 GB flood + purge)

 | Query | ayg | rg | Speedup | Files |
 |-------|----:|---:|--------:|------:|
 | `Copyright` | **74.3ms** | 3,783ms | **51×** | 49,482 |
 | `EXPORT_SYMBOL_GPL` | **15.8ms** | 3,549ms | **225×** | 3,129 |
 | `PM_RESUME` | **6.2ms** | 3,585ms | **578×** | 13 |

 With content.bin on fast local NVMe, even the broad query is 51× faster.

 ---

 ### Linux kernel — M3 Max, content.bin, hot

 | Query | ayg | rg | Speedup | Files |
 |-------|----:|---:|--------:|------:|
 | `Copyright` | **96.7ms** | 2,550ms | **26×** | 49,482 |
 | `struct device` | **75.6ms** | 2,693ms | **36×** | 11,408 |
 | `mutex_lock` | **45.2ms** | 2,577ms | **57×** | 5,471 |
 | `EXPORT_SYMBOL_GPL` | **30.1ms** | 2,578ms | **86×** | 3,129 |
 | `PM_RESUME` | **7.0ms** | 2,761ms | **394×** | 13 |

 | | ayg total | rg total | Speedup |
 |-|----------:|---------:|--------:|
 | **Hot** | 254.6ms | 13,159ms | **52×** |

 ---

 ### Chromium (436,610 files) — M3 Max, content.bin, cold

 | Query | ayg | rg | Speedup | Candidates |
 |-------|----:|---:|--------:|-----------:|
 | `#include "base/` | **109.4ms** | ~16,500ms | **~151×** | 78,979 |
 | `NOTREACHED` | **36.2ms** | 18,440ms | **509×** | 8,310 |
 | `MAX_FILE_SIZE` | **1.0ms** | 16,567ms | **16,567×** | 24 |

 ---

 ### Chromium — M3 Max, no content.bin, cold (filesystem scan)

 | Query | ayg | rg | Speedup | Candidates |
 |-------|----:|---:|--------:|-----------:|
 | `std::unique_ptr` | **4,223ms** | 11,684ms | **2.8×** | 43,371 |
 | `NOTREACHED` | **1,648ms** | 18,440ms | **11×** | 8,310 |
 | `gpu::Mailbox` | **412ms** | 16,797ms | **41×** | 419 |
 | `kMaxBufferSize` | **60ms** | 17,041ms | **284×** | 140 |
 | `MAX_FILE_SIZE` | **8.0ms** | 16,567ms | **2,071×** | 24 |

 Without content.bin, broad queries (43K candidates) approach ripgrep speed because 43K individual `open/read/close` syscall chains become the bottleneck.

 ---

 ### Chromium — GCP e2-standard-2 (8 GB, 2 vCPU), no content.bin, cold

 The absolute worst case: slowest CPU, smallest RAM, network-attached disk, no content file, cold cache.

 | Query | ayg | rg | Speedup | Candidates |
 |-------|----:|---:|--------:|-----------:|
 | `std::unique_ptr` | **39,332ms** | 264,000ms | **6.7×** | 43,376 |
 | `NOTREACHED` | **10,487ms** | 223,000ms | **21×** | 8,313 |
 | `gpu::Mailbox` | **667ms** | ~209,000ms | **~313×** | 419 |
 | `MAX_FILE_SIZE` | **160ms** | 209,000ms | **1,306×** | 24 |

 ripgrep goes from 16s (local NVMe) to 209s (network SSD) — 13× slower. ayg goes from 8ms to 160ms — 20× slower. Both degrade, but ayg degrades less because it reads 24 files instead of 436,000.

 ---

 ### Customer-facing wall-clock (Homebrew binary, M3 Max, Chromium)

 The numbers above use internal `Instant::now()` which excludes ~50ms of process startup and index loading. This is what a human running `ayg search` from the terminal actually sees:

 | Scenario | ayg CLI wall | ayg internal | rg wall | Speedup (wall) |
 |----------|------------:|------------:|--------:|--------:|
 | Warm | **59–64ms** | 0.5–1.5ms | 29.24s | **~460×** |
 | Cold | **0.25s** | 45.4ms | 48.20s | **~193×** |

 The ~460× wall-clock speedup is the honest end-to-end comparison. The sub-millisecond internal time is what an AI agent with a persistent process sees after the first query (index already loaded).

 ---

 ## 13. Where ayg Loses {#13-where-ayg-loses}

 ayg does not always win. Showing where it loses matters more than showing where it wins.

 **1. Broad queries without content.bin, warm cache:**

 M3 Max, Chromium, `std::unique_ptr` (43,371 candidates), no content.bin, warm:

 | Tool | Time |
 |------|-----:|
 | ripgrep | 11,684ms |
 | **ayg** | **16,785ms** |

 ayg is **0.7×** — slower than ripgrep. 43K individual `open/read/close` syscalls are slower than ripgrep's optimized parallel SIMD scan of all files sequentially. With content.bin, the same query takes 22.2ms.

 **2. Any query where >50% of files match:**

 `Copyright` matches 49,481 out of 79,225 Linux kernel files. On CI cold cache: ayg 12.4s vs ripgrep 13.7s = 1.1×. The index provides almost no selectivity.

 **3. Small corpora, warm cache:**

 On the Linux kernel (1.2 GB) with 32 GB RAM and warm cache, ripgrep scans everything from page cache in ~2.5s. The SIMD scan at 8+ GB/s has near-zero overhead. ayg's index lookup + scan is faster (254ms), but ripgrep's 2.5s is already fast enough that the index build (22s) takes 9 queries to amortize.

 **4. Regex:**

 ayg handles literals only. ripgrep handles `\p{Greek}`, lookaheads, Unicode word boundaries, and every regex feature the `regex` crate supports.

 ### When ayg wins

 - **The query is selective.** Agent patterns like `kMaxBufferSize`, `gpu::Mailbox` match 10–400 files. The index eliminates 99.9%+ of I/O.
 - **The corpus doesn't fit in RAM.** Chromium (6.7 GB) on 8 GB: ripgrep takes 3.5 minutes. ayg takes 160ms.
 - **The storage is slow.** Network-attached SSD: ripgrep 209s → ayg 160ms.
 - **Repeated searches.** AI agents search 50–200 times per session. Build amortizes after 2–7 queries depending on hardware.

 ---

 ## 14. What Cursor Described vs What We Built {#14-cursor-comparison}

 | Aspect | Cursor's blog | ayg |
 |--------|--------------|-----|
 | N-gram algorithm | Sparse n-grams with frequency boundaries | Same: 256×256 byte-pair table, interior boundaries for covering |
 | Index format | Two files: lookup (mmap) + postings (pread) | Same: 16-byte sorted entries, delta+varint posting lists |
 | Lookup | "mmap this table, and only this table" | Same |
 | Content scanning | "trivial" (filesystem) | Adaptive: mmap / pread / filesystem based on RAM |
 | 8 GB machines | Not discussed | Streaming build, auto-skip content.bin |
 | Query fallback | Not discussed | Two-tier: sparse path + raw trigram fallback |
 | Regex | Implied but not detailed | Not implemented |
 | Git-based versioning | Yes | Not implemented |

 Cursor's claimed 13ms on M2 is consistent with our measurements. On M3 Max cold with content.bin: 1.0ms for MAX_FILE_SIZE. Cursor's number likely includes editor IPC overhead.

 ---

 ## 15. Lessons Learned {#15-lessons}

 ### What moved the needle

 | Rank | Change | Impact | Category |
 |-----:|--------|--------|----------|
 | 1 | mmap content file (zero-copy scan) | NOTREACHED scan: 29ms → 8.7ms | Systems |
 | 2 | Kill subprocess (embedded memchr) | Query floor: 5ms → 0.1ms | Systems |
 | 3 | Content-in-index (single file) | Enabled #1; eliminated per-file syscalls | Architecture |
 | 4 | Sparse n-grams | MAX_FILE_SIZE: 200 → 24 candidates | Algorithm |
 | 5 | Probe-before-fetch + early termination | Saved 2–3 cold pread calls | Systems |
 | 6 | MADV_WILLNEED prefault | Cold first query: -4ms | OS hints |
 | 7 | git ls-files (replace walkdir) | Build: 8s → 0.2s | Build |
 | 8 | FxHash, bitvec, galloping intersect | Unmeasurable | Noise |

 Three of the top five are "boring" systems changes. The algorithm change (sparse n-grams) matters for selective queries but is irrelevant for broad ones where there are zero false positives.

 ### What didn't matter

 - **FxHash vs SipHash:** Not on query path.
 - **rayon vs std::thread::scope:** Identical performance.
 - **Galloping intersection:** Marginal. Only helps when list sizes differ 64×+.

 ### What I got wrong

 1. **Believed `purge` cleared the cache on macOS.** It doesn't fully on Apple Silicon. Needed a 38 GB memory flood.

 2. **Abandoned sparse n-grams when the first implementation had a bug.** The bug was boundary alignment. Fixing it took 30 minutes once diagnosed.

 3. **Assumed library overhead was the bottleneck.** Spent hours replacing HashSet, HashMap, rayon. Net query improvement: zero.

 4. **The M3 Max warm content table in the first version of this post had wrong speedup calculations.** Listed 126,978× for MAX_FILE_SIZE but 16,567ms ÷ 0.2ms = 82,835×. The raw times were correct; the division was wrong across the entire table. Fixed after a reader checked the math.

 ---

 ## Reproducing

 ```bash
 git clone https://github.com/hemeda3/aygrep.git
 cd aygrep && cargo install --path .
 ./scripts/benchmark-full.sh linux      # ~5 min, clones kernel
 ./scripts/benchmark-full.sh chromium   # ~30 min, clones Chromium
 ```

 CI runs the Linux benchmark on every push and publishes results to [hemeda3.github.io/aygrep](https://hemeda3.github.io/aygrep/).

 ---

 ## Acknowledgments

 Developed with AI assistance (Claude) for code generation, optimization iteration, and benchmarking. The author directed all architectural decisions and validated results on real hardware.

 - **Vicent Marti** and the Cursor team for the blog post
 - **Andrew Gallant (BurntSushi)** for ripgrep, memchr, and the regex crate
 - **Russ Cox** for trigram indexing
 - The **zoekt** team for the content-in-index architecture

 ---

 *Ahmed Yousri · [github.com/hemeda3](https://github.com/hemeda3) · March 2026*
Query	Candidates	Total time	vs ripgrep 16.6s
MAX_FILE_SIZE	~200	11.3ms	~1,500×
gpu::Mailbox	~245	9.1ms	~1,800×
NOTREACHED	~8,300	225ms	~74×
std::unique_ptr	~43,400	2,125ms	~8×
Query	Before (subprocess)	After (memchr)	Improvement
MAX_FILE_SIZE	11.3ms	6.5ms	1.7×
kMaxBufferSize	8.6ms	1.9ms	4.5×
gpu::Mailbox	9.1ms	1.5ms	6.1×
Change	Suspected cost	Actual query impact
Replace `HashSet` with bitvec	SipHash overhead	Unmeasurable
Replace `HashMap` with FxHashMap	SipHash overhead	Build only
Replace `walkdir` with `git ls-files`	8s directory walk	Build: 7.5s saved
Replace `rayon` with `std::thread::scope`	Thread pool overhead	Unmeasurable
Query	pread scan	mmap scan	Improvement
MAX_FILE_SIZE	1.6ms	0.1ms	16×
NOTREACHED	29ms	8.7ms	3.3×
std::unique_ptr	93ms	22.2ms	4.2×
Mode	When	Strategy
ContentMmap	content.bin exists, RAM > 2× content	Zero-copy mmap
ContentPread	content.bin exists, RAM tight	One-alloc pread
Filesystem	No content.bin or low RAM	Read from repo
Machine	CPU	vCPU	RAM	Storage
GitHub Actions	AMD EPYC 7763	2	7.8 GB	Azure SSD
M3 Max MacBook Pro	Apple M3 Max @ 4.05 GHz	14	36 GB	Apple NVMe (~7 GB/s)
GCP e2-standard-2	AMD EPYC 7B12 @ 2.25 GHz	2	8 GB	pd-ssd (~400 MB/s)
Query	ayg cold	ayg hot	rg cold	rg hot	Cold ×	Hot ×	Candidates	Files
`Copyright`	12,412ms	461ms	13,697ms	1,563ms	1.1×	3.4×	49,507	49,481
`struct device`	6,406ms	221ms	13,253ms	918ms	2.1×	4.2×	21,185	11,408
`mutex_lock`	2,119ms	72ms	13,203ms	875ms	6.2×	12×	5,617	5,472
`EXPORT_SYMBOL_GPL`	1,814ms	62ms	13,112ms	891ms	7.2×	14×	5,694	3,130
`PM_RESUME`	155ms	5.8ms	13,129ms	747ms	85×	129×	235	13
	Build	ayg total	rg total	Speedup
Cold	29.46s	22,906ms	66,394ms	2.9×
Hot	29.46s	822ms	4,940ms	6.0×
Query	ayg	rg	Speedup	Files
`Copyright`	74.3ms	3,783ms	51×	49,482
`EXPORT_SYMBOL_GPL`	15.8ms	3,549ms	225×	3,129
`PM_RESUME`	6.2ms	3,585ms	578×	13
Query	ayg	rg	Speedup	Files
`Copyright`	96.7ms	2,550ms	26×	49,482
`struct device`	75.6ms	2,693ms	36×	11,408
`mutex_lock`	45.2ms	2,577ms	57×	5,471
`EXPORT_SYMBOL_GPL`	30.1ms	2,578ms	86×	3,129
`PM_RESUME`	7.0ms	2,761ms	394×	13