Two 1-bit ANN distance estimators (symmetric Charikar-LSH + asymmetric RaBitQ-2024), 32× per-vector code compression, 21× full-index memory compression, 3.13× end-to-end speedup at 100% recall@10 on n = 100,000 clustered D=128 vectors — pure Rust, no C/C++ deps, no SIMD intrinsics yet.
Part of the ruvector vector search library. Branch:
research/nightly/2026-04-23-rabitq(head6c6e04554) · ADR-154 · PR #370
crates/ruvector-rabitq/ implements two 1-bit distance estimators behind a single AnnIndex trait — swap backends by changing a constructor line:
| Variant | Storage | Estimator | Rerank | Headline (n=100k, post-optimization) |
|---|---|---|---|---|
FlatF32Index |
f32 originals | exact L2 | N/A | 100% r@10 · 306 QPS · 50.4 MB |
RabitqIndex |
rotation + SoA codes | symmetric (Charikar) | no | 8.1% r@10 · 3,639 QPS · 2.4 MB |
RabitqPlusIndex |
rotation + SoA codes + originals | sym + exact rerank | yes | 100% r@10 · 957 QPS · 53.5 MB (rerank×20) |
RabitqAsymIndex |
rotation + SoA codes [+ originals] | asymmetric (RaBitQ-2024) | optional | 95.6% r@10 · 26 QPS (scalar, needs SIMD) |
Every row is from one cargo run on the same clustered-Gaussian dataset with the same queries and ground truth.
The v1 numbers (commit 8dbc560d0) used per-entry Vec<(usize, BinaryCode)> storage with each BinaryCode heap-allocating its own Vec<u64>. The v2 kernel (commit 6c6e04554) collapses this into a flat struct-of-arrays and replaces the .cos() call with a lookup table.
| n=100k, D=128 | v1 QPS | v2 QPS | Δ | memory |
|---|---|---|---|---|
| Flat | 309 | 306 | 1.0× | 50.4 MB |
| RaBitQ sym no rerank | 1,176 | 3,639 | 3.09× | 5.8 → 2.4 MB |
| RaBitQ+ sym rerank×5 | 811 | 2,058 | 2.54× | — |
| RaBitQ+ sym rerank×20 | 544 | 957 | 1.76× | — |
- Struct-of-arrays storage. Three contiguous vecs —
ids: Vec<u32>,norms: Vec<f32>,packed: Vec<u64>(n × n_words flat slab). Kills the pointer chase per candidate and halves per-entry memory (8 B vs 40 B). - cos lookup table. Agreement ∈ [0, D] has at most D+1 distinct values, so
cos(π · (1 − B/D))is precomputed once at build. One L1-resident load per candidate instead of a ~30 nscoscall. At D=128 the LUT is 516 B. - Aligned-D fast path. When
D % 64 == 0the last-word mask is!0u64and the AND on the tail word gets skipped — at D=128 the inner loop is 2 XORs + 2 POPCNTs + 1 LUT load + 1 FMA per candidate. - Raw-pointer SoA walk avoids bounds checks on the
packedslab.
Net effect: 3.13× over exact flat at 100% recall@10 (up from 1.76×), and 21× full-index memory compression (up from 8.7×).
Asymmetric path unchanged — its O(D) scalar signed-dot-product dominates; std::arch SIMD gather is the named next lever.
Symmetric (Charikar-style) — both query and database are 1-bit.
E[agreement/D] = 1 − θ/π (random-hyperplane collision probability)
est cos(θ) = cos_lut[agreement]
est ⟨q, x⟩ = ‖q‖ · ‖x‖ · est cos(θ)
est ‖q − x‖² = ‖q‖² + ‖x‖² − 2 · est ⟨q, x⟩
This is hyperplane-LSH on a rotated basis, not the tighter estimator from the SIGMOD 2024 RaBitQ paper. The rotation step (Haar-uniform via Gram–Schmidt on a Gaussian matrix) decorrelates dimensions before binarization — naive sign(x_i) without rotation gives ~15–20% recall on structured data; this gets 36–88% pre-rerank and 100% after rerank×20.
Asymmetric (RaBitQ-2024-style) — query stays f32, database stays 1-bit.
b_x ∈ {−1/√D, +1/√D}^D (stored 1-bit code)
⟨q̂_rot, u_x⟩ ≈ (1/√D) · Σᵢ sign(x_rot,i) · q_rot,i
est ⟨q, x⟩ ≈ ‖q‖ · ‖x‖ · ⟨q̂_rot, u_x⟩
Reconstructs the inner product by walking the stored signs against the full-precision rotated query — unbiased on Haar-uniform rotations, tighter variance than the symmetric path. O(D) per candidate vs O(D/64) popcount — 140× slower than symmetric in the scalar baseline, wants SIMD gather to be practical at scale.
Recall@10 at n=100k: 95.6% with rerank×5 vs 87.9% for symmetric rerank×5 — the extra accuracy is real, the QPS gap is engineering.
Release build, Ryzen-class laptop, single thread, no SIMD intrinsics, D=128, 100 Gaussian clusters (σ=0.6), nq=200. Deterministic seeds; reruns are bit-identical.
| n | variant | r@1 | r@10 | r@100 | QPS | mem/MB | vs flat |
|---|---|---|---|---|---|---|---|
| 100 k | FlatF32 (exact) | 100.0% | 100.0% | 100.0% | 306 | 50.4 | — |
| RaBitQ sym no rerank | 2.0% | 8.1% | 27.1% | 3,639 | 2.4 | 11.9× | |
| RaBitQ+ sym rerank×20 | 100.0% | 100.0% | 100.0% | 957 | 53.5 | 3.13× | |
| RaBitQ+ sym rerank×5 | 92.0% | 87.9% | 78.1% | 2,058 | 53.5 | 6.73× | |
| RaBitQ-Asym rerank×5 | 99.0% | 95.6% | 87.0% | 26 | 53.5 | 0.08× | |
| 50 k | RaBitQ+ sym rerank×5 | 100.0% | 99.9% | 99.8% | 3,238 | 26.8 | 5.35× |
| RaBitQ+ sym rerank×20 | 100.0% | 100.0% | 100.0% | 1,468 | 26.8 | 2.43× | |
| 5 k | RaBitQ+ sym rerank×5 | 100.0% | 100.0% | 85.1% | 9,374 | 2.7 | 1.70× |
| 1 k | FlatF32 | 100.0% | 100.0% | 100.0% | 20,711 | 0.5 | — |
What the numbers actually say:
- Production config = symmetric + rerank×20 at n ≥ 50k. At n=50k it's 2.43× over flat at 100% recall@10; at n=100k it's 3.13×. Speedup grows with n because flat scan is O(n·D) while the 1-bit scan is O(n·D/64).
- rerank×5 is the sweet-spot below n=50k. At n=50k rerank×5 wins 5.35× at 99.9% recall@10; the scaling regression (100% at n=5k → 87.9% at n=100k) forces the bump to ×20 at large n. Documented in
BENCHMARK.md. - Memory: 32× on codes alone, 21× on the full index at n=100k. RabitqIndex at n=100k is 2.4 MB vs Flat's 50.4 MB. The rotation matrix is a one-time 64 KB overhead that amortises to nothing as n grows.
- Asymmetric is slow in the scalar baseline — 26 QPS at n=100k. Needs SIMD gather. Follow-up.
Seven integrity issues from the deep review of v1 were fixed, then a perf pass delivered the 2.5–3.1× kernel speedup.
- Padding-bug at
D % 64 != 0— XNOR-popcount was counting the zero padding of the last u64 as "matches". Fixed viamasked_xnor_popcountwith a regression test at D=100 (raw returns 28 matches for opposite vectors; masked returns 0). - Memory accounting was misleading —
RabitqIndexstored full f32 originals but omitted them frommemory_bytes(). Fixed: dropped unused originals; all variants now report honest totals. Testmemory_accounting_is_honestenforcesRabitqIndex < Flat < RabitqPlusIndex. partial_cmp().unwrap()could panic on NaN. Replaced withf32::total_cmp.rabitq_recall_at_10_above_70pctasserted> 0.20— renamed to match reality.- "RaBitQ estimator" was Charikar hyperplane-LSH. Correctly labelled now; tighter RaBitQ-2024 estimator shipped as
RabitqAsymIndex. - "Pure Rust, zero dependencies beyond std" was false. Correctly stated — deps are
rand,rand_distr,serde,thiserror. No C/C++ deps, no unsafe (beyond the SoA raw-pointer walk in the scan hot loop). - Bench vs recall datasets were incompatible. Replaced with a unified harness (
src/main.rs) whose every row comes from the same run.
Perf pass:
- SoA storage — flat
Vec<u32> ids,Vec<f32> norms,Vec<u64> packed(row-major) replaceVec<(usize, BinaryCode)>. No pointer chase per candidate; per-entry memory halved. - cos lookup table — kills the
.cos()call. 516 B at D=128, L1-resident. - Aligned-D fast path —
D % 64 == 0skips the tail-word mask AND; unaligned D still correct via the regression-tested masked path.
Test count: 10 → 20 passing. Non-aligned D (100), NaN safety, full-pairs orthogonality at D ∈ {64, 128, 256}, asymmetric-vs-symmetric ordering, honest memory accounting, heap top-k ordering.
- Rotate once. Build a D×D Haar-ish orthogonal matrix via Gram–Schmidt on a Gaussian matrix (seeded). 64 KB at D=128, ~3.6 M ops to build. The rotation decorrelates dimensions so 1-bit quantization error spreads evenly.
- Encode database. For each
x: normalize to unit sphere (store ‖x‖), rotatex' = P·x̂, binarizebit_i = 1 if x'_i ≥ 0. Packed into the sharedpacked: Vec<u64>slab — no per-vector allocation. - Query:
- Symmetric: same encoder, then raw-pointer walk through
packed, 2 XORs + 2 POPCNTs +cos_lut[agreement]+ FMA per candidate at D=128. - Asymmetric: rotate the normalised query to f32
q_rot, walk each candidate's 1-bit code and do a signed dot product againstq_rotscaled by1/√D.
- Symmetric: same encoder, then raw-pointer walk through
- Rerank (optional). Take top
rerank_factor · kcandidates from the 1-bit scan, compute exact f32 distance only for those, pick the final top-k. This is how you go from 8% recall (bit-scan only) to 100% (rerank×20).
git clone -b research/nightly/2026-04-23-rabitq https://github.com/ruvnet/ruvector.git
cd ruvector
cargo test -p ruvector-rabitq --release # 20 tests pass
cargo run --release -p ruvector-rabitq --bin rabitq-demo # full table (~20 s)
cargo run --release -p ruvector-rabitq --bin rabitq-demo -- --fast # smoke run (~5 s)
cargo bench -p ruvector-rabitq --bench rabitq_bench # Criterion kernel + search benches- SIFT1M / GIST1M / DEEP10M — the standard ANN benchmarks from the SIGMOD paper. Current numbers are on clustered Gaussian. Follow-up.
- HNSW integration — ruvector ships HNSW. RaBitQ in production is a distance-kernel plug-in for a graph index; the current flat-scan path is a correctness + throughput harness, not a production shape.
std::archSIMD gather for the asymmetric path. Scalar baseline is 140× slower than symmetric; SIMD would close most of the gap.- AVX2/NEON popcount batching for the symmetric path. Scalar POPCNT is already good (3.6k QPS at n=100k); batched SIMD adds another ~2–3×.
- Parallel search via
rayon— already feature-gated (parallel), off by default.
- Repo: github.com/ruvnet/ruvector
- Branch:
research/nightly/2026-04-23-rabitq(head:6c6e04554) - Draft PR: #370
- ADR-154:
docs/adr/ADR-154-rabitq-rotation-binary-quantization.md - Benchmark reproducer:
crates/ruvector-rabitq/BENCHMARK.md - Paper: arXiv:2405.12497 (Gao & Long, SIGMOD 2024) — the symmetric path here is a Charikar-based predecessor; the asymmetric path targets the 2024 estimator with full query precision.