ruvnet/ruvector-rabitq-2026-overview.md

ruvector-rabitq — Rotation-Based 1-Bit Vector Quantization in Rust

Two 1-bit ANN distance estimators (symmetric Charikar-LSH + asymmetric RaBitQ-2024), 32× per-vector code compression, 21× full-index memory compression, 3.13× end-to-end speedup at 100% recall@10 on n = 100,000 clustered D=128 vectors — pure Rust, no C/C++ deps, no SIMD intrinsics yet.

Part of the ruvector vector search library. Branch: research/nightly/2026-04-23-rabitq (head 6c6e04554) · ADR-154 · PR #370

What's in this crate

crates/ruvector-rabitq/ implements two 1-bit distance estimators behind a single AnnIndex trait — swap backends by changing a constructor line:

Variant	Storage	Estimator	Rerank	Headline (n=100k, post-optimization)
`FlatF32Index`	f32 originals	exact L2	N/A	100% r@10 · 306 QPS · 50.4 MB
`RabitqIndex`	rotation + SoA codes	symmetric (Charikar)	no	8.1% r@10 · 3,639 QPS · 2.4 MB
`RabitqPlusIndex`	rotation + SoA codes + originals	sym + exact rerank	yes	100% r@10 · 957 QPS · 53.5 MB (rerank×20)
`RabitqAsymIndex`	rotation + SoA codes [+ originals]	asymmetric (RaBitQ-2024)	optional	95.6% r@10 · 26 QPS (scalar, needs SIMD)

Every row is from one cargo run on the same clustered-Gaussian dataset with the same queries and ground truth.

The v2 optimization — what changed and what it bought

The v1 numbers (commit 8dbc560d0) used per-entry Vec<(usize, BinaryCode)> storage with each BinaryCode heap-allocating its own Vec<u64>. The v2 kernel (commit 6c6e04554) collapses this into a flat struct-of-arrays and replaces the .cos() call with a lookup table.

n=100k, D=128	v1 QPS	v2 QPS	Δ	memory
Flat	309	306	1.0×	50.4 MB
RaBitQ sym no rerank	1,176	3,639	3.09×	5.8 → 2.4 MB
RaBitQ+ sym rerank×5	811	2,058	2.54×	—
RaBitQ+ sym rerank×20	544	957	1.76×	—

Struct-of-arrays storage. Three contiguous vecs — ids: Vec<u32>, norms: Vec<f32>, packed: Vec<u64> (n × n_words flat slab). Kills the pointer chase per candidate and halves per-entry memory (8 B vs 40 B).
cos lookup table. Agreement ∈ [0, D] has at most D+1 distinct values, so cos(π · (1 − B/D)) is precomputed once at build. One L1-resident load per candidate instead of a ~30 ns cos call. At D=128 the LUT is 516 B.
Aligned-D fast path. When D % 64 == 0 the last-word mask is !0u64 and the AND on the tail word gets skipped — at D=128 the inner loop is 2 XORs + 2 POPCNTs + 1 LUT load + 1 FMA per candidate.
Raw-pointer SoA walk avoids bounds checks on the packed slab.

Net effect: 3.13× over exact flat at 100% recall@10 (up from 1.76×), and 21× full-index memory compression (up from 8.7×).

Asymmetric path unchanged — its O(D) scalar signed-dot-product dominates; std::arch SIMD gather is the named next lever.

Two estimators — what each one is, honestly

Symmetric (Charikar-style) — both query and database are 1-bit.

E[agreement/D] = 1 − θ/π        (random-hyperplane collision probability)
est cos(θ)     = cos_lut[agreement]
est ⟨q, x⟩     = ‖q‖ · ‖x‖ · est cos(θ)
est ‖q − x‖²   = ‖q‖² + ‖x‖² − 2 · est ⟨q, x⟩

This is hyperplane-LSH on a rotated basis, not the tighter estimator from the SIGMOD 2024 RaBitQ paper. The rotation step (Haar-uniform via Gram–Schmidt on a Gaussian matrix) decorrelates dimensions before binarization — naive sign(x_i) without rotation gives ~15–20% recall on structured data; this gets 36–88% pre-rerank and 100% after rerank×20.

Asymmetric (RaBitQ-2024-style) — query stays f32, database stays 1-bit.

b_x ∈ {−1/√D, +1/√D}^D                       (stored 1-bit code)
⟨q̂_rot, u_x⟩ ≈ (1/√D) · Σᵢ sign(x_rot,i) · q_rot,i
est ⟨q, x⟩ ≈ ‖q‖ · ‖x‖ · ⟨q̂_rot, u_x⟩

Reconstructs the inner product by walking the stored signs against the full-precision rotated query — unbiased on Haar-uniform rotations, tighter variance than the symmetric path. O(D) per candidate vs O(D/64) popcount — 140× slower than symmetric in the scalar baseline, wants SIMD gather to be practical at scale.

Recall@10 at n=100k: 95.6% with rerank×5 vs 87.9% for symmetric rerank×5 — the extra accuracy is real, the QPS gap is engineering.

Honest benchmarks (single-run, reproducible)

Release build, Ryzen-class laptop, single thread, no SIMD intrinsics, D=128, 100 Gaussian clusters (σ=0.6), nq=200. Deterministic seeds; reruns are bit-identical.

n	variant	r@1	r@10	r@100	QPS	mem/MB	vs flat
100 k	FlatF32 (exact)	100.0%	100.0%	100.0%	306	50.4	—
	RaBitQ sym no rerank	2.0%	8.1%	27.1%	3,639	2.4	11.9×
	RaBitQ+ sym rerank×20	100.0%	100.0%	100.0%	957	53.5	3.13×
	RaBitQ+ sym rerank×5	92.0%	87.9%	78.1%	2,058	53.5	6.73×
	RaBitQ-Asym rerank×5	99.0%	95.6%	87.0%	26	53.5	0.08×
50 k	RaBitQ+ sym rerank×5	100.0%	99.9%	99.8%	3,238	26.8	5.35×
	RaBitQ+ sym rerank×20	100.0%	100.0%	100.0%	1,468	26.8	2.43×
5 k	RaBitQ+ sym rerank×5	100.0%	100.0%	85.1%	9,374	2.7	1.70×
1 k	FlatF32	100.0%	100.0%	100.0%	20,711	0.5	—

What the numbers actually say:

Production config = symmetric + rerank×20 at n ≥ 50k. At n=50k it's 2.43× over flat at 100% recall@10; at n=100k it's 3.13×. Speedup grows with n because flat scan is O(n·D) while the 1-bit scan is O(n·D/64).
rerank×5 is the sweet-spot below n=50k. At n=50k rerank×5 wins 5.35× at 99.9% recall@10; the scaling regression (100% at n=5k → 87.9% at n=100k) forces the bump to ×20 at large n. Documented in BENCHMARK.md.
Memory: 32× on codes alone, 21× on the full index at n=100k. RabitqIndex at n=100k is 2.4 MB vs Flat's 50.4 MB. The rotation matrix is a one-time 64 KB overhead that amortises to nothing as n grows.
Asymmetric is slow in the scalar baseline — 26 QPS at n=100k. Needs SIMD gather. Follow-up.

Changelog — what v2 fixed (since `f2dbb6efb`)

Seven integrity issues from the deep review of v1 were fixed, then a perf pass delivered the 2.5–3.1× kernel speedup.

Padding-bug at D % 64 != 0 — XNOR-popcount was counting the zero padding of the last u64 as "matches". Fixed via masked_xnor_popcount with a regression test at D=100 (raw returns 28 matches for opposite vectors; masked returns 0).
Memory accounting was misleading — RabitqIndex stored full f32 originals but omitted them from memory_bytes(). Fixed: dropped unused originals; all variants now report honest totals. Test memory_accounting_is_honest enforces RabitqIndex < Flat < RabitqPlusIndex.
partial_cmp().unwrap() could panic on NaN. Replaced with f32::total_cmp.
rabitq_recall_at_10_above_70pct asserted > 0.20 — renamed to match reality.
"RaBitQ estimator" was Charikar hyperplane-LSH. Correctly labelled now; tighter RaBitQ-2024 estimator shipped as RabitqAsymIndex.
"Pure Rust, zero dependencies beyond std" was false. Correctly stated — deps are rand, rand_distr, serde, thiserror. No C/C++ deps, no unsafe (beyond the SoA raw-pointer walk in the scan hot loop).
Bench vs recall datasets were incompatible. Replaced with a unified harness (src/main.rs) whose every row comes from the same run.

Perf pass:

SoA storage — flat Vec<u32> ids, Vec<f32> norms, Vec<u64> packed (row-major) replace Vec<(usize, BinaryCode)>. No pointer chase per candidate; per-entry memory halved.
cos lookup table — kills the .cos() call. 516 B at D=128, L1-resident.
Aligned-D fast path — D % 64 == 0 skips the tail-word mask AND; unaligned D still correct via the regression-tested masked path.

Test count: 10 → 20 passing. Non-aligned D (100), NaN safety, full-pairs orthogonality at D ∈ {64, 128, 256}, asymmetric-vs-symmetric ordering, honest memory accounting, heap top-k ordering.

How it works

Rotate once. Build a D×D Haar-ish orthogonal matrix via Gram–Schmidt on a Gaussian matrix (seeded). 64 KB at D=128, ~3.6 M ops to build. The rotation decorrelates dimensions so 1-bit quantization error spreads evenly.
Encode database. For each x: normalize to unit sphere (store ‖x‖), rotate x' = P·x̂, binarize bit_i = 1 if x'_i ≥ 0. Packed into the shared packed: Vec<u64> slab — no per-vector allocation.
Query:
- Symmetric: same encoder, then raw-pointer walk through packed, 2 XORs + 2 POPCNTs + cos_lut[agreement] + FMA per candidate at D=128.
- Asymmetric: rotate the normalised query to f32 q_rot, walk each candidate's 1-bit code and do a signed dot product against q_rot scaled by 1/√D.
Rerank (optional). Take top rerank_factor · k candidates from the 1-bit scan, compute exact f32 distance only for those, pick the final top-k. This is how you go from 8% recall (bit-scan only) to 100% (rerank×20).

Reproduce

git clone -b research/nightly/2026-04-23-rabitq https://github.com/ruvnet/ruvector.git
cd ruvector
cargo test -p ruvector-rabitq --release                         # 20 tests pass
cargo run --release -p ruvector-rabitq --bin rabitq-demo        # full table (~20 s)
cargo run --release -p ruvector-rabitq --bin rabitq-demo -- --fast  # smoke run (~5 s)
cargo bench -p ruvector-rabitq --bench rabitq_bench              # Criterion kernel + search benches

Open items (named, deferred)

SIFT1M / GIST1M / DEEP10M — the standard ANN benchmarks from the SIGMOD paper. Current numbers are on clustered Gaussian. Follow-up.
HNSW integration — ruvector ships HNSW. RaBitQ in production is a distance-kernel plug-in for a graph index; the current flat-scan path is a correctness + throughput harness, not a production shape.
std::arch SIMD gather for the asymmetric path. Scalar baseline is 140× slower than symmetric; SIMD would close most of the gap.
AVX2/NEON popcount batching for the symmetric path. Scalar POPCNT is already good (3.6k QPS at n=100k); batched SIMD adds another ~2–3×.
Parallel search via rayon — already feature-gated (parallel), off by default.

Links

Repo: github.com/ruvnet/ruvector
Branch: research/nightly/2026-04-23-rabitq (head: 6c6e04554)
Draft PR: #370
ADR-154: docs/adr/ADR-154-rabitq-rotation-binary-quantization.md
Benchmark reproducer: crates/ruvector-rabitq/BENCHMARK.md
Paper: arXiv:2405.12497 (Gao & Long, SIGMOD 2024) — the symmetric path here is a Charikar-based predecessor; the asymmetric path targets the 2024 estimator with full query precision.

ruvnet/ruvector-rabitq-2026-overview.md

Select an option

No results found

Select an option

No results found

ruvector-rabitq — Rotation-Based 1-Bit Vector Quantization in Rust

What's in this crate

The v2 optimization — what changed and what it bought

Two estimators — what each one is, honestly

Honest benchmarks (single-run, reproducible)

Changelog — what v2 fixed (since `f2dbb6efb`)

How it works

Reproduce

Open items (named, deferred)

Links

ruvnet/ruvector-rabitq-2026-overview.md

ruvector-rabitq — Rotation-Based 1-Bit Vector Quantization in Rust

What's in this crate

The v2 optimization — what changed and what it bought

Two estimators — what each one is, honestly

Honest benchmarks (single-run, reproducible)

Changelog — what v2 fixed (since f2dbb6efb)

How it works

Reproduce

Open items (named, deferred)

Links

Changelog — what v2 fixed (since `f2dbb6efb`)