Skip to content

Instantly share code, notes, and snippets.

@ruvnet
Last active April 24, 2026 17:30
Show Gist options
  • Select an option

  • Save ruvnet/fd5478f67c526a0375257d260d1efaac to your computer and use it in GitHub Desktop.

Select an option

Save ruvnet/fd5478f67c526a0375257d260d1efaac to your computer and use it in GitHub Desktop.
ruvector-rabitq: 1-bit rotation-based vector quantization in Rust — symmetric (Charikar) + asymmetric (RaBitQ-2024) estimators, SoA + cos-LUT optimized kernel: 32x codes / 21x full-index compression, 3.13x speedup at 100% recall@10 on n=100k clustered D=128 vectors

ruvector-rabitq — Rotation-Based 1-Bit Vector Quantization in Rust

Two 1-bit ANN distance estimators (symmetric Charikar-LSH + asymmetric RaBitQ-2024), 32× per-vector code compression, 21× full-index memory compression, 3.13× end-to-end speedup at 100% recall@10 on n = 100,000 clustered D=128 vectors — pure Rust, no C/C++ deps, no SIMD intrinsics yet.

Part of the ruvector vector search library. Branch: research/nightly/2026-04-23-rabitq (head 6c6e04554) · ADR-154 · PR #370


What's in this crate

crates/ruvector-rabitq/ implements two 1-bit distance estimators behind a single AnnIndex trait — swap backends by changing a constructor line:

Variant Storage Estimator Rerank Headline (n=100k, post-optimization)
FlatF32Index f32 originals exact L2 N/A 100% r@10 · 306 QPS · 50.4 MB
RabitqIndex rotation + SoA codes symmetric (Charikar) no 8.1% r@10 · 3,639 QPS · 2.4 MB
RabitqPlusIndex rotation + SoA codes + originals sym + exact rerank yes 100% r@10 · 957 QPS · 53.5 MB (rerank×20)
RabitqAsymIndex rotation + SoA codes [+ originals] asymmetric (RaBitQ-2024) optional 95.6% r@10 · 26 QPS (scalar, needs SIMD)

Every row is from one cargo run on the same clustered-Gaussian dataset with the same queries and ground truth.


The v2 optimization — what changed and what it bought

The v1 numbers (commit 8dbc560d0) used per-entry Vec<(usize, BinaryCode)> storage with each BinaryCode heap-allocating its own Vec<u64>. The v2 kernel (commit 6c6e04554) collapses this into a flat struct-of-arrays and replaces the .cos() call with a lookup table.

n=100k, D=128 v1 QPS v2 QPS Δ memory
Flat 309 306 1.0× 50.4 MB
RaBitQ sym no rerank 1,176 3,639 3.09× 5.8 → 2.4 MB
RaBitQ+ sym rerank×5 811 2,058 2.54×
RaBitQ+ sym rerank×20 544 957 1.76×
  • Struct-of-arrays storage. Three contiguous vecs — ids: Vec<u32>, norms: Vec<f32>, packed: Vec<u64> (n × n_words flat slab). Kills the pointer chase per candidate and halves per-entry memory (8 B vs 40 B).
  • cos lookup table. Agreement ∈ [0, D] has at most D+1 distinct values, so cos(π · (1 − B/D)) is precomputed once at build. One L1-resident load per candidate instead of a ~30 ns cos call. At D=128 the LUT is 516 B.
  • Aligned-D fast path. When D % 64 == 0 the last-word mask is !0u64 and the AND on the tail word gets skipped — at D=128 the inner loop is 2 XORs + 2 POPCNTs + 1 LUT load + 1 FMA per candidate.
  • Raw-pointer SoA walk avoids bounds checks on the packed slab.

Net effect: 3.13× over exact flat at 100% recall@10 (up from 1.76×), and 21× full-index memory compression (up from 8.7×).

Asymmetric path unchanged — its O(D) scalar signed-dot-product dominates; std::arch SIMD gather is the named next lever.


Two estimators — what each one is, honestly

Symmetric (Charikar-style) — both query and database are 1-bit.

E[agreement/D] = 1 − θ/π        (random-hyperplane collision probability)
est cos(θ)     = cos_lut[agreement]
est ⟨q, x⟩     = ‖q‖ · ‖x‖ · est cos(θ)
est ‖q − x‖²   = ‖q‖² + ‖x‖² − 2 · est ⟨q, x⟩

This is hyperplane-LSH on a rotated basis, not the tighter estimator from the SIGMOD 2024 RaBitQ paper. The rotation step (Haar-uniform via Gram–Schmidt on a Gaussian matrix) decorrelates dimensions before binarization — naive sign(x_i) without rotation gives ~15–20% recall on structured data; this gets 36–88% pre-rerank and 100% after rerank×20.

Asymmetric (RaBitQ-2024-style) — query stays f32, database stays 1-bit.

b_x ∈ {−1/√D, +1/√D}^D                       (stored 1-bit code)
⟨q̂_rot, u_x⟩ ≈ (1/√D) · Σᵢ sign(x_rot,i) · q_rot,i
est ⟨q, x⟩ ≈ ‖q‖ · ‖x‖ · ⟨q̂_rot, u_x⟩

Reconstructs the inner product by walking the stored signs against the full-precision rotated query — unbiased on Haar-uniform rotations, tighter variance than the symmetric path. O(D) per candidate vs O(D/64) popcount — 140× slower than symmetric in the scalar baseline, wants SIMD gather to be practical at scale.

Recall@10 at n=100k: 95.6% with rerank×5 vs 87.9% for symmetric rerank×5 — the extra accuracy is real, the QPS gap is engineering.


Honest benchmarks (single-run, reproducible)

Release build, Ryzen-class laptop, single thread, no SIMD intrinsics, D=128, 100 Gaussian clusters (σ=0.6), nq=200. Deterministic seeds; reruns are bit-identical.

n variant r@1 r@10 r@100 QPS mem/MB vs flat
100 k FlatF32 (exact) 100.0% 100.0% 100.0% 306 50.4
RaBitQ sym no rerank 2.0% 8.1% 27.1% 3,639 2.4 11.9×
RaBitQ+ sym rerank×20 100.0% 100.0% 100.0% 957 53.5 3.13×
RaBitQ+ sym rerank×5 92.0% 87.9% 78.1% 2,058 53.5 6.73×
RaBitQ-Asym rerank×5 99.0% 95.6% 87.0% 26 53.5 0.08×
50 k RaBitQ+ sym rerank×5 100.0% 99.9% 99.8% 3,238 26.8 5.35×
RaBitQ+ sym rerank×20 100.0% 100.0% 100.0% 1,468 26.8 2.43×
5 k RaBitQ+ sym rerank×5 100.0% 100.0% 85.1% 9,374 2.7 1.70×
1 k FlatF32 100.0% 100.0% 100.0% 20,711 0.5

What the numbers actually say:

  • Production config = symmetric + rerank×20 at n ≥ 50k. At n=50k it's 2.43× over flat at 100% recall@10; at n=100k it's 3.13×. Speedup grows with n because flat scan is O(n·D) while the 1-bit scan is O(n·D/64).
  • rerank×5 is the sweet-spot below n=50k. At n=50k rerank×5 wins 5.35× at 99.9% recall@10; the scaling regression (100% at n=5k → 87.9% at n=100k) forces the bump to ×20 at large n. Documented in BENCHMARK.md.
  • Memory: 32× on codes alone, 21× on the full index at n=100k. RabitqIndex at n=100k is 2.4 MB vs Flat's 50.4 MB. The rotation matrix is a one-time 64 KB overhead that amortises to nothing as n grows.
  • Asymmetric is slow in the scalar baseline — 26 QPS at n=100k. Needs SIMD gather. Follow-up.

Changelog — what v2 fixed (since f2dbb6efb)

Seven integrity issues from the deep review of v1 were fixed, then a perf pass delivered the 2.5–3.1× kernel speedup.

  1. Padding-bug at D % 64 != 0 — XNOR-popcount was counting the zero padding of the last u64 as "matches". Fixed via masked_xnor_popcount with a regression test at D=100 (raw returns 28 matches for opposite vectors; masked returns 0).
  2. Memory accounting was misleadingRabitqIndex stored full f32 originals but omitted them from memory_bytes(). Fixed: dropped unused originals; all variants now report honest totals. Test memory_accounting_is_honest enforces RabitqIndex < Flat < RabitqPlusIndex.
  3. partial_cmp().unwrap() could panic on NaN. Replaced with f32::total_cmp.
  4. rabitq_recall_at_10_above_70pct asserted > 0.20 — renamed to match reality.
  5. "RaBitQ estimator" was Charikar hyperplane-LSH. Correctly labelled now; tighter RaBitQ-2024 estimator shipped as RabitqAsymIndex.
  6. "Pure Rust, zero dependencies beyond std" was false. Correctly stated — deps are rand, rand_distr, serde, thiserror. No C/C++ deps, no unsafe (beyond the SoA raw-pointer walk in the scan hot loop).
  7. Bench vs recall datasets were incompatible. Replaced with a unified harness (src/main.rs) whose every row comes from the same run.

Perf pass:

  1. SoA storage — flat Vec<u32> ids, Vec<f32> norms, Vec<u64> packed (row-major) replace Vec<(usize, BinaryCode)>. No pointer chase per candidate; per-entry memory halved.
  2. cos lookup table — kills the .cos() call. 516 B at D=128, L1-resident.
  3. Aligned-D fast pathD % 64 == 0 skips the tail-word mask AND; unaligned D still correct via the regression-tested masked path.

Test count: 10 → 20 passing. Non-aligned D (100), NaN safety, full-pairs orthogonality at D ∈ {64, 128, 256}, asymmetric-vs-symmetric ordering, honest memory accounting, heap top-k ordering.


How it works

  1. Rotate once. Build a D×D Haar-ish orthogonal matrix via Gram–Schmidt on a Gaussian matrix (seeded). 64 KB at D=128, ~3.6 M ops to build. The rotation decorrelates dimensions so 1-bit quantization error spreads evenly.
  2. Encode database. For each x: normalize to unit sphere (store ‖x‖), rotate x' = P·x̂, binarize bit_i = 1 if x'_i ≥ 0. Packed into the shared packed: Vec<u64> slab — no per-vector allocation.
  3. Query:
    • Symmetric: same encoder, then raw-pointer walk through packed, 2 XORs + 2 POPCNTs + cos_lut[agreement] + FMA per candidate at D=128.
    • Asymmetric: rotate the normalised query to f32 q_rot, walk each candidate's 1-bit code and do a signed dot product against q_rot scaled by 1/√D.
  4. Rerank (optional). Take top rerank_factor · k candidates from the 1-bit scan, compute exact f32 distance only for those, pick the final top-k. This is how you go from 8% recall (bit-scan only) to 100% (rerank×20).

Reproduce

git clone -b research/nightly/2026-04-23-rabitq https://github.com/ruvnet/ruvector.git
cd ruvector
cargo test -p ruvector-rabitq --release                         # 20 tests pass
cargo run --release -p ruvector-rabitq --bin rabitq-demo        # full table (~20 s)
cargo run --release -p ruvector-rabitq --bin rabitq-demo -- --fast  # smoke run (~5 s)
cargo bench -p ruvector-rabitq --bench rabitq_bench              # Criterion kernel + search benches

Open items (named, deferred)

  • SIFT1M / GIST1M / DEEP10M — the standard ANN benchmarks from the SIGMOD paper. Current numbers are on clustered Gaussian. Follow-up.
  • HNSW integration — ruvector ships HNSW. RaBitQ in production is a distance-kernel plug-in for a graph index; the current flat-scan path is a correctness + throughput harness, not a production shape.
  • std::arch SIMD gather for the asymmetric path. Scalar baseline is 140× slower than symmetric; SIMD would close most of the gap.
  • AVX2/NEON popcount batching for the symmetric path. Scalar POPCNT is already good (3.6k QPS at n=100k); batched SIMD adds another ~2–3×.
  • Parallel search via rayon — already feature-gated (parallel), off by default.

Links

  • Repo: github.com/ruvnet/ruvector
  • Branch: research/nightly/2026-04-23-rabitq (head: 6c6e04554)
  • Draft PR: #370
  • ADR-154: docs/adr/ADR-154-rabitq-rotation-binary-quantization.md
  • Benchmark reproducer: crates/ruvector-rabitq/BENCHMARK.md
  • Paper: arXiv:2405.12497 (Gao & Long, SIGMOD 2024) — the symmetric path here is a Charikar-based predecessor; the asymmetric path targets the 2024 estimator with full query precision.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment