hed0rah/x8r_overview_beta.md

Created April 21, 2026 14:34

Star (0) You must be signed in to star a gist
Fork (0) You must be signed in to fork a gist

Select an option

Learn more about clone URLs
Clone this repository at <script src="https://gist.github.com/hed0rah/f3f018ca06f32622a4c71591952601c4.js"></script>
Save hed0rah/f3f018ca06f32622a4c71591952601c4 to your computer and use it in GitHub Desktop.

Download ZIP

x8r overview (preview)

Raw

x8r_overview_beta.md

x8r

A native CLI + shared library that answers two questions fast: "how many tokens is this file?" and "where do I cut it at ≤N tokens without slicing a function in half?"

What it is

single-purpose token-aware chunker for LLM agent tools
drop-in replacement for calling tiktoken in a subprocess loop
emits byte offsets [start, end) + token counts + boundary kind

Why

LLM agents call tokenizers on the hot path (every Read, every context-fit check)
Python tiktoken is fine for one-shot but bottlenecks when scanning a monorepo
nothing else combines BPE tokenization with boundary-aware cuts

Stack

C11, -O3 -march=x86-64-v3, Linux x86_64 only
AVX2 pretokenizer with scalar fallback
mmap'd input and vocab (cross-process page sharing)
public C ABI, thin CLI on top (build/x8r, libx8r.so, libx8r.a)

Vocab support

cl100k (GPT-4/3.5)
o200k (GPT-4o, Claude-era stack)
runtime dispatch by vocab ID, single binary

Pipeline

pretokenize: regex-equivalent byte/codepoint classifier (letter/digit/space/punct/newline plus upper/lower/mark for o200k CamelCase splits)
BPE merge: FNV-1a 32-bit, open-addressing hash table, load factor ~0.38
boundary pick: walks cumulative token array backward through candidate cuts (function end > blank line > line)
chunk: binary search for largest ≤budget prefix, pick best boundary in tolerance window

Layout

src/pretok_{scalar,avx2,o200k}.c: pretokenizers
src/bpe.c: merge loop + vocab lookup
src/chunk.c: public API, budget search
src/vocab.c, src/mmap_io.c: io
vocab/{cl100k,o200k}.bin: flat hashed vocab blobs
scripts/gen_unicode_tables.py: two-stage Unicode class table generator
scripts/fuzz_vs_tiktoken.py: differential fuzzer with shrinking
bench/run.py: cold/warm benchmark harness vs tiktoken

Performance (o200k, warm, from last bench)

~7-16x faster than tiktoken.encode_ordinary on mid-size corpora
JSON-heavy large file: 13.86ms vs 220.84ms (15.94x)
UTF-8 prose large: 15.64ms vs 111.97ms (7.16x)

Correctness

bit-exact token counts vs tiktoken on all bench corpora, both vocabs
differential fuzzer: 5 flavors (ascii, utf8_mixed, realcode, edge, bytes) with bisect-shrinker
currently tracking a Unicode-database edge case: fancy-regex (Rust, used by tiktoken) classifies a handful of unassigned-but-block-allocated codepoints differently from Python's regex module, producing 2-4 disagreements per 10k random inputs. Pure ASCII and realistic code paths are clean.

Current status

v0.2: cl100k fully working, SIMD pretok, benches green
v0.3 in progress: o200k added, fuzzing passes on realistic inputs, chasing one Unicode classification corner case
deferred: threading, tree-sitter strict boundaries, Python/Node bindings

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment