A native CLI + shared library that answers two questions fast: "how many tokens is this file?" and "where do I cut it at ≤N tokens without slicing a function in half?"
- single-purpose token-aware chunker for LLM agent tools
- drop-in replacement for calling tiktoken in a subprocess loop
- emits byte offsets
[start, end)+ token counts + boundary kind
- LLM agents call tokenizers on the hot path (every Read, every context-fit check)
- Python
tiktokenis fine for one-shot but bottlenecks when scanning a monorepo - nothing else combines BPE tokenization with boundary-aware cuts
- C11,
-O3 -march=x86-64-v3, Linux x86_64 only - AVX2 pretokenizer with scalar fallback
- mmap'd input and vocab (cross-process page sharing)
- public C ABI, thin CLI on top (
build/x8r,libx8r.so,libx8r.a)
- cl100k (GPT-4/3.5)
- o200k (GPT-4o, Claude-era stack)
- runtime dispatch by vocab ID, single binary
- pretokenize: regex-equivalent byte/codepoint classifier (letter/digit/space/punct/newline plus upper/lower/mark for o200k CamelCase splits)
- BPE merge: FNV-1a 32-bit, open-addressing hash table, load factor ~0.38
- boundary pick: walks cumulative token array backward through candidate cuts (function end > blank line > line)
- chunk: binary search for largest ≤budget prefix, pick best boundary in tolerance window
src/pretok_{scalar,avx2,o200k}.c: pretokenizerssrc/bpe.c: merge loop + vocab lookupsrc/chunk.c: public API, budget searchsrc/vocab.c,src/mmap_io.c: iovocab/{cl100k,o200k}.bin: flat hashed vocab blobsscripts/gen_unicode_tables.py: two-stage Unicode class table generatorscripts/fuzz_vs_tiktoken.py: differential fuzzer with shrinkingbench/run.py: cold/warm benchmark harness vs tiktoken
- ~7-16x faster than
tiktoken.encode_ordinaryon mid-size corpora - JSON-heavy large file: 13.86ms vs 220.84ms (15.94x)
- UTF-8 prose large: 15.64ms vs 111.97ms (7.16x)
- bit-exact token counts vs tiktoken on all bench corpora, both vocabs
- differential fuzzer: 5 flavors (ascii, utf8_mixed, realcode, edge, bytes) with bisect-shrinker
- currently tracking a Unicode-database edge case: fancy-regex (Rust, used by tiktoken) classifies a handful of unassigned-but-block-allocated codepoints differently from Python's
regexmodule, producing 2-4 disagreements per 10k random inputs. Pure ASCII and realistic code paths are clean.
- v0.2: cl100k fully working, SIMD pretok, benches green
- v0.3 in progress: o200k added, fuzzing passes on realistic inputs, chasing one Unicode classification corner case
- deferred: threading, tree-sitter strict boundaries, Python/Node bindings