Skip to content

Instantly share code, notes, and snippets.

@hed0rah
Created April 21, 2026 14:34
Show Gist options
  • Select an option

  • Save hed0rah/f3f018ca06f32622a4c71591952601c4 to your computer and use it in GitHub Desktop.

Select an option

Save hed0rah/f3f018ca06f32622a4c71591952601c4 to your computer and use it in GitHub Desktop.
x8r overview (preview)

x8r

A native CLI + shared library that answers two questions fast: "how many tokens is this file?" and "where do I cut it at ≤N tokens without slicing a function in half?"

What it is

  • single-purpose token-aware chunker for LLM agent tools
  • drop-in replacement for calling tiktoken in a subprocess loop
  • emits byte offsets [start, end) + token counts + boundary kind

Why

  • LLM agents call tokenizers on the hot path (every Read, every context-fit check)
  • Python tiktoken is fine for one-shot but bottlenecks when scanning a monorepo
  • nothing else combines BPE tokenization with boundary-aware cuts

Stack

  • C11, -O3 -march=x86-64-v3, Linux x86_64 only
  • AVX2 pretokenizer with scalar fallback
  • mmap'd input and vocab (cross-process page sharing)
  • public C ABI, thin CLI on top (build/x8r, libx8r.so, libx8r.a)

Vocab support

  • cl100k (GPT-4/3.5)
  • o200k (GPT-4o, Claude-era stack)
  • runtime dispatch by vocab ID, single binary

Pipeline

  • pretokenize: regex-equivalent byte/codepoint classifier (letter/digit/space/punct/newline plus upper/lower/mark for o200k CamelCase splits)
  • BPE merge: FNV-1a 32-bit, open-addressing hash table, load factor ~0.38
  • boundary pick: walks cumulative token array backward through candidate cuts (function end > blank line > line)
  • chunk: binary search for largest ≤budget prefix, pick best boundary in tolerance window

Layout

  • src/pretok_{scalar,avx2,o200k}.c: pretokenizers
  • src/bpe.c: merge loop + vocab lookup
  • src/chunk.c: public API, budget search
  • src/vocab.c, src/mmap_io.c: io
  • vocab/{cl100k,o200k}.bin: flat hashed vocab blobs
  • scripts/gen_unicode_tables.py: two-stage Unicode class table generator
  • scripts/fuzz_vs_tiktoken.py: differential fuzzer with shrinking
  • bench/run.py: cold/warm benchmark harness vs tiktoken

Performance (o200k, warm, from last bench)

  • ~7-16x faster than tiktoken.encode_ordinary on mid-size corpora
  • JSON-heavy large file: 13.86ms vs 220.84ms (15.94x)
  • UTF-8 prose large: 15.64ms vs 111.97ms (7.16x)

Correctness

  • bit-exact token counts vs tiktoken on all bench corpora, both vocabs
  • differential fuzzer: 5 flavors (ascii, utf8_mixed, realcode, edge, bytes) with bisect-shrinker
  • currently tracking a Unicode-database edge case: fancy-regex (Rust, used by tiktoken) classifies a handful of unassigned-but-block-allocated codepoints differently from Python's regex module, producing 2-4 disagreements per 10k random inputs. Pure ASCII and realistic code paths are clean.

Current status

  • v0.2: cl100k fully working, SIMD pretok, benches green
  • v0.3 in progress: o200k added, fuzzing passes on realistic inputs, chasing one Unicode classification corner case
  • deferred: threading, tree-sitter strict boundaries, Python/Node bindings
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment