Skip to content

Instantly share code, notes, and snippets.

@ArthurZucker
Created April 23, 2026 06:25
Show Gist options
  • Select an option

  • Save ArthurZucker/b5f60b51af22ecd62b16939db25efc5f to your computer and use it in GitHub Desktop.

Select an option

Save ArthurZucker/b5f60b51af22ecd62b16939db25efc5f to your computer and use it in GitHub Desktop.
tokenizers vs tiktoken vs iree vs wordchipper vs bpe — 10-model non-OpenAI BPE leaderboard + Rust Criterion matrix on llama-3 (AMD EPYC 7R13, pinned 8 cores)

Final tokenizer benchmark results

Pinned to CPU cores 0–7 for both Python and Rust runs. OMP_NUM_THREADS=1, 8 rayon/tiktoken threads. Results are the best of (warmup + iters) per combination.

  • Host: AMD EPYC 7R13 Processor (nproc: 64)
  • Python run timestamp: 2026-04-23T06:10:10.228895+00:00 load avg at start: 2.58 / 2.16 / 2.54
  • Rust run timestamp: 2026-04-23T06:19:48.992276Z load avg post-bench: 6.11 / 5.61 / 4.01
  • Python config: batches=[1, 32, 128] lengths=[128, 2048, 8192] threads=8 iters=4 warmup=2

Python world — 10-model leaderboard

encode peak throughput (MB/s)

model tokenizers iree tiktoken best × tokenizers
meta-llama/Llama-3.2-1B 27.1 26.7 36.1 1.33× tiktoken
Qwen/Qwen2.5-7B 21.0 18.6 0.89× iree
Qwen/Qwen3-8B 20.9 22.3 1.07× iree
deepseek-ai/DeepSeek-V3 19.6 10.6 0.54× iree
zai-org/GLM-4.5-Air 26.4 25.5 0.97× iree
mistralai/Mistral-Nemo-Instruct-2407 26.0 26.1 1.00× iree
01-ai/Yi-1.5-9B 27.6 28.2 1.02× iree
bigcode/starcoder2-7b 21.0 15.7 0.75× iree
EleutherAI/gpt-neox-20b 21.0 21.1 1.01× iree
tiiuae/falcon-7b 14.1 9.9 0.70× iree

decode peak throughput (Mtok/s)

model tokenizers iree tiktoken best × tokenizers
meta-llama/Llama-3.2-1B 9.9 72.0 45.3 7.24× iree
Qwen/Qwen2.5-7B 10.5 57.4 5.46× iree
Qwen/Qwen3-8B 10.5 76.6 7.28× iree
deepseek-ai/DeepSeek-V3 10.2 60.5 5.94× iree
zai-org/GLM-4.5-Air 10.1 74.4 7.35× iree
mistralai/Mistral-Nemo-Instruct-2407 10.2 77.2 7.54× iree
01-ai/Yi-1.5-9B 10.3 67.4 6.51× iree
bigcode/starcoder2-7b 10.3 73.1 7.13× iree
EleutherAI/gpt-neox-20b 10.5 71.6 6.80× iree
tiiuae/falcon-7b 10.2 74.8 7.36× iree

Rust world — Criterion matrix (llama-3 tokenizer)

encode (offsets)

batch input_len mean ms throughput
1 128 0.15 4.2 MB/s
1 1024 1.37 3.3 MB/s
1 8192 9.87 3.6 MB/s
8 128 0.27 18.7 MB/s
8 1024 1.55 23.5 MB/s
8 8192 11.64 24.4 MB/s
32 128 0.78 25.8 MB/s
32 1024 4.95 29.4 MB/s
32 8192 43.74 26.0 MB/s
128 128 2.56 31.4 MB/s
128 1024 18.38 31.7 MB/s
128 8192 167.15 27.2 MB/s

encode (fast)

batch input_len mean ms throughput
1 128 0.15 4.3 MB/s
1 1024 1.19 3.8 MB/s
1 8192 8.78 4.0 MB/s
8 128 0.21 23.8 MB/s
8 1024 1.29 28.3 MB/s
8 8192 9.88 28.8 MB/s
32 128 0.65 30.9 MB/s
32 1024 4.30 33.9 MB/s
32 8192 36.26 31.4 MB/s
128 128 2.16 37.2 MB/s
128 1024 15.21 38.3 MB/s
128 8192 126.52 36.0 MB/s

decode

batch input_len mean ms throughput
1 128 0.01 9.5 Mtok/s
1 1024 0.16 6.6 Mtok/s
1 8192 1.26 6.5 Mtok/s
8 128 0.05 20.7 Mtok/s
8 1024 0.23 36.4 Mtok/s
8 8192 1.53 42.8 Mtok/s
32 128 0.12 35.2 Mtok/s
32 1024 0.70 46.8 Mtok/s
32 8192 5.14 51.0 Mtok/s
128 128 0.35 46.6 Mtok/s
128 1024 2.30 56.9 Mtok/s
128 8192 17.93 58.5 Mtok/s

Python vs Rust overhead (llama-3, batch=128, len=8192)

phase Python (MB/s or Mtok/s) Rust (MB/s or Mtok/s) python / rust
encode (fast) 27.1 MB/s 36.0 MB/s 75%
decode 9.9 Mtok/s 58.5 Mtok/s 17%

Files

  • python_leaderboard.json — full row-level results per model/backend/combo
  • python_leaderboard.md — model × backend peak table
  • python_leaderboard.log — full interactive log incl. per-model tables
  • rust_matrix.json — criterion estimates parsed into a compact form
  • rust_matrix.md — encode/encode-fast/decode matrices
  • rust_matrix.log — raw criterion output

Tokenizer benchmark results

  • Timestamp: 2026-04-23T06:10:10.228895+00:00
  • CPU: AMD EPYC 7R13 Processor (nproc: 64)
  • Load avg (1/5/15): 2.58 / 2.16 / 2.54
  • Pinned CPUs: 8 (0..7)
  • Governor: unknown
  • Config: batches=[1, 32, 128] lengths=[128, 2048, 8192] threads=8 iters=4 warmup=2

encode peak throughput (MB/s)

model tokenizers iree tiktoken best × tokenizers
meta-llama/Llama-3.2-1B 27.1 26.7 36.1 1.33× tiktoken
Qwen/Qwen2.5-7B 21.0 18.6 0.89× iree
Qwen/Qwen3-8B 20.9 22.3 1.07× iree
deepseek-ai/DeepSeek-V3 19.6 10.6 0.54× iree
zai-org/GLM-4.5-Air 26.4 25.5 0.97× iree
mistralai/Mistral-Nemo-Instruct-2407 26.0 26.1 1.00× iree
01-ai/Yi-1.5-9B 27.6 28.2 1.02× iree
bigcode/starcoder2-7b 21.0 15.7 0.75× iree
EleutherAI/gpt-neox-20b 21.0 21.1 1.01× iree
tiiuae/falcon-7b 14.1 9.9 0.70× iree

decode peak throughput (Mtok/s)

model tokenizers iree tiktoken best × tokenizers
meta-llama/Llama-3.2-1B 9.9 72.0 45.3 7.24× iree
Qwen/Qwen2.5-7B 10.5 57.4 5.46× iree
Qwen/Qwen3-8B 10.5 76.6 7.28× iree
deepseek-ai/DeepSeek-V3 10.2 60.5 5.94× iree
zai-org/GLM-4.5-Air 10.1 74.4 7.35× iree
mistralai/Mistral-Nemo-Instruct-2407 10.2 77.2 7.54× iree
01-ai/Yi-1.5-9B 10.3 67.4 6.51× iree
bigcode/starcoder2-7b 10.3 73.1 7.13× iree
EleutherAI/gpt-neox-20b 10.5 71.6 6.80× iree
tiiuae/falcon-7b 10.2 74.8 7.36× iree

Rust matrix benchmark results

  • Timestamp: 2026-04-23T06:19:48.992276Z
  • CPU: AMD EPYC 7R13 Processor (nproc: 64)
  • Load avg (at bench time ≈): 6.11 / 5.61 / 4.01
  • Pinned CPUs: 0..7 (8 cores, via taskset -c 0-7)
  • Tokenizer: llama-3 Corpus: data/big.txt Harness: criterion (warmup 2s, measure 5s)

encode (offsets)

batch input_len mean ms throughput
1 128 0.15 4.2 MB/s
1 1024 1.37 3.3 MB/s
1 8192 9.87 3.6 MB/s
8 128 0.27 18.7 MB/s
8 1024 1.55 23.5 MB/s
8 8192 11.64 24.4 MB/s
32 128 0.78 25.8 MB/s
32 1024 4.95 29.4 MB/s
32 8192 43.74 26.0 MB/s
128 128 2.56 31.4 MB/s
128 1024 18.38 31.7 MB/s
128 8192 167.15 27.2 MB/s

encode (fast)

batch input_len mean ms throughput
1 128 0.15 4.3 MB/s
1 1024 1.19 3.8 MB/s
1 8192 8.78 4.0 MB/s
8 128 0.21 23.8 MB/s
8 1024 1.29 28.3 MB/s
8 8192 9.88 28.8 MB/s
32 128 0.65 30.9 MB/s
32 1024 4.30 33.9 MB/s
32 8192 36.26 31.4 MB/s
128 128 2.16 37.2 MB/s
128 1024 15.21 38.3 MB/s
128 8192 126.52 36.0 MB/s

decode

batch input_len mean ms throughput
1 128 0.01 9.5 Mtok/s
1 1024 0.16 6.6 Mtok/s
1 8192 1.26 6.5 Mtok/s
8 128 0.05 20.7 Mtok/s
8 1024 0.23 36.4 Mtok/s
8 8192 1.53 42.8 Mtok/s
32 128 0.12 35.2 Mtok/s
32 1024 0.70 46.8 Mtok/s
32 8192 5.14 51.0 Mtok/s
128 128 0.35 46.6 Mtok/s
128 1024 2.30 56.9 Mtok/s
128 8192 17.93 58.5 Mtok/s
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment