A proof of concept implementation can be found at https://github.com/trifectatechfoundation/rust/tree/labeled-match. Build it with ./x build, and then set up the toolchain. Now cargo +stage1 build should use a compiler with labeled-match available.
git clone https://github.com/trifectatechfoundation/zlib-rs.git
git checkout len-as-match
sh replicate-labeled-match-benchmarks.sh
this runs 4 benchmarks
- baseline: the current zlib-rs main branch approach using tail calls
- loop-plus-match: standard approach using a loop and match; suffers from branch misprediction
- labeled-match-len: the
lenfunction and friends now use labeled match - labeled-match-fast: the
lenand friends, andinflate_fast_helpfunctions now use labeld match
Mostly what we see is that labeled match gives significant speedups for small chunk sizes. For larger chunk sizes, the results are less clear. I believe really the result is net-zero, but we need clearly need to perform some further tuning.
Note in particular how loop-plus-match works well for small and big inputs, but terribly for medium inputs.
Benchmark 1 (69 runs): /tmp/uncompress-baseline rs-chunked 4 silesia-small.tar.gz
measurement mean ± σ min … max outliers delta
wall_time 72.5ms ± 2.46ms 71.0ms … 91.0ms 7 (10%) 0%
peak_rss 24.1MB ± 77.8KB 23.9MB … 24.1MB 0 ( 0%) 0%
cpu_cycles 294M ± 9.90M 291M … 371M 7 (10%) 0%
instructions 914M ± 274 914M … 914M 0 ( 0%) 0%
cache_references 3.04M ± 519K 2.69M … 6.09M 4 ( 6%) 0%
cache_misses 156K ± 39.1K 126K … 463K 1 ( 1%) 0%
branch_misses 4.09M ± 10.8K 4.08M … 4.17M 5 ( 7%) 0%
Benchmark 2 (71 runs): /tmp/loop-plus-match rs-chunked 4 silesia-small.tar.gz
measurement mean ± σ min … max outliers delta
wall_time 70.9ms ± 817us 69.9ms … 76.2ms 2 ( 3%) ⚡- 2.1% ± 0.8%
peak_rss 24.1MB ± 58.4KB 24.0MB … 24.1MB 0 ( 0%) + 0.1% ± 0.1%
cpu_cycles 287M ± 2.33M 285M … 305M 3 ( 4%) ⚡- 2.5% ± 0.8%
instructions 792M ± 300 792M … 792M 1 ( 1%) ⚡- 13.4% ± 0.0%
cache_references 2.98M ± 89.8K 2.78M … 3.33M 1 ( 1%) - 1.9% ± 4.0%
cache_misses 154K ± 15.5K 115K … 207K 1 ( 1%) - 1.6% ± 6.3%
branch_misses 4.10M ± 3.85K 4.09M … 4.11M 1 ( 1%) + 0.2% ± 0.1%
Benchmark 3 (79 runs): /tmp/labeled-match-len rs-chunked 4 silesia-small.tar.gz
measurement mean ± σ min … max outliers delta
wall_time 63.9ms ± 1.66ms 62.7ms … 74.9ms 5 ( 6%) ⚡- 11.8% ± 0.9%
peak_rss 24.1MB ± 64.6KB 23.9MB … 24.1MB 18 (23%) + 0.1% ± 0.1%
cpu_cycles 254M ± 6.97M 252M … 307M 9 (11%) ⚡- 13.6% ± 0.9%
instructions 710M ± 259 710M … 710M 0 ( 0%) ⚡- 22.3% ± 0.0%
cache_references 3.05M ± 736K 2.79M … 9.38M 5 ( 6%) + 0.4% ± 6.8%
cache_misses 134K ± 14.6K 111K … 176K 1 ( 1%) ⚡- 14.0% ± 5.9%
branch_misses 4.08M ± 5.27K 4.08M … 4.11M 3 ( 4%) - 0.1% ± 0.1%
Benchmark 4 (81 runs): /tmp/labeled-match-fast rs-chunked 4 silesia-small.tar.gz
measurement mean ± σ min … max outliers delta
wall_time 62.1ms ± 554us 61.4ms … 64.4ms 3 ( 4%) ⚡- 14.3% ± 0.8%
peak_rss 24.1MB ± 63.0KB 23.9MB … 24.1MB 0 ( 0%) + 0.0% ± 0.1%
cpu_cycles 246M ± 1.75M 245M … 257M 6 ( 7%) ⚡- 16.3% ± 0.7%
instructions 689M ± 258 689M … 689M 0 ( 0%) ⚡- 24.6% ± 0.0%
cache_references 3.02M ± 336K 2.77M … 5.78M 4 ( 5%) - 0.6% ± 4.5%
cache_misses 128K ± 16.9K 101K … 186K 5 ( 6%) ⚡- 17.7% ± 6.0%
branch_misses 4.08M ± 5.14K 4.08M … 4.10M 2 ( 2%) - 0.1% ± 0.1%
Benchmark 1 (108 runs): /tmp/uncompress-baseline rs-chunked 7 silesia-small.tar.gz
measurement mean ± σ min … max outliers delta
wall_time 46.5ms ± 649us 45.3ms … 50.1ms 2 ( 2%) 0%
peak_rss 24.1MB ± 57.0KB 24.0MB … 24.1MB 27 (25%) 0%
cpu_cycles 174M ± 1.71M 173M … 187M 9 ( 8%) 0%
instructions 516M ± 277 516M … 516M 1 ( 1%) 0%
cache_references 3.14M ± 181K 2.93M … 4.33M 3 ( 3%) 0%
cache_misses 45.8K ± 17.2K 29.9K … 93.9K 7 ( 6%) 0%
branch_misses 2.00M ± 2.42K 2.00M … 2.02M 2 ( 2%) 0%
Benchmark 2 (78 runs): /tmp/loop-plus-match rs-chunked 7 silesia-small.tar.gz
measurement mean ± σ min … max outliers delta
wall_time 64.8ms ± 705us 63.8ms … 69.2ms 1 ( 1%) 💩+ 39.3% ± 0.4%
peak_rss 24.1MB ± 76.1KB 23.9MB … 24.1MB 0 ( 0%) - 0.1% ± 0.1%
cpu_cycles 257M ± 2.22M 256M … 273M 9 (12%) 💩+ 47.5% ± 0.3%
instructions 720M ± 385 720M … 720M 1 ( 1%) 💩+ 39.7% ± 0.0%
cache_references 3.18M ± 92.4K 2.99M … 3.55M 2 ( 3%) + 1.5% ± 1.4%
cache_misses 37.4K ± 17.6K 26.5K … 174K 7 ( 9%) ⚡- 18.4% ± 11.1%
branch_misses 2.00M ± 2.47K 2.00M … 2.01M 3 ( 4%) - 0.2% ± 0.0%
Benchmark 3 (104 runs): /tmp/labeled-match-len rs-chunked 7 silesia-small.tar.gz
measurement mean ± σ min … max outliers delta
wall_time 48.0ms ± 500us 47.2ms … 50.3ms 1 ( 1%) 💩+ 3.2% ± 0.3%
peak_rss 24.1MB ± 59.7KB 24.0MB … 24.1MB 0 ( 0%) - 0.0% ± 0.1%
cpu_cycles 181M ± 1.18M 180M … 187M 6 ( 6%) 💩+ 3.8% ± 0.2%
instructions 586M ± 354 586M … 586M 4 ( 4%) 💩+ 13.6% ± 0.0%
cache_references 3.25M ± 202K 3.02M … 4.80M 6 ( 6%) 💩+ 3.5% ± 1.6%
cache_misses 48.6K ± 8.60K 31.5K … 83.5K 5 ( 5%) + 6.0% ± 8.0%
branch_misses 2.00M ± 1.77K 2.00M … 2.01M 5 ( 5%) - 0.2% ± 0.0%
Benchmark 4 (110 runs): /tmp/labeled-match-fast rs-chunked 7 silesia-small.tar.gz
measurement mean ± σ min … max outliers delta
wall_time 45.5ms ± 927us 44.5ms … 51.4ms 4 ( 4%) ⚡- 2.2% ± 0.5%
peak_rss 24.1MB ± 73.1KB 23.9MB … 24.1MB 0 ( 0%) - 0.0% ± 0.1%
cpu_cycles 170M ± 2.85M 168M … 190M 12 (11%) ⚡- 2.7% ± 0.4%
instructions 515M ± 249 515M … 515M 0 ( 0%) - 0.1% ± 0.0%
cache_references 3.27M ± 107K 3.07M … 3.94M 2 ( 2%) 💩+ 4.2% ± 1.3%
cache_misses 109K ± 5.80K 98.7K … 139K 4 ( 4%) 💩+139.0% ± 7.4%
branch_misses 2.00M ± 4.10K 1.99M … 2.03M 4 ( 4%) - 0.2% ± 0.0%
Benchmark 1 (181 runs): /tmp/uncompress-baseline rs-chunked 16 silesia-small.tar.gz
measurement mean ± σ min … max outliers delta
wall_time 27.6ms ± 433us 26.8ms … 31.4ms 6 ( 3%) 0%
peak_rss 24.1MB ± 58.9KB 23.9MB … 24.1MB 45 (25%) 0%
cpu_cycles 90.0M ± 919K 89.5M … 99.7M 15 ( 8%) 0%
instructions 239M ± 322 239M … 239M 3 ( 2%) 0%
cache_references 2.28M ± 53.6K 2.19M … 2.70M 4 ( 2%) 0%
cache_misses 43.5K ± 2.44K 40.3K … 64.5K 5 ( 3%) 0%
branch_misses 1.05M ± 1.70K 1.05M … 1.06M 4 ( 2%) 0%
Benchmark 2 (186 runs): /tmp/loop-plus-match rs-chunked 16 silesia-small.tar.gz
measurement mean ± σ min … max outliers delta
wall_time 27.0ms ± 304us 26.3ms … 28.6ms 3 ( 2%) ⚡- 2.5% ± 0.3%
peak_rss 24.1MB ± 64.1KB 23.9MB … 24.1MB 0 ( 0%) - 0.0% ± 0.1%
cpu_cycles 87.3M ± 654K 86.8M … 91.7M 19 (10%) ⚡- 3.0% ± 0.2%
instructions 248M ± 266 248M … 248M 2 ( 1%) 💩+ 3.8% ± 0.0%
cache_references 2.25M ± 55.5K 2.17M … 2.75M 5 ( 3%) - 1.2% ± 0.5%
cache_misses 45.4K ± 2.26K 40.6K … 52.2K 1 ( 1%) 💩+ 4.2% ± 1.1%
branch_misses 1.05M ± 1.67K 1.04M … 1.05M 4 ( 2%) - 0.5% ± 0.0%
Benchmark 3 (184 runs): /tmp/labeled-match-len rs-chunked 16 silesia-small.tar.gz
measurement mean ± σ min … max outliers delta
wall_time 27.2ms ± 817us 26.3ms … 33.7ms 16 ( 9%) ⚡- 1.7% ± 0.5%
peak_rss 24.1MB ± 55.9KB 23.9MB … 24.1MB 39 (21%) + 0.0% ± 0.0%
cpu_cycles 87.4M ± 1.95M 86.4M … 103M 23 (13%) ⚡- 2.9% ± 0.3%
instructions 248M ± 278 248M … 248M 2 ( 1%) 💩+ 3.6% ± 0.0%
cache_references 2.28M ± 112K 2.18M … 2.93M 16 ( 9%) + 0.0% ± 0.8%
cache_misses 47.9K ± 11.1K 40.5K … 168K 15 ( 8%) 💩+ 10.0% ± 3.8%
branch_misses 1.05M ± 2.32K 1.04M … 1.06M 4 ( 2%) - 0.4% ± 0.0%
Benchmark 4 (182 runs): /tmp/labeled-match-fast rs-chunked 16 silesia-small.tar.gz
measurement mean ± σ min … max outliers delta
wall_time 27.5ms ± 252us 26.9ms … 28.5ms 2 ( 1%) - 0.7% ± 0.3%
peak_rss 24.1MB ± 64.1KB 23.9MB … 24.1MB 44 (24%) - 0.0% ± 0.1%
cpu_cycles 89.5M ± 477K 89.1M … 93.9M 18 (10%) - 0.6% ± 0.2%
instructions 253M ± 306 253M … 253M 1 ( 1%) 💩+ 5.7% ± 0.0%
cache_references 2.27M ± 67.3K 2.19M … 3.00M 3 ( 2%) - 0.5% ± 0.5%
cache_misses 66.5K ± 2.10K 59.8K … 73.6K 2 ( 1%) 💩+ 52.7% ± 1.1%
branch_misses 1.05M ± 1.22K 1.05M … 1.06M 1 ( 1%) - 0.2% ± 0.0%
On a AMD Ryzen 7 3700X 8-Core Processor: