Last active
October 23, 2024 07:54
-
-
Save folkertdev/977183fb706b7693863bd7f358578292 to your computer and use it in GitHub Desktop.
zlib-rs labeled match benchmarks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| # zlib-rs labeled match benchmarks | |
| ## build the toolchain | |
| A proof of concept implementation can be found at https://github.com/trifectatechfoundation/rust/tree/labeled-match. Build it with `./x build`, and then set up the [toolchain](https://rustc-dev-guide.rust-lang.org/building/how-to-build-and-run.html#creating-a-rustup-toolchain). Now `cargo +stage1 build` should use a compiler with `labeled-match` available. | |
| ## run the benchmark | |
| ``` | |
| git clone https://github.com/trifectatechfoundation/zlib-rs.git | |
| git checkout len-as-match | |
| sh replicate-labeled-match-benchmarks.sh | |
| ``` | |
| this runs 4 benchmarks | |
| - baseline: the current zlib-rs main branch approach using tail calls | |
| - loop-plus-match: standard approach using a loop and match; suffers from branch misprediction | |
| - labeled-match-len: the `len` function and friends now use labeled match | |
| - labeled-match-fast: the `len` and friends, and `inflate_fast_help` functions now use labeld match | |
| ## results | |
| Mostly what we see is that labeled match gives significant speedups for small chunk sizes. For larger chunk sizes, the results are less clear. I believe really the result is net-zero, but we need clearly need to perform some further tuning. | |
| Note in particular how `loop-plus-match` works well for small _and_ big inputs, but terribly for medium inputs. | |
| ``` | |
| Benchmark 1 (69 runs): /tmp/uncompress-baseline rs-chunked 4 silesia-small.tar.gz | |
| measurement mean ± σ min … max outliers delta | |
| wall_time 72.5ms ± 2.46ms 71.0ms … 91.0ms 7 (10%) 0% | |
| peak_rss 24.1MB ± 77.8KB 23.9MB … 24.1MB 0 ( 0%) 0% | |
| cpu_cycles 294M ± 9.90M 291M … 371M 7 (10%) 0% | |
| instructions 914M ± 274 914M … 914M 0 ( 0%) 0% | |
| cache_references 3.04M ± 519K 2.69M … 6.09M 4 ( 6%) 0% | |
| cache_misses 156K ± 39.1K 126K … 463K 1 ( 1%) 0% | |
| branch_misses 4.09M ± 10.8K 4.08M … 4.17M 5 ( 7%) 0% | |
| Benchmark 2 (71 runs): /tmp/loop-plus-match rs-chunked 4 silesia-small.tar.gz | |
| measurement mean ± σ min … max outliers delta | |
| wall_time 70.9ms ± 817us 69.9ms … 76.2ms 2 ( 3%) ⚡- 2.1% ± 0.8% | |
| peak_rss 24.1MB ± 58.4KB 24.0MB … 24.1MB 0 ( 0%) + 0.1% ± 0.1% | |
| cpu_cycles 287M ± 2.33M 285M … 305M 3 ( 4%) ⚡- 2.5% ± 0.8% | |
| instructions 792M ± 300 792M … 792M 1 ( 1%) ⚡- 13.4% ± 0.0% | |
| cache_references 2.98M ± 89.8K 2.78M … 3.33M 1 ( 1%) - 1.9% ± 4.0% | |
| cache_misses 154K ± 15.5K 115K … 207K 1 ( 1%) - 1.6% ± 6.3% | |
| branch_misses 4.10M ± 3.85K 4.09M … 4.11M 1 ( 1%) + 0.2% ± 0.1% | |
| Benchmark 3 (79 runs): /tmp/labeled-match-len rs-chunked 4 silesia-small.tar.gz | |
| measurement mean ± σ min … max outliers delta | |
| wall_time 63.9ms ± 1.66ms 62.7ms … 74.9ms 5 ( 6%) ⚡- 11.8% ± 0.9% | |
| peak_rss 24.1MB ± 64.6KB 23.9MB … 24.1MB 18 (23%) + 0.1% ± 0.1% | |
| cpu_cycles 254M ± 6.97M 252M … 307M 9 (11%) ⚡- 13.6% ± 0.9% | |
| instructions 710M ± 259 710M … 710M 0 ( 0%) ⚡- 22.3% ± 0.0% | |
| cache_references 3.05M ± 736K 2.79M … 9.38M 5 ( 6%) + 0.4% ± 6.8% | |
| cache_misses 134K ± 14.6K 111K … 176K 1 ( 1%) ⚡- 14.0% ± 5.9% | |
| branch_misses 4.08M ± 5.27K 4.08M … 4.11M 3 ( 4%) - 0.1% ± 0.1% | |
| Benchmark 4 (81 runs): /tmp/labeled-match-fast rs-chunked 4 silesia-small.tar.gz | |
| measurement mean ± σ min … max outliers delta | |
| wall_time 62.1ms ± 554us 61.4ms … 64.4ms 3 ( 4%) ⚡- 14.3% ± 0.8% | |
| peak_rss 24.1MB ± 63.0KB 23.9MB … 24.1MB 0 ( 0%) + 0.0% ± 0.1% | |
| cpu_cycles 246M ± 1.75M 245M … 257M 6 ( 7%) ⚡- 16.3% ± 0.7% | |
| instructions 689M ± 258 689M … 689M 0 ( 0%) ⚡- 24.6% ± 0.0% | |
| cache_references 3.02M ± 336K 2.77M … 5.78M 4 ( 5%) - 0.6% ± 4.5% | |
| cache_misses 128K ± 16.9K 101K … 186K 5 ( 6%) ⚡- 17.7% ± 6.0% | |
| branch_misses 4.08M ± 5.14K 4.08M … 4.10M 2 ( 2%) - 0.1% ± 0.1% | |
| Benchmark 1 (108 runs): /tmp/uncompress-baseline rs-chunked 7 silesia-small.tar.gz | |
| measurement mean ± σ min … max outliers delta | |
| wall_time 46.5ms ± 649us 45.3ms … 50.1ms 2 ( 2%) 0% | |
| peak_rss 24.1MB ± 57.0KB 24.0MB … 24.1MB 27 (25%) 0% | |
| cpu_cycles 174M ± 1.71M 173M … 187M 9 ( 8%) 0% | |
| instructions 516M ± 277 516M … 516M 1 ( 1%) 0% | |
| cache_references 3.14M ± 181K 2.93M … 4.33M 3 ( 3%) 0% | |
| cache_misses 45.8K ± 17.2K 29.9K … 93.9K 7 ( 6%) 0% | |
| branch_misses 2.00M ± 2.42K 2.00M … 2.02M 2 ( 2%) 0% | |
| Benchmark 2 (78 runs): /tmp/loop-plus-match rs-chunked 7 silesia-small.tar.gz | |
| measurement mean ± σ min … max outliers delta | |
| wall_time 64.8ms ± 705us 63.8ms … 69.2ms 1 ( 1%) 💩+ 39.3% ± 0.4% | |
| peak_rss 24.1MB ± 76.1KB 23.9MB … 24.1MB 0 ( 0%) - 0.1% ± 0.1% | |
| cpu_cycles 257M ± 2.22M 256M … 273M 9 (12%) 💩+ 47.5% ± 0.3% | |
| instructions 720M ± 385 720M … 720M 1 ( 1%) 💩+ 39.7% ± 0.0% | |
| cache_references 3.18M ± 92.4K 2.99M … 3.55M 2 ( 3%) + 1.5% ± 1.4% | |
| cache_misses 37.4K ± 17.6K 26.5K … 174K 7 ( 9%) ⚡- 18.4% ± 11.1% | |
| branch_misses 2.00M ± 2.47K 2.00M … 2.01M 3 ( 4%) - 0.2% ± 0.0% | |
| Benchmark 3 (104 runs): /tmp/labeled-match-len rs-chunked 7 silesia-small.tar.gz | |
| measurement mean ± σ min … max outliers delta | |
| wall_time 48.0ms ± 500us 47.2ms … 50.3ms 1 ( 1%) 💩+ 3.2% ± 0.3% | |
| peak_rss 24.1MB ± 59.7KB 24.0MB … 24.1MB 0 ( 0%) - 0.0% ± 0.1% | |
| cpu_cycles 181M ± 1.18M 180M … 187M 6 ( 6%) 💩+ 3.8% ± 0.2% | |
| instructions 586M ± 354 586M … 586M 4 ( 4%) 💩+ 13.6% ± 0.0% | |
| cache_references 3.25M ± 202K 3.02M … 4.80M 6 ( 6%) 💩+ 3.5% ± 1.6% | |
| cache_misses 48.6K ± 8.60K 31.5K … 83.5K 5 ( 5%) + 6.0% ± 8.0% | |
| branch_misses 2.00M ± 1.77K 2.00M … 2.01M 5 ( 5%) - 0.2% ± 0.0% | |
| Benchmark 4 (110 runs): /tmp/labeled-match-fast rs-chunked 7 silesia-small.tar.gz | |
| measurement mean ± σ min … max outliers delta | |
| wall_time 45.5ms ± 927us 44.5ms … 51.4ms 4 ( 4%) ⚡- 2.2% ± 0.5% | |
| peak_rss 24.1MB ± 73.1KB 23.9MB … 24.1MB 0 ( 0%) - 0.0% ± 0.1% | |
| cpu_cycles 170M ± 2.85M 168M … 190M 12 (11%) ⚡- 2.7% ± 0.4% | |
| instructions 515M ± 249 515M … 515M 0 ( 0%) - 0.1% ± 0.0% | |
| cache_references 3.27M ± 107K 3.07M … 3.94M 2 ( 2%) 💩+ 4.2% ± 1.3% | |
| cache_misses 109K ± 5.80K 98.7K … 139K 4 ( 4%) 💩+139.0% ± 7.4% | |
| branch_misses 2.00M ± 4.10K 1.99M … 2.03M 4 ( 4%) - 0.2% ± 0.0% | |
| Benchmark 1 (181 runs): /tmp/uncompress-baseline rs-chunked 16 silesia-small.tar.gz | |
| measurement mean ± σ min … max outliers delta | |
| wall_time 27.6ms ± 433us 26.8ms … 31.4ms 6 ( 3%) 0% | |
| peak_rss 24.1MB ± 58.9KB 23.9MB … 24.1MB 45 (25%) 0% | |
| cpu_cycles 90.0M ± 919K 89.5M … 99.7M 15 ( 8%) 0% | |
| instructions 239M ± 322 239M … 239M 3 ( 2%) 0% | |
| cache_references 2.28M ± 53.6K 2.19M … 2.70M 4 ( 2%) 0% | |
| cache_misses 43.5K ± 2.44K 40.3K … 64.5K 5 ( 3%) 0% | |
| branch_misses 1.05M ± 1.70K 1.05M … 1.06M 4 ( 2%) 0% | |
| Benchmark 2 (186 runs): /tmp/loop-plus-match rs-chunked 16 silesia-small.tar.gz | |
| measurement mean ± σ min … max outliers delta | |
| wall_time 27.0ms ± 304us 26.3ms … 28.6ms 3 ( 2%) ⚡- 2.5% ± 0.3% | |
| peak_rss 24.1MB ± 64.1KB 23.9MB … 24.1MB 0 ( 0%) - 0.0% ± 0.1% | |
| cpu_cycles 87.3M ± 654K 86.8M … 91.7M 19 (10%) ⚡- 3.0% ± 0.2% | |
| instructions 248M ± 266 248M … 248M 2 ( 1%) 💩+ 3.8% ± 0.0% | |
| cache_references 2.25M ± 55.5K 2.17M … 2.75M 5 ( 3%) - 1.2% ± 0.5% | |
| cache_misses 45.4K ± 2.26K 40.6K … 52.2K 1 ( 1%) 💩+ 4.2% ± 1.1% | |
| branch_misses 1.05M ± 1.67K 1.04M … 1.05M 4 ( 2%) - 0.5% ± 0.0% | |
| Benchmark 3 (184 runs): /tmp/labeled-match-len rs-chunked 16 silesia-small.tar.gz | |
| measurement mean ± σ min … max outliers delta | |
| wall_time 27.2ms ± 817us 26.3ms … 33.7ms 16 ( 9%) ⚡- 1.7% ± 0.5% | |
| peak_rss 24.1MB ± 55.9KB 23.9MB … 24.1MB 39 (21%) + 0.0% ± 0.0% | |
| cpu_cycles 87.4M ± 1.95M 86.4M … 103M 23 (13%) ⚡- 2.9% ± 0.3% | |
| instructions 248M ± 278 248M … 248M 2 ( 1%) 💩+ 3.6% ± 0.0% | |
| cache_references 2.28M ± 112K 2.18M … 2.93M 16 ( 9%) + 0.0% ± 0.8% | |
| cache_misses 47.9K ± 11.1K 40.5K … 168K 15 ( 8%) 💩+ 10.0% ± 3.8% | |
| branch_misses 1.05M ± 2.32K 1.04M … 1.06M 4 ( 2%) - 0.4% ± 0.0% | |
| Benchmark 4 (182 runs): /tmp/labeled-match-fast rs-chunked 16 silesia-small.tar.gz | |
| measurement mean ± σ min … max outliers delta | |
| wall_time 27.5ms ± 252us 26.9ms … 28.5ms 2 ( 1%) - 0.7% ± 0.3% | |
| peak_rss 24.1MB ± 64.1KB 23.9MB … 24.1MB 44 (24%) - 0.0% ± 0.1% | |
| cpu_cycles 89.5M ± 477K 89.1M … 93.9M 18 (10%) - 0.6% ± 0.2% | |
| instructions 253M ± 306 253M … 253M 1 ( 1%) 💩+ 5.7% ± 0.0% | |
| cache_references 2.27M ± 67.3K 2.19M … 3.00M 3 ( 2%) - 0.5% ± 0.5% | |
| cache_misses 66.5K ± 2.10K 59.8K … 73.6K 2 ( 1%) 💩+ 52.7% ± 1.1% | |
| branch_misses 1.05M ± 1.22K 1.05M … 1.06M 1 ( 1%) - 0.2% ± 0.0% | |
| ``` |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
On a AMD Ryzen 7 3700X 8-Core Processor: