Skip to content

Instantly share code, notes, and snippets.

@folkertdev
Last active October 23, 2024 07:54
Show Gist options
  • Select an option

  • Save folkertdev/977183fb706b7693863bd7f358578292 to your computer and use it in GitHub Desktop.

Select an option

Save folkertdev/977183fb706b7693863bd7f358578292 to your computer and use it in GitHub Desktop.

Revisions

  1. folkertdev revised this gist Oct 22, 2024. 1 changed file with 92 additions and 90 deletions.
    182 changes: 92 additions & 90 deletions gistfile1.md
    Original file line number Diff line number Diff line change
    @@ -19,119 +19,121 @@ this runs 4 benchmarks
    - labeled-match-len: the `len` function and friends now use labeled match
    - labeled-match-fast: the `len` and friends, and `inflate_fast_help` functions now use labeld match

    The benchmark is run for various chunk sizes (2 to the power 4, 7 and 16), which varies what logic is run: a chunk size of 2^16 spends most time in an inner loop, while 2^4 spends much more time in the state machine logic.

    ## results

    Mostly what we see is that labeled match gives significant speedups for small chunk sizes. For larger chunk sizes, the results are less clear. I believe really the result is net-zero, but we need clearly need to perform some further tuning.

    Note in particular how `loop-plus-match` works well for small _and_ big inputs, but terribly for medium inputs.
    Note in particular how `loop-plus-match` works well for small _and_ big inputs, but terribly for medium inputs. The `labeled-match-fast` change barely seems to do anything versus just `labeled-match-len`.

    ```
    Benchmark 1 (69 runs): /tmp/uncompress-baseline rs-chunked 4 silesia-small.tar.gz
    measurement mean ± σ min … max outliers delta
    wall_time 72.5ms ± 2.46ms 71.0ms91.0ms 7 (10%) 0%
    peak_rss 24.1MB ± 77.8KB 23.9MB … 24.1MB 0 ( 0%) 0%
    cpu_cycles 294M ± 9.90M 291M371M 7 (10%) 0%
    instructions 914M ± 274 914M … 914M 0 ( 0%) 0%
    cache_references 3.04M ± 519K 2.69M … 6.09M 4 ( 6%) 0%
    cache_misses 156K ± 39.1K 126K463K 1 ( 1%) 0%
    branch_misses 4.09M ± 10.8K 4.08M … 4.17M 5 ( 7%) 0%
    wall_time 72.7ms ± 4.55ms 71.1ms 108ms 3 ( 4%) 0%
    peak_rss 24.1MB ± 56.3KB 23.9MB … 24.1MB 12 (17%) 0%
    cpu_cycles 294M ± 17.7M 290M434M 6 ( 9%) 0%
    instructions 914M ± 449 914M … 914M 1 ( 1%) 0%
    cache_references 2.99M ± 407K 2.68M … 6.10M 4 ( 6%) 0%
    cache_misses 134K ± 23.8K 101K302K 2 ( 3%) 0%
    branch_misses 4.09M ± 8.56K 4.08M … 4.14M 2 ( 3%) 0%
    Benchmark 2 (71 runs): /tmp/loop-plus-match rs-chunked 4 silesia-small.tar.gz
    measurement mean ± σ min … max outliers delta
    wall_time 70.9ms ± 817us 69.9ms … 76.2ms 2 ( 3%) ⚡- 2.1% ± 0.8%
    peak_rss 24.1MB ± 58.4KB 24.0MB … 24.1MB 0 ( 0%) + 0.1% ± 0.1%
    cpu_cycles 287M ± 2.33M 285M … 305M 3 ( 4%) ⚡- 2.5% ± 0.8%
    instructions 792M ± 300 792M … 792M 1 ( 1%) ⚡- 13.4% ± 0.0%
    cache_references 2.98M ± 89.8K 2.78M … 3.33M 1 ( 1%) - 1.9% ± 4.0%
    cache_misses 154K ± 15.5K 115K 207K 1 ( 1%) - 1.6% ± 6.3%
    branch_misses 4.10M ± 3.85K 4.09M … 4.11M 1 ( 1%) + 0.2% ± 0.1%
    Benchmark 3 (79 runs): /tmp/labeled-match-len rs-chunked 4 silesia-small.tar.gz
    wall_time 70.6ms ± 499us 69.9ms … 72.9ms 3 ( 4%) ⚡- 2.8% ± 1.5%
    peak_rss 24.1MB ± 60.2KB 24.0MB … 24.1MB 0 ( 0%) - 0.1% ± 0.1%
    cpu_cycles 287M ± 1.53M 285M … 294M 5 ( 7%) ⚡- 2.7% ± 1.4%
    instructions 792M ± 336 792M … 792M 0 ( 0%) ⚡- 13.4% ± 0.0%
    cache_references 2.92M ± 81.8K 2.77M … 3.12M 0 ( 0%) - 2.2% ± 3.2%
    cache_misses 113K ± 12.4K 89.9K 169K 3 ( 4%) ⚡- 16.0% ± 4.7%
    branch_misses 4.10M ± 4.76K 4.09M … 4.13M 3 ( 4%) + 0.2% ± 0.1%
    Benchmark 3 (80 runs): /tmp/labeled-match-len rs-chunked 4 silesia-small.tar.gz
    measurement mean ± σ min … max outliers delta
    wall_time 63.9ms ± 1.66ms 62.7ms … 74.9ms 5 ( 6%) ⚡- 11.8% ± 0.9%
    peak_rss 24.1MB ± 64.6KB 23.9MB … 24.1MB 18 (23%) + 0.1% ± 0.1%
    cpu_cycles 254M ± 6.97M 252M307M 9 (11%) ⚡- 13.6% ± 0.9%
    instructions 710M ± 259 710M710M 0 ( 0%) ⚡- 22.3% ± 0.0%
    cache_references 3.05M ± 736K 2.79M9.38M 5 ( 6%) + 0.4% ± 6.8%
    cache_misses 134K ± 14.6K 111K176K 1 ( 1%) ⚡- 14.0% ± 5.9%
    branch_misses 4.08M ± 5.27K 4.08M … 4.11M 3 ( 4%) - 0.1% ± 0.1%
    wall_time 62.6ms ± 555us 61.7ms … 66.1ms 2 ( 3%) ⚡- 14.0% ± 1.4%
    peak_rss 24.1MB ± 77.9KB 23.9MB … 24.1MB 0 ( 0%) - 0.1% ± 0.1%
    cpu_cycles 249M ± 1.87M 248M263M 5 ( 6%) ⚡- 15.4% ± 1.3%
    instructions 686M ± 267 686M686M 0 ( 0%) ⚡- 24.9% ± 0.0%
    cache_references 3.01M ± 480K 2.76M7.16M 2 ( 3%) + 0.5% ± 4.8%
    cache_misses 123K ± 6.92K 100K137K 2 ( 3%) ⚡- 8.4% ± 4.1%
    branch_misses 4.08M ± 2.78K 4.08M … 4.09M 4 ( 5%) - 0.1% ± 0.0%
    Benchmark 4 (81 runs): /tmp/labeled-match-fast rs-chunked 4 silesia-small.tar.gz
    measurement mean ± σ min … max outliers delta
    wall_time 62.1ms ± 554us 61.4ms … 64.4ms 3 ( 4%) ⚡- 14.3% ± 0.8%
    peak_rss 24.1MB ± 63.0KB 23.9MB … 24.1MB 0 ( 0%) + 0.0% ± 0.1%
    cpu_cycles 246M ± 1.75M 245M … 257M 6 ( 7%) ⚡- 16.3% ± 0.7%
    instructions 689M ± 258 689M … 689M 0 ( 0%) ⚡- 24.6% ± 0.0%
    cache_references 3.02M ± 336K 2.77M5.78M 4 ( 5%) - 0.6% ± 4.5%
    cache_misses 128K ± 16.9K 101K 186K 5 ( 6%) ⚡- 17.7% ± 6.0%
    branch_misses 4.08M ± 5.14K 4.08M … 4.10M 2 ( 2%) - 0.1% ± 0.1%
    wall_time 61.9ms ± 468us 61.3ms … 64.0ms 6 ( 7%) ⚡- 14.8% ± 1.4%
    peak_rss 24.1MB ± 56.9KB 24.0MB … 24.1MB 20 (25%) - 0.0% ± 0.1%
    cpu_cycles 246M ± 1.53M 245M … 255M 12 (15%) ⚡- 16.4% ± 1.3%
    instructions 689M ± 365 689M … 689M 1 ( 1%) ⚡- 24.6% ± 0.0%
    cache_references 2.97M ± 211K 2.79M4.37M 3 ( 4%) - 0.8% ± 3.4%
    cache_misses 91.4K ± 9.22K 72.0K 114K 0 ( 0%) ⚡- 31.8% ± 4.2%
    branch_misses 4.08M ± 4.06K 4.08M … 4.10M 1 ( 1%) - 0.1% ± 0.1%
    Benchmark 1 (108 runs): /tmp/uncompress-baseline rs-chunked 7 silesia-small.tar.gz
    measurement mean ± σ min … max outliers delta
    wall_time 46.5ms ± 649us 45.3ms50.1ms 2 ( 2%) 0%
    peak_rss 24.1MB ± 57.0KB 24.0MB … 24.1MB 27 (25%) 0%
    cpu_cycles 174M ± 1.71M 173M … 187M 9 ( 8%) 0%
    instructions 516M ± 277 516M … 516M 1 ( 1%) 0%
    cache_references 3.14M ± 181K 2.93M … 4.33M 3 ( 3%) 0%
    cache_misses 45.8K ± 17.2K 29.9K93.9K 7 ( 6%) 0%
    branch_misses 2.00M ± 2.42K 2.00M … 2.02M 2 ( 2%) 0%
    wall_time 46.3ms ± 1.37ms 45.0ms56.5ms 10 ( 9%) 0%
    peak_rss 24.1MB ± 56.3KB 24.0MB … 24.1MB 26 (24%) 0%
    cpu_cycles 174M ± 4.67M 173M … 214M 10 ( 9%) 0%
    instructions 516M ± 443 516M … 516M 3 ( 3%) 0%
    cache_references 3.16M ± 219K 2.88M … 4.43M 6 ( 6%) 0%
    cache_misses 83.6K ± 11.6K 62.4K 161K 5 ( 5%) 0%
    branch_misses 2.00M ± 5.17K 2.00M … 2.04M 9 ( 8%) 0%
    Benchmark 2 (78 runs): /tmp/loop-plus-match rs-chunked 7 silesia-small.tar.gz
    measurement mean ± σ min … max outliers delta
    wall_time 64.8ms ± 705us 63.8ms69.2ms 1 ( 1%) 💩+ 39.3% ± 0.4%
    peak_rss 24.1MB ± 76.1KB 23.9MB … 24.1MB 0 ( 0%) - 0.1% ± 0.1%
    cpu_cycles 257M ± 2.22M 256M273M 9 (12%) 💩+ 47.5% ± 0.3%
    instructions 720M ± 385 720M … 720M 1 ( 1%) 💩+ 39.7% ± 0.0%
    cache_references 3.18M ± 92.4K 2.99M3.55M 2 ( 3%) + 1.5% ± 1.4%
    cache_misses 37.4K ± 17.6K 26.5K 174K 7 ( 9%) ⚡- 18.4% ± 11.1%
    branch_misses 2.00M ± 2.47K 2.00M … 2.01M 3 ( 4%) - 0.2% ± 0.0%
    Benchmark 3 (104 runs): /tmp/labeled-match-len rs-chunked 7 silesia-small.tar.gz
    wall_time 64.5ms ± 1.29ms 63.6ms71.3ms 6 ( 8%) 💩+ 39.3% ± 0.8%
    peak_rss 24.1MB ± 72.6KB 23.9MB … 24.1MB 0 ( 0%) - 0.1% ± 0.1%
    cpu_cycles 257M ± 3.65M 255M279M 12 (15%) 💩+ 47.4% ± 0.7%
    instructions 720M ± 384 720M … 720M 1 ( 1%) 💩+ 39.7% ± 0.0%
    cache_references 3.21M ± 182K 2.97M4.32M 3 ( 4%) + 1.8% ± 1.9%
    cache_misses 57.9K ± 8.34K 47.0K98.8K 4 ( 5%) ⚡- 30.7% ± 3.6%
    branch_misses 2.00M ± 2.44K 2.00M … 2.01M 5 ( 6%) - 0.2% ± 0.1%
    Benchmark 3 (111 runs): /tmp/labeled-match-len rs-chunked 7 silesia-small.tar.gz
    measurement mean ± σ min … max outliers delta
    wall_time 48.0ms ± 500us 47.2ms … 50.3ms 1 ( 1%) 💩+ 3.2% ± 0.3%
    peak_rss 24.1MB ± 59.7KB 24.0MB … 24.1MB 0 ( 0%) - 0.0% ± 0.1%
    cpu_cycles 181M ± 1.18M 180M187M 6 ( 6%) 💩+ 3.8% ± 0.2%
    instructions 586M ± 354 586M586M 4 ( 4%) 💩+ 13.6% ± 0.0%
    cache_references 3.25M ± 202K 3.02M … 4.80M 6 ( 6%) 💩+ 3.5% ± 1.6%
    cache_misses 48.6K ± 8.60K 31.5K83.5K 5 ( 5%) + 6.0% ± 8.0%
    branch_misses 2.00M ± 1.77K 2.00M … 2.01M 5 ( 5%) - 0.2% ± 0.0%
    Benchmark 4 (110 runs): /tmp/labeled-match-fast rs-chunked 7 silesia-small.tar.gz
    wall_time 45.2ms ± 781us 44.2ms … 51.3ms 5 ( 5%) ⚡- 2.5% ± 0.6%
    peak_rss 24.1MB ± 66.5KB 23.9MB … 24.1MB 0 ( 0%) - 0.0% ± 0.1%
    cpu_cycles 170M ± 3.34M 168M199M 9 ( 8%) ⚡- 2.8% ± 0.6%
    instructions 510M ± 359 510M510M 1 ( 1%) ⚡- 1.0% ± 0.0%
    cache_references 3.21M ± 178K 2.97M … 4.55M 6 ( 5%) + 1.7% ± 1.7%
    cache_misses 31.2K ± 4.33K 25.0K52.0K 6 ( 5%) ⚡- 62.7% ± 2.8%
    branch_misses 1.99M ± 1.35K 1.99M … 2.00M 5 ( 5%) - 0.4% ± 0.0%
    Benchmark 4 (111 runs): /tmp/labeled-match-fast rs-chunked 7 silesia-small.tar.gz
    measurement mean ± σ min … max outliers delta
    wall_time 45.5ms ± 927us 44.5ms51.4ms 4 ( 4%) ⚡- 2.2% ± 0.5%
    peak_rss 24.1MB ± 73.1KB 23.9MB … 24.1MB 0 ( 0%) - 0.0% ± 0.1%
    cpu_cycles 170M ± 2.85M 168M … 190M 12 (11%) ⚡- 2.7% ± 0.4%
    instructions 515M ± 249 515M … 515M 0 ( 0%) - 0.1% ± 0.0%
    cache_references 3.27M ± 107K 3.07M … 3.94M 2 ( 2%) 💩+ 4.2% ± 1.3%
    cache_misses 109K ± 5.80K 98.7K 139K 4 ( 4%) 💩+139.0% ± 7.4%
    branch_misses 2.00M ± 4.10K 1.99M … 2.03M 4 ( 4%) - 0.2% ± 0.0%
    Benchmark 1 (181 runs): /tmp/uncompress-baseline rs-chunked 16 silesia-small.tar.gz
    wall_time 45.1ms ± 454us 44.3ms46.8ms 5 ( 5%) ⚡- 2.5% ± 0.6%
    peak_rss 24.1MB ± 73.0KB 23.9MB … 24.1MB 0 ( 0%) - 0.1% ± 0.1%
    cpu_cycles 169M ± 1.27M 168M … 176M 10 ( 9%) ⚡- 3.0% ± 0.5%
    instructions 515M ± 295 515M … 515M 0 ( 0%) - 0.1% ± 0.0%
    cache_references 3.20M ± 81.1K 2.97M … 3.38M 0 ( 0%) + 1.5% ± 1.4%
    cache_misses 47.5K ± 8.09K 36.6K72.1K 1 ( 1%) ⚡- 43.2% ± 3.2%
    branch_misses 2.00M ± 1.86K 1.99M … 2.00M 2 ( 2%) - 0.2% ± 0.1%
    Benchmark 1 (182 runs): /tmp/uncompress-baseline rs-chunked 16 silesia-small.tar.gz
    measurement mean ± σ min … max outliers delta
    wall_time 27.6ms ± 433us 26.8ms … 31.4ms 6 ( 3%) 0%
    peak_rss 24.1MB ± 58.9KB 23.9MB … 24.1MB 45 (25%) 0%
    cpu_cycles 90.0M ± 919K 89.5M99.7M 15 ( 8%) 0%
    instructions 239M ± 322 239M … 239M 3 ( 2%) 0%
    cache_references 2.28M ± 53.6K 2.19M … 2.70M 4 ( 2%) 0%
    cache_misses 43.5K ± 2.44K 40.3K … 64.5K 5 ( 3%) 0%
    branch_misses 1.05M ± 1.70K 1.05M … 1.06M 4 ( 2%) 0%
    wall_time 27.5ms ± 474us 26.7ms … 31.4ms 7 ( 4%) 0%
    peak_rss 24.1MB ± 48.1KB 24.0MB … 24.1MB 29 (16%) 0%
    cpu_cycles 90.0M ± 1.27M 89.4M 102M 17 ( 9%) 0%
    instructions 239M ± 253 239M … 239M 3 ( 2%) 0%
    cache_references 2.28M ± 57.3K 2.20M … 2.78M 5 ( 3%) 0%
    cache_misses 48.4K ± 2.85K 43.3K … 68.4K 5 ( 3%) 0%
    branch_misses 1.05M ± 1.62K 1.05M … 1.06M 2 ( 1%) 0%
    Benchmark 2 (186 runs): /tmp/loop-plus-match rs-chunked 16 silesia-small.tar.gz
    measurement mean ± σ min … max outliers delta
    wall_time 27.0ms ± 304us 26.3ms … 28.6ms 3 ( 2%) ⚡- 2.5% ± 0.3%
    peak_rss 24.1MB ± 64.1KB 23.9MB … 24.1MB 0 ( 0%) - 0.0% ± 0.1%
    cpu_cycles 87.3M ± 654K 86.8M … 91.7M 19 (10%) ⚡- 3.0% ± 0.2%
    instructions 248M ± 266 248M … 248M 2 ( 1%) 💩+ 3.8% ± 0.0%
    cache_references 2.25M ± 55.5K 2.17M … 2.75M 5 ( 3%) - 1.2% ± 0.5%
    cache_misses 45.4K ± 2.26K 40.6K … 52.2K 1 ( 1%) 💩+ 4.2% ± 1.1%
    branch_misses 1.05M ± 1.67K 1.04M … 1.05M 4 ( 2%) - 0.5% ± 0.0%
    Benchmark 3 (184 runs): /tmp/labeled-match-len rs-chunked 16 silesia-small.tar.gz
    wall_time 26.9ms ± 335us 26.3ms … 29.5ms 4 ( 2%) ⚡- 2.4% ± 0.3%
    peak_rss 24.1MB ± 63.3KB 23.9MB … 24.1MB 0 ( 0%) - 0.1% ± 0.0%
    cpu_cycles 87.2M ± 732K 86.8M … 94.7M 20 (11%) ⚡- 3.1% ± 0.2%
    instructions 248M ± 262 248M … 248M 0 ( 0%) 💩+ 3.8% ± 0.0%
    cache_references 2.26M ± 85.8K 2.18M … 2.97M 5 ( 3%) - 0.8% ± 0.7%
    cache_misses 52.0K ± 2.13K 47.6K … 67.0K 5 ( 3%) 💩+ 7.4% ± 1.1%
    branch_misses 1.05M ± 1.58K 1.04M … 1.05M 2 ( 1%) - 0.5% ± 0.0%
    Benchmark 3 (182 runs): /tmp/labeled-match-len rs-chunked 16 silesia-small.tar.gz
    measurement mean ± σ min … max outliers delta
    wall_time 27.2ms ± 817us 26.3ms33.7ms 16 ( 9%) - 1.7% ± 0.5%
    peak_rss 24.1MB ± 55.9KB 23.9MB … 24.1MB 39 (21%) + 0.0% ± 0.0%
    cpu_cycles 87.4M ± 1.95M 86.4M103M 23 (13%) - 2.9% ± 0.3%
    instructions 248M ± 278 248M248M 2 ( 1%) 💩+ 3.6% ± 0.0%
    cache_references 2.28M ± 112K 2.18M … 2.93M 16 ( 9%) + 0.0% ± 0.8%
    cache_misses 47.9K ± 11.1K 40.5K … 168K 15 ( 8%) 💩+ 10.0% ± 3.8%
    branch_misses 1.05M ± 2.32K 1.04M … 1.06M 4 ( 2%) - 0.4% ± 0.0%
    wall_time 27.4ms ± 407us 26.7ms31.6ms 3 ( 2%) - 0.4% ± 0.3%
    peak_rss 24.1MB ± 64.3KB 23.9MB … 24.1MB 0 ( 0%) - 0.1% ± 0.0%
    cpu_cycles 89.3M ± 985K 88.8M101M 13 ( 7%) - 0.8% ± 0.3%
    instructions 254M ± 326 254M254M 2 ( 1%) 💩+ 6.1% ± 0.0%
    cache_references 2.26M ± 68.6K 2.18M … 2.75M 3 ( 2%) - 0.9% ± 0.6%
    cache_misses 55.1K ± 2.58K 50.5K … 68.0K 4 ( 2%) 💩+ 13.8% ± 1.2%
    branch_misses 1.05M ± 2.09K 1.04M … 1.05M 0 ( 0%) - 0.4% ± 0.0%
    Benchmark 4 (182 runs): /tmp/labeled-match-fast rs-chunked 16 silesia-small.tar.gz
    measurement mean ± σ min … max outliers delta
    wall_time 27.5ms ± 252us 26.9ms28.5ms 2 ( 1%) - 0.7% ± 0.3%
    peak_rss 24.1MB ± 64.1KB 23.9MB … 24.1MB 44 (24%) - 0.0% ± 0.1%
    cpu_cycles 89.5M ± 477K 89.1M … 93.9M 18 (10%) - 0.6% ± 0.2%
    instructions 253M ± 306 253M … 253M 1 ( 1%) 💩+ 5.7% ± 0.0%
    cache_references 2.27M ± 67.3K 2.19M … 3.00M 3 ( 2%) - 0.5% ± 0.5%
    cache_misses 66.5K ± 2.10K 59.8K73.6K 2 ( 1%) 💩+ 52.7% ± 1.1%
    branch_misses 1.05M ± 1.22K 1.05M … 1.06M 1 ( 1%) - 0.2% ± 0.0%
    wall_time 27.5ms ± 703us 26.8ms33.7ms 6 ( 3%) + 0.0% ± 0.4%
    peak_rss 24.1MB ± 62.8KB 23.9MB … 24.1MB 0 ( 0%) - 0.1% ± 0.0%
    cpu_cycles 89.8M ± 1.93M 89.1M … 108M 20 (11%) - 0.3% ± 0.4%
    instructions 253M ± 257 253M … 253M 1 ( 1%) 💩+ 5.7% ± 0.0%
    cache_references 2.26M ± 101K 2.18M … 3.05M 7 ( 4%) - 1.1% ± 0.7%
    cache_misses 50.2K ± 2.61K 45.6K66.8K 5 ( 3%) 💩+ 3.8% ± 1.2%
    branch_misses 1.05M ± 1.14K 1.05M … 1.06M 2 ( 1%) - 0.2% ± 0.0%
    ```
  2. folkertdev renamed this gist Oct 22, 2024. 1 changed file with 0 additions and 0 deletions.
    File renamed without changes.
  3. folkertdev created this gist Oct 22, 2024.
    137 changes: 137 additions & 0 deletions gistfile1.txt
    Original file line number Diff line number Diff line change
    @@ -0,0 +1,137 @@
    # zlib-rs labeled match benchmarks

    ## build the toolchain

    A proof of concept implementation can be found at https://github.com/trifectatechfoundation/rust/tree/labeled-match. Build it with `./x build`, and then set up the [toolchain](https://rustc-dev-guide.rust-lang.org/building/how-to-build-and-run.html#creating-a-rustup-toolchain). Now `cargo +stage1 build` should use a compiler with `labeled-match` available.

    ## run the benchmark

    ```
    git clone https://github.com/trifectatechfoundation/zlib-rs.git
    git checkout len-as-match
    sh replicate-labeled-match-benchmarks.sh
    ```

    this runs 4 benchmarks

    - baseline: the current zlib-rs main branch approach using tail calls
    - loop-plus-match: standard approach using a loop and match; suffers from branch misprediction
    - labeled-match-len: the `len` function and friends now use labeled match
    - labeled-match-fast: the `len` and friends, and `inflate_fast_help` functions now use labeld match

    ## results

    Mostly what we see is that labeled match gives significant speedups for small chunk sizes. For larger chunk sizes, the results are less clear. I believe really the result is net-zero, but we need clearly need to perform some further tuning.

    Note in particular how `loop-plus-match` works well for small _and_ big inputs, but terribly for medium inputs.

    ```
    Benchmark 1 (69 runs): /tmp/uncompress-baseline rs-chunked 4 silesia-small.tar.gz
    measurement mean ± σ min … max outliers delta
    wall_time 72.5ms ± 2.46ms 71.0ms … 91.0ms 7 (10%) 0%
    peak_rss 24.1MB ± 77.8KB 23.9MB … 24.1MB 0 ( 0%) 0%
    cpu_cycles 294M ± 9.90M 291M … 371M 7 (10%) 0%
    instructions 914M ± 274 914M … 914M 0 ( 0%) 0%
    cache_references 3.04M ± 519K 2.69M … 6.09M 4 ( 6%) 0%
    cache_misses 156K ± 39.1K 126K … 463K 1 ( 1%) 0%
    branch_misses 4.09M ± 10.8K 4.08M … 4.17M 5 ( 7%) 0%
    Benchmark 2 (71 runs): /tmp/loop-plus-match rs-chunked 4 silesia-small.tar.gz
    measurement mean ± σ min … max outliers delta
    wall_time 70.9ms ± 817us 69.9ms … 76.2ms 2 ( 3%) ⚡- 2.1% ± 0.8%
    peak_rss 24.1MB ± 58.4KB 24.0MB … 24.1MB 0 ( 0%) + 0.1% ± 0.1%
    cpu_cycles 287M ± 2.33M 285M … 305M 3 ( 4%) ⚡- 2.5% ± 0.8%
    instructions 792M ± 300 792M … 792M 1 ( 1%) ⚡- 13.4% ± 0.0%
    cache_references 2.98M ± 89.8K 2.78M … 3.33M 1 ( 1%) - 1.9% ± 4.0%
    cache_misses 154K ± 15.5K 115K … 207K 1 ( 1%) - 1.6% ± 6.3%
    branch_misses 4.10M ± 3.85K 4.09M … 4.11M 1 ( 1%) + 0.2% ± 0.1%
    Benchmark 3 (79 runs): /tmp/labeled-match-len rs-chunked 4 silesia-small.tar.gz
    measurement mean ± σ min … max outliers delta
    wall_time 63.9ms ± 1.66ms 62.7ms … 74.9ms 5 ( 6%) ⚡- 11.8% ± 0.9%
    peak_rss 24.1MB ± 64.6KB 23.9MB … 24.1MB 18 (23%) + 0.1% ± 0.1%
    cpu_cycles 254M ± 6.97M 252M … 307M 9 (11%) ⚡- 13.6% ± 0.9%
    instructions 710M ± 259 710M … 710M 0 ( 0%) ⚡- 22.3% ± 0.0%
    cache_references 3.05M ± 736K 2.79M … 9.38M 5 ( 6%) + 0.4% ± 6.8%
    cache_misses 134K ± 14.6K 111K … 176K 1 ( 1%) ⚡- 14.0% ± 5.9%
    branch_misses 4.08M ± 5.27K 4.08M … 4.11M 3 ( 4%) - 0.1% ± 0.1%
    Benchmark 4 (81 runs): /tmp/labeled-match-fast rs-chunked 4 silesia-small.tar.gz
    measurement mean ± σ min … max outliers delta
    wall_time 62.1ms ± 554us 61.4ms … 64.4ms 3 ( 4%) ⚡- 14.3% ± 0.8%
    peak_rss 24.1MB ± 63.0KB 23.9MB … 24.1MB 0 ( 0%) + 0.0% ± 0.1%
    cpu_cycles 246M ± 1.75M 245M … 257M 6 ( 7%) ⚡- 16.3% ± 0.7%
    instructions 689M ± 258 689M … 689M 0 ( 0%) ⚡- 24.6% ± 0.0%
    cache_references 3.02M ± 336K 2.77M … 5.78M 4 ( 5%) - 0.6% ± 4.5%
    cache_misses 128K ± 16.9K 101K … 186K 5 ( 6%) ⚡- 17.7% ± 6.0%
    branch_misses 4.08M ± 5.14K 4.08M … 4.10M 2 ( 2%) - 0.1% ± 0.1%
    Benchmark 1 (108 runs): /tmp/uncompress-baseline rs-chunked 7 silesia-small.tar.gz
    measurement mean ± σ min … max outliers delta
    wall_time 46.5ms ± 649us 45.3ms … 50.1ms 2 ( 2%) 0%
    peak_rss 24.1MB ± 57.0KB 24.0MB … 24.1MB 27 (25%) 0%
    cpu_cycles 174M ± 1.71M 173M … 187M 9 ( 8%) 0%
    instructions 516M ± 277 516M … 516M 1 ( 1%) 0%
    cache_references 3.14M ± 181K 2.93M … 4.33M 3 ( 3%) 0%
    cache_misses 45.8K ± 17.2K 29.9K … 93.9K 7 ( 6%) 0%
    branch_misses 2.00M ± 2.42K 2.00M … 2.02M 2 ( 2%) 0%
    Benchmark 2 (78 runs): /tmp/loop-plus-match rs-chunked 7 silesia-small.tar.gz
    measurement mean ± σ min … max outliers delta
    wall_time 64.8ms ± 705us 63.8ms … 69.2ms 1 ( 1%) 💩+ 39.3% ± 0.4%
    peak_rss 24.1MB ± 76.1KB 23.9MB … 24.1MB 0 ( 0%) - 0.1% ± 0.1%
    cpu_cycles 257M ± 2.22M 256M … 273M 9 (12%) 💩+ 47.5% ± 0.3%
    instructions 720M ± 385 720M … 720M 1 ( 1%) 💩+ 39.7% ± 0.0%
    cache_references 3.18M ± 92.4K 2.99M … 3.55M 2 ( 3%) + 1.5% ± 1.4%
    cache_misses 37.4K ± 17.6K 26.5K … 174K 7 ( 9%) ⚡- 18.4% ± 11.1%
    branch_misses 2.00M ± 2.47K 2.00M … 2.01M 3 ( 4%) - 0.2% ± 0.0%
    Benchmark 3 (104 runs): /tmp/labeled-match-len rs-chunked 7 silesia-small.tar.gz
    measurement mean ± σ min … max outliers delta
    wall_time 48.0ms ± 500us 47.2ms … 50.3ms 1 ( 1%) 💩+ 3.2% ± 0.3%
    peak_rss 24.1MB ± 59.7KB 24.0MB … 24.1MB 0 ( 0%) - 0.0% ± 0.1%
    cpu_cycles 181M ± 1.18M 180M … 187M 6 ( 6%) 💩+ 3.8% ± 0.2%
    instructions 586M ± 354 586M … 586M 4 ( 4%) 💩+ 13.6% ± 0.0%
    cache_references 3.25M ± 202K 3.02M … 4.80M 6 ( 6%) 💩+ 3.5% ± 1.6%
    cache_misses 48.6K ± 8.60K 31.5K … 83.5K 5 ( 5%) + 6.0% ± 8.0%
    branch_misses 2.00M ± 1.77K 2.00M … 2.01M 5 ( 5%) - 0.2% ± 0.0%
    Benchmark 4 (110 runs): /tmp/labeled-match-fast rs-chunked 7 silesia-small.tar.gz
    measurement mean ± σ min … max outliers delta
    wall_time 45.5ms ± 927us 44.5ms … 51.4ms 4 ( 4%) ⚡- 2.2% ± 0.5%
    peak_rss 24.1MB ± 73.1KB 23.9MB … 24.1MB 0 ( 0%) - 0.0% ± 0.1%
    cpu_cycles 170M ± 2.85M 168M … 190M 12 (11%) ⚡- 2.7% ± 0.4%
    instructions 515M ± 249 515M … 515M 0 ( 0%) - 0.1% ± 0.0%
    cache_references 3.27M ± 107K 3.07M … 3.94M 2 ( 2%) 💩+ 4.2% ± 1.3%
    cache_misses 109K ± 5.80K 98.7K … 139K 4 ( 4%) 💩+139.0% ± 7.4%
    branch_misses 2.00M ± 4.10K 1.99M … 2.03M 4 ( 4%) - 0.2% ± 0.0%
    Benchmark 1 (181 runs): /tmp/uncompress-baseline rs-chunked 16 silesia-small.tar.gz
    measurement mean ± σ min … max outliers delta
    wall_time 27.6ms ± 433us 26.8ms … 31.4ms 6 ( 3%) 0%
    peak_rss 24.1MB ± 58.9KB 23.9MB … 24.1MB 45 (25%) 0%
    cpu_cycles 90.0M ± 919K 89.5M … 99.7M 15 ( 8%) 0%
    instructions 239M ± 322 239M … 239M 3 ( 2%) 0%
    cache_references 2.28M ± 53.6K 2.19M … 2.70M 4 ( 2%) 0%
    cache_misses 43.5K ± 2.44K 40.3K … 64.5K 5 ( 3%) 0%
    branch_misses 1.05M ± 1.70K 1.05M … 1.06M 4 ( 2%) 0%
    Benchmark 2 (186 runs): /tmp/loop-plus-match rs-chunked 16 silesia-small.tar.gz
    measurement mean ± σ min … max outliers delta
    wall_time 27.0ms ± 304us 26.3ms … 28.6ms 3 ( 2%) ⚡- 2.5% ± 0.3%
    peak_rss 24.1MB ± 64.1KB 23.9MB … 24.1MB 0 ( 0%) - 0.0% ± 0.1%
    cpu_cycles 87.3M ± 654K 86.8M … 91.7M 19 (10%) ⚡- 3.0% ± 0.2%
    instructions 248M ± 266 248M … 248M 2 ( 1%) 💩+ 3.8% ± 0.0%
    cache_references 2.25M ± 55.5K 2.17M … 2.75M 5 ( 3%) - 1.2% ± 0.5%
    cache_misses 45.4K ± 2.26K 40.6K … 52.2K 1 ( 1%) 💩+ 4.2% ± 1.1%
    branch_misses 1.05M ± 1.67K 1.04M … 1.05M 4 ( 2%) - 0.5% ± 0.0%
    Benchmark 3 (184 runs): /tmp/labeled-match-len rs-chunked 16 silesia-small.tar.gz
    measurement mean ± σ min … max outliers delta
    wall_time 27.2ms ± 817us 26.3ms … 33.7ms 16 ( 9%) ⚡- 1.7% ± 0.5%
    peak_rss 24.1MB ± 55.9KB 23.9MB … 24.1MB 39 (21%) + 0.0% ± 0.0%
    cpu_cycles 87.4M ± 1.95M 86.4M … 103M 23 (13%) ⚡- 2.9% ± 0.3%
    instructions 248M ± 278 248M … 248M 2 ( 1%) 💩+ 3.6% ± 0.0%
    cache_references 2.28M ± 112K 2.18M … 2.93M 16 ( 9%) + 0.0% ± 0.8%
    cache_misses 47.9K ± 11.1K 40.5K … 168K 15 ( 8%) 💩+ 10.0% ± 3.8%
    branch_misses 1.05M ± 2.32K 1.04M … 1.06M 4 ( 2%) - 0.4% ± 0.0%
    Benchmark 4 (182 runs): /tmp/labeled-match-fast rs-chunked 16 silesia-small.tar.gz
    measurement mean ± σ min … max outliers delta
    wall_time 27.5ms ± 252us 26.9ms … 28.5ms 2 ( 1%) - 0.7% ± 0.3%
    peak_rss 24.1MB ± 64.1KB 23.9MB … 24.1MB 44 (24%) - 0.0% ± 0.1%
    cpu_cycles 89.5M ± 477K 89.1M … 93.9M 18 (10%) - 0.6% ± 0.2%
    instructions 253M ± 306 253M … 253M 1 ( 1%) 💩+ 5.7% ± 0.0%
    cache_references 2.27M ± 67.3K 2.19M … 3.00M 3 ( 2%) - 0.5% ± 0.5%
    cache_misses 66.5K ± 2.10K 59.8K … 73.6K 2 ( 1%) 💩+ 52.7% ± 1.1%
    branch_misses 1.05M ± 1.22K 1.05M … 1.06M 1 ( 1%) - 0.2% ± 0.0%
    ```