/* Benchmark for Eigen change, see Comment reported here for posterity: Here's a synthetic "benchmark" which I _believe_ shows the difference: https://gist.github.com/bitonic/2d09df858ba2233b7f472f5f8c0512b4 . I say that I believe that it exhibits the difference because it shows the runtime differences that I'd expect, with some caveats (see comments on number of instructions below). However, I have not inspected the assembly manually to check that the code varies in the way I'd expect, which would be a requirement to ensure that things change in the way we expect. That is a bit more labor intensive, and while I might do it, I don't have time to do it right now. The code inline: ```cpp #include #include using ArrType = Eigen::Array; __attribute__((noinline)) static void print_array(const char* name, const ArrType& arr) { std::cout << name << ": " << arr << std::endl; } __attribute__((noinline)) static void test_packet(float x) { ArrType xs(x); ArrType ys(0.0f); print_array("xs", xs); print_array("ys", ys); for (size_t i = 0; i < 100000000; i++) { if (i % 2 == 0) { ys += xs; } else { ys -= xs; } } print_array("ys", ys); return; } int main() { test_packet(5.0f); return 0; } ``` I compile it with ``` % clang++ -std=c++20 -I. -Wall -Werror -mavx2 -O3 test-avx2.cpp -o test-avx2 ``` In the `eigen` repo. We just add and subtract from an array which is 169 elements wide. What I realized is that this change only affects arrays of static size -- which was the case in the proprietary code this perf improvement came up in. In fact I am using the same size I were using in that code -- 169. We might want to extend it to `Dynamic` (see the stop condition on the same line on why it does not work with `Dynamic`. If we compile with the improvement, this is perf stat: ``` Performance counter stats for './test-avx2-new': 2,526.48 msec task-clock # 1.000 CPUs utilized 5 context-switches # 0.002 K/sec 2 cpu-migrations # 0.001 K/sec 89 page-faults # 0.035 K/sec 9,079,877,582 cycles # 3.594 GHz 12,968,042,398 instructions # 1.43 insn per cycle 203,464,968 branches # 80.533 M/sec 91,320 branch-misses # 0.04% of all branches 2.527443809 seconds time elapsed 2.525190000 seconds user 0.001999000 seconds sys ``` Numbers of note: 2.5 seconds runtime, 12B instructions. With the old code: ``` 3,704.16 msec task-clock # 0.999 CPUs utilized 8 context-switches # 0.002 K/sec 0 cpu-migrations # 0.000 K/sec 86 page-faults # 0.023 K/sec 13,027,668,290 cycles # 3.517 GHz 38,871,811,483 instructions # 2.98 insn per cycle 2,904,199,848 branches # 784.037 M/sec 139,444 branch-misses # 0.00% of all branches 3.706167382 seconds time elapsed 3.702763000 seconds user 0.002999000 seconds sys ``` 3.7 seconds runtime (1.5x speedup), 40B instructions. I actually do not have a great explanation for the 3x jump in instruction, I was expecting a 2x jump, roughly. Again, I've learnt to not make definitive statements when it comes to micro benchmarks unless I have checked the assembly, but I think the above already gives some confidence that the code does what I think it does. */ #include #include using ArrType = Eigen::Array; __attribute__((noinline)) static void print_array(const char* name, const ArrType& arr) { std::cout << name << ": " << arr << std::endl; } __attribute__((noinline)) static void test_packet(float x) { ArrType xs(x); ArrType ys(0.0f); print_array("xs", xs); print_array("ys", ys); for (size_t i = 0; i < 100000000; i++) { if (i % 2 == 0) { ys += xs; } else { ys -= xs; } } print_array("ys", ys); return; } int main() { test_packet(5.0f); return 0; }