Hi this is Tecca, this post will be a continuation of last post for comparing the relative performance of various algorithms on a same machine across several implementations of AArch64 and x86_64 systems.
In this post I will be demonstrating the result of the performance of each algorithms mentions in last post. And going over some questions relate to the algorithms.
We want to measure the performance of each algorithm specifically and nothing else should be in the way in order to give a correct measurement of the time elapsed performing the algorithm.
In order to do that we need to make some modification to all the files.
#include <time.h>
clock_t start_t, end_t;
start_t = clock();
// ---- This is the part we're interested in!
// ---- Scale the samples from in[], placing results in out[]
for (x = 0; x < SAMPLES; x++) {
out[x]=scale_sample(in[x], VOLUME);
}
end_t = clock();
printf("Time elapsed: %f\n", ((double)start_t - end_t)/CLOCKS_PER_SEC);
every other code goes blow..
The key point is to wrap the for loop which performs the scaling of the samples to get the most accurate result of the algorithm when it runs.
Declaration for testing
- Each file was compiled with the gcc compiler with
gcc -g -O2 -fopt-info-vec-all -fno-lto -fno-tree-vectorize -fno-tree-loop-vectorize
. - 1,500,000,000 Sample was define for each algorithm to test.
- Each algorithm/file was tested 15 times.
Example running vol0 in x86_64 systems
Benchmarking Result running in x86_64 systems
vol0 | vol1 | vol2 | |
---|---|---|---|
1 | 2.696961 | 2.655414 | 3.370985 |
2 | 2.666739 | 2.649839 | 3.302379 |
3 | 2.671609 | 2.648382 | 3.30752 |
4 | 2.664414 | 2.653646 | 3.327133 |
5 | 2.660572 | 2.679785 | 3.285885 |
6 | 2.671767 | 2.639657 | 3.302924 |
7 | 2.685051 | 2.633608 | 3.396682 |
8 | 2.674105 | 2.663188 | 3.352611 |
9 | 2.675021 | 2.663653 | 3.331424 |
10 | 2.677584 | 2.651484 | 3.291357 |
11 | 2.655889 | 2.653947 | 3.314665 |
12 | 2.659099 | 2.634871 | 3.287128 |
13 | 2.591898 | 2.66246 | 3.283349 |
14 | 2.645027 | 2.662764 | 3.327623 |
15 | 2.582574 | 2.662238 | 3.298565 |
average | 2.658554 | 2.654329067 | 3.318682 |
median | 2.666739 | 2.653947 | 3.30752 |
In the x86_64 system vol1 algorithm is the fastest, second comes vol0, and vol2 is the slowest out of all three. As what I was expecting in the previous post, because vol2 pre-calculates all results, and looking up answers for each input value afterward absolutely will cost more resources than the others and vol1 beats vol0 by a little by using fixed-point calculation. Which avoids the cost of repetitively casting between integer and floating point.
We will be testing algorithms in vol4 and vol5 in AArch64 system because the utilize SIMD instructions the programs used are written unique to the AArch64 system.
Benchmarking result running in AArch64 systems
vol0 | vol1 | vol2 | vol4 | vol5 | |
---|---|---|---|---|---|
1 | 4.980222 | 4.313869 | 10.567804 | 3.481179 | 4.04965 |
2 | 4.968114 | 4.323427 | 10.560218 | 3.456832 | 4.045124 |
3 | 4.959307 | 4.30239 | 10.59247 | 3.460554 | 4.045296 |
4 | 4.973315 | 4.325246 | 10.590718 | 3.469896 | 4.028074 |
5 | 4.961954 | 4.31275 | 10.585899 | 3.493608 | 4.039965 |
6 | 4.976137 | 4.320063 | 10.532851 | 3.438831 | 4.040304 |
7 | 4.959057 | 4.349743 | 10.642468 | 3.482111 | 4.055758 |
8 | 4.960718 | 4.317437 | 10.534451 | 3.469382 | 4.150955 |
9 | 4.95485 | 4.336698 | 10.548297 | 3.451517 | 4.085376 |
10 | 4.960642 | 4.329521 | 10.552774 | 3.455712 | 4.022335 |
11 | 4.952321 | 4.332141 | 10.550225 | 3.384459 | 4.041784 |
12 | 4.990002 | 4.334904 | 10.572559 | 3.423148 | 4.058293 |
13 | 4.942643 | 4.326342 | 10.545258 | 3.420771 | 4.096189 |
14 | 4.957297 | 4.317604 | 10.548684 | 3.391878 | 4.020974 |
15 | 4.967439 | 4.317819 | 10.558838 | 3.433111 | 4.062904 |
average | 4.964267867 | 4.323996933 | 10.5655676 | 3.4475326 | 4.056198733 |
median | 4.960718 | 4.323427 | 10.558838 | 3.455712 | 4.045296 |
Looking at the results from running vol0, vol1, vol2, vol4, vol5.
In the previous post, we assumed the algorithms that use SIMD instructions would perform faster than others(vol4 and vol5). We can observe that vol4 and vol5 algorithms definitely outperform others. The performance difference between the two come very close to each other. I believe it is because both algorithm inline assembly and compiler intrinsic are almost equally fast.
vol1 still has a better performance than vol0 in AArch64 as well.
vol2 algorithm became significantly slower than other algorithms. This result could possibly mean that the CPU is slow at reading the memory when looking over the pre-calculated values.
QUESTIONS SPECIFIC TO ALGORITHMS
Q: Why is this code needed?
for (x = 0; x < SAMPLES; x++) {
ttl=(ttl+out[x])%1000;
}
printf("Result: %d\n", ttl);
This code prints out the number of scaled samples in the time the program was run. This is meant to be a post-processing value that is displayed to the user, which can be used to determine how long a certain algorithm takes.
Q: What does this next block do? Why?
ifeq ($(shell echo | gcc -E -dM - | grep -c aarch64),1)
BINARIES:=vol0 vol1 vol2 vol3 vol4 vol5
else
BINARIES:=vol0 vol1 vol2 vol3
Endif
This block within the Makefile validates which architecture the user is currently in and changes “BINARIES” to different values.
Since vol4.c and vol5.c are design in a way that can only be run in AArch64,it is important to remove it from the BINARIES if the user is not in a aarch64 system to prevent it from building and causing error.
Q: What is the purpose of the cast to unint16_t in the next line?
precalc[(uint16_t) x] = (int16_t) ((float) x * VOLUME / 100.0);
I guess this is some kind of casting to the values to stay the same size (16-bit) even if it is in a different system.
Q: What's the point of this dummy program? How does it help with benchmarking?
The dummy program(vol3.c) helps the calculation of each algorithm by doing nothing but creating the samples and count the results. Basically everything but algorithm. Which means it's a baseline in terms of processing time to all the other files that contains different algorithms.
Q: what is the purpose of these next two lines?
in_cursor = in;
out_cursor = out;
limit = in + SAMPLES;
These two lines prepare the two int16_t pointers and points them to the in and out arrays respectively.
Q: Are the results usable? Are they accurate?
printf("Result: %d\n", ttl);
The results for vol5 and vol4 are the same, but when comparing them to the naïve implementation the results vary by a lot so I'm not too sure about the accuracy.
Q: Why is the increment below 8 instead of 16 or some other value?
Q: Why is this line not needed in the inline assembler version of this program?
in_cursor += 8;
out_cursor += 8;
Incrementing the in_cursor and out_cursor by 8 is because the values are 8 bits apart from each other.
In the inline assembler version, chances are the assembly code automatically does it within its logic. But with C language, we have to be specific about it.
Q: what does it mean to "duplicate" values in the next line?
__asm__ ("dup v1.8h,%w0"::"r"(vol_int)); // duplicate vol_int into v1.8h
dup
loads the scaling factor and loads it into a 128-bits wide register.
Conclusion
In the end, the measurement of the performance of each algorithm is done to test the assumptions made in the part-1 post of this lab.
As expected, the algorithms(vol4&vol5 in this case) that use SIMD instructions appear to outperform others as they are best at processing multiple data simultaneously.
Top comments (0)