When I first started programming, one of the first things I wanted to do was optimize. No, really. I believe it’s a core tenant of programming: it’s fun, exciting, and makes it where programming is much more than gluing frameworks together.
That said, most programmers struggle with basic optimization ideas. Of course, most of those interested in the subject would at least know the bare basics such as algorithmic complexity, always timing your code before optimizing, and avoiding memory leaks. However, since serious nitty-gritty optimization is no longer a major facet of most developers’ lives due to a lack of technological constraints, time, and the constant push for new features, many are lacking any experience at all.
Therefore, this article is a part of a series that hopes to act as an introduction for those above the embedded environment, using C++, using an Intel architecture, on Windows Operating Systems, and using MSVC.
Starting Code
Two sorting algorithms will be compared to find which one is more optimized. They are Bubble Sort and Exchange Sort respectively. The code is as follows:
void BubbleSort(const uint32_t array_size) {
//Fill Vector
auto vec = std::vector<uint32_t>(array_size);
for (auto& i : vec) {
i = std::rand();
}; //vector
//Sort Vector
for (uint32_t i = 0;i<array_size-1;++i) {
for (uint32_t j = 0; j < array_size - i - 1; ++j)
if (vec[j] > vec[j + 1]) {
auto temp = vec[i];
vec[i] = vec[j];
vec[j] = temp;
}; //if j less than i
}; //for every idx
}; //BubbleSort
void ExchangeSort(uint32_t array_size) {
//Fill Vector
auto vec = std::vector<uint32_t>(array_size);
for (auto& i : vec) {
i = std::rand();
}; //vector
//Sort Vector
for (uint_fast32_t i = 0; i < array_size-1; ++i) {
for (uint_fast32_t j = (i + 1); j < array_size; ++j)
{
if (vec[i] < vec[j])
{
auto temp = vec[i];
vec[i] = vec[j];
vec[j] = temp;
};//if i < j
}; //for every next variable
}; //for every idx
}; //EchangeSort
int main() {
BubbleSort(7000);
ExchangeSort(7000);
}; //main
Assembly
The first thing to know is the layer below the high-level compiler: assembly. While assembly is not necessarily needed to optimize, it’s still important for approximations and creating extremely niche optimizations. Using assembly for said niche optimizations won’t be shown because it’s rarely needed modernly, and inline assembly isn’t available for x64 architectures in MSVC anyhow.
To view the assembly of a program, you must view the disassembly. There are two primary ways to see the disassembly in MSVC. The first is to simply press Ctrl-Alt-D while debugging in a breaking state. Keep in mind Ctrl-Alt-D will only show the assembly within a breakpoint’s scope. Therefore, one should be careful to use breakpoints appropriately.
The second option is to enable the assembly output in your project settings as viewed in the image below. One should keep in mind this option produces far more code than Ctrl-Alt-D due to being the entire program. Due to this, it can be much harder to read.
The main useful thing in seeing assembly is the ability to approximate which function is faster. Keep in mind one can only approximate to a certain point. Using assembly without running it primarily works when talking about smaller functions or language features in which the programmer has little control over.
Here is the disassembly output using Ctrl-Alt-D for the BubbleSort function:
mov dword ptr [rsp+8],ecx
push rbp
push rdi
sub rsp,238h
lea rbp,[rsp+20h]
lea rdi,[rsp+20h]
mov ecx,56h
mov eax,0CCCCCCCCh
rep stos dword ptr [rdi]
mov ecx,dword ptr [rsp+258h]
mov rax,qword ptr [__security_cookie (07FF6FA562018h)]
xor rax,rbp
mov qword ptr [rbp+200h],rax
lea rcx,[__76EBFBA9_main@cpp (07FF6FA568069h)]
call __CheckForDebuggerJustMyCode (07FF6FA551573h)
mov edx,20h
lea rcx,[vec]
call std::vector<unsigned int,std::allocator<unsigned int> >::__autoclassinit2 (07FF6FA5516FEh)
lea rcx,[rbp+1E4h]
call std::allocator<unsigned int>::allocator<unsigned int> (07FF6FA5516E0h)
mov ecx,dword ptr [array_size]
mov r8,rax
mov edx,ecx
lea rcx,[vec]
call std::vector<unsigned int,std::allocator<unsigned int> >::vector<unsigned int,std::allocator<unsigned int> > (07FF6FA5516E5h)
lea rax,[vec]
mov qword ptr [rbp+48h],rax
mov rcx,qword ptr [rbp+48h]
call std::vector<unsigned int,std::allocator<unsigned int> >::_Unchecked_begin (07FF6FA5516B8h)
mov qword ptr [rbp+68h],rax
mov rcx,qword ptr [rbp+48h]
call std::vector<unsigned int,std::allocator<unsigned int> >::_Unchecked_end (07FF6FA551690h)
mov qword ptr [rbp+88h],rax
jmp __$EncStackInitStart+96h (07FF6FA5597C8h)
mov rax,qword ptr [rbp+68h]
add rax,4
mov qword ptr [rbp+68h],rax
mov rax,qword ptr [rbp+88h]
cmp qword ptr [rbp+68h],rax
je __$EncStackInitStart+0BFh (07FF6FA5597F1h)
mov rax,qword ptr [rbp+68h]
mov qword ptr [rbp+0A8h],rax
call qword ptr [__imp_rand (07FF6FA5663F0h)]
mov rcx,qword ptr [rbp+0A8h]
mov dword ptr [rcx],eax
jmp __$EncStackInitStart+8Ah (07FF6FA5597BCh)
mov dword ptr [rbp+0C4h],0
jmp __$EncStackInitStart+0D9h (07FF6FA55980Bh)
mov eax,dword ptr [rbp+0C4h]
inc eax
mov dword ptr [rbp+0C4h],eax
mov eax,dword ptr [array_size]
dec eax
cmp dword ptr [rbp+0C4h],eax
jae __$EncStackInitStart+1C3h (07FF6FA5598F5h)
mov eax,dword ptr [rbp+0C4h]
inc eax
mov dword ptr [rbp+0E4h],eax
jmp __$EncStackInitStart+10Bh (07FF6FA55983Dh)
mov eax,dword ptr [rbp+0E4h]
inc eax
mov dword ptr [rbp+0E4h],eax
mov eax,dword ptr [array_size]
cmp dword ptr [rbp+0E4h],eax
jae __$EncStackInitStart+1BEh (07FF6FA5598F0h)
mov eax,dword ptr [rbp+0C4h]
mov edx,eax
lea rcx,[vec]
call std::vector<unsigned int,std::allocator<unsigned int> >::operator[] (07FF6FA55162Ch)
mov qword ptr [rbp+1F8h],rax
mov ecx,dword ptr [rbp+0E4h]
mov edx,ecx
lea rcx,[vec]
call std::vector<unsigned int,std::allocator<unsigned int> >::operator[] (07FF6FA55162Ch)
mov eax,dword ptr [rax]
mov rcx,qword ptr [rbp+1F8h]
cmp dword ptr [rcx],eax
jae __$EncStackInitStart+1B9h (07FF6FA5598EBh)
mov eax,dword ptr [rbp+0C4h]
mov edx,eax
lea rcx,[vec]
call std::vector<unsigned int,std::allocator<unsigned int> >::operator[] (07FF6FA55162Ch)
mov eax,dword ptr [rax]
mov dword ptr [rbp+104h],eax
mov eax,dword ptr [rbp+0E4h]
mov edx,eax
lea rcx,[vec]
call std::vector<unsigned int,std::allocator<unsigned int> >::operator[] (07FF6FA55162Ch)
mov qword ptr [rbp+1F8h],rax
mov ecx,dword ptr [rbp+0C4h]
mov edx,ecx
lea rcx,[vec]
call std::vector<unsigned int,std::allocator<unsigned int> >::operator[] (07FF6FA55162Ch)
mov rcx,qword ptr [rbp+1F8h]
mov ecx,dword ptr [rcx]
mov dword ptr [rax],ecx
mov eax,dword ptr [rbp+0E4h]
mov edx,eax
lea rcx,[vec]
call std::vector<unsigned int,std::allocator<unsigned int> >::operator[] (07FF6FA55162Ch)
mov ecx,dword ptr [rbp+104h]
mov dword ptr [rax],ecx
jmp __$EncStackInitStart+0FDh (07FF6FA55982Fh)
jmp __$EncStackInitStart+0CBh (07FF6FA5597FDh)
lea rcx,[vec]
call std::vector<unsigned int,std::allocator<unsigned int> >::~vector<unsigned int,std::allocator<unsigned int> > (07FF6FA5516F4h)
lea rcx,[rbp-20h]
lea rdx,[string "null pointer cannot point to a @"...+110h (07FF6FA55DEE0h)]
call _RTC_CheckStackVars (07FF6FA5514ABh)
mov rcx,qword ptr [rbp+200h]
xor rcx,rbp
call __security_check_cookie (07FF6FA55129Eh)
lea rsp,[rbp+218h]
pop rdi
pop rbp
ret
The way one approximates is simply by comparing the number of lines between functions. The number of lines is the exact number of commands to the CPU disregarding the continual vector and debugger operations. The vector operations have their own assembly beyond just the “call,” and the debugger operations only are there due to being in debug mode. You can view the vector assembly, if relevant, by simply exploring the compiler’s implementation with the debugger.
Comparing the two functions, Exchange Sort has the smaller number of lines at 116 vs 120. However, keep in mind what was stated earlier: the approximation typically only works with small or near-identical functions. Two sorting functions are clearly different due to their general algorithm even if they’re similar in length. That gives way to the next section: timing.
Timing
Timing, as the old wisdom goes, is the only true way to guarantee your program wins on the speed front. There are only two native ways to measure the time spent using code at least regarding the CPU itself and not any peripherals like GPUs. Those would be the literal time spent or the CPU clock-cycles used. Keep in mind when timing you should always use the “Release” variant of a program. Debug mode includes under performing safe gaps specifically to make it easier to analyze bugs in code regardless of performance.
The STL has a standard clock easily available under the chrono header. Considering multiple clock options high_resolution_clock should be the best option. You can select timing increments from nano seconds to years.
#include <vector>
#include <list>
#include <cstdlib>
#include <iostream>
#include <chrono>
#include <string>
double bubblesort_time_average;
double exchangesort_time_average;
std::chrono::high_resolution_clock::time_point init_time;
std::chrono::high_resolution_clock::time_point fin_time;
std::vector<uint32_t> vector;
inline void End_Timer() {
fin_time = std::chrono::high_resolution_clock::now();
}; //End_Timer
inline void Start_Timer() {
init_time = std::chrono::steady_clock::now();
}; //Start_Time
void BubbleSort(const uint32_t array_size) {
//Fill Vector
for (auto& i : vector) {
i = std::rand();
}; //vector
Start_Timer();
//Sort Vector
for (uint32_t i = 0;i<array_size-1;++i) {
for (uint32_t j = 0; j < array_size - i - 1; ++j)
if (vector[j] > vector[j + 1]) {
auto temp = vector[i];
vector[i] = vector[j];
vector[j] = temp;
}; //if j less than i
}; //for every idx
End_Timer();
bubblesort_time_average += std::chrono::duration<double,
std::chrono::seconds::period>(fin_time -
init_time).count();
}; //BubbleSort
void ExchangeSort(uint32_t array_size) {
//Fill Vector
for (auto& i : vector) {
i = std::rand();
}; //vector
Start_Timer();
//Sort Vector
for (uint_fast32_t i = 0; i < array_size-1; ++i) {
for (uint_fast32_t j = (i + 1); j < array_size; ++j)
{
if (vector[i] < vector[j])
{
auto temp = vector[i];
vector[i] = vector[j];
vector[j] = temp;
};//if i < j
}; //for every next variable
}; //for every idx
End_Timer();
exchangesort_time_average += std::chrono::duration<double,
std::chrono::seconds::period>(fin_time -
init_time).count();
}; //EchangeSort
int main() {
vector = std::vector<uint32_t>(7000);
uint_fast8_t i = 0;
for (; i < UINT_MAX8; ++i) {
BubbleSort(7000);
ExchangeSort(7000);
}; //for loop
exchangesort_time_average /= i;
bubblesort_time_average /= i;
std::cout << "ExchangeSort:" << std::to_string(exchangesort_time_average) << "\n";
std::cout << "BubbleSort:" << std::to_string(bubblesort_time_average) << "\n";
}; //main
Here we add two functions: Start_Timer and End_Timer. The names should be quite self explanatory. Inside them we initialize the timer by using a variable external to the function to collect the current time. Then, after placing the timer functions in the sorting functions, we subtract the previously stated variables afterwards storing the difference in another variable. Then after the sorting functions are looped multiple times, they’re averaged using the number of loops and output to console. My output gave me these numbers: Exchange Sort at 0.048897 seconds and Bubble Sort at 0.034960.
Clearly Bubble Sort is faster counteractive to the guesses using the assembly approximation. The reason being is the same reason previously stated: the approximation primarily works with smaller functions extremely similar in algorithmic complexity. These two sorting functions are clearly different in algorithmic complexity.
One could easily end there. However, timing the CPU clock cycles is still an important option especially when digging into the nitty-gritty. In situations where the the regular clock is barely enough to see the difference the CPU clock cycles can provide more clarity. The next code snippet shows how to do such.
#include <vector>
#include <list>
#include <cstdlib>
#include <iostream>
#include <chrono>
#include <string>
#include <intrin.h>
double init_clock_cycles;
double fin_clock_cycles;
double bubblesort_time_average;
double exchangesort_time_average;
double bubblesort_clock_cycle_average;
double exchangesort_clock_cycle_average;
std::chrono::high_resolution_clock::time_point init_time;
std::chrono::high_resolution_clock::time_point fin_time;
std::vector<uint32_t> vector;
inline void End_Timer() {
fin_clock_cycles = __rdtsc();
fin_time = std::chrono::high_resolution_clock::now();
}; //End_Timer
inline void Start_Timer() {
init_clock_cycles = __rdtsc();
init_time = std::chrono::steady_clock::now();
}; //Start_Time
void BubbleSort(const uint32_t array_size) {
//Fill Vector
for (auto& i : vector) {
i = std::rand();
}; //vector
Start_Timer();
//Sort Vector
for (uint32_t i = 0;i<array_size-1;++i) {
for (uint32_t j = 0; j < array_size - i - 1; ++j)
if (vector[j] > vector[j + 1]) {
auto temp = vector[i];
vector[i] = vector[j];
vector[j] = temp;
}; //if j less than i
}; //for every idx
End_Timer();
bubblesort_time_average += std::chrono::duration<double,
std::chrono::seconds::period>(fin_time -
init_time).count();
bubblesort_clock_cycle_average += fin_clock_cycles - init_clock_cycles;
}; //BubbleSort
void ExchangeSort(uint32_t array_size) {
//Fill Vector
for (auto& i : vector) {
i = std::rand();
}; //vector
Start_Timer();
//Sort Vector
for (uint_fast32_t i = 0; i < array_size-1; ++i) {
for (uint_fast32_t j = (i + 1); j < array_size; ++j)
{
if (vector[i] < vector[j])
{
auto temp = vector[i];
vector[i] = vector[j];
vector[j] = temp;
};//if i < j
}; //for every next variable
}; //for every idx
End_Timer();
exchangesort_time_average += std::chrono::duration<double,
std::chrono::seconds::period>(fin_time -
init_time).count();
exchangesort_clock_cycle_average += fin_clock_cycles - init_clock_cycles;
}; //EchangeSort
int main() {
vector = std::vector<uint32_t>(7000);
uint_fast8_t i = 0;
for (; i < UINT8_MAX; ++i) {
BubbleSort(7000);
ExchangeSort(7000);
}; //for loop
exchangesort_time_average /= i;
bubblesort_time_average /= i;
exchangesort_clock_cycle_average /= i;
exchangesort_clock_cycle_average /= i;
std::cout << "ExchangeSort:" << std::to_string(exchangesort_time_average) << "\n";
std::cout << "BubbleSort:" << std::to_string(bubblesort_time_average) << "\n";
std::cout << "ExchangeSort:" << std::to_string(exchangesort_clock_cycle_average) << "\n";
std::cout << "BubbleSort:" << std::to_string(bubblesort_clock_cycle_average) << "\n";
}; //main
Just like the timing from the previous option, the way you gather the clock cycles is exactly the same except now you use the __rdtsc() function. This is something called a compiler intrinsic which is essentially pipelined assembly and a borderline replacement for inline assembly. It can be found in intrin header.
Removing OS Noise
If you’ve attempted profiling the previous functions multiple times you may have noticed it’s not really consistent. Despite the fact the difference between attempts is rather small, the number gathered for an average is rarely ever the same. To help alleviate this, one has a few options at their disposal.
One option is simply to play with the OS. The OS has a scheduler that manages tasks through a priority queue. It selects based on differing criteria such as whether the program is the one in focus to the user or whether the program is just in the background calculating and brooding its next move. To prevent the scheduler from taking thread power away from your program you can do two things: close as many programs as possible and specifically ask the OS to divert as many resources as possible to your program. The following program shows the latter, and one can assume any competent programmer can do the former.
#include <vector>
#include <list>
#include <cstdlib>
#include <iostream>
#include <chrono>
#include <string>
#include <intrin.h>
#include <Windows.h>
...
int main() {
auto process = GetCurrentProcess();
auto thread = GetCurrentThread();
SetPriorityClass(process, REALTIME_PRIORITY_CLASS);
SetThreadPriority(thread, THREAD_PRIORITY_TIME_CRITICAL);
vector = std::vector<uint32_t>(7000);
uint_fast8_t i = 0;
for (; i < UINT8_MAX; ++i) {
BubbleSort(7000);
ExchangeSort(7000);
}; //for loop
exchangesort_time_average /= i;
bubblesort_time_average /= i;
std::cout << "ExchangeSort:" << std::to_string(exchangesort_time_average) << "\n";
std::cout << "BubbleSort:" << std::to_string(bubblesort_time_average) << "\n";
std::cout << "ExchangeSort:" << std::to_string(exchangesort_clock_cycle_average) << "\n";
std::cout << "BubbleSort:" << std::to_string(bubblesort_clock_cycle_average) << "\n";
}; //main
One may notice that not only the process is being modified. The threads are being modified as well. That’s because, if it wasn’t known, the Operating System manages true CPU threads for programs. The threads typically produced through methods such as std::thread and std::future are actually only asking the compiler to divert the resources best they can from the scheduler. If they can’t get a literal thread, the thread is just split going between multiple processes rapidly. Despite the fact methods like this are not used here, we still need to tell the OS that the central thread of the program should be sped-up. The settings REAL_TIME_PRIORITY and THREAD_PRIORITY_TIME_CRITICAL are the maximum priority settings. You can view all the settings in the Win32 API documentation.
Profiling with other programs
No one should have to profile this way specifically. In fact, entire suits of programs called profilers exist that allow a visual timeline with minimal code intervention. One of the best open source ones currently out there is Tracy.
To install Tracy, go to their github releases and download both the source code and the windows-version-zip. For this specific task, the source cannot be downloaded directly from the repo. The source can only be taken from the releases.
Run the typical cmake build command directly in the source code. Then open the SLN in the build folder. After the library has been compiled [use RELEASE mode for the best profiling] link it to the program. The necessary files to include can be found in the “public” folder at the top level of the source code. If you did not compile in release mode, the debug information format will have to be changed in the project properties. This is displayed below.
Tracy works using macros called ZoneScopes. Most other profilers work in a similar vein. A ZoneScope marks a certain “scope” in a C++ program to be analyzed. Inside of the BubbleSort or ExchangeSort function the ZoneScope macro can be added as shown below in code. The defines and includes needed are also shown.
#define TRACY_ENABLE
#include "include/tracy/Tracy.hpp"
...
void BubbleSort(const uint32_t array_size) {
//Fill Vector
for (auto& i : vector) {
i = std::rand();
}; //vector
Start_Timer();
{
ZoneScopedNCS("Bubble Sort",tracy::Color::Blue);
//Sort Vector
for (uint32_t i = 0;i<array_size-1;++i) {
for (uint32_t j = 0; j < array_size - i - 1; ++j)
if (vector[j] > vector[j + 1]) {
auto temp = vector[i];
vector[i] = vector[j];
vector[j] = temp;
}; //if j less than i
}; //for every idx
}
End_Timer();
bubblesort_time_average += std::chrono::duration<double,
std::chrono::seconds::period>(fin_time -
init_time).count();
bubblesort_clock_cycle_average += fin_clock_cycles - init_clock_cycles;
}; //BubbleSort
Although anyone can simply use ZoneScope without parentheses to work, ZoneScopeNC allows the addition of a name and color to make the visualization easier to use. Without the N, Zonescope will just use the function name, and without the C it’ll produce an identical beige color for any function on the timeline. It’s suggested to change the color when used in exchange sort. The colors use the X11 scheme which is well documented on Wikipedia. Using a hex code for colors is also an option.
To actually see the visualization unzip the windows-version-zip folder and start the tracy-profiler executable inside. Press the connect button then start the program. After a few moments, if you’ve done the previous instructions correctly, something like this should appear:
It doesn’t seem so useful at first, but one can zoom in with the scroll wheel on the main thread to see the timing difference.
Even more useful is that one can click on the function on the timeline. One can then click on statistics to bring up a histogram perfect for seeing timing in progress.
It’s beautiful, but Tracy is a much, much deeper program. One can add messages with TracyMessage() as well as plot values [such as the cpu clockcycles or seconds spent] with TracyPlot. An S can be added to the ZoneScope to see the callstack directly in Tracy, and Tracy can be held over server to view the profile on another computer. It truly is versatile.
Of course, that completes the section on timing. Timing is extremely simple. The more juicy bits come when one starts to analyze the memory to figure out how to micromanage allocation to their best interests. There’s also always using special compiler or CPU specific functions to really accelerate the program. Tracy is perfectly built for that as well, but that comes in another article.
Top comments (0)