Loop Optimization in C++ - Testing capabilities of the Visual C++ compiler to optimize loops.

Intro

While writing our blog post on Windows APC, my friend Rbmm and I had a discussion about the for loop optimization in C++. I want to expand it a little bit further in this post, as it may be interesting for C/C++ programmers that want to write efficient code.

For an easier navigation to each particular test, please use the following quick links:

"For" Loop With An Index
"For" Loop With An Index As A Local Variable
"For" Loop With A Pointer As An Index
"Do-While" Loop

Premise

I want to explain right off the bat what I will not be doing here. Even though we will be talking about efficiency I am not going to run any timing tests. Why?

Well, there are many reasons for that. To name just a few:

It makes sense to run a timing test only for a specific algorithm or function for a particular environment. Otherwise, doing it for some arbitrarily made-up function is a moot point.
Modern CPUs have a myriad of features that can influence timing tests from one hardware setup to another. To name just a few that can significantly skew timing test results: different cache amounts on the CPU, or types of cache available, cache alignment, parallel processing of instructions, branch prediction, hyperthreading, amount of RAM in the system, how much of it was free during the test, the type and speed of the system RAM, physical proximity of the RAM chips to the CPU (NUMA nodes), other running threads in the system, how busy the OS was during the test, and many many more.
Timing tests also greatly depend on other variables that are not easy to control in your test environment. Things like: what was the cache state before the test? Did you run the first test that primed the cache so that things run faster now? Did the built-in antivirus software kick in to check the first run of your test app? And so forth.
Moreover, I've seen quite a few timing tests conducted in a separate process specially built from scratch for the purpose of running such a test. But this is not a good example because an actual piece of code needing the test may not necessarily run in an isolated environment.
And even if someone manages to run a slew of timing tests and present the averages, the results will be mostly appropriate for that particular system that the tests were conducted on, and thus sharing them in a blog post will not help someone else with a different hardware configuration.

So having said that, what is the goal of my blog post then?

The Goal

My goal is to study the efficiency of the C++ source code compiled with the Visual Studio compiler in relative terms, using assembly language instructions themselves.

It is a well-known fact that some x86 CPU instructions execute slower than others. For instance:

Floating point instructions are generally slower on the CPU than similar integer instructions.
An integer division, or idiv instruction is much slower than a logical shift shr, followed by an arithmetic addition or subtraction instructions.
A conditional branch instruction, e.g. jnz, is generally slower than an instruction without branching, e.g. cmovnz.
A context switch to the kernel for thread is slower than a controlled sysenter instruction (available via many Windows API calls.)

Test Environment

Having named the types of tests that I will not be doing, let me describe my actual test environment:

Visual Studio C++ optimizing compiler v.9.27.29111 for x64, that was supplied with the Visual Studio Community v.16.7.2.
All tests were conducted on the "Get Process Modules" C++ sample code, created originally as a "Console Application" project in Visual Studio 2019, Community edition. The actual loop being tested was a part of the CollectModules function in that project.
The C++ project was compiled for the x64 platform, as the Release configuration, with the configuration parameters for the project pretty much kept as they were out-of-the-box after installation of the Visual Studio:

Configuration Properties > C/C++ > Optimization:

Optimization: "Maximum Optimization (Favor Speed) (/O2)"
Favor Size or Speed: Favor fast code (/Ot)
Enable Fiber-Safe Optimizations: No
Whole Program Optimization: Yes (/GL)

Configuration Properties > C/C++ > Code Generation:

Spectre Mitigations: Disabled
Enable Intel JCC Erratum Mitigation:

Test Loops

I have compiled the CollectModules function with the following loops inserted into the part of the code that deals with the output of the results to the console.

Note that although each test sample deals with a very slow printf-type function, its presence does not influence assembly language instructions that were generated by the compiler within the loop itself.

Each of the following loops had a similar function.

"For" Loop With An Index

This is probably how most people would write the for loop. We specify a variable that loops from 0 to the number of elements, and then use it as an index in the original array of RTL_PROCESS_MODULES objects that we are printing:

C++[Copy]

// pRPMs of type PRTL_PROCESS_MODULES
pRPMs = (PRTL_PROCESS_MODULES)pThisBaseAddr;

if (!pRPMs->NumberOfModules)
{
	status = STATUS_PROCEDURE_NOT_FOUND;
	goto cleanup;
}

wprintf(L"64-bit Modules (%u):\n", pRPMs->NumberOfModules);

for (ULONG m = 0; m < pRPMs->NumberOfModules; m++)
{
	printf("%p sz=%08X flg=%08X Ord=%02X %s\n"
		,
		pRPMs->Modules[m].ImageBase,
		pRPMs->Modules[m].ImageSize,
		pRPMs->Modules[m].Flags,
		pRPMs->Modules[m].InitOrderIndex,
		pRPMs->Modules[m].FullPathName
	);
}

And just for completeness, each RTL_PROCESS_MODULES object is declared as such:

C++[Copy]

typedef struct RTL_PROCESS_MODULE_INFORMATION {
	HANDLE Section;                 // Not filled in
	PVOID MappedBase;
	PVOID ImageBase;
	ULONG ImageSize;
	ULONG Flags;
	USHORT LoadOrderIndex;
	USHORT InitOrderIndex;
	USHORT LoadCount;
	USHORT OffsetToFileName;
	CHAR  FullPathName[256];
} *PRTL_PROCESS_MODULE_INFORMATION;

typedef struct RTL_PROCESS_MODULES {
	ULONG NumberOfModules;
	RTL_PROCESS_MODULE_INFORMATION Modules[1];
} *PRTL_PROCESS_MODULES;

Compiled Result

Now let's review the compiled result for that specific loop:

Blog Post

Loop Optimization in C++

Testing capabilities of the Visual C++ compiler to optimize loops.

Intro

Table Of Contents

Premise

The Goal

Test Environment

Test Loops

"For" Loop With An Index

Compiled Result

Efficiency Issues

"For" Loop With An Index As A Local Variable

Compiled Result

Efficiency Issues

"For" Loop With A Pointer As An Index

Compiled Result

Efficiency Issues

"Do-While" Loop

Compiled Result

Efficiency Issues

Conclusion

Social Media

Contact

Related Articles