Blog Post

Loop Optimization in C++

Testing capabilities of the Visual C++ compiler to optimize loops.

Loop Optimization in C++ - Testing capabilities of the Visual C++ compiler to optimize loops.

Intro

While writing our blog post on Windows APC, my friend Rbmm and I had a discussion about the for loop optimization in C++. I want to expand it a little bit further in this post, as it may be interesting for C/C++ programmers that want to write efficient code.

Table Of Contents

For an easier navigation to each particular test, please use the following quick links:

Premise

I want to explain right off the bat what I will not be doing here. Even though we will be talking about efficiency I am not going to run any timing tests. Why?

Well, there are many reasons for that. To name just a few:

  • It makes sense to run a timing test only for a specific algorithm or function for a particular environment. Otherwise, doing it for some arbitrarily made-up function is a moot point.
  • Modern CPUs have a myriad of features that can influence timing tests from one hardware setup to another. To name just a few that can significantly skew timing test results: different cache amounts on the CPU, or types of cache available, cache alignment, parallel processing of instructions, branch prediction, hyperthreading, amount of RAM in the system, how much of it was free during the test, the type and speed of the system RAM, physical proximity of the RAM chips to the CPU (NUMA nodes), other running threads in the system, how busy the OS was during the test, and many many more.
  • Timing tests also greatly depend on other variables that are not easy to control in your test environment. Things like: what was the cache state before the test? Did you run the first test that primed the cache so that things run faster now? Did the built-in antivirus software kick in to check the first run of your test app? And so forth.
  • Moreover, I've seen quite a few timing tests conducted in a separate process specially built from scratch for the purpose of running such a test. But this is not a good example because an actual piece of code needing the test may not necessarily run in an isolated environment.
  • And even if someone manages to run a slew of timing tests and present the averages, the results will be mostly appropriate for that particular system that the tests were conducted on, and thus sharing them in a blog post will not help someone else with a different hardware configuration.

So having said that, what is the goal of my blog post then?

The Goal

My goal is to study the efficiency of the C++ source code compiled with the Visual Studio compiler in relative terms, using assembly language instructions themselves.

It is a well-known fact that some x86 CPU instructions execute slower than others. For instance:

  • Floating point instructions are generally slower on the CPU than similar integer instructions.
  • An integer division, or idiv instruction is much slower than a logical shift shr, followed by an arithmetic addition or subtraction instructions.
  • A conditional branch instruction, e.g. jnz, is generally slower than an instruction without branching, e.g. cmovnz.
  • A context switch to the kernel for thread is slower than a controlled sysenter instruction (available via many Windows API calls.)

Test Environment

Having named the types of tests that I will not be doing, let me describe my actual test environment:

  • Visual Studio C++ optimizing compiler v.9.27.29111 for x64, that was supplied with the Visual Studio Community v.16.7.2.
  • All tests were conducted on the "Get Process Modules" C++ sample code, created originally as a "Console Application" project in Visual Studio 2019, Community edition. The actual loop being tested was a part of the CollectModules function in that project.
  • The C++ project was compiled for the x64 platform, as the Release configuration, with the configuration parameters for the project pretty much kept as they were out-of-the-box after installation of the Visual Studio:
    • Configuration Properties > C/C++ > Optimization:
      • Optimization: "Maximum Optimization (Favor Speed) (/O2)"
      • Favor Size or Speed: Favor fast code (/Ot)
      • Enable Fiber-Safe Optimizations: No
      • Whole Program Optimization: Yes (/GL)
    • Configuration Properties > C/C++ > Code Generation:
      • Spectre Mitigations: Disabled
      • Enable Intel JCC Erratum Mitigation:

Test Loops

I have compiled the CollectModules function with the following loops inserted into the part of the code that deals with the output of the results to the console.

Note that although each test sample deals with a very slow printf-type function, its presence does not influence assembly language instructions that were generated by the compiler within the loop itself.

Each of the following loops had a similar function.

"For" Loop With An Index

This is probably how most people would write the for loop. We specify a variable that loops from 0 to the number of elements, and then use it as an index in the original array of RTL_PROCESS_MODULES objects that we are printing:

C++[Copy]
// pRPMs of type PRTL_PROCESS_MODULES
pRPMs = (PRTL_PROCESS_MODULES)pThisBaseAddr;

if (!pRPMs->NumberOfModules)
{
	status = STATUS_PROCEDURE_NOT_FOUND;
	goto cleanup;
}

wprintf(L"64-bit Modules (%u):\n", pRPMs->NumberOfModules);

for (ULONG m = 0; m < pRPMs->NumberOfModules; m++)
{
	printf("%p sz=%08X flg=%08X Ord=%02X %s\n"
		,
		pRPMs->Modules[m].ImageBase,
		pRPMs->Modules[m].ImageSize,
		pRPMs->Modules[m].Flags,
		pRPMs->Modules[m].InitOrderIndex,
		pRPMs->Modules[m].FullPathName
	);
}

And just for completeness, each RTL_PROCESS_MODULES object is declared as such:

C++[Copy]
typedef struct RTL_PROCESS_MODULE_INFORMATION {
	HANDLE Section;                 // Not filled in
	PVOID MappedBase;
	PVOID ImageBase;
	ULONG ImageSize;
	ULONG Flags;
	USHORT LoadOrderIndex;
	USHORT InitOrderIndex;
	USHORT LoadCount;
	USHORT OffsetToFileName;
	CHAR  FullPathName[256];
} *PRTL_PROCESS_MODULE_INFORMATION;

typedef struct RTL_PROCESS_MODULES {
	ULONG NumberOfModules;
	RTL_PROCESS_MODULE_INFORMATION Modules[1];
} *PRTL_PROCESS_MODULES;

Compiled Result

Now let's review the compiled result for that specific loop:

x86-64[Copy]
	; if (!pRPMs->NumberOfModules)
	mov         edx,dword ptr [rdi]
	test        edx,edx
	jne         @@0
	
	; status = STATUS_PROCEDURE_NOT_FOUND;
	; goto cleanup;
	mov         edi,dword ptr [rbp-59h]
	mov         ebx,0C000007Ah
	jmp         @@cleanup
@@0:
	; wprintf(L"64-bit Modules (%u):\n", pRPMs->NumberOfModules);
	lea         rcx,[rip+0x1e33]			; "64-bit Modules (%u):\n"
	call        wprintf

	; for (ULONG m = 0; m < pRPMs->NumberOfModules; m++)
	xor         ebx,ebx
	cmp         dword ptr [rdi],ebx
	jbe         @@2
	nop         dword ptr [rax]				; alignment to 16 bytes

	; printf("%p sz=%08X flg=%08X Ord=%02X %s\n"
@@1:
	mov         eax,ebx
	imul        rdx,rax,128h
	add         rdx,rdi
	movzx       ecx,word ptr [rdx+2Ah]
	lea         rax,[rdx+30h]
	mov         r9d,dword ptr [rdx+24h]
	mov         r8d,dword ptr [rdx+20h]
	mov         rdx,qword ptr [rdx+18h]
	mov         qword ptr [rsp+28h],rax
	mov         dword ptr [rsp+20h],ecx
	lea         rcx,[rip+0x1e20]			; "%p sz=%08X flg=%08X Ord=%02X %s\n"
	call        printf
	inc         ebx
	cmp         ebx,dword ptr [rdi]
	jb          @@1
@@2:

Let's review what happens there step by step:

x86-64[Copy]
	mov         edx,dword ptr [rdi]
	test        edx,edx
	jne         @@0
	mov         edi,dword ptr [rbp-59h]
	mov         ebx,0C000007Ah
	jmp         @@cleanup

At first compiler checks that pRPMs->NumberOfModules is not 0 and jumps to @@0 if not. Otherwise it sets the status code to 0C000007Ah and jumps to the end of the function at the @@cleanup label. (I don't show it here as it's not relevant for our tests.)

x86-64[Copy]
	lea         rcx,[rip+0x1e33]			; "64-bit Modules (%u):\n"
	call        wprintf

Then note that the compiler needed to prepare two input parameters for the first call to wprintf: One is the string itself, that went into rcx according to the x64 calling convention, and the second one is the pRPMs->NumberOfModules that we used earlier. Thus note that the compiler had strategically preserved it from the previous comparison. Notice that the compiler chose to use the rdx register for that, which is not just a coincidence. That register is also used for the second input parameter for the call to wprintf.

x86-64[Copy]
	xor         ebx,ebx
	cmp         dword ptr [rdi],ebx
	jbe         @@2

Then, as you can see, the compiler had to keep the pointer to the RTL_PROCESS_MODULES struct, or pRPMs, in the rdi register. It also allocated the rbx register to contain 0 constant with the xor ebx,ebx instruction. It's a quick way to zero out a register. Also note that compiler used a shorter 32-bit version of the xor instruction, since CPU zeroes out high 32 bits of a 64-bit register in that case, and thus affecting the entire rbx.

After that, we compare our pRPMs->NumberOfModules to 0 with the cmp dword ptr [rdi],ebx instruction, and if the result is below or equal (for an unsigned comparison), we skip the loop altogether by following the conditional jump with the jbe @@2 instruction. Translated to plain English, we skip the loop if pRPMs->NumberOfModules is less than or equal to 0. (Which by itself is an interesting way of coding it in assembly, as an unsigned integer technically cannot be less than 0. So the logical way to code it for a human being would be with the je or jz instruction instead. But the compiler simply reversed the logic of the C++'s m < pRPMs->NumberOfModules comparison to come up with the jbe instruction, or "Jump if below or equal".) Note that this was the implementation of the first condition of the for statement.

x86-64[Copy]
	nop         dword ptr [rax]				; alignment to 16 bytes

Now we come to an interesting part. The loop itself. As you can see, the compiler used a dud instruction nop dword ptr [rax] to align the first instruction in the loop on the 16-byte boundary. This is another optimization, needed to align the loop on the cache line to ensure that the CPU has to fetch as few instructions from memory as possible. (The cache line is read by the CPU all at once, and thus it makes sense to align as many instructions as can fit into it.)

x86-64[Copy]
	mov         eax,ebx
	imul        rdx,rax,128h
	add         rdx,rdi

The loop then begins with the calculation of the offset from the beginning of the array, or pRPMs, to the needed index. It seems like the index value, or our m, will be stored into the ebx register. The compiler first moves it to rax and multiplies it by the size of the RTL_PROCESS_MODULE_INFORMATION struct, that happens to be 128h bytes. This is done by the imul rdx,rax,128h instruction. The result of the multiplication is stored in rdx. After that we add rdi to rdx to get the actual offset of our needed index in the pRPMs array. (Remember rdi stores the address of the beginning of the RTL_PROCESS_MODULES struct array.)

x86-64[Copy]
	movzx       ecx,word ptr [rdx+2Ah]
	lea         rax,[rdx+30h]
	mov         r9d,dword ptr [rdx+24h]
	mov         r8d,dword ptr [rdx+20h]
	mov         rdx,qword ptr [rdx+18h]
	mov         qword ptr [rsp+28h],rax
	mov         dword ptr [rsp+20h],ecx
	lea         rcx,[rip+0x1e20]			; "%p sz=%08X flg=%08X Ord=%02X %s\n"
	call        printf

Then the compiler begins setting up input parameters for the printf call. For simplicity, let's review it from the call printf instruction backwards. It puts the pointer to the string with the specifiers in the rcx register (i.e. "%p sz=%08X flg=%08X Ord=%02X %s\n"), and then sets up the next 3 parameters in rdx, r8 and r9 registers. The remaining 2 parameters have to be stored on the stack, according to the calling convention, so the compiler first copies them into ecx and rax registers, and then stores them on the stack at the [rsp+20h] and [rsp+28h] locations, respectively.

x86-64[Copy]
	inc         ebx
	cmp         ebx,dword ptr [rdi]
	jb          @@1

Finally, after the call to printf, it increments the index in ebx and compares it to our number of elements in pRPMs->NumberOfModules with the cmp ebx,dword ptr [rdi] instruction. And the loop continues in case the index value in ebx is below (in unsigned notation) than the number of elements. This conditional jump is performed with the jb @@1 (or "Jump if below") instruction.

Efficiency Issues

Having seen the compiled assembly language code we can notice the following efficiency issues right off the bat:

  1. Multiplication of the index on each iteration. The imul instruction, even though significantly faster on modern CPUs, is still not the most efficient way to calculate the offset. I am actually quite surprised that the optimizing compiler in Visual Studio chose to go with this approach. This is exactly the literal translation of what the logic in the C++ source code was doing.
    It seems like the reason for the in-loop calculation of the offset (with the imul instruction) was chosen by the compiler because we used one of the members of the pRPMs struct, or the pRPMs->NumberOfModules, in the for statement itself. It seems like this prevented the compiler from applying optimization, since that member variable could be altered when we called the function inside the body of the loop. And since it was used as the end-loop parameter in the for statement, the compiler could not calculate it before running the code of the actual loop.
  2. Use of memory reference for the index comparisons. The cmp ebx,dword ptr [rdi] instruction at the end of the loop was reading the dword ptr [rdi] memory location on each iteration of the loop, which was not necessary as that value would not change during the execution of the loop. It would've been obviously stored in the CPU cache after the first read, which would speed things up, but still, relying on cache alone is not a good way to go, especially if we can easily store that value in one of the registers that we had available for us: rsi, r12, r13, r14, r15, etc.

So the bottom line for the compiled C++ code in this case is that it will not be the most efficient way to go.

So let's see how can we improve it?

"For" Loop With An Index As A Local Variable

Let's make a small change to the C++ code that we reviewed earlier: Instead of referring to the counter of elements in the array by its member variable in the RTL_PROCESS_MODULES struct, let's move it to a scope-local variable instead:

C++[Copy]
// pRPMs of type PRTL_PROCESS_MODULES
pRPMs = (PRTL_PROCESS_MODULES)pThisBaseAddr;

ULONG nNumberOfModules = pRPMs->NumberOfModules;			//Using scope-local variable to store the number of elements in the loop

if (!nNumberOfModules)
{
	status = STATUS_PROCEDURE_NOT_FOUND;
	goto cleanup;
}

wprintf(L"64-bit Modules (%u):\n", nNumberOfModules);

for (ULONG m = 0; m < nNumberOfModules; m++)
{
	printf("%p sz=%08X flg=%08X Ord=%02X %s\n"
		,
		pRPMs->Modules[m].ImageBase,
		pRPMs->Modules[m].ImageSize,
		pRPMs->Modules[m].Flags,
		pRPMs->Modules[m].InitOrderIndex,
		pRPMs->Modules[m].FullPathName
	);
}

At first glance, the C++ code above seems like a slightly slower version of the original that adds one extra assignment in the form of initialization of the local nNumberOfModules variable.

Compiled Result

Let's see what it now compiles into:

x86-64[Copy]
	; pRPMs = (PRTL_PROCESS_MODULES)pThisBaseAddr;
	; ULONG nNumberOfModules = pRPMs->NumberOfModules;
	mov         esi,dword ptr [rdi]

	; if (!nNumberOfModules)
	test        esi,esi
	jne         @@0

	; status = STATUS_PROCEDURE_NOT_FOUND;
	; goto cleanup;
	mov         edi,dword ptr [rbp-59h]
	mov         ebx,0C000007Ah
	jmp         @@cleanup
@@0:
	; wprintf(L"64-bit Modules (%u):\n", nNumberOfModules);
	mov         edx,esi
	lea         rcx,[rip+0x1e20]					; "64-bit Modules (%u):\n"
	call        wprintf

	; for (ULONG m = 0; m < nNumberOfModules; m++)
	test        esi,esi
	je          @@2
	lea         rbx,[rdi+24h]
	add         rdi,30h
	nop         word ptr [rax+rax]					; alignment to 16 bytes
@@1:
	; printf("%p sz=%08X flg=%08X Ord=%02X %s\n"
	movzx       eax,word ptr [rbx+6]
	lea         rcx,[rip+0x1e25]					; "%p sz=%08X flg=%08X Ord=%02X %s\n"
	mov         r9d,dword ptr [rbx]
	mov         r8d,dword ptr [rbx-4]
	mov         rdx,qword ptr [rbx-0Ch]
	mov         qword ptr [rsp+28h],rdi
	mov         dword ptr [rsp+20h],eax
	call        printf
	add         rdi,128h
	lea         rbx,[rbx+128h]
	sub         rsi,1
	jne         @@1
@@2:

As you can see, the assembly code is surprisingly more laconic now.

x86-64[Copy]
	mov         esi,dword ptr [rdi]

The compiler chose again to use the rdi register to store the pointer to the local RTL_PROCESS_MODULES struct. But now it also caches the pRPMs->NumberOfModules member in the esi register and uses it later throughout the code snippet above. The compiler can do this because we hinted to it with our addition of the following scope-local variable:

C++[Copy]
ULONG nNumberOfModules = pRPMs->NumberOfModules;

Because that is a scope-local variable, that is not indirectly referenced anywhere else in the code, this means that a compiler can cache it in a local register and gain the speed advantage. So the addition of the line of code above, that seems quite superfluous in the C++ code, was actually a performance gain.

x86-64[Copy]
	lea         rbx,[rdi+24h]

	; ....
@@1:
	movzx       eax,word ptr [rbx+6]
	lea         rcx,[rip+0x1e25]					; "%p sz=%08X flg=%08X Ord=%02X %s\n"
	mov         r9d,dword ptr [rbx]
	mov         r8d,dword ptr [rbx-4]
	mov         rdx,qword ptr [rbx-0Ch]
	mov         qword ptr [rsp+28h],rdi
	mov         dword ptr [rsp+20h],eax
	call        printf
	add         rdi,128h
	lea         rbx,[rbx+128h]
	sub         rsi,1
	jne         @@1

Further more, if you look at the assembly code for the loop itself (after the @@1 tag), the compiler was able to eliminate the unnecessary calculation of an offset of each element in the RTL_PROCESS_MODULE_INFORMATION array using multiplication, and instead simply used another register rbx to keep the pointer to the current element in the array, and gradually incremented it to the next element at the end of the loop with the use of the lea rbx,[rbx+128h] instruction. Remember 128h is the size of each RTL_PROCESS_MODULE_INFORMATION struct in bytes.

Note that the lea instruction is faster than the imul instruction that was used in the first compiled sample.
x86-64[Copy]
	add         rdi,30h
@@1:
	; ...
	mov         qword ptr [rsp+28h],rdi
	; ...
	add         rdi,128h
	; ...
	jne         @@1

Note that compiler had also cached the pointer to the FullPathName member of the RTL_PROCESS_MODULE_INFORMATION struct in the array in the rdi register. But that also involved some additional steps, such as first calculating it before the loop with the add rdi,30h instruction, and then also by incrementing it during the loop itself with the add rdi,128h instruction.

x86-64[Copy]
	mov         esi,dword ptr [rdi]

	; ...
@@1:
	; ...

	sub         rsi,1
	jne         @@1

And as the loop counter, this time the compiler chose to decrement a previously assigned pRPMs->NumberOfModules member in the rsi register, and check it for non-zero with the jne @@1 ("Jump if not zero") instruction at the end of the loop. This is a more efficient way of checking the loop counter than before.

Efficiency Issues

As you can see, the assembly language code produced by the compiler this time looks much more efficient with the addition of just one line of source code and by moving the loop counter into a scope-local variable. This may be counterintuitive for the C++ language at first, but it made that loop more optimized and thus more efficient.

The reason for such gain in efficiency is this: the compiler knows that a scope-local variable cannot be referenced anywhere outside of the local scope in the C/C++ language, and thus it can be free to optimize the use of such variable. Thus by moving the pRPMs->NumberOfModules into a local variable we hinted to the compiler that it can optimize it in a register, or any other way it needs to.

But the assembly code produced still has some efficiency issues:

  1. Too many variables in the loop. We have to not only keep track of the pointer to the current RTL_PROCESS_MODULE_INFORMATION struct in the rbx register, but also to keep track of another pointer in rdi, and of the counter in rsi. That's too many counters for one loop.
  2. Checking pRPMs->NumberOfModules for non-0 twice. This is not a huge deal breaker, but we checked pRPMs->NumberOfModules for being 0 and returned an error before the loop, but then it is also checked again in the for loop.

Let's see if we can optimize it a little bit more.

"For" Loop With A Pointer As An Index

In this test, let's try to hint the compiler even more by using a pointer to an array of RTL_PROCESS_MODULE_INFORMATION structs in our pPMI:

C++[Copy]
// pRPMs of type PRTL_PROCESS_MODULES
pRPMs = (PRTL_PROCESS_MODULES)pThisBaseAddr;
RTL_PROCESS_MODULE_INFORMATION* pPMI = pRPMs->Modules;

if (!pRPMs->NumberOfModules)
{
	status = STATUS_PROCEDURE_NOT_FOUND;
	goto cleanup;
}

wprintf(L"64-bit Modules (%u):\n", pRPMs->NumberOfModules);

for (RTL_PROCESS_MODULE_INFORMATION* pEnd = pPMI + pRPMs->NumberOfModules; pPMI < pEnd; pPMI++)
{
	printf("%p sz=%08X flg=%08X Ord=%02X %s\n"
		,
		pPMI->ImageBase,
		pPMI->ImageSize,
		pPMI->Flags,
		pPMI->InitOrderIndex,
		pPMI->FullPathName
	);
}

Compiled Result

This C++ source code compiled into the following assembly language snippet:

x86-64[Copy]
	; pRPMs = (PRTL_PROCESS_MODULES)pThisBaseAddr;
	; RTL_PROCESS_MODULE_INFORMATION* pPMI = pRPMs->Modules;
	mov         edx,dword ptr [rdi]
	lea         rbx,[rdi+8]

	; if (!pRPMs->NumberOfModules)
	test        edx,edx
	jne         @@0

	mov         edi,dword ptr [rbp-59h]
	mov         ebx,0C000007Ah
	jmp         @@cleanup
@@0:
	; wprintf(L"64-bit Modules (%u):\n", pRPMs->NumberOfModules);
	lea         rcx,[rip+0x1e1a]			; "64-bit Modules (%u):\n"
	call        wprintf

	; for (RTL_PROCESS_MODULE_INFORMATION* pEnd = pPMI + pRPMs->NumberOfModules; pPMI < pEnd; pPMI++)
	mov         eax,dword ptr [rdi]
	imul        rcx,rax,128h
	lea         rax,[rcx+rbx]
	cmp         rbx,rax
	jae         @@2

	add         rbx,22h
	mov         rax,0DD67C8A60DD67C8Bh
	dec         rcx
	mul         rax,rcx
	mov         rdi,rdx
	shr         rdi,8
	inc         rdi
	nop										; alignment to 16 bytes
@@1:
	; printf("%p sz=%08X flg=%08X Ord=%02X %s\n"
	movzx       ecx,word ptr [rbx]
	lea         rax,[rbx+6]
	mov         r9d,dword ptr [rbx-6]
	mov         r8d,dword ptr [rbx-0Ah]
	mov         rdx,qword ptr [rbx-12h]
	mov         qword ptr [rsp+28h],rax
	mov         dword ptr [rsp+20h],ecx
	lea         rcx,[rip+0x1ded]			; "%p sz=%08X flg=%08X Ord=%02X %s\n"
	call        printf
	lea         rbx,[rbx+128h]
	sub         rdi,1
	jne         @@1
@@2:

Wow! This looks like a lot more assembly code, doesn't it?

The first part of the code before the for loop has nothing special in it. But the for loop itself compiled into an interesting construct. Let's review it.

x86-64[Copy]
	mov         edx,dword ptr [rdi]
	lea         rbx,[rdi+8]

As before, the rdi register was originally set to point to the RTL_PROCESS_MODULES struct. Additionally, in this case, the compiler chose to use the rbx register to keep the pointer to the current RTL_PROCESS_MODULE_INFORMATION struct in the array, or our pPMI variable in the C++ code.

x86-64[Copy]
	mov         eax,dword ptr [rdi]
	imul        rcx,rax,128h
	lea         rax,[rcx+rbx]
	cmp         rbx,rax
	jae         @@2

So the first part of the for loop calculates the pEnd pointer in the rax register, and compares it with rbx, or with the pPMI variable, and then skips the loop altogether if the result is above-or-equal, in the jae @@2 instruction. This part was expected.

x86-64[Copy]
	mov         rax,0DD67C8A60DD67C8Bh
	dec         rcx
	mul         rax,rcx
	mov         rdi,rdx
	shr         rdi,8
	inc         rdi

What is really unusual is the next chunk of code before the actual loop. The compiler tries to calculate and use its own internal counter in the rdi register, even though our original for loop did not specify it. It does so by computing such counter first using an optimized division logic. That code sample is quite interesting, as it doesn't immediately imply division.

What this assembly code does is equivalent to dividing rcx (which is the total size of our array of RTL_PROCESS_MODULE_INFORMATION structs in bytes) by 128h, which is the size of a single struct in that array. It then rounds up any remainder to an integer. This is done only for the purpose of optimizing a division, using the algorithm called "Division by invariant integers".

This is not a blog post on mathematics, but in a nutshell it works as such:
  • Say, we need to compute n / 128h.
  • We calculate the reciprocal of 128h, which is 1 / 128h, and multiply it by n. So our formula becomes (1 / 128h) * n.
  • Since we also need to round our result up to the next integer, we can multiply and divide both parts by a special constant Z: ((Z * 1 / 128h) * n) / Z
  • Our final goal is to pick Z in such a way that our division could be optimized by a CPU. In this case division by a power-of-2 can be greatly sped up with the use of the right-shift instruction shr.
  • Lastly, we decrement our original dividend by 1 (with dec rcx instruction) to account for the marginal values (or divisible by 128h), and then increment the result (with inc rdi instruction) to round it to the next integer.
x86-64[Copy]
	add         rbx,22h

Additionally, the compiler chose to move the rbx register storing a pointer to the current RTL_PROCESS_MODULE_INFORMATION struct, that will be used later in the loop, by 22h bytes forward in the add rbx,22h instruction. That is an interesting decision. My guess is that the compiler did it to avoid using a displacement in one of the following memory reads, like in this instruction: movzx ecx,word ptr [rbx].

x86-64[Copy]
	sub         rdi,1
	jne         @@1

Also note, that quite interestingly, the compiler chose to use sub rdi,1 instruction to decrement the counter instead of the dec rdi, as I would've expected. This choice in quite unusual because sub rdi,1 instruction translates into 48 83 EF 01 machine code sequence, while dec rdi into 48 FF CF, which is one byte shorter. I can't really explain the decision to use the sub instruction, other than Microsoft just didn't care about saving one byte in the resulting code and didn't want to adjust their compiler logic for this specific case. (Note that both instructions in this case are equivalent in the number of CPU cycles that it takes to execute them. Nor that by using a longer instruction the compiler would preserve the alignment of the further block of code. Matter of fact, it still had to pad that block with the nop dword ptr [rax] instruction further down the code to maintain the 16-byte alignment for the next loop that followed.)

Efficiency Issues

Unlike my original expectations, this method of changing the for loop did not bring significant improvements over the previous method. Let's name what is lacking here efficiency-wise:

  1. Unnecessary calculation of number of elements. This is the calculation of the number of elements in the RTL_PROCESS_MODULE_INFORMATION array to determine the internal loop counter in rdi. As you can see in the source code, we specifically chose not use a loop counter, but the compiler used it for us internally anyway.
  2. Checking pRPMs->NumberOfModules for non-0 twice. This is again a rudiment of using the for loop, that checks it on the first iteration after we checked it earlier in the if statement.

The question then arises. Do we even need a for loop here?

"Do-While" Loop

My friend Rbmm proposed his own version of the loop using the do-while loop instead. It may look like something like this:

C++[Copy]
// pRPMs of type PRTL_PROCESS_MODULES
pRPMs = (PRTL_PROCESS_MODULES)pThisBaseAddr;
ULONG nNumberOfModules = pRPMs->NumberOfModules;
if (!nNumberOfModules)
{
	status = STATUS_PROCEDURE_NOT_FOUND;
	goto cleanup;
}

wprintf(L"64-bit Modules (%u):\n", nNumberOfModules);

RTL_PROCESS_MODULE_INFORMATION* pPMI = pRPMs->Modules;

do
{
	printf("%p sz=%08X flg=%08X Ord=%02X %s\n",
		pPMI->ImageBase,
		pPMI->ImageSize,
		pPMI->Flags,
		pPMI->InitOrderIndex,
		pPMI->FullPathName
	);
}
while (pPMI++, --nNumberOfModules);

Compiled Result

And after compilation this will produce the following assembly code:

x86-64[Copy]
	; pRPMs = (PRTL_PROCESS_MODULES)pThisBaseAddr;
	; ULONG nNumberOfModules = pRPMs->NumberOfModules;
	mov         edi,dword ptr [rsi]

	; if (!nNumberOfModules)
	test        edi,edi
	jne         @@0

	mov         esi,dword ptr [rbp-59h]
	mov         ebx,0C000007Ah
	jmp         @@cleanup
@@0:
	; wprintf(L"64-bit Modules (%u):\n", nNumberOfModules);
	mov         edx,edi
	lea         rcx,[rip+0x1e20]			; "64-bit Modules (%u):\n"
	call        wprintf
	
	; RTL_PROCESS_MODULE_INFORMATION* pPMI = pRPMs->Modules;
	lea         rbx,[rsi+2Ah]
	nop         dword ptr [rax]

@@1:
	; printf("%p sz=%08X flg=%08X Ord=%02X %s\n",
	movzx       ecx,word ptr [rbx]
	lea         rax,[rbx+6]
	mov         r9d,dword ptr [rbx-6]
	mov         r8d,dword ptr [rbx-0Ah]
	mov         rdx,qword ptr [rbx-12h]
	mov         qword ptr [rsp+28h],rax
	mov         dword ptr [rsp+20h],ecx
	lea         rcx,[rip+0x1e0c]			; "%p sz=%08X flg=%08X Ord=%02X %s\n"
	call        printf

	; while (pPMI++, --nNumberOfModules);
	lea         rbx,[rbx+128h]
	add         edi,0FFFFFFFFh
	jne         @@1

As you can see, generated assembly code now more closely resembles our C++ code. It also eliminates the second check for pRPMs->NumberOfModules that we had in our previous for loop.

x86-64[Copy]
	lea         rbx,[rsi+2Ah]

Then before we enter the loop, the compiler sets the current pointer to the RTL_PROCESS_MODULE_INFORMATION struct array in the rbx register with a slight offset of 2Ah bytes. Such offset will allow compiler to avoid using the displacement in one of the instructions in the loop itself, such as movzx ecx,word ptr [rbx].

x86-64[Copy]
	movzx       ecx,word ptr [rbx]
	lea         rax,[rbx+6]
	mov         r9d,dword ptr [rbx-6]
	mov         r8d,dword ptr [rbx-0Ah]
	mov         rdx,qword ptr [rbx-12h]
	mov         qword ptr [rsp+28h],rax
	mov         dword ptr [rsp+20h],ecx
	lea         rcx,[rip+0x1e0c]			; "%p sz=%08X flg=%08X Ord=%02X %s\n"
	call        printf

In this case the loop is much more laconic and does pretty much the minimum required actions to fill in parameters for the printf call.

x86-64[Copy]
	lea         rbx,[rbx+128h]
	add         edi,0FFFFFFFFh
	jne         @@1

Then as a final iteration, the compiler forwards rbx by 128h bytes to get to the next RTL_PROCESS_MODULE_INFORMATION struct, and decrements edi counter by subtracting 1 from it with the add edi,0FFFFFFFFh instruction, and continues to the next loop iteration via a conditional jump with the jne @@1 ("Jump if not-equal") instruction.

There's an interesting side note. It still baffles me why Microsoft compiler chose to use add edi,0FFFFFFFFh instruction instead of just plain old dec edi, because the former one compiles into the following machine code: 83 C7 FF, while the latter one only into: FF CF. Additionally both instructions execute within an equal number of CPU cycles, which gives a clear advantage to the dec instruction, as being one byte shorter.

Efficiency Issues

This loop seems to be the most optimized version that I could get with the Visual Studio C++ compiler, apart from coding it manually in assembly.

Conclusion

Even though some gains in performance that were achieved in this blog post were not very significant for an average C++ loop, for some applications it may be a crucial step to adjust your source code to achieve a better performing assembly language code.

Keep in mind the following:

  • Try to move components of the loop into scope-local variables to achieve better optimization of the loop.
  • The most readable and beautiful C++ source code may not always generate the most efficient and robust assembly language code. Always check generated assembly language code for critical components of your program.

And finally, do not rely on the specific examples given in this blog post to draw your conclusion. The only way to know for sure is to check your code in the debugger.

To view the assembly language code for your compiled source code in Visual Studio, first place a breakpoint at the location of your code that you want to view generated assembly code for. Then start debugging (by pressing F5 key.) When the debugger breaks on your breakpoint, while the cursor is still in your source code window, hit Ctrl+F11 to toggle the disassembly window.

Related Articles