Intro
While writing our blog post on Windows APC,
my friend Rbmm and I had a discussion about the for
loop optimization in C++.
I want to expand it a little bit further in this post, as it may be interesting for C/C++ programmers that want to write efficient code.
Table Of Contents
For an easier navigation to each particular test, please use the following quick links:
- "For" Loop With An Index
- "For" Loop With An Index As A Local Variable
- "For" Loop With A Pointer As An Index
- "Do-While" Loop
Premise
I want to explain right off the bat what I will not be doing here. Even though we will be talking about efficiency I am not going to run any timing tests. Why?
Well, there are many reasons for that. To name just a few:
- It makes sense to run a timing test only for a specific algorithm or function for a particular environment. Otherwise, doing it for some arbitrarily made-up function is a moot point.
- Modern CPUs have a myriad of features that can influence timing tests from one hardware setup to another. To name just a few that can significantly skew timing test results: different cache amounts on the CPU, or types of cache available, cache alignment, parallel processing of instructions, branch prediction, hyperthreading, amount of RAM in the system, how much of it was free during the test, the type and speed of the system RAM, physical proximity of the RAM chips to the CPU (NUMA nodes), other running threads in the system, how busy the OS was during the test, and many many more.
- Timing tests also greatly depend on other variables that are not easy to control in your test environment. Things like: what was the cache state before the test? Did you run the first test that primed the cache so that things run faster now? Did the built-in antivirus software kick in to check the first run of your test app? And so forth.
- Moreover, I've seen quite a few timing tests conducted in a separate process specially built from scratch for the purpose of running such a test. But this is not a good example because an actual piece of code needing the test may not necessarily run in an isolated environment.
- And even if someone manages to run a slew of timing tests and present the averages, the results will be mostly appropriate for that particular system that the tests were conducted on, and thus sharing them in a blog post will not help someone else with a different hardware configuration.
So having said that, what is the goal of my blog post then?
The Goal
My goal is to study the efficiency of the C++ source code compiled with the Visual Studio compiler in relative terms, using assembly language instructions themselves.
It is a well-known fact that some x86 CPU instructions execute slower than others. For instance:
- Floating point instructions are generally slower on the CPU than similar integer instructions.
- An integer division, or
idiv
instruction is much slower than a logical shiftshr
, followed by an arithmetic addition or subtraction instructions. - A conditional branch instruction, e.g.
jnz
, is generally slower than an instruction without branching, e.g.cmovnz
. - A context switch to the kernel for thread is slower than a controlled
sysenter
instruction (available via many Windows API calls.)
Test Environment
Having named the types of tests that I will not be doing, let me describe my actual test environment:
- Visual Studio C++ optimizing compiler v.9.27.29111 for x64, that was supplied with the Visual Studio Community v.16.7.2.
- All tests were conducted on the
"Get Process Modules"
C++ sample code, created originally as a "Console Application" project in Visual Studio 2019, Community edition.
The actual loop being tested was a part of the
CollectModules
function in that project. - The C++ project was compiled for the
x64
platform, as theRelease
configuration, with the configuration parameters for the project pretty much kept as they were out-of-the-box after installation of the Visual Studio: - Configuration Properties > C/C++ > Optimization:
- Optimization: "Maximum Optimization (Favor Speed) (/O2)"
- Favor Size or Speed: Favor fast code (/Ot)
- Enable Fiber-Safe Optimizations: No
- Whole Program Optimization: Yes (/GL)
- Configuration Properties > C/C++ > Code Generation:
- Spectre Mitigations: Disabled
- Enable Intel JCC Erratum Mitigation:
Test Loops
I have compiled the
CollectModules
function with the following loops inserted into the part of the code that deals with the output of the results to the console.
Note that although each test sample deals with a very slow printf
-type function, its presence does not influence assembly language instructions that were
generated by the compiler within the loop itself.
Each of the following loops had a similar function.
"For" Loop With An Index
This is probably how most people would write the for
loop. We specify a variable that loops from 0 to the number of elements, and then use it as
an index in the original array of RTL_PROCESS_MODULES
objects that we are printing:
// pRPMs of type PRTL_PROCESS_MODULES
pRPMs = (PRTL_PROCESS_MODULES)pThisBaseAddr;
if (!pRPMs->NumberOfModules)
{
status = STATUS_PROCEDURE_NOT_FOUND;
goto cleanup;
}
wprintf(L"64-bit Modules (%u):\n", pRPMs->NumberOfModules);
for (ULONG m = 0; m < pRPMs->NumberOfModules; m++)
{
printf("%p sz=%08X flg=%08X Ord=%02X %s\n"
,
pRPMs->Modules[m].ImageBase,
pRPMs->Modules[m].ImageSize,
pRPMs->Modules[m].Flags,
pRPMs->Modules[m].InitOrderIndex,
pRPMs->Modules[m].FullPathName
);
}
And just for completeness, each RTL_PROCESS_MODULES
object is declared as such:
typedef struct RTL_PROCESS_MODULE_INFORMATION {
HANDLE Section; // Not filled in
PVOID MappedBase;
PVOID ImageBase;
ULONG ImageSize;
ULONG Flags;
USHORT LoadOrderIndex;
USHORT InitOrderIndex;
USHORT LoadCount;
USHORT OffsetToFileName;
CHAR FullPathName[256];
} *PRTL_PROCESS_MODULE_INFORMATION;
typedef struct RTL_PROCESS_MODULES {
ULONG NumberOfModules;
RTL_PROCESS_MODULE_INFORMATION Modules[1];
} *PRTL_PROCESS_MODULES;
Compiled Result
Now let's review the compiled result for that specific loop:
; if (!pRPMs->NumberOfModules)
mov edx,dword ptr [rdi]
test edx,edx
jne @@0
; status = STATUS_PROCEDURE_NOT_FOUND;
; goto cleanup;
mov edi,dword ptr [rbp-59h]
mov ebx,0C000007Ah
jmp @@cleanup
@@0:
; wprintf(L"64-bit Modules (%u):\n", pRPMs->NumberOfModules);
lea rcx,[rip+0x1e33] ; "64-bit Modules (%u):\n"
call wprintf
; for (ULONG m = 0; m < pRPMs->NumberOfModules; m++)
xor ebx,ebx
cmp dword ptr [rdi],ebx
jbe @@2
nop dword ptr [rax] ; alignment to 16 bytes
; printf("%p sz=%08X flg=%08X Ord=%02X %s\n"
@@1:
mov eax,ebx
imul rdx,rax,128h
add rdx,rdi
movzx ecx,word ptr [rdx+2Ah]
lea rax,[rdx+30h]
mov r9d,dword ptr [rdx+24h]
mov r8d,dword ptr [rdx+20h]
mov rdx,qword ptr [rdx+18h]
mov qword ptr [rsp+28h],rax
mov dword ptr [rsp+20h],ecx
lea rcx,[rip+0x1e20] ; "%p sz=%08X flg=%08X Ord=%02X %s\n"
call printf
inc ebx
cmp ebx,dword ptr [rdi]
jb @@1
@@2:
Let's review what happens there step by step:
mov edx,dword ptr [rdi]
test edx,edx
jne @@0
mov edi,dword ptr [rbp-59h]
mov ebx,0C000007Ah
jmp @@cleanup
At first compiler checks that pRPMs->NumberOfModules
is not 0 and jumps to @@0
if not. Otherwise it sets the status code to 0C000007Ah
and jumps to the end
of the function at the @@cleanup
label. (I don't show it here as it's not relevant for our tests.)
Then note that the compiler needed to prepare two input parameters for the first call to wprintf
: One is the string itself, that went into rcx
according to the
x64 calling convention, and the second one is the pRPMs->NumberOfModules
that we used earlier. Thus note that the compiler had strategically preserved it from the previous comparison. Notice that the compiler chose to use the rdx
register for that,
which is not just a coincidence. That register is also used for the second input parameter for the call to wprintf
.
Then, as you can see, the compiler had to keep the pointer to the RTL_PROCESS_MODULES
struct, or pRPMs
, in the rdi
register.
It also allocated the rbx
register to contain 0 constant with the xor ebx,ebx
instruction. It's a quick way to zero out a register. Also note that compiler used
a shorter 32-bit version of the xor
instruction, since CPU zeroes out high 32 bits of a 64-bit register in that case, and thus affecting the entire rbx
.
After that, we compare our pRPMs->NumberOfModules
to 0 with the cmp dword ptr [rdi],ebx
instruction, and if the result is below or equal (for an unsigned comparison),
we skip the loop altogether by following the conditional jump with the jbe @@2
instruction. Translated to plain English, we skip the loop if pRPMs->NumberOfModules
is less than or equal to 0. (Which by itself is an interesting way of coding it in assembly, as an unsigned integer technically cannot be less than 0. So the logical way
to code it for a human being would be with the je
or jz
instruction instead. But the compiler simply reversed the logic of the C++'s m < pRPMs->NumberOfModules
comparison to come up with the jbe
instruction, or "Jump if below or equal".)
Note that this was the implementation of the first condition of the for
statement.
Now we come to an interesting part. The loop itself. As you can see, the compiler used a dud instruction nop dword ptr [rax]
to align the first instruction in the loop
on the 16-byte boundary. This is another optimization, needed to align the loop on the cache line to ensure that the CPU has to fetch as few instructions from
memory as possible. (The cache line is read by the CPU all at once, and thus it makes sense to align as many instructions as can fit into it.)
The loop then begins with the calculation of the offset from the beginning of the array, or pRPMs
, to the needed index. It seems like the index value, or our m
,
will be stored into the ebx
register. The compiler first moves it to rax
and multiplies it by the size of the RTL_PROCESS_MODULE_INFORMATION
struct, that
happens to be 128h
bytes. This is done by the imul rdx,rax,128h
instruction. The result of the multiplication is stored in rdx
. After that we add rdi
to rdx
to get the actual offset of our needed index in the pRPMs
array. (Remember rdi
stores the address of the beginning of the
RTL_PROCESS_MODULES
struct array.)
movzx ecx,word ptr [rdx+2Ah]
lea rax,[rdx+30h]
mov r9d,dword ptr [rdx+24h]
mov r8d,dword ptr [rdx+20h]
mov rdx,qword ptr [rdx+18h]
mov qword ptr [rsp+28h],rax
mov dword ptr [rsp+20h],ecx
lea rcx,[rip+0x1e20] ; "%p sz=%08X flg=%08X Ord=%02X %s\n"
call printf
Then the compiler begins setting up input parameters for the printf
call. For simplicity, let's review it from the call printf
instruction backwards.
It puts the pointer to the string with the specifiers in the rcx
register
(i.e. "%p sz=%08X flg=%08X Ord=%02X %s\n"
),
and then sets up the next 3 parameters in rdx
, r8
and r9
registers. The remaining 2 parameters have to be stored on the stack, according to the
calling convention,
so the compiler first copies them into ecx
and rax
registers, and then stores them on the stack at the [rsp+20h]
and [rsp+28h]
locations, respectively.
Finally, after the call to printf
, it increments the index in ebx
and compares it to our number of elements in pRPMs->NumberOfModules
with the
cmp ebx,dword ptr [rdi]
instruction. And the loop continues in case the index value in ebx
is below (in unsigned notation) than the number of elements.
This conditional jump is performed with the jb @@1
(or "Jump if below") instruction.
Efficiency Issues
Having seen the compiled assembly language code we can notice the following efficiency issues right off the bat:
- Multiplication of the index on each iteration. The
imul
instruction, even though significantly faster on modern CPUs, is still not the most efficient way to calculate the offset. I am actually quite surprised that the optimizing compiler in Visual Studio chose to go with this approach. This is exactly the literal translation of what the logic in the C++ source code was doing.It seems like the reason for the in-loop calculation of the offset (with the
imul
instruction) was chosen by the compiler because we used one of the members of thepRPMs
struct, or thepRPMs->NumberOfModules
, in thefor
statement itself. It seems like this prevented the compiler from applying optimization, since that member variable could be altered when we called the function inside the body of the loop. And since it was used as the end-loop parameter in thefor
statement, the compiler could not calculate it before running the code of the actual loop. - Use of memory reference for the index comparisons. The
cmp ebx,dword ptr [rdi]
instruction at the end of the loop was reading thedword ptr [rdi]
memory location on each iteration of the loop, which was not necessary as that value would not change during the execution of the loop. It would've been obviously stored in the CPU cache after the first read, which would speed things up, but still, relying on cache alone is not a good way to go, especially if we can easily store that value in one of the registers that we had available for us:rsi
,r12
,r13
,r14
,r15
, etc.
So the bottom line for the compiled C++ code in this case is that it will not be the most efficient way to go.
So let's see how can we improve it?
"For" Loop With An Index As A Local Variable
Let's make a small change to the C++ code that we reviewed earlier: Instead of referring to the counter of elements in the array
by its member variable in the RTL_PROCESS_MODULES
struct, let's move it to a scope-local variable instead:
// pRPMs of type PRTL_PROCESS_MODULES
pRPMs = (PRTL_PROCESS_MODULES)pThisBaseAddr;
ULONG nNumberOfModules = pRPMs->NumberOfModules; //Using scope-local variable to store the number of elements in the loop
if (!nNumberOfModules)
{
status = STATUS_PROCEDURE_NOT_FOUND;
goto cleanup;
}
wprintf(L"64-bit Modules (%u):\n", nNumberOfModules);
for (ULONG m = 0; m < nNumberOfModules; m++)
{
printf("%p sz=%08X flg=%08X Ord=%02X %s\n"
,
pRPMs->Modules[m].ImageBase,
pRPMs->Modules[m].ImageSize,
pRPMs->Modules[m].Flags,
pRPMs->Modules[m].InitOrderIndex,
pRPMs->Modules[m].FullPathName
);
}
At first glance, the C++ code above seems like a slightly slower version of the original that adds one extra assignment in the form of
initialization of the local nNumberOfModules
variable.
Compiled Result
Let's see what it now compiles into:
; pRPMs = (PRTL_PROCESS_MODULES)pThisBaseAddr;
; ULONG nNumberOfModules = pRPMs->NumberOfModules;
mov esi,dword ptr [rdi]
; if (!nNumberOfModules)
test esi,esi
jne @@0
; status = STATUS_PROCEDURE_NOT_FOUND;
; goto cleanup;
mov edi,dword ptr [rbp-59h]
mov ebx,0C000007Ah
jmp @@cleanup
@@0:
; wprintf(L"64-bit Modules (%u):\n", nNumberOfModules);
mov edx,esi
lea rcx,[rip+0x1e20] ; "64-bit Modules (%u):\n"
call wprintf
; for (ULONG m = 0; m < nNumberOfModules; m++)
test esi,esi
je @@2
lea rbx,[rdi+24h]
add rdi,30h
nop word ptr [rax+rax] ; alignment to 16 bytes
@@1:
; printf("%p sz=%08X flg=%08X Ord=%02X %s\n"
movzx eax,word ptr [rbx+6]
lea rcx,[rip+0x1e25] ; "%p sz=%08X flg=%08X Ord=%02X %s\n"
mov r9d,dword ptr [rbx]
mov r8d,dword ptr [rbx-4]
mov rdx,qword ptr [rbx-0Ch]
mov qword ptr [rsp+28h],rdi
mov dword ptr [rsp+20h],eax
call printf
add rdi,128h
lea rbx,[rbx+128h]
sub rsi,1
jne @@1
@@2:
As you can see, the assembly code is surprisingly more laconic now.
The compiler chose again to use the rdi
register to store the pointer to the local
RTL_PROCESS_MODULES
struct. But now it also caches the pRPMs->NumberOfModules
member in the esi
register and uses it later throughout
the code snippet above. The compiler can do this because we hinted to it with our addition of the following scope-local variable:
Because that is a scope-local variable, that is not indirectly referenced anywhere else in the code, this means that a compiler can cache it in a local register and gain the speed advantage. So the addition of the line of code above, that seems quite superfluous in the C++ code, was actually a performance gain.
lea rbx,[rdi+24h]
; ....
@@1:
movzx eax,word ptr [rbx+6]
lea rcx,[rip+0x1e25] ; "%p sz=%08X flg=%08X Ord=%02X %s\n"
mov r9d,dword ptr [rbx]
mov r8d,dword ptr [rbx-4]
mov rdx,qword ptr [rbx-0Ch]
mov qword ptr [rsp+28h],rdi
mov dword ptr [rsp+20h],eax
call printf
add rdi,128h
lea rbx,[rbx+128h]
sub rsi,1
jne @@1
Further more, if you look at the assembly code for the loop itself (after the @@1
tag), the compiler was able to eliminate the unnecessary calculation of an offset
of each element in the RTL_PROCESS_MODULE_INFORMATION
array using multiplication, and instead simply used another register rbx
to
keep the pointer to the current element in the array,
and gradually incremented it to the next element at the end of the loop with the use of the lea rbx,[rbx+128h]
instruction. Remember 128h
is the size of each
RTL_PROCESS_MODULE_INFORMATION
struct in bytes.
Note that thelea
instruction is faster than theimul
instruction that was used in the first compiled sample.
Note that compiler had also cached the pointer to the FullPathName
member of the RTL_PROCESS_MODULE_INFORMATION
struct in the array
in the rdi
register.
But that also involved some additional steps, such as first calculating it before the loop with the add rdi,30h
instruction, and then also by incrementing it during
the loop itself with the add rdi,128h
instruction.
And as the loop counter, this time the compiler chose to decrement a previously assigned pRPMs->NumberOfModules
member in the rsi
register, and check it for non-zero
with the jne @@1
("Jump if not zero") instruction at the end of the loop. This is a more efficient way of checking the loop counter
than before.
Efficiency Issues
As you can see, the assembly language code produced by the compiler this time looks much more efficient with the addition of just one line of source code and by moving the loop counter into a scope-local variable. This may be counterintuitive for the C++ language at first, but it made that loop more optimized and thus more efficient.
The reason for such gain in efficiency is this: the compiler knows that a scope-local variable cannot be referenced anywhere outside of the local scope in the C/C++ language,
and thus it can be free to optimize the use of such variable. Thus by moving the pRPMs->NumberOfModules
into a local variable we hinted to the compiler
that it can optimize it in a register, or any other way it needs to.
But the assembly code produced still has some efficiency issues:
- Too many variables in the loop. We have to not only keep track of the pointer to the current
RTL_PROCESS_MODULE_INFORMATION
struct in therbx
register, but also to keep track of another pointer inrdi
, and of the counter inrsi
. That's too many counters for one loop. - Checking
pRPMs->NumberOfModules
for non-0 twice. This is not a huge deal breaker, but we checkedpRPMs->NumberOfModules
for being 0 and returned an error before the loop, but then it is also checked again in thefor
loop.
Let's see if we can optimize it a little bit more.
"For" Loop With A Pointer As An Index
In this test, let's try to hint the compiler even more by using a pointer to an array of RTL_PROCESS_MODULE_INFORMATION
structs in our pPMI
:
// pRPMs of type PRTL_PROCESS_MODULES
pRPMs = (PRTL_PROCESS_MODULES)pThisBaseAddr;
RTL_PROCESS_MODULE_INFORMATION* pPMI = pRPMs->Modules;
if (!pRPMs->NumberOfModules)
{
status = STATUS_PROCEDURE_NOT_FOUND;
goto cleanup;
}
wprintf(L"64-bit Modules (%u):\n", pRPMs->NumberOfModules);
for (RTL_PROCESS_MODULE_INFORMATION* pEnd = pPMI + pRPMs->NumberOfModules; pPMI < pEnd; pPMI++)
{
printf("%p sz=%08X flg=%08X Ord=%02X %s\n"
,
pPMI->ImageBase,
pPMI->ImageSize,
pPMI->Flags,
pPMI->InitOrderIndex,
pPMI->FullPathName
);
}
Compiled Result
This C++ source code compiled into the following assembly language snippet:
; pRPMs = (PRTL_PROCESS_MODULES)pThisBaseAddr;
; RTL_PROCESS_MODULE_INFORMATION* pPMI = pRPMs->Modules;
mov edx,dword ptr [rdi]
lea rbx,[rdi+8]
; if (!pRPMs->NumberOfModules)
test edx,edx
jne @@0
mov edi,dword ptr [rbp-59h]
mov ebx,0C000007Ah
jmp @@cleanup
@@0:
; wprintf(L"64-bit Modules (%u):\n", pRPMs->NumberOfModules);
lea rcx,[rip+0x1e1a] ; "64-bit Modules (%u):\n"
call wprintf
; for (RTL_PROCESS_MODULE_INFORMATION* pEnd = pPMI + pRPMs->NumberOfModules; pPMI < pEnd; pPMI++)
mov eax,dword ptr [rdi]
imul rcx,rax,128h
lea rax,[rcx+rbx]
cmp rbx,rax
jae @@2
add rbx,22h
mov rax,0DD67C8A60DD67C8Bh
dec rcx
mul rax,rcx
mov rdi,rdx
shr rdi,8
inc rdi
nop ; alignment to 16 bytes
@@1:
; printf("%p sz=%08X flg=%08X Ord=%02X %s\n"
movzx ecx,word ptr [rbx]
lea rax,[rbx+6]
mov r9d,dword ptr [rbx-6]
mov r8d,dword ptr [rbx-0Ah]
mov rdx,qword ptr [rbx-12h]
mov qword ptr [rsp+28h],rax
mov dword ptr [rsp+20h],ecx
lea rcx,[rip+0x1ded] ; "%p sz=%08X flg=%08X Ord=%02X %s\n"
call printf
lea rbx,[rbx+128h]
sub rdi,1
jne @@1
@@2:
Wow! This looks like a lot more assembly code, doesn't it?
The first part of the code before the for
loop has nothing special in it. But the for
loop itself compiled into an interesting construct.
Let's review it.
As before, the rdi
register was originally set to point to the RTL_PROCESS_MODULES
struct. Additionally, in this case, the compiler chose
to use the rbx
register to keep the pointer to the current RTL_PROCESS_MODULE_INFORMATION
struct in the array, or our pPMI
variable in the
C++ code.
So the first part of the for
loop calculates the pEnd
pointer in the rax
register, and compares it with rbx
, or with the pPMI
variable, and then skips the loop
altogether if the result is above-or-equal, in the jae @@2
instruction. This part was expected.
What is really unusual is the next chunk of code before the actual loop. The compiler tries to calculate and use its own internal counter in the rdi
register,
even though our original for
loop did not specify it. It does so by computing such counter first using an optimized division logic. That code sample is quite interesting,
as it doesn't immediately imply division.
What this assembly code does is equivalent to dividing rcx
(which is the total size of our array of RTL_PROCESS_MODULE_INFORMATION
structs in bytes)
by 128h
, which is the size of a single struct in that array. It then rounds up any remainder to an integer. This is done only for the purpose of optimizing a division,
using the algorithm called
"Division by invariant integers".
This is not a blog post on mathematics, but in a nutshell it works as such:
- Say, we need to compute
n / 128h
.- We calculate the reciprocal of
128h
, which is1 / 128h
, and multiply it byn
. So our formula becomes(1 / 128h) * n
.- Since we also need to round our result up to the next integer, we can multiply and divide both parts by a special constant
Z
:((Z * 1 / 128h) * n) / Z
- Our final goal is to pick
Z
in such a way that our division could be optimized by a CPU. In this case division by a power-of-2 can be greatly sped up with the use of the right-shift instructionshr
.- Lastly, we decrement our original dividend by 1 (with
dec rcx
instruction) to account for the marginal values (or divisible by128h
), and then increment the result (withinc rdi
instruction) to round it to the next integer.
Additionally, the compiler chose to move the rbx
register storing a pointer to the current RTL_PROCESS_MODULE_INFORMATION
struct, that will be used later in the loop,
by 22h
bytes forward in the add rbx,22h
instruction. That is an interesting decision. My guess is that the compiler did it to avoid using a displacement in one of the
following memory reads, like in this instruction: movzx ecx,word ptr [rbx]
.
Also note, that quite interestingly, the compiler chose to use sub rdi,1
instruction to decrement the counter instead of the dec rdi
, as I would've expected.
This choice in quite unusual because sub rdi,1
instruction translates into 48 83 EF 01
machine code sequence, while dec rdi
into 48 FF CF
, which is one byte shorter.
I can't really explain the decision to use the sub
instruction, other than Microsoft just didn't care about saving one byte in the resulting code and didn't want to adjust
their compiler logic for this specific case. (Note that both instructions in this case are equivalent in the number of CPU cycles that it takes to execute them. Nor
that by using a longer instruction the compiler would preserve the alignment of the further block of code. Matter of fact, it still had to pad that block with the
nop dword ptr [rax]
instruction further down the code to maintain the 16-byte alignment for the next loop that followed.)
Efficiency Issues
Unlike my original expectations, this method of changing the for
loop did not bring significant improvements over the previous method.
Let's name what is lacking here efficiency-wise:
- Unnecessary calculation of number of elements. This is the calculation of the number of elements in the
RTL_PROCESS_MODULE_INFORMATION
array to determine the internal loop counter inrdi
. As you can see in the source code, we specifically chose not use a loop counter, but the compiler used it for us internally anyway. - Checking
pRPMs->NumberOfModules
for non-0 twice. This is again a rudiment of using thefor
loop, that checks it on the first iteration after we checked it earlier in theif
statement.
The question then arises. Do we even need a for
loop here?
"Do-While" Loop
My friend Rbmm proposed his own version of the loop using the
do-while
loop instead. It may look like something like this:
// pRPMs of type PRTL_PROCESS_MODULES
pRPMs = (PRTL_PROCESS_MODULES)pThisBaseAddr;
ULONG nNumberOfModules = pRPMs->NumberOfModules;
if (!nNumberOfModules)
{
status = STATUS_PROCEDURE_NOT_FOUND;
goto cleanup;
}
wprintf(L"64-bit Modules (%u):\n", nNumberOfModules);
RTL_PROCESS_MODULE_INFORMATION* pPMI = pRPMs->Modules;
do
{
printf("%p sz=%08X flg=%08X Ord=%02X %s\n",
pPMI->ImageBase,
pPMI->ImageSize,
pPMI->Flags,
pPMI->InitOrderIndex,
pPMI->FullPathName
);
}
while (pPMI++, --nNumberOfModules);
Compiled Result
And after compilation this will produce the following assembly code:
; pRPMs = (PRTL_PROCESS_MODULES)pThisBaseAddr;
; ULONG nNumberOfModules = pRPMs->NumberOfModules;
mov edi,dword ptr [rsi]
; if (!nNumberOfModules)
test edi,edi
jne @@0
mov esi,dword ptr [rbp-59h]
mov ebx,0C000007Ah
jmp @@cleanup
@@0:
; wprintf(L"64-bit Modules (%u):\n", nNumberOfModules);
mov edx,edi
lea rcx,[rip+0x1e20] ; "64-bit Modules (%u):\n"
call wprintf
; RTL_PROCESS_MODULE_INFORMATION* pPMI = pRPMs->Modules;
lea rbx,[rsi+2Ah]
nop dword ptr [rax]
@@1:
; printf("%p sz=%08X flg=%08X Ord=%02X %s\n",
movzx ecx,word ptr [rbx]
lea rax,[rbx+6]
mov r9d,dword ptr [rbx-6]
mov r8d,dword ptr [rbx-0Ah]
mov rdx,qword ptr [rbx-12h]
mov qword ptr [rsp+28h],rax
mov dword ptr [rsp+20h],ecx
lea rcx,[rip+0x1e0c] ; "%p sz=%08X flg=%08X Ord=%02X %s\n"
call printf
; while (pPMI++, --nNumberOfModules);
lea rbx,[rbx+128h]
add edi,0FFFFFFFFh
jne @@1
As you can see, generated assembly code now more closely resembles our C++ code. It also eliminates the second check for pRPMs->NumberOfModules
that we had in our previous for
loop.
Then before we enter the loop, the compiler sets the current pointer to the RTL_PROCESS_MODULE_INFORMATION
struct array
in the rbx
register with a slight offset of 2Ah bytes. Such offset will allow compiler to avoid using the displacement in one
of the instructions in the loop itself, such as movzx ecx,word ptr [rbx]
.
movzx ecx,word ptr [rbx]
lea rax,[rbx+6]
mov r9d,dword ptr [rbx-6]
mov r8d,dword ptr [rbx-0Ah]
mov rdx,qword ptr [rbx-12h]
mov qword ptr [rsp+28h],rax
mov dword ptr [rsp+20h],ecx
lea rcx,[rip+0x1e0c] ; "%p sz=%08X flg=%08X Ord=%02X %s\n"
call printf
In this case the loop is much more laconic and does pretty much the minimum required actions to fill in parameters for the printf
call.
Then as a final iteration, the compiler forwards rbx
by 128h bytes to get to the next RTL_PROCESS_MODULE_INFORMATION
struct, and decrements edi
counter by subtracting 1
from it with the add edi,0FFFFFFFFh
instruction, and continues to the next loop iteration via a conditional jump with the jne @@1
("Jump if not-equal") instruction.
There's an interesting side note. It still baffles me why Microsoft compiler chose to useadd edi,0FFFFFFFFh
instruction instead of just plain olddec edi
, because the former one compiles into the following machine code:83 C7 FF
, while the latter one only into:FF CF
. Additionally both instructions execute within an equal number of CPU cycles, which gives a clear advantage to thedec
instruction, as being one byte shorter.
Efficiency Issues
This loop seems to be the most optimized version that I could get with the Visual Studio C++ compiler, apart from coding it manually in assembly.
Conclusion
Even though some gains in performance that were achieved in this blog post were not very significant for an average C++ loop, for some applications it may be a crucial step to adjust your source code to achieve a better performing assembly language code.
Keep in mind the following:
- Try to move components of the loop into scope-local variables to achieve better optimization of the loop.
- The most readable and beautiful C++ source code may not always generate the most efficient and robust assembly language code. Always check generated assembly language code for critical components of your program.
And finally, do not rely on the specific examples given in this blog post to draw your conclusion. The only way to know for sure is to check your code in the debugger.
To view the assembly language code for your compiled source code in Visual Studio, first place a breakpoint at the location of your code that you want to view generated assembly code for. Then start debugging (by pressing F5 key.) When the debugger breaks on your breakpoint, while the cursor is still in your source code window, hit Ctrl+F11 to toggle the disassembly window.