This article contains functions and features that are not documented by the original manufacturer. By following advice in this article, you're doing so at your own risk. The methods presented in this article may rely on internal implementation and may not work in the future.
Intro
One of my previous clients had an interesting dilemma at hand. Their commercial software was running without any hitches, but the users were complaining about its slow performance. When their own developers tried to profile that software they came up empty handed, having produced just a long list of stats without any single function standing out from the rest.
I analyzed their source code and found one peculiar implementation. Some previous developer had written his (or her) own version of a critical section (or a synchronization lock.) At the time it probably seemed like a great idea. But after I had analyzed their implementation and wrote a small POC app to demonstrate its performance, I was ready to present my findings. I will share those with the readers of my blog as well.
Table Of Contents
This is another blog post that needs a table of contents:
- Critical Section
- Kernel Synchronization Objects
- Entering Kernel From a User-Mode
- The Cost of a SYSCALL
- Entering SYSCALL
- Leaving SYSCALL
- Alternative Exit
- KiServiceInternal
Zw*
Kernel Functions PrologNt*
Kernel Functions- POC To Illustrate The Performance Impact
- Conclusion
Critical Section
In Windows terminology, a critical section is a fast synchronization lock that can be used inside a single process. Or, in other words, it cannot be used across two processes. But why would you limit it to a single process?
You can read more detailed information about critical sections in the MSDN page.
Also, if you remember your computer science class, what is the difference between a critical section and a mutex? Or a semaphore? Or even an event? All of those objects can be used as a synchronization lock. Sure, let's exclude for now the "same process" limitation. But why can't we use a mutex or an event instead of a critical section? What is the point to have a critical section after all? Don't we have several synchronization objects already in that list?
Those are all valid questions.
Luckily C++'s STL makes a distinction internally. For their std::mutex
class they use SRW locks in Windows. We'll get back to it in some later blog post though. There's too much to cover here.
To understand the difference, the best way is to look under the hood of those synchronization primitives. Which we will do in this post.
For now though, let me quickly explain the difference.
Critical Section Internals
A critical section in Windows is called a "fast user-mode lock" for a reason. To understand why, let's review how it works. In a very basic approximation it functions using the following principle:
The following is the result of reverse engineering it on Windows 10. Keep in mind that the exact sequence of the instructions that I present below may change from one build of the operating system to another.
- When you enter a critical section (with a call to the
EnterCriticalSection
API) that function quickly checks if the critical section was previously entered from another thread. And if not, it returns without doing much work.Let's review how it looks like at the Assembly language level.
The
EnterCriticalSection
function call is translated into thentdll!RtlEnterCriticalSection
function by the C compiler, that becomes the following:RtlEnterCriticalSection[Copy]sub rsp, 28h mov rax, gs:30h ; RAX = TEB* lock btr dword ptr [rcx+8], 0 ; Check & reset bit-0 in RTL_CRITICAL_SECTION::LockCount mov rax, [rax+48h] ; RAX = thread ID jnb lbl_cs_entered ; jump if bit-0 was not set mov [rcx+10h], rax ; RTL_CRITICAL_SECTION::OwningThread = RAX xor eax, eax mov dword ptr [rcx+0Ch], 1 ; RTL_CRITICAL_SECTION::RecursionCount = 1 add rsp, 28h retn
The sequence above is very simple. At first the function checks and resets bit-0 with the atomic
btr
instruction in theLockCount
in the providedCRITICAL_SECTION
structure. If that bit was set (as it will be the first time that critical section is entered), the call simply remembers the ID of the thread that calledEnterCriticalSection
inOwningThread
, and setsRecursionCount
to 1 and returns.As you can see, that sequence pretty much takes (almost) no time to execute.
By the way, the
CRITICAL_SECTION
structure is equivalent to the kernel'sRTL_CRITICAL_SECTION
, and is declared as such:C[Copy]typedef struct _RTL_CRITICAL_SECTION { PRTL_CRITICAL_SECTION_DEBUG DebugInfo; //Used only if the process is being debugged // // The following three fields control entering and exiting the critical // section for the resource // /* 8h */ LONG LockCount; // Bit-0 will be reset if CS was entered once /* 0Ch */ LONG RecursionCount; // Count of how many times CS was entered from the same thread /* 10h */ HANDLE OwningThread; // Thread ID that first entered the critical section /* 18h */ HANDLE LockSemaphore; /* 20h */ ULONG_PTR SpinCount; } RTL_CRITICAL_SECTION, *PRTL_CRITICAL_SECTION;
In case bit-0 of
LockCount
was initially reset, this means that the critical section was previously entered. So in that case we switch to another branch at thelbl_cs_entered
label:x86-64 Assembly[Copy]lbl_cs_entered: cmp [rcx+10h], rax ; Check if thread ID is the same jnz lbl_cs_locked ; Jump if not inc dword ptr [rcx+0Ch] ; Increment RTL_CRITICAL_SECTION::RecursionCount xor eax, eax add rsp, 28h retn
The code above checks if the current thread ID is the same as the thread that initially entered the critical section, and if so, it simply increments the
RecursionCount
variable of theCRITICAL_SECTION
structure. This allows the critical section not to lock the thread in case of a repeated, recursive entering from the same thread.But, if this is a different thread, this means that our critical section is locked and we need to wait. Thus we jump to the
lbl_cs_locked
label. There we can see a call to the internalRtlpEnterCriticalSectionContended
function that executes further logic. - Thus if the critical section is already held by another thread, the internal logic in the
RtlpEnterCriticalSectionContended
function assumes that it will be a transient state, and enters what is knows as a spin-lock, in which the thread spins for a very short time while checking the state of the critical section.The exact implementation of the
RtlpEnterCriticalSectionContended
function is too much to cover for this blog post. Let's just say that it increases in complexity in comparison to what we saw above.The interesting part is the spin-lock itself:
spin-lock[Copy]xor ecx, ecx ; Set count to 0 nop word ptr [rax+rax+00h] ; Alignment on 16 bytes for the loop that follows lbl_repeat_spin_lock: mov eax, [r9] ; EAX = RTL_CRITICAL_SECTION::LockCount test r13b, al ; See if bit-0 is set in EAX (R13 = 1) jnz lbl_lock_changed ; If yes, leave spin-lock if CS lock count was changed cmp ecx, ebx ; Check this spin-lock count jz lbl_timed_out ; Leave if spin-lock timed out pause ; CPU instruction to wait inc ecx ; Increment spin-lock count jmp lbl_repeat_spin_lock ; Start over
As you can see the spin-lock is a simple loop that counts to a certain number. That number comes from the spin-count, that you can set by calling the
InitializeCriticalSectionAndSpinCount
function, or it is set automatically to 2000 if you use theInitializeCriticalSection
function to initialize your critical section.The count for the spin-loop above is initially held in the
EBX
register and is calculated by multiplying the spin-count for the critical section by 10 and then by dividing it by some predefined constant. In my tests it was 47.So if the initial spin-count was 2000, the counter of iterations for the loop above becomes 425, or 2000 * 10 / 47.
To be honest, I have no idea where 10 and 47 come from in that formula.
Then the CPU spins in the spin-lock loop, periodically checking if the
LockCount
variable of theCRITICAL_SECTION
has changed, or until the loop times out by running out of iterations.One important aspect to note here is that the spin-loop has the
pause
instruction that informs the CPU about a pending wait-loop. This helps with the power consumption, also to optimize the CPU cache use, as well as to eliminate a possibility of the "expensive" memory order violations (i.e. mis-speculations) when leaving the spin-loop.The important assumption is that the spin-loop is relatively short. This obviously helps to reduce the waste of CPU cycles for a longer wait, but also removes the need to enter the kernel for a shorter lock duration. That is why a critical section is recommended for locks that get unlocked fairly quickly.
- If the critical section is released by another thread, while our thread is spinning in a spin-lock, that thread is released from the spin cycle and our thread enters the critical section.
This happens in the
lbl_lock_changed
conditional branch of the Assembly code above (after some additional checks.)Note that the entire spin-lock resides in the user-mode, which is important for performance. This will become apparent later.
- But, in the worst case scenario, if the spin-lock times out and the critical section is still held up by another thread, the waiting thread enters the waiting state in the kernel in the
lbl_timed_out
branch of the Assembly code above.At this point the critical section implementation stops being efficient and continues on the path to the kernel through the sequence of the following function calls:
RtlpWaitOnCriticalSection
,RtlpWaitOnAddress
,RtlpWaitOnAddressWithTimeout
and finally toNtWaitForAlertByThreadId
that takes it to the kernel for an extended wait.Such wait state may technically continue indefinitely.
Also note that at this stage the critical section is no longer an efficient synchronization lock because of the wasted CPU cycles at the spin-lock, and thus you should generally avoid using a critical section when the locked portion of the code takes a significantly longer time to run.
- If all goes well and the other thread that was holding the critical section attempts to release it from the kernel wait, it will do so by calling the
RtlpWakeByAddress
function that in turn wakes the waiting thread by callingNtAlertThreadByThreadId
.You can read more about the
NtWaitForAlertByThreadId
andNtAlertThreadByThreadId
functions in my other blog post.
Because of the implementation that I outlined above, a critical section is usually intended for a very transient lock for the processing that doesn't take too long to complete.
Also due to the fact that the speed of a critical section is achieved by avoiding entering into the kernel (by using the spin-lock loop) this limits the use of a critical section to a single process only. But why? To be able to check its state while spinning, the spin-lock needs to access some shared memory inside a critical section object, which will not be possible for different processes without entering the kernel because of a process isolation in Windows.
Kernel Synchronization Objects
On the other hand, a wait on a kernel object is done in the kernel. When you call the WaitForSingleObject
or a similar function that waits for a kernel object (such as event, semaphore, Win32 mutex, or others) it enters the kernel without any pre-processing and all the waiting is done in the kernel.
This works fine except for the condition when the kernel object did not require any waiting. Say, if the event was signaled, or a mutex was not acquired. In that case a trip to the kernel and back will not be justified for an intra-process lock.
And that is the main difference between a critical section and a Win32 mutex, event, or a semaphore.
But why does a trip to the kernel and back make a difference?
This post will give you the answer by analyzing the Assembly code that has to run just for a trip to the kernel and to return back.
Entering Kernel From a User-Mode
Let's look at the WaitForSingleObject
function. This is the function that can be used when we want to enter a lock. It checks if the lock is acquired, and if not, it acquires it and returns control back. And if the lock was already acquired when we called WaitForSingleObject
, it then waits for a provided time interval until the lock is released.
To simplify our reverse engineering work, we can quickly step through the WaitForSingleObject
API and notice that internally it calls the native function, called NtWaitForSingleObject
, that in itself is just a wrapper for the following syscall
:
mov r10, rcx ; Save RCX in R10
mov eax, 4 ; "System service number" of the system call
test 0x7FFE0308, 1
jnz lbl_alt
syscall ; Initiate a trip to the kernel
retn
lbl_alt:
int 2Eh ; Alternative (slower) route to the kernel
retn
If we step through the Assembly instructions above, and try to step-into the syscall
instruction, the execution will happen immediately like if that system call was just a monolith atomic instruction.
But what is really happening there?
The user-mode debugger will not be able to step-into the kernel mode through thesyscall
instruction, like it would otherwise if it was just a localcall
instruction.So we need to use some clever trick to do it manually.
If you know how Microsoft structured their native ntdll.dll
library in Windows NT, you'd remember that pretty much most of its functions are mirrored in the kernel, with a slight difference between the Nt
and Zw
prefixes on function names. Thus, we may assume that there's the NtWaitForSingleObject
function in the kernel that does all the work by waiting for the object.
We can verify this by searching thentoskrnl.exe
file in theC:\Windows\System32\
directory for theNtWaitForSingleObject
export name with the WinAPI Search tool. Such export indeed exists.
But what happens after the syscall
instruction in the kernel and before we get to the NtWaitForSingleObject
function?
Let's review it next...
The Cost of a SYSCALL
Let's look at the Assembly code from the Windows kernel-side right after the CPU executes the syscall
instruction. There are technically two sides of the coin: entering kernel, and leaving it. Let's review them one by one.
Entering SYSCALL
As we saw above the execution enters the kernel mode with the syscall
instruction. But what exactly happens there?
syscall
Instruction
I won't go too deep into it (you can look it up yourself in the official Intel documentation.) In a nutshell, the syscall
instruction performs the following steps:
Further on, I will be assuming a 64-bit CPU mode, and a 64-bit Windows 10 OS.
RCX
register is set to the address of the instruction following thesyscall
.R11
register is set to the value ofRFLAGS
.RFLAGS
register is masked with theIA32_FMASK
MSR (at addressC0000084h
.) Each bit that is set in the value ofIA32_FMASK
is cleared inRFLAGS
.MSR
stands for "Model Specific Register" - these are Intel's architectural registers that usually convey certain information about the CPU, or allow to issue privileged control commands to the CPU.MSR
s are available only in the kernel-mode, or in ring-zero.The value of
IA32_FMASK
MSR on my Windows 10 is00000000`00004700h
, which means that theRFLAGS
register will have the following flags cleared during asyscall
:TF
- "Trap Flag",IF
- "Interrupt Flag",DF
- "Direction Flag", andNT
- "Nested Task".The action above will disable maskable interrupts, among others.
CS
code segment is set as follows (without checking permissions):- Selector for
CS
is set from bits [32-47] ofIA32_STAR
MSR (at addressC0000081h
), and then by AND-ing it withFFFCh
, thus resetting itsRPL/CPL
to 0. - Set the following
CS
code segment attributes:CS.Base=0
,CS.Limit=FFFFFh
,CS.Type=1011b
("Nonconforming, execute & read code segment, accessed"),CS.S=1
,CS.DPL=0
,CS.P=1
,CS.L=1
(64-bit long mode),CS.D=0
,CS.G=1
("4-KByte granularity"),CPL=0
(for "ring-zero").
- Selector for
SS
stack segment is set as follows (without checking permissions):- Selector for
SS
is set from bits [32-47] ofIA32_STAR
MSR (at addressC0000081h
), plus the value of8h
, thus making theSS
segment follow theCS
segment. - Set the following
SS
segment attributes:SS.Base=0
,SS.Limit=FFFFFh
,SS.Type=0011b
("Read/write data segment, accessed"),SS.S=1
,SS.DPL=0
,SS.P=1
,SS.B=1
("expand-up, 32-bit stack segment"),SS.G=1
("4-KByte granularity").
- Selector for
RIP
register is set to the value of theIA32_LSTAR
MSR (at addressC0000082h
).Note that for AMD processors, it is set slightly differently.
By the way, you can probably see why the user-mode portion of the transition to the kernel inntdll
saved theRCX
register inR10
. The former one is a part of the x64 calling convention for the 1st function parameter, but it is also overwritten by thesyscall
instruction, and thus needs to be temporarily saved. Microsoft chose theR10
volatile register to do that.
This elaborate sequence is actually called a "fast" transition to the kernel. Why? Because a more involved (and older) way of transiting to the kernel is done via a software interrupt with the int
instruction. It involves way more steps than what I outlined above and is much slower.
Still, even though it's called "fast", a syscall
instruction is anything but fast.
Beginning of The System Call Handler
Let's see what happens when we get to the kernel right after the syscall
instruction.
The RIP
register will point to the nt!KiSystemCall64Shadow
entry point in the kernel and the execution begins in ring-zero.
Keep in mind that the following is mostly a reverse engineered and undocumented code that may change at any moment. Do not rely on it in your production environment!
I'll try to add some brief comments with descriptions in the Assembly code below:
swapgs ; gs = KPCR + KPRCB
mov qword ptr gs:[9010h], rsp ; KPRCB::UserRspShadow (user stack pointer)
mov rsp, qword ptr gs:[9000h] ; RSP = KPRCB::KernelDirectoryTableBase
bt dword ptr gs:[9018h], 1 ; KPRCB::ShadowFlags
jb lbl_01
mov cr3, rsp ; Meltdown mitigation: Kernel Virtual Address Shadow (KVAS)
lbl_01:
mov rsp, qword ptr gs:[9008h] ; RSP = KPRCB::RspBaseShadow (kernel stack)
push 2Bh ; Dummy SS selector
push qword ptr gs:[9010h] ; Save KPRCB::UserRspShadow (user stack pointer)
push r11 ; Save previous RFLAGS
push 33h ; Dummy 64-bit CS selector (Also used as 'is_syscall'. We set it to 33h here)
push rcx ; Save return address to the user-mode
mov rcx, r10 ; Restore 1st parameter back into RCX
sub rsp, 8
push rbp ; Save previous RBP
sub rsp, 158h ; Reserve space for local variables
lea rbp, [rsp+80h] ; RBP = frame pointer on the stack
mov qword ptr [rbp+0C0h], rbx ; Save nonvolatile registers
mov qword ptr [rbp+0C8h], rdi
mov qword ptr [rbp+0D0h], rsi
test byte ptr [ntkrnlmp!KeSmapEnabled], 0FFh ; SMAP
je lbl_02
test byte ptr [rbp+0F0h], 1
je lbl_02 ; Jump if(is_syscall & 1) == 0
stac ; Enable alignment checking for user-mode data accesses
lbl_02:
mov qword ptr [rbp-50h], rax ; Save system service number
mov qword ptr [rbp-48h], rcx ; Save 1st input parameter
mov qword ptr [rbp-40h], rdx ; Save 2nd input parameter
mov rcx, qword ptr gs:[188h] ; KPRCB::KTHREAD*
mov rcx, qword ptr [rcx+220h] ; KPROCESS*
mov rcx, qword ptr [rcx+9E0h] ; EPROCESS::SecurityDomain
mov qword ptr gs:[270h], rcx ; KPRCB::TrappedSecurityDomain
mov cl, byte ptr gs:[850h] ; KPRCB::BpbRetpolineExitSpecCtrl
mov byte ptr gs:[851h], cl ; KPRCB::BpbTrappedRetpolineExitSpecCtrl
mov cl, byte ptr gs:[278h] ; KPRCB::BpbState
; 1h = BpbCpuIdle
; 2h = BpbFlushRsbOnTrap
; 4h = BpbIbpbOnReturn
; 8h = BpbIbpbOnTrap
; 10h = BpbIbpbOnRetpolineExit
mov byte ptr gs:[852h], cl ; KPRCB::BpbTrappedBpbState:
; 1h = BpbTrappedCpuIdle
; 2h = BpbTrappedFlushRsbOnTrap
; 4h = BpbTrappedIbpbOnReturn
; 8h = BpbTrappedIbpbOnTrap
; 10h = BpbTrappedIbpbOnRetpolineExit
movzx eax, byte ptr gs:[27Bh] ; KPRCB::BpbKernelSpecCtrl
cmp byte ptr gs:[27Ah], al ; KPRCB::BpbCurrentSpecCtrl
je lbl_03
mov byte ptr gs:[27Ah], al ; KPRCB::BpbCurrentSpecCtrl
mov ecx, 48h ; IA32_SPEC_CTRL MSR
xor edx, edx
wrmsr
lbl_03:
movzx edx, byte ptr gs:[278h] ; KPRCB::BpbState
test edx, 8 ; 8h = BpbTrappedIbpbOnTrap
je lbl_04
mov eax, 1 ; Indirect Branch Prediction Barrier (IBPB)
xor edx, edx
mov ecx, 49h ; IA32_PRED_CMD MSR
wrmsr
jmp lbl_60
lbl_04:
test edx, 2 ; 2h = BpbFlushRsbOnTrap
je lbl_f50
test byte ptr gs:[279h], 4 ; KPRCB::BpbFeatures:
; 1h = BpbClearOnIdle
; 2h = BpbEnabled
; 4h = BpbSmep
jne lbl_f50
; Software implementation of "Indirect Branch Prediction Barrier" feature
call lbl_c40
lbl_c10:
add rsp, 8
call lbl_c41
lbl_c11:
add rsp, 8
call lbl_c10
lbl_c12:
add rsp, 8
call lbl_c11
lbl_c13:
add rsp, 8
call lbl_c12
lbl_c14:
add rsp, 8
call lbl_c13
lbl_c15:
add rsp, 8
call lbl_c14
lbl_c16:
add rsp, 8
call lbl_c15
lbl_c17:
add rsp, 8
call lbl_c16
lbl_c18:
add rsp, 8
call lbl_c17
lbl_c19:
add rsp, 8
call lbl_c18
lbl_c20:
add rsp, 8
call lbl_c19
lbl_c21:
add rsp, 8
call lbl_c20
lbl_c22:
add rsp, 8
call lbl_c21
lbl_c23:
add rsp, 8
call lbl_c22
lbl_c24:
add rsp, 8
call lbl_c23
lbl_c25:
add rsp, 8
call lbl_c24
lbl_c26:
add rsp, 8
call lbl_c25
lbl_c27:
add rsp, 8
call lbl_c26
lbl_c28:
add rsp, 8
call lbl_c27
lbl_c29:
add rsp, 8
call lbl_c28
lbl_c30:
add rsp, 8
call lbl_c29
lbl_c31:
add rsp, 8
call lbl_c30
lbl_c32:
add rsp, 8
call lbl_c31
lbl_c33:
add rsp, 8
call lbl_c32
lbl_c34:
add rsp, 8
call lbl_c33
lbl_c35:
add rsp, 8
call lbl_c34
lbl_c36:
add rsp, 8
call lbl_c35
lbl_c37:
add rsp, 8
call lbl_c36
lbl_c38:
add rsp, 8
call lbl_c37
lbl_c39:
add rsp, 8
call lbl_c38
lbl_c40:
add rsp, 8
call lbl_c39
lbl_c41:
add rsp, 8
lbl_f50:
lfence
lbl_60:
mov byte ptr gs:[853h], 0 ; KPRCB::BpbRetpolineState:
; 1h = BpbRunningNonRetpolineCode
; 2h = BpbIndirectCallsSafe
; 4h = BpbRetpolineEnabled
jmp ntkrnlmp!KiSystemServiceUser
By the way, an interesting aside is theKiSystemCall64Shadow
function. If you look through Microsoft public symbols, you will notice theKiSystemCall64
function as well. That is because the newKiSystemCall64Shadow
function was added later with the mitigations for the "Meltdown hardware vulnerability" that was discovered in 2018.If you look at the older
KiSystemCall64
entry point, it starts with the following Assembly code. Try to spot the difference:KiSystemCall64[Copy]swapgs ; gs = KPCR + KPRCB mov qword ptr gs:[10h], rsp ; KPCR::UserRsp (user stack pointer) mov rsp, qword ptr gs:[1A8h] ; RSP = KPRCB::RspBase = kernel stack push 2Bh ; Dummy SS selector push qword ptr gs:[10h] ; KPCR::UserRsp push r11 ; Save previous RFLAGS push 33h ; Dummy 64-bit CS selector (Also used as 'is_syscall'. Set it to 33h) push rcx ; Save return address to the user-mode mov rcx, r10 ; Restore 1st parameter back into RCX sub rsp, 8 push rbp ; Save previous RBP sub rsp, 158h ; Reserve space for local variables lea rbp, [rsp+80h] ; RBP = frame pointer on the stack mov qword ptr [rbp+0C0h], rbx ; Save nonvolatile registers mov qword ptr [rbp+0C8h], rdi mov qword ptr [rbp+0D0h], rsi test byte ptr [ntkrnlmp!KeSmapEnabled], 0FFh ; SMAP je lbl_02 test byte ptr [rbp+0F0h], 1 je lbl_02 ; Jump if(is_syscall & 1) == 0 stac ; Enable alignment checking for user-mode data accesses lbl_02: mov qword ptr [rbp-50h], rax mov qword ptr [rbp-48h], rcx mov qword ptr [rbp-40h], rdx mov rcx, qword ptr gs:[188h] ; KPRCB::KTHREAD* mov rcx, qword ptr [rcx+220h] ; KPROCESS* mov rcx, qword ptr [rcx+9E0h] ; EPROCESS::SecurityDomain mov qword ptr gs:[270h], rcx ; KPRCB::TrappedSecurityDomain mov cl, byte ptr gs:[850h] ; KPRCB::BpbRetpolineExitSpecCtrl mov byte ptr gs:[851h], cl ; KPRCB::BpbTrappedRetpolineExitSpecCtrl mov cl, byte ptr gs:[278h] ; KPRCB::BpbState ; 1h = BpbCpuIdle ; 2h = BpbFlushRsbOnTrap ; 4h = BpbIbpbOnReturn ; 8h = BpbIbpbOnTrap ; 10h = BpbIbpbOnRetpolineExit mov byte ptr gs:[852h], cl ; KPRCB::BpbTrappedBpbState ; 1h = BpbTrappedCpuIdle ; 2h = BpbTrappedFlushRsbOnTrap ; 4h = BpbTrappedIbpbOnReturn ; 8h = BpbTrappedIbpbOnTrap ; 10h = BpbTrappedIbpbOnRetpolineExit movzx eax, byte ptr gs:[27Bh] ; KPRCB::BpbKernelSpecCtrl cmp byte ptr gs:[27Ah], al ; KPRCB::BpbCurrentSpecCtrl je lbl_03 mov byte ptr gs:[27Ah], al ; KPRCB::BpbCurrentSpecCtrl mov ecx, 48h ; IA32_SPEC_CTRL MSR xor edx, edx wrmsr lbl_03: movzx edx, byte ptr gs:[278h] ; KPRCB::BpbState test edx, 8 je lbl_04 mov eax, 1 ; Indirect Branch Prediction Barrier (IBPB) xor edx, edx mov ecx, 49h ; IA32_PRED_CMD MSR wrmsr jmp lbl_60 lbl_04: test edx, 2 ; 2h = BpbFlushRsbOnTrap je lbl_f50 test byte ptr gs:[279h], 4 ; KPRCB::BpbFeatures ; 1h = BpbClearOnIdle ; 2h = BpbEnabled ; 4h = BpbSmep jne lbl_f50 ; Software implementation of "Indirect Branch Prediction Barrier" feature call lbl_c40 lbl_c10: add rsp, 8 call lbl_c41 lbl_c11: add rsp, 8 call lbl_c10 lbl_c12: add rsp, 8 call lbl_c11 lbl_c13: add rsp, 8 call lbl_c12 lbl_c14: add rsp, 8 call lbl_c13 lbl_c15: add rsp, 8 call lbl_c14 lbl_c16: add rsp, 8 call lbl_c15 lbl_c17: add rsp, 8 call lbl_c16 lbl_c18: add rsp, 8 call lbl_c17 lbl_c19: add rsp, 8 call lbl_c18 lbl_c20: add rsp, 8 call lbl_c19 lbl_c21: add rsp, 8 call lbl_c20 lbl_c22: add rsp, 8 call lbl_c21 lbl_c23: add rsp, 8 call lbl_c22 lbl_c24: add rsp, 8 call lbl_c23 lbl_c25: add rsp, 8 call lbl_c24 lbl_c26: add rsp, 8 call lbl_c25 lbl_c27: add rsp, 8 call lbl_c26 lbl_c28: add rsp, 8 call lbl_c27 lbl_c29: add rsp, 8 call lbl_c28 lbl_c30: add rsp, 8 call lbl_c29 lbl_c31: add rsp, 8 call lbl_c30 lbl_c32: add rsp, 8 call lbl_c31 lbl_c33: add rsp, 8 call lbl_c32 lbl_c34: add rsp, 8 call lbl_c33 lbl_c35: add rsp, 8 call lbl_c34 lbl_c36: add rsp, 8 call lbl_c35 lbl_c37: add rsp, 8 call lbl_c36 lbl_c38: add rsp, 8 call lbl_c37 lbl_c39: add rsp, 8 call lbl_c38 lbl_c40: add rsp, 8 call lbl_c39 lbl_c41: add rsp, 8 lbl_f50: lfence lbl_60: mov byte ptr gs:[853h], 0 ; KPRCB::BpbRetpolineState ; 1h = BpbRunningNonRetpolineCode ; 2h = BpbIndirectCallsSafe ; 4h = BpbRetpolineEnabled jmp ntkrnlmp!KiSystemServiceUser
As you can see, the updated
KiSystemCall64Shadow
function has the following Meltdown mitigations (if you read the comments in the Assembly code above):
- Kernel Virtual Address Shadow (KVAS) - Microsoft's version of the Kernel Page Table Isolation (KPTI) to provide a separate page table for the kernel code for the virtual address translation.
There's too much going on there, so I'll try to cover just the basics:
- The first thing that happens is that the
swapgs
instruction exchanges theGS
base register with the value of theIA32_KERNEL_GS_BASE
MSR (at addressC0000102h
). It is important to do this first since theGS
register is used to hold the base for the "Kernel Processor Control Region" (orKPCR
), which is the data structure that holds important information needed for the kernel mode operation.KPCR
is specific for each processor core.KPRCB
is an extended data struct that followsKPCR
in memory.If
GS
register is not set up, or corrupted, this is a guaranteed way to cause a Blue-Screen-Of-Death, or to hang up the system. - Then the handler saves the user-mode stack pointer (
RSP
) in theUserRspShadow
variable of theKPCR
.Note that the structure of
KPCR
is not well documented and tends to change from one build of the OS to another. Thus the offsets in it are prone to change. - Then the code briefly reuses the
RSP
register to set up the Meltdown mitigation by setting theCR3
register (to switch the physical address for the base of the page directory in accordance with the "Kernel Virtual Address Shadow" guidelines), depending on the bit 1 of theShadowFlags
of theKPRCB
struct. - After that it sets up the kernel stack by getting its pointer from
KPRCB::RspBaseShadow
, and saves provided nonvolatile registers in the kernel stack. - Note that the code above pushes to the stack the hardcoded values for the stack segment and for the code segment, as
2Bh
and33h
, respectively.I'm not entirely sure why they decided to do this originally.
Also note that the hardcoded value
33h
is also checked and used later in the code. I marked it asis_syscall
in the comments. (I will return to it later.) - Then depending on the global
KeSmapEnabled
, it checks if SMAP should be enabled and if so sets it up. SMAP stands for "Supervisor-Mode Access Prevention". This is a hardware feature that guards kernel code from accessing user-mode address space without the prior knowledge. This feature is set by theCR4.SMAP
control bit, and is later guided by theRFLAGS.AC
flag, and thestac
instruction. - Then the code initiates additional security features for the syscall routine:
EPROCESS::SecurityDomain
is copied intoKPRCB::TrappedSecurityDomain
from the current thread.Retpoline
mitigations are initiated viaBpbTrappedRetpolineExitSpecCtrl
that is copied fromBpbRetpolineExitSpecCtrl
of theKPRCB
.- Additionally
BpbState
flags are copied intoBpbTrappedBpbState
of theKPRCB
. - If
BpbKernelSpecCtrl
value is not set up in theBpbCurrentSpecCtrl
member of theKPRCB
, theIA32_SPEC_CTRL
MSR is set to its value. That MSR controls "Speculative Execution Side Channel Mitigations" on the hardware level. - Then the
BpbTrappedIbpbOnTrap
flag in theKPRCB::BpbState
is checked, and if it's on, the code enables "Indirect Branch Prediction Barrier" (or IBPB) via theIA32_PRED_CMD
MSR. - And if the
BpbFlushRsbOnTrap
flag is not set in theKPRCB::BpbState
, andBpbSmep
flag is enabled inKPRCB::BpbFeatures
, the code skips the following software emulation of the IBPB. - Then notice a convoluted sequence of
call
instructions that call up the code to a previousadd rsp, 8
instruction that restores the stack pointer, only to repeat it again multiple times. This is Microsoft's way to implement the "Indirect Branch Prediction Barrier" in software if the hardware implementation is not available, or ifBpbFlushRsbOnTrap
flag is set.That call sequence seems to overwhelm the hardware pipeline, and such defeats a possible attack.
But ouch! What a waste of CPU cycles.
I wonder though, why didn't they optimize it to a bunch of
call
instructions, following by a singleadd rsp, n
instead of insertingadd rsp, 8
after eachcall
instruction. If anyone knows, please leave a comment below?
- At the end the
lfence
instruction issues more assurance against the indirect branch prediction attacks. - Finally a jump to
KiSystemServiceUser
label continues on to the shared code with the originalKiSystemCall64
function.
Let's review what happens next. We're far from being done with the system call handler.
Kernel Stack Layout
To help us navigate over the local variables that are used by the syscall handler that we're reviewing here, I made the following kernel stack diagram. Assuming that the initial value of RSP
was 1000h
when we set it during the beginning of the handler, the following layout of local variables will be used:
I know that the address of the kernel stack cannot be 1000h
. I'm using such a lower value so that we don't deal with crazy-long 64-bit numbers here.
1000 - rsp at the time we entered syscall:
FF8 = 2Bh Dummy SS segment
FF0 = original RSP (from user mode) = KPRCB::UserRspShadow
FE8 = original R11 RFLAGS from user mode
FE0 = 33h Dummy CS segment & also 'is_syscall'
FD8 = original RCX return address (to the user mode)
FD0 =
FC8 = original RBP (from user mode)
FC0 = original RSI (from user mode)
FB8 = original RDI (from user mode)
FB0 = original RBX (from user mode)
FA8 = Beginning of the CONTEXT for a trap frame. Also: alt_exit: P1Home
FA0 =
F98 =
F90 =
F88 =
F80 =
F78 =
F70 = 0 (word) MxCsr: 'word_f70'
F68 =
F60 =
F58 =
F50 =
F48 =
F40 =
F38 = scratch: \ original XMM5 (high)
F30 = scratch: \ original XMM5 (lower)
F28 = scratch: \ original XMM4 (high)
F20 = scratch: | original XMM4 (lower)
F18 = scratch: | original XMM3 (high)
F10 = scratch: | original XMM3 (lower)
F08 = scratch: | original XMM2 (high)
F00 = scratch: | PsAltSystemCallDispatch original XMM2 (lower)
EF8 = scratch: | original XMM1 (high)
EF0 = <-- RBP | original XMM1 (lower)
EE8 = scratch: | original XMM0 (high)
EE0 = scratch: | original XMM0 (lower)
ED8 = scratch: /
ED0 = scratch: / original R10 (from user mode) = RCX before syscall
EC8 = scratch: / original R10 (from user mode) = RCX before syscall
EC0 = scratch: original R9 (from user mode)
EB8 = scratch: original R8 (from user mode)
EB0 = scratch: original RDX (from user mode)
EA8 = scratch: original R10 (from user mode) = RCX before syscall
EA0 = scratch: original RAX (from user mode) = system service number
E9F = \
E9E = \ original MXCSR (from user mode)
E9D = /
E9C = /
E9B = 2 set in KiSystemServiceUser
E9A =
E99 =
E98 = alt_exit: PreviousMode
E90 =
E88 =
E80 =
E78 =
E70 = <- RSP
E68 = \
E60 = \
E58 = \
E50 = |
E48 = |
E40 = |
E38 = | Stack for the service function call
E30 = |
E28 = |
E20 = |
E18 = /
E10 = /
E08 = /
E00 = <- (RSP for the service function)
DF8 = return address from the service function
The easiest way to read this is to search this page by a variable name to see where in code it will be used.
KiSystemServiceUser
Now that all the vulnerability mitigations are covered, the following part of the code begins servicing the user-mode call.
It first loads the KTHREAD::TrapFrame
into the CPU cache with the prefetchw
instruction to improve performance.
A "trap frame" points to the base of the stack and is often used to store the registers of the current thread, among other parameters, when an interrupt occurs, or during a system service call.
It then saves the MXCSR
control and status register (for the SSE registers) in the kernel stack, and loads the kernel copy of MXCSR
from KPRCB::MxCsr
.
It then checks DISPATCHER_HEADER::DebugActive
byte to see if the thread has a debugger attached to it.
mov byte ptr [rbp-55h], 2
mov rbx, qword ptr gs:[188h] ; KPRCB::KTHREAD*
prefetchw [rbx+90h] ; KTRAP_FRAME
stmxcsr dword ptr [rbp-54h]
ldmxcsr dword ptr gs:[180h] ; KPRCB::MxCsr
cmp byte ptr [rbx+3], 0 ; DISPATCHER_HEADER::DebugActive
mov word ptr [rbp+80h], 0 ; word_f70 = 0 (lower CONTEXT::MxCsr)
je lbl_s_01 ; Jump if no debugger attached
The section of the code that executes when a debugger is attached follows. Since during normal operation this block of code is not invoked, I will not dwell on it too much.
An interesting part to note is that when a debugger is attached, asyscall
is processed differently. For instance, theKiSaveDebugRegisterState
function saves the debugging registers and re-sets them for the kernel mode. Additionally, thePsAltSystemCallDispatch
function may take the execution to an alternate processing routine. (Both are outside of the scope of this blog post.)
The following is a quick snippet of the debugging logic:
test byte ptr [rbx+3], 3 ; DISPATCHER_HEADER::DebugActive
; 1h = ActiveDR7
; 2h = Instrumented
; 4h = Minimal
; 20h = AltSyscall
; 40h = UmsScheduled
; 80h = UmsPrimary
mov qword ptr [rbp-38h], r8 ; Save 3rd input parameter
mov qword ptr [rbp-30h], r9 ; Save 4th input parameter
je lbl_dbg_01
call ntkrnlmp!KiSaveDebugRegisterState
lbl_dbg_01:
test byte ptr [rbx+3], 24h ; ltSyscall + Minimal (DISPATCHER_HEADER::DebugActive)
je lbl_dbg_03
mov qword ptr [rbp-20h], r10
mov qword ptr [rbp-28h], r10
movaps xmmword ptr [rbp-10h], xmm0
movaps xmmword ptr [rbp], xmm1
movaps xmmword ptr [rbp+10h], xmm2
movaps xmmword ptr [rbp+20h], xmm3
movaps xmmword ptr [rbp+30h], xmm4
movaps xmmword ptr [rbp+40h], xmm5
sti ; Enable maskable interrupts
mov rcx, rsp
call ntkrnlmp!PsAltSystemCallDispatch
cmp al, 1
je lbl_dbg_03
mov rax, qword ptr [rbp-50h] ; rax = system service number
jl lbl_dbg_02
mov ecx, 0C000001Ch ; STATUS_INVALID_SYSTEM_SERVICE
xor edx, edx
mov r8, qword ptr [rbp+0E8h] ; Return address to the user-mode
call ntkrnlmp!KiExceptionDispatch
int 3
lbl_dbg_02:
test byte ptr [rbx+3], 4 ; Minimal (DISPATCHER_HEADER::DebugActive)
je KiSystemServiceExit
jmp KiSystemServiceExitPico
lbl_dbg_03:
test byte ptr [rbx+3], 80h ; UmsPrimary (DISPATCHER_HEADER::DebugActive)
je lbl_dbg_04
mov ecx, 0C0000102h ; IA32_KERNEL_GS_BASE
rdmsr
shl rdx, 20h
or rax, rdx
cmp rax, qword ptr [ntkrnlmp!MmUserProbeAddress]
cmovae rax, qword ptr [ntkrnlmp!MmUserProbeAddress]
cmp qword ptr [rbx+0F0h], rax ; TEB*
je lbl_dbg_04
mov rdx, qword ptr [rbx+1F0h] ; UCB
bts dword ptr [rbx+74h], 8
dec word ptr [rbx+1E6h] ; SpecialApcDisable
mov qword ptr [rdx+80h], rax
sti ; Enable maskable interrupts
call ntkrnlmp!KiUmsCallEntry
jmp lbl_dbg_05
lbl_dbg_04:
test byte ptr [rbx+3], 40h ; UmsScheduled (DISPATCHER_HEADER::DebugActive)
je lbl_dbg_05
bts dword ptr [rbx+74h], 10h ; UmsPerformingSyscall
lbl_dbg_05:
mov r8, qword ptr [rbp-38h] ; Restore 3rd input parameter
mov r9, qword ptr [rbp-30h] ; Restore 4th input parameter
The part of the code that deals with the situation when a debugger is attached to a thread is quite interesting, and I may return to it in another blog post.
The code then retrieves first two input parameters that were passed into the original native function (that initiated the syscall that we are processing here.) Remember, in our case it was NtWaitForSingleObject
. And then saves the first argument into the KTHREAD
struct for the current thread, along with the syscall number.
It then enables maskable interrupts, that were automatically disabled by the syscall
instruction when we entered the kernel.
lbl_s_01:
mov rax, qword ptr [rbp-50h] ; system service number
mov rcx, qword ptr [rbp-48h] ; Restore 1st input parameter
mov rdx, qword ptr [rbp-40h] ; Restore 2nd input parameter
sti ; Enable maskable interrupts
mov qword ptr [rbx+88h], rcx ; KTHREAD::FirstArgument
mov dword ptr [rbx+80h], eax ; KTHREAD::SystemCallNumber
nop
Next it remembers the current "trap frame" in the KTHREAD
struct.
KiSystemServiceStart - System Service Number
At this point the service routine begins processing the "system service number" itself, or the EAX
register value that was passed into the syscall
instruction in the user-mode, that defines which service function we're calling.
For that the code splits the system service number (or 4
for the NtWaitForSingleObject
call in our case) into two components:
EAX
= lower 12 bits of the system service number.EDI
= bit 12 of the system service number.
I'll explain a bit later how these are used.
This could be visualized by the following bit breakdown:1 11 2 1098 7654 3210 - bit numbers ---------------- u iiii iiii iiii - bits
Where:
i
=EAX
lower 12 bits, or "syscall index"u
=EDI
bit 12.
All that looks pretty neat in code:
mov qword ptr [rbx+90h], rsp ; KTHREAD::KTRAP_FRAME*
mov edi, eax ; eax = system service number
shr edi, 7
and edi, 20h ; bit 12
and eax, 0FFFh ; lower 12 bits = syscall index
The interesting part about this code block is that this is where most of theZw*
kernel function prologues jump to from theKiServiceInternal
shim. We'll get back to it a bit later.
System Service Descriptor Tables
The next block of code deals with the so-called "System Service Descriptor Tables" (or SSDT
s.) These tables effectively map a syscall to the address of its kernel service function using the syscall's "system service number" as an index.
SSDT
can be visualized as the following struct:
struct SERVICE_DESCRIPTOR_TABLE
{
SYSTEM_SERVICE_TABLE nt; // for service functions in: ntoskrl.exe or ntkrnlmp
SYSTEM_SERVICE_TABLE win32k; // for service functions in: win32k.sys or GUI subsystem
SYSTEM_SERVICE_TABLE reserved2;
SYSTEM_SERVICE_TABLE reserved3;
};
Where:
struct SYSTEM_SERVICE_TABLE
{
LONG* ServiceTable; // Array of service function offsets & number of parameters
void* CounterTable;
ULONG ServiceLimit; // Number of elements in 'ServiceTable'
void* ArgumentTable;
};
So the code below picks the SSDT
that it needs, based on the KTHREAD::ThreadFlags
for the user-mode thread that invoked the syscall, and stores it in the R10
register. If it's not a GUI thread, it uses the KeServiceDescriptorTable
. If it's a regular GUI thread, it uses KeServiceDescriptorTableShadow
. Otherwise, for filtered GUI threads, it goes with KeServiceDescriptorTableFilter
.
KiSystemServiceRepeat:
lea r10, [ntkrnlmp!KeServiceDescriptorTable]
lea r11, [ntkrnlmp!KeServiceDescriptorTableShadow]
test dword ptr [rbx+78h], 80h ; GuiThread (KTHREAD::ThreadFlags)
je lbl_s_22 ; Jump if not a GUI-thread
test dword ptr [rbx+78h], 200000h ; RestrictedGuiThread (KTHREAD::ThreadFlags)
je lbl_s_21 ; Jump if not a filtered GUI-thread
lea r11, [ntkrnlmp!KeServiceDescriptorTableFilter]
lbl_s_21:
mov r10, r11
lbl_s_22:
cmp eax, dword ptr [r10+rdi+10h] ; SYSTEM_SERVICE_TABLE::ServiceLimit
jae lbl_check_n_conv_gui ; Jump if syscall index is out of bounds
Finally, it checks the lower 12 bits of the "system service number" (or the syscall index) for an overflow by comparing it to SYSTEM_SERVICE_TABLE::ServiceLimit
of the appropriate SSDT
. Remember, the code above kept it in the EAX
register.
That same code also set EDI
register (or collectively, its bigger 64-bit brother RDI
) to 20h
if the "system service number" for the syscall referred to a GUI (win32k.sys
) subsystem. Thus, the cmp eax, dword ptr [r10+rdi+10h]
instruction will add 20h
offset (in RDI
) to the address of the previously selected SSDT
in R10
. And, as you can see in the SERVICE_DESCRIPTOR_TABLE
struct, win32k
member follows right after the first nt
member, whose sizeof
is exactly 20h
bytes. This is what that magic 20h
was doing in the EDI
register for a GUI syscall in the code above.
Finally, if the syscall index is greater-or-equal to SYSTEM_SERVICE_TABLE::ServiceLimit
, the execution jumps to the lbl_check_n_conv_gui
label. This is somewhat outside of the scope of this post. But I'll explain it briefly without including the Assembly code.
The code at the lbl_check_n_conv_gui
branch will check if EDI
contains the value of 20h
, thus referring to a GUI subsystem. And if it does, this will mean that it's the first invocation of a GUI function and we need to convert our thread to a GUI thread (by calling KiConvertToGuiThread
function), and then repeat all the checks again by jumping back to the KiSystemServiceRepeat
code label above.
If on the other hand EDI
is not 20h
, this means that we were passed an incorrect "system service number" that is out of bounds. In this case the execution simply returns STATUS_INVALID_SYSTEM_SERVICE
back to the user-mode by exiting the syscall.
Note that a "GUI thread" moniker exists only due to historic reasons. Some time ago in the ancient past Microsoft decided to move most of their GUI code from the user-mode into kernel (for performance reasons.) This was done after a good portion of the NT kernel infrustructure was already in place. Thus, they had to separate syscalls from the regular NT-based syscalls (in thentoskrl
module) from GUI syscalls, whose processing code resided in a totally different module:win32k.sys
.
An interesting aside is how a thread becomes the "GUI thread". Obviously there is no way of knowing this until one of the syscalls to a GUI subsystem is made. Microsoft can differentiate their syscalls by using bit 12 of the "system service number" to denote a call to a GUI subsystem. So if a syscall service code notices that bit 12 of the "system service number" is set (by doing anAND
with20h
) it invokes theKiConvertToGuiThread
function to convert that thread to a GUI thread.This is done only once per thread.
Another question that may come to mind is, what the heck is the "filtered GUI thread", or the one that usesKeServiceDescriptorTableFilter
SSDT?There's not much official documention there. But from what I can gather, this is an internal feature that Microsoft uses as a fine-grained filter to their documented
PROCESS_MITIGATION_SYSTEM_CALL_DISABLE_POLICY
option. If you enable the latter option on a process, such process will not be able to make any calls to thewin32k.sys
GUI subsystem. And in most cases this could be analogous to using a sledgehammer to fix a dent. So the filteringSSDT
allows to specify a detailed list of GUI syscalls that are available for a process to make.By the way, the only reason why Microsoft is attempting to isolate their
win32k.sys
subsystem is because of its very poor reputation in regards to security vulnerabilities that were found in it. To put it briefly,win32k.sys
contains very buggy code that you don't want potential attackers to abuse in your critical application.
System Service Number To Service Function
Next thing, the code needs to begin preparing for the conversion of the "system service number" to the actual service function pointer.
It first retrieves the base address of the ServiceTable
from the SSDT
and stores it in R10
. It then adds the offset to it to get the address of the actual kernel function to invoke. (It is updated and stored in the R10
register.)
The code gets a specific 32-bit signed value from the SYSTEM_SERVICE_TABLE::ServiceTable
array for the offset using the syscall index in EAX
(here using the full RAX
register), that was obtained earler.
Note though that the code shifts the 32-bit signed value by 4 bits to the right to get the offset. This tells us that lower 4 bits of that value are used for something else.We'll see its use in the next code block.
mov r10, qword ptr [r10+rdi] ; R10 = SYSTEM_SERVICE_TABLE::ServiceTable
movsxd r11, dword ptr [r10+rax*4]
mov rax, r11 ; RAX = R11 = Offset to function & number of parameters
sar r11, 4
add r10, r11 ; R10 = address of service function
cmp edi, 20h
jne lbl_no_w32_callout ; Jump if non-GUI syscall
The following code-block is not relevant for our example as it deals with GDI batching for a GUI thread. Our syscall will skip over it:
mov r11, qword ptr [rbx+0F0h] ; R11 = TEB*
KiSystemServiceGdiTebAccess:
cmp dword ptr [r11+1740h], 0 ; TEB::GdiBatchCount
je lbl_no_w32_callout
mov qword ptr [rbp-50h], rax
mov qword ptr [rbp-48h], rcx
mov qword ptr [rbp-40h], rdx
mov rbx, r8
mov rdi, r9
mov rsi, r10
mov ecx, 7
xor edx, edx
xor r8, r8
xor r9, r9
call ntkrnlmp!PsInvokeWin32Callout
mov rax, qword ptr [rbp-50h]
mov rcx, qword ptr [rbp-48h]
mov rdx, qword ptr [rbp-40h]
mov r8, rbx
mov r9, rdi
mov r10, rsi
nop dword ptr [rax]
lbl_no_w32_callout:
Service Function Input Parameters
Now we get to the calculation and a setup of the service function parameters that we are invoking.
As you can see from the description above, the R10
register now holds the address of the function that we need to call to service our syscall. But if you remember the x64 calling convention, we can't call it just yet.
Even though we have the first four input parameters for that function in: RCX
, RDX
, R8
and R9
registers; any further input parameters are passed on the stack. And we don't have those set up yet.
This is what the next chunk of code will do.
But before we get to it, remember that we had 4-lower bits left out from the function offsets in the SYSTEM_SERVICE_TABLE::ServiceTable
array? Well, this is where they come handy. Those 4 bits represent 16 possible combinations for the number of input parameters for a function that is used to service the syscall.
To be precise, for the x64 calling convention, that we've been dealing with, the number of input parameters stored in the lower 4 bits of the function offsets in theSYSTEM_SERVICE_TABLE::ServiceTable
array represents only the input parameters that are "passed on the stack". And thus we have to add 4 to the actual number of possible combinations (since the first 4 parameters are passed in registers: RCX, RDX, R8, R9.)So if we exclude 0 as no-parameters, then the maximum number of possible input parameters that we can pass into a syscall in the 64-bit Windows is 19, or 15 + 4.
Let's see how that 4-bit nibble is used.
If it is 0, then nothing needs to be done and the code jumps to the KiSystemServiceCopyEnd
label.
Otherwise the value of a 4-bit nibble is multiplied by 8 (in the shl eax, 3
instruction) and that value is subtracted from the address of the KiSystemServiceCopyEnd
label.
But why multiplying it by 8, you may ask?If you look at the pairs of the
mov
instructions in the sequence that starts from theKiSystemServiceCopyStart
label, each pair takes 8 bytes in machine code:And each pair of those
mov
instructions is used to copy a single input parameter on the stack for the service function. That is why we multiply by 8: to get the code offset from theKiSystemServiceCopyStart
label (backwards) to where we should begin copying the stack.
Lastly, we need to calculate the address of where to copy the input parameters from the user-mode stack (and store it in the rsi
register), as well as the destination address in the kernel stack (and store it in the rdi
register.)
For security reasons, the code needs to make sure that the user-mode caller passed us a stack address in the user-mode address range. For that it checks thersi
register against theMmUserProbeAddress
constant (orMM_USER_PROBE_ADDRESS
), which is the address of demarcation between the user-mode and the kernel-mode address ranges.If the caller passed us a user-stack pointer in the kernel address range, the code substitutes it with the
MmUserProbeAddress
constant itself. (As a side note, this is a strange way to deal with this situation. I'd expect raising an exception.)
Finally, when we get the source and destination stacks (in rsi
and rdi
registers, respectively) and know how many parameters to copy, the optimized code below will execute a jmp r11
instruction, that will redirect the execution to one of the pairs of the mov
instructions in the sequence that starts from the KiSystemServiceCopyStart
label. That sequence of mov
instructions will copy all necessary input parameters from the user-mode stack into the kernel one.
and eax, 0Fh
je KiSystemServiceCopyEnd
shl eax, 3
lea rsp, [rsp-70h] ; Adjust RSP for the service function
lea rdi, [rsp+18h] ; Reserve space for the "shadow stack"
mov rsi, qword ptr [rbp+100h] ; RSI = user-mode RSP value
lea rsi, [rsi+20h] ; "shadow stack" + return address
test byte ptr [rbp+0F0h], 1
je lbl_jmp_r11 ; Jump if(is_syscall & 1) == 0
; Check user-mode RSP for overflow
cmp rsi, qword ptr [ntkrnlmp!MmUserProbeAddress] ; 7fff`ffff0000h
cmovae rsi, qword ptr [ntkrnlmp!MmUserProbeAddress]
nop dword ptr [rax]
lbl_jmp_r11:
lea r11, KiSystemServiceCopyEnd
sub r11, rax
jmp r11
int 3
int 3
int 3
KiSystemServiceCopyStart:
mov rax, qword ptr [rsi+70h]
mov qword ptr [rdi+70h], rax
mov rax, qword ptr [rsi+68h]
mov qword ptr [rdi+68h], rax
mov rax, qword ptr [rsi+60h]
mov qword ptr [rdi+60h], rax
mov rax, qword ptr [rsi+58h]
mov qword ptr [rdi+58h], rax
mov rax, qword ptr [rsi+50h]
mov qword ptr [rdi+50h], rax
mov rax, qword ptr [rsi+48h]
mov qword ptr [rdi+48h], rax
mov rax, qword ptr [rsi+40h]
mov qword ptr [rdi+40h], rax
mov rax, qword ptr [rsi+38h]
mov qword ptr [rdi+38h], rax
mov rax, qword ptr [rsi+30h]
mov qword ptr [rdi+30h], rax
mov rax, qword ptr [rsi+28h]
mov qword ptr [rdi+28h], rax
mov rax, qword ptr [rsi+20h]
mov qword ptr [rdi+20h], rax
mov rax, qword ptr [rsi+18h]
mov qword ptr [rdi+18h], rax
mov rax, qword ptr [rsi+10h]
mov qword ptr [rdi+10h], rax
mov rax, qword ptr [rsi+8]
mov qword ptr [rdi+8], rax
KiSystemServiceCopyEnd:
Calling The Service Function
And lastly, after a couple of checks for tracing options, we are ready to call our destination service function via the call rax
instruction:
Remember thatr10
contained our service function pointer, which the code below copies intorax
for thecall
instruction.I'm not sure though, why they couldn't just use
call r10
instruction instead?
test dword ptr [ntkrnlmp!KiDynamicTraceMask], 1
jne lbl_track_syscall
test dword ptr [ntkrnlmp!PerfGlobalGroupMask+0x8], 40h
jne lbl_perf_info_log_syscall
mov rax, r10
call rax ; Call the service function
In case you are curious what does the code do that is pointed by thelbl_track_syscall
andlbl_perf_info_log_syscall
labels:
lbl_track_syscall
- wraps the invocation of the service function for the syscall with theKiTrackSystemCallEntry
function. It is used to invoke a trace callback for the system service dispatcher.lbl_perf_info_log_syscall
- does the same wrapping, but with thePerfInfoLogSysCallEntry
function that does performance logging for the system service dispatcher.The rest of the code sequence in those branches is exactly the same.
After the indirect call
instruction, the execution will follow to the NtWaitForSingleObject
function in our specific case. This may put the calling thread into a waiting state, or return immediately, depending on the state of the kernel object that this function was called with.
Leaving SYSCALL
After the processing of the service function is done, the execution will return from the indirect call
instruction, in the code branch shown above.
But we're far from being done with the syscall. Let's see what happens next.
Return From The Service Function
The return value from running a service function is returned in the rax
register, and thus the remaining part of the syscall code needs to preserve it. The code will do so by placing it on the stack into a temporary address at the [rbp-50h]
offset.
Note that it seems like a kernel service function cannot accept any direct floating point values (via SSE registers) or return one. Although, it can hypothetically accept floating point variables passed by reference.
The first thing the code increments the syscall counter in KPRCB::KeSystemCalls
. This may be Microsoft's way to keep track of some telemetry, or to calculate the system work load.
Then the code restores 3 of the nonvolatile registers: RBX
, RDI
and RSI
. They have to be preserved, according to the x64 calling convention.
And the R11
register is set to point to the KTHREAD
struct, that describes the current user thread.
Then the code makes sure that the IRQL
is at the PASSIVE_LEVEL
, and that KTHREAD::ApcStateIndex
is at 0, and that APCs are not disabled, by checking the KernelApcDisable
and SpecialApcDisable
flags (used mutally as the KTHREAD::CombinedApcDisable
union.) If any of these conditions don't hold, the code executes a KiBugCheckDispatch
, or a KeBugCheckEx
call (that will cause a BSoD.)
The KTHREAD::ApcStateIndex
will not be 0 if the thread is not currently attached to the same process that initiated the syscall. It will be a violation to return that thread to the user-mode in such a state.
The bug-check in this case will have the following codes:IRQL_GT_ZERO_AT_SYSTEM_SERVICE
for the incorrectIRQL
, orAPC_INDEX_MISMATCH
for the APC index mismatch.
Lastly, the code disables maskable interrupts for the rest of its function.
Maskable interrupts will be enabled back when the kernel code executes the sysretq
instruction.
nop dword ptr [rax]
inc dword ptr gs:[2EB8h] ; KPRCB::KeSystemCalls
KiSystemServiceExit:
mov rbx, qword ptr [rbp+0C0h]
mov rdi, qword ptr [rbp+0C8h]
mov rsi, qword ptr [rbp+0D0h]
mov r11, qword ptr gs:[188h] ; R11 = KPRCB::KTHREAD*
test byte ptr [rbp+0F0h], 1
je lbl_alt_exit ; If (is_syscall & 1) == 0: perform alternative exit from syscall
mov rcx, cr8 ; IRQL (Task Priority Register, TPR)
or cl, byte ptr [r11+24Ah] ; KTHREAD::ApcStateIndex
or ecx, dword ptr [r11+1E4h] ; KTHREAD::CombinedApcDisable
jne lbl_bug_check_1
cli ; Ignore maskable interrupts
Note the conditional jump to the lbl_alt_exit
branch. This is an alternative exit from a syscall handler that we will discuss later.
Processing User-Mode APCs
The next stage is to process all accumulated user-mode APCs.
We already wrote about user-mode and kernel APC in two different blog posts.
The code below spins in a loop checking if SpecialUserApcPending
or UserApcPending
flags of the KTHREAD::KAPC_STATE
object are set to denote the presence of one or more queued APCs. And if so, it invokes KiInitiateUserApc
function to process them one-at-a-time.
Note that before processing an APC, the code raises theIRQL
to theAPC_LEVEL
and enables maskable interrupts. It then disables maskable interrupts after theKiInitiateUserApc
call returns, and lowers theIRQL
back down to thePASSIVE_LEVEL
.
lbl_apc_01:
mov rcx, qword ptr gs:[188h] ; RCX = KPRCB::KTHREAD*
test byte ptr [rcx+0C2h], 3 ; KTHREAD::KAPC_STATE::UserApcPendingAll
je lbl_ex_10 ; Jump if no user APCs
mov qword ptr [rbp-50h], rax ; Save return value from the service function
xor eax, eax
mov qword ptr [rbp-48h], rax
mov qword ptr [rbp-40h], rax
mov qword ptr [rbp-38h], rax
mov qword ptr [rbp-30h], rax
mov qword ptr [rbp-28h], rax
mov qword ptr [rbp-20h], rax
pxor xmm0, xmm0
movaps xmmword ptr [rbp-10h], xmm0
movaps xmmword ptr [rbp], xmm0
movaps xmmword ptr [rbp+10h], xmm0
movaps xmmword ptr [rbp+20h], xmm0
movaps xmmword ptr [rbp+30h], xmm0
movaps xmmword ptr [rbp+40h], xmm0
mov ecx, 1 ; IRQL = APC_LEVEL
mov cr8, rcx
sti ; Enable maskable interrupts
call ntkrnlmp!KiInitiateUserApc ; Initiate a user-mode APC
cli ; Disable maskable interrupts
mov ecx, 0
mov cr8, rcx ; IRQL = PASSIVE_LEVEL
mov rax, qword ptr [rbp-50h] ; Restore return value from the service function
jmp lbl_apc_01
lbl_ex_10:
In a nutshell, APC stands for "Asynchronous Procedure Call". This is Microsoft's way to delay execution of some callback until a later, more appropriate time, in the same thread. APCs are mostly relevant for the kernel-mode, but are also used in the user land. For instance, one can use QueueUserAPC
to queue one.
Security Mitigations At Exit
The following section of code deals with more vulnerability mitigations during an exit from the syscall
. In this case it implements the "Single Thread Indirect Branch Predictors" (or STIBP) mitigation that deals with protecting user-mode code against the Spectre v2 vulnerability when hyper-threading is enabled.
In a nutshell, this vulnerability "leaks" information from one thread to another when hyper-threading is enabled. Thus, the following mitigation attempts to pair the thread state from one logical CPU core to another using the KiUpdateStibpPairing
function.
test byte ptr gs:[27Eh], 2 ; KPRCB::PairRegister
je lbl_ex_12 ; Jump if pairing state is not stale
mov qword ptr [rbp-50h], rax
xor ecx, ecx
call ntkrnlmp!KiUpdateStibpPairing ; Single Thread Indirect Branch Predictors (STIBP)
mov rax, qword ptr [rbp-50h]
lbl_ex_12:
It then performs some additional processing, depending on the state of the KTHREAD::DISPATCHER_HEADER
struct.
mov rcx, qword ptr gs:[188h] ; RCX = KPRCB::KTHREAD*
test dword ptr [rcx], 8000000h ; DISPATCHER_HEADER::Lock
je lbl_ex_13
mov qword ptr [rbp-50h], rax ; Save return value from the service function
xor eax, eax
mov qword ptr [rbp-48h], rax
mov qword ptr [rbp-40h], rax
mov qword ptr [rbp-38h], rax
mov qword ptr [rbp-30h], rax
mov qword ptr [rbp-28h], rax
mov qword ptr [rbp-20h], rax
pxor xmm0, xmm0
movaps xmmword ptr [rbp-10h], xmm0
movaps xmmword ptr [rbp], xmm0
movaps xmmword ptr [rbp+10h], xmm0
movaps xmmword ptr [rbp+20h], xmm0
movaps xmmword ptr [rbp+30h], xmm0
movaps xmmword ptr [rbp+40h], xmm0
call ntkrnlmp!KiRestoreSetContextState
lbl_ex_13:
mov rcx, qword ptr gs:[188h] ; RCX = KPRCB::KTHREAD*
test dword ptr [rcx], 40010000h ; DISPATCHER_HEADER::Lock
je lbl_ex_14
mov qword ptr [rbp-50h], rax ; Save return value from the service function
test byte ptr [rcx+2], 1 ; CycleProfiling (DISPATCHER_HEADER::ThreadControlFlags)
je lbl_ex_13_1
call ntkrnlmp!KiCopyCounters
mov rcx, qword ptr gs:[188h] ; RCX = KPRCB::KTHREAD*
lbl_ex_13_1:
test byte ptr [rcx+3], 40h ; UmsScheduled (DISPATCHER_HEADER::DebugActive)
je lbl_ex_13_2
lea rsp, [rbp-80h] ; Restore RSP (stack pointer)
xor ecx, ecx
call ntkrnlmp!KiUmsExit
lbl_ex_13_2:
mov rax, qword ptr [rbp-50h] ; Restore return value from the service function
lbl_ex_14:
The code above will not be invoked for a regular processing of a syscall, like our NtWaitForSingleObject
function call.
Instrumentation Callback
Then the code restores the MXCSR
control and status register, and also restores CPU debugging registers in the KiRestoreDebugRegisterState
function, if they were previously saved.
As an interesting aside, the code below performs additional action if KPROCESS::InstrumentationCallback
is not 0. In that case the R10
register is set to the initial return address from the syscall, and the actual return address is set to the value of KPROCESS::InstrumentationCallback
. This effectively allows to specify an alternate return target from a syscall.
ldmxcsr dword ptr [rbp-54h] ; Restore MXCSR
xor r10, r10
cmp word ptr [rbp+80h], 0 ; word_f70 (lower CONTEXT::MxCsr)
je lbl_ex_25
mov qword ptr [rbp-50h], rax ; Save return value from the service function
call ntkrnlmp!KiRestoreDebugRegisterState
mov rax, qword ptr gs:[188h] ; RAX = KPRCB::KTHREAD*
mov rax, qword ptr [rax+0B8h] ; RAX = KTHREAD::KAPC_STATE::KPROCESS
mov rax, qword ptr [rax+3D8h] ; RAX = KPROCESS::InstrumentationCallback
or rax, rax
je lbl_ex_24
cmp word ptr [rbp+0F0h], 33h
jne lbl_ex_24 ; Jump if(is_syscall != 0x33)
mov r10, qword ptr [rbp+0E8h] ; R10 = return address from the syscall
mov qword ptr [rbp+0E8h], rax ; Update return address from the syscall
lbl_ex_24:
mov rax, qword ptr [rbp-50h] ; Restore return value from the service function
lbl_ex_25:
As the multiple conditional checks in the code above show, this is most certainly a debugging feature of the syscall handler.
More Security Mitigations At Exit
After that the code saves (one more time) the return value from the service function in RAX
at the [rbp-50h]
address on the stack, and then resets the following security mitigations:
- Retpoline mitigation state in the
KPRCB::BpbRetpolineState
thread variable. - "Speculative Execution Side Channel Mitigations" via the
IA32_SPEC_CTRL
MSR. - "Indirect Branch Prediction Barrier" via the
IA32_PRED_CMD
MSR.
mov qword ptr [rbp-50h], rax ; Save return value from the service function
mov byte ptr gs:[853h], 0 ; KPRCB::BpbRetpolineState
; 1h = BpbRunningNonRetpolineCode
; 2h = BpbIndirectCallsSafe
; 4h = BpbRetpolineEnabled
movzx eax, byte ptr gs:[27Dh] ; KPRCB::BpbUserSpecCtrl
cmp byte ptr gs:[27Ah], al ; KPRCB::BpbCurrentSpecCtrl
je lbl_ex_30
mov byte ptr gs:[27Ah], al ; KPRCB::BpbCurrentSpecCtrl
mov ecx, 48h ; IA32_SPEC_CTRL MSR
xor edx, edx
wrmsr
lbl_ex_30:
btr word ptr gs:[278h], 2 ; KPRCB::BpbState & BpbIbpbOnReturn
jae lbl_ex_31
mov eax, 1 ; Indirect Branch Prediction Barrier (IBPB)
xor edx, edx
mov ecx, 49h ; IA32_PRED_CMD MSR
wrmsr
lbl_ex_31:
Note that it is equally important to perform these security mitigations at the exit from the kernel mode as it is when entering into it.
sysretq
Instruction
Finally the code restores the RAX
, RCX
and R11
registers that will be used in the sysretq
instruction to return to the user mode. It then clears the XMM
registers and recovers the previous values of RBP
and RSP
stack register.
It then executes the swapgs
instruction to swap the GS
base register with the value of the IA32_KERNEL_GS_BASE
MSR, basically to revert what was done when we entered the syscall.
At the end of this long sequence, the sysretq
instruction returns execution back to the user mode. This is done pretty much in reverse order to what syscall
had done originally.
Without going into too many details, the sequence is as follows:
RFLAGS
is set to the value of theR11
register, with the exception of theRF
andVM
flags that are always set to 0, and other reserved flags that remain unchanged.CS
code segment selector is set from bits [48-64] ofIA32_STAR
MSR (at addressC0000081h
), plus the value of10h
, and then by OR-ing it with03h
, thus resetting itsRPL/CPL
to 3.- Set the following
CS
code segment attributes:CS.Base=0
,CS.Limit=FFFFFh
,CS.Type=0011b
("Execute & read code segment, accessed"),CS.S=1
,CS.DPL=3
,CS.P=1
,CS.L=1
(64-bit long mode),CS.D=0
,CS.G=1
("4-KByte granularity"),CPL=3
(for "user-mode"). SS
code segment selector is set from bits [48-63] ofIA32_STAR
MSR (at addressC0000081h
), plus the value of8h
, thus making theSS
segment follow theCS
segment. And then by OR-ing it with03h
, thus resetting itsRPL
to 3.- Set the following
SS
segment attributes:SS.Base=0
,SS.Limit=FFFFFh
,SS.Type=0011b
("Read/write data segment, accessed"),SS.S=1
,SS.DPL=3
,SS.P=1
,SS.B=1
("expand-up, 32-bit stack segment"),SS.G=1
("4-KByte granularity"). RIP
is set to the value of theRCX
register.
Note that unlessKTHREAD::KAPC_STATE::KPROCESS::InstrumentationCallback
was set to some non-0 address, the execution will return to the instruction that follows thesyscall
that had started this sequence. Otherwise it will jump to the address fromInstrumentationCallback
andr10
will contain the original return address.
mov rax, qword ptr [rbp-50h] ; Restore return value from the service function
mov r8, qword ptr [rbp+100h] ; R8 = original user-mode RSP
mov r9, qword ptr [rbp+0D8h] ; R9 = original user-mode RBP
xor edx, edx
pxor xmm0, xmm0 ; Reset volatile SSE registers to 0
pxor xmm1, xmm1
pxor xmm2, xmm2
pxor xmm3, xmm3
pxor xmm4, xmm4
pxor xmm5, xmm5
mov rcx, qword ptr [rbp+0E8h] ; RCX = return address to user mode
mov r11, qword ptr [rbp+0F8h] ; R11 = original RFLAGS from user mode
test byte ptr [ntkrnlmp!KiKvaShadow], 1
jne KiKernelSysretExit
mov rbp, r9
mov rsp, r8 ; Restore user-mode stack pointer
swapgs
sysretq ; Return to user-mode
System Exit With Meltdown Mitigations
In case we used the "Kernel Virtual Address Shadow" (KVAS) for the Meltdown mitigation, the code above will take the KiKernelSysretExit
branch and perform additional actions:
- A check will be made using bit 1 of the
KPRCB::ShadowFlags
, like was done in the beginning of the syscall handler. The following steps will be performed only if that bit is not set. - The base of the page directory in
CR3
register will be set to the value ofKPROCESS::UserDirectoryTableBase
to switch the virtual address translation to its own table for the user-mode.Note how the code uses bit 0 of the address from
KPROCESS::UserDirectoryTableBase
. If it's 0, it will not clear bit 0 inKPRCB::ShadowFlags
. Otherwise, it will check bit 0 ofKPRCB::ShadowFlags
and will set bit 63 in the address obtained fromKPROCESS::UserDirectoryTableBase
. - If bit 1 of
KPRCB::ShadowFlags
is set, the code will also run theverw
instruction to verify the selector, stored inKPRCB::VerwSelector
for write access.To be honest I'm not entirely sure of the need for that
verw
instruction in that sequence.verw
sets (or resets)ZF
flag inRFLAGS
, which is never used anywhere after that instruction.
Then finally, the code sequence below executes the sysretq
instruction, as I described above, to return to the user mode.
KiKernelSysretExit:
mov esp, dword ptr gs:[9018h] ; KPRCB::ShadowFlags
bt esp, 1
jb lbl_ex_55
mov rbp, qword ptr gs:[188h] ; KPRCB::KTHREAD*
mov rbp, qword ptr [rbp+220h] ; KPROCESS*
mov rbp, qword ptr [rbp+388h] ; KPROCESS::UserDirectoryTableBase
bt ebp, 0
jae lbl_ex_54
bt esp, 0
jb lbl_ex_53
bts rbp, 3Fh
jmp lbl_ex_54
lbl_ex_53:
and dword ptr gs:[9018h], 0FFFFFFFEh ; KPRCB::ShadowFlags & ~1
lbl_ex_54:
mov cr3, rbp ; Set base of page table
lbl_ex_55:
mov rbp, r9
bt esp, 1
jb lbl_ex_56
verw word ptr gs:[902Ah] ; KPRCB::VerwSelector
lbl_ex_56:
mov rsp, r8 ; Restore user mode stack pointer
swapgs
sysretq ; Return to user mode
And this will conclude the gargantuan sequence of actions that is performed for each and every call to a kernel function.
Alternative Exit
While reviewing the Assembly code for the syscall handler, you might have noticed a bunch of mysterious checks for the [rbp+0F0h]
memory location on the stack. It's obviously some local variable. An astute reader might have even asked themselves, "What does that variable do?"
And this is what I set off to determine.
First off, to make it easier to spot, I labeled it as is_syscall
in this post. So use Ctrl+F keyboard shortcut (or ⌘+F on the Mac) to highlight it in your web browser.
The is_syscall
variable begins its life as a hardcoded value 33h
on the stack at the very beginning of the processing of a syscall. But it may also come from the Zw*
function prolog.
In that frame reference, theis_syscall
variable looks very much like a hardcoded code segment, orCS
register.
It is then checked in many places, and if its bit 0 is cleared, the code will completely bypass security checks and mitigations, and it will take a totally different (and super-fast) system exit in the KiSystemServiceExit
branch:
lbl_alt_exit:
mov rdx, qword ptr [rbp+0B8h] ; alt_exit: P1Home (or NormalContext for APC)
mov qword ptr [r11+90h], rdx ; KPRCB::KTHREAD::KTRAP_FRAME::P1Home
mov dl, byte ptr [rbp-58h] ; alt_exit: PreviousMode
mov byte ptr [r11+232h], dl ; KPRCB::KTHREAD::PreviousMode
cli ; Disable maskable interrupts
mov rsp, rbp
mov rbp, qword ptr [rbp+0D8h] ; RBP = original RBP from user mode
mov rsp, qword ptr [rsp+100h] ; RSP = original RSP from user mode
sti ; Enable maskable interrupts
ret
As you can see above, the return from the syscall handler is done via the ret
instruction instead of a more conventional sysretq
. Additionally, this code branch also restores the RBP
and RSP
registers to their original values (before the syscall handler was invoked.)
Also notice that before returning, the code seems to be setting theNormalContext
for the APC in theKTRAP_FRAME::P1Home
variable for the current thread to the local variable from the stack that I marked as "alt_exit: P1Home". If you look at my kernel stack diagram, that area points to an unused location on the stack, at least for the syscall handler.The code also sets the
KTHREAD::PreviousMode
from the location on the stack that I marked as "alt_exit: PreviousMode" in my stack layout diagram. (ThePreviousMode
can be later retrieved via the documentedExGetPreviousMode
function.)
The bottom line is that if the initial value of the hardcoded is_syscall
variable on the kernel stack is set without bit-0, the syscall handler will skip most of the vulnerability mitigations and security checks, and return to the address marked as the offset FD8
in my kernel stack diagram.
My original guess was that theis_syscall
variable was used as a sort of an internal debugging constant. That was until I discovered theKiServiceInternal
function.
KiServiceInternal
To explain the existence of the alternative exit branch from the KiSystemServiceExit
code block in the syscall handler one needs to look at the KiServiceInternal
shim.
Further more, if we search the ntoskrnl.exe
module for references to the KiServiceInternal
symbol (using any static analysis tools, such as Ghidra for instance) we will come up with a myriad of hits. And if we look at some of them, they will all be almost identical.
The reason for that is because the KiServiceInternal
code shim is used as a part of the Zw*
kernel functions prolog. (We'll review it next.)
For now though, if we look at the code in KiServiceInternal
, most of the actions will be already familiar. They follow very similar pattern as we saw above at the beginning of the syscall handler.
sub rsp, 8h
push rbp ; Save frame pointer register
sub rsp, 158h
lea rbp, [rsp+80h]
mov qword ptr [rbp+0C0h], rbx ; Saves some nonvolatile registers
mov qword ptr [rbp+0C8h], rdi
mov qword ptr [rbp+0D0h], rsi
sti ; Enable maskable interrupts
mov rbx, qword ptr gs:[188h] ; KPRCB::KTHREAD*
prefetchw [rbx+90h] ; KTRAP_FRAME
movzx edi, byte ptr [rbx+232h] ; KPRCB::KTHREAD::PreviousMode
mov byte ptr [rbp-58h], dil ; alt_exit: PreviousMode
mov byte ptr [rbx+232h], 0x0 ; KPRCB::KTHREAD::PreviousMode = 0 (KernelMode)
mov r10, qword ptr [rbx+90h] ; KTRAP_FRAME::P1Home
mov qword ptr [rbp+0B8h], r10 ; alt_exit: P1Home
lea r11, [KiSystemServiceStart]
jmp r11 ; Jump to KiSystemServiceStart
The first thing it saves the RBP
register on the stack, adds a space for the local variables by subtracting 0x158
from the stack pointer, and then sets RBP
to point slightly higher.
All this matches exactly my kernel stack layout for the syscall handler.
It then saves RBX
, RDI
and RSI
nonvolatile registers and enables maskable interrupts. It then preloads the KTHREAD::TrapFrame
into the CPU cache (exacly as we saw above.)
Then the code saves the value of the KPRCB::KTHREAD::PreviousMode
on the stack in what we marked as "alt_exit: PreviousMode" offset and resets the PreviousMode
to 0, or KernelMode
.
Such behavior is exactly what Zw* functions do in the kernel.
Finally, it saves the value of KTRAP_FRAME::P1Home
in our offset on the stack that we marked as "alt_exit: P1Home". (This is a part of the NormalContext
for the kernel APC data.)
And jumps to the KiSystemServiceStart
branch of the regular syscall handler.
But this still doesn't make sense. This is because the KiServiceInternal
shim can't be used alone. It needs to be used in tandem with one of the Zw*
function prologs.
Zw*
Kernel Functions Prolog
Finally, to tie it all together and to understand the meaning and use of the KiServiceInternal
and of the "alternative exit" code branches, as well as to figure out the use of the local variable that we labeled is_syscall
, we need to look at one of the Zw*
function prologs. Since we started torturing the NtWaitForSingleObject
function, let's look at its kernel counterpart ZwWaitForSingleObject
then.
mov rax, rsp
cli ; Disable maskable interrupts
sub rsp, 10h
push rax ; Save RSP
pushfq ; Save RFLAGS
push 10h ; is_syscall = 10h
lea rax, [KiServiceLinkage]
push rax ; Return address from syscall handler
mov eax, 4h ; "System service number"
jmp KiServiceInternal
The actions above are quite simple. The code first disables maskable interrupts (since it will be messing with the stack pointer) and follows the same pattern of filling out the stack as we saw in the beginning of the syscall handler.
Why? Because it will soon be merging with it at the KiSystemServiceStart
branch in the KiServiceInternal
shim above.
This is how the code above fills out the stack. Compare it to the syscall stack layout. I'll show just the first few bytes for brevity:
1008 = Return address from Zw* function call
1000 =
FF8 =
FF0 = original RSP
FE8 = original RFLAGS
FE0 = 10h Dummy CS segment & also 'is_syscall'
FD8 = KiServiceLinkage Return address from syscall handler
FD0 =
FC8 = original RBP (filled out in KiServiceInternal)
FC0 = original RSI (filled out in KiServiceInternal)
FB8 = original RDI (filled out in KiServiceInternal)
FB0 = original RBX (filled out in KiServiceInternal)
...
As you can see, one of the main differences is that our local variable is_syscall
is set to 0x10
with its bit-0 cleared. This means that the main syscall handler logic will skip most of the security and vulnerability checks, as I explained above, and after the service function is executed, the KiSystemServiceExit
will take the alternative exit branch.
Then if you recall the alternative exit branch, the code there will reset the PreviousMode
to the old value for the thread taken from KPRCB::KTHREAD::PreviousMode
(that is saved by the KiServiceInternal
branch); and will reset the NormalContext
for kernel APC via the "alt_exit: P1Home" stack variable.
The second difference that you can see in the code above is that the return address from the syscall handler is set to the KiServiceLinkage
function. (It is marked with the FD8
offset in my kernel stack layout diagram.)
Then if we look at the KiServiceLinkage
function itself, it can't be simpler than this:
Finally, the EAX
register is hardcoded to the same "system service number" (or 4 in case of the ZwWaitForSingleObject
, as we saw earlier) and the control flow is diverted to the KiServiceInternal
shim that in turn will bring it to the KiSystemServiceStart
branch in the main syscall handler.
And this concludes the code path that is taken when a Zw*
function (such as ZwWaitForSingleObject
) is invoked from the kernel. In a sense, those Zw*
functions are executed via a similar syscall handler mechanism.
To be honest, I'm not sure why Microsoft chose such a convoluted way to redirect theirZw*
functions to theNt*
ones, that have the actual implementation. My guess is that it saves on the amount of code. But it definitely doesn't help with the efficiency.
Nt*
Kernel Functions
Finally, for completeness, let's review how Nt*
kernel functions are structured.
If you remember from the documentation, they are the ones that perform verification of the input parameters that may be passed in from the user-mode. But that is not entirely true. Let's look at the implementation of the NtWaitForSingleObject
function to see what it does. It has a fairly simple logic:
; NTSYSAPI NTSTATUS NtWaitForSingleObject(
; HANDLE Handle, ; RCX
; BOOLEAN Alertable, ; RDX
; PLARGE_INTEGER Timeout ; R8
; )
mov qword ptr [rsp+18h], r8 ; Save R8 in shadow stack frame
sub rsp, 38h
movzx r9d, dl ; R9 = RDX
mov qword ptr [rsp+58h], 0
mov rax, qword ptr gs:[188h] ; RAX = KPRCB::KTHREAD*
movzx edx, byte ptr [rax+232h] ; EDX = KPRCB::KTHREAD::PreviousMode
mov rax, qword ptr [rsp+50h] ; RAX = saved R8 (3rd parameter)
test rax, rax
je lbl_ok ; Jump if 3rd parameter is NULL
test dl, dl
je lbl_ok ; Jump if PreviousMode == KernelMode
; Check if the 3rd parameter is a user-mode address
;
mov r8, ntkrnlmp!MmUserProbeAddress ; R8 = 7fff`ffff0000h (or MM_USER_PROBE_ADDRESS)
cmp rax, r8
jb lbl_1 ; Jump if RAX < 0x7FFFFFFF0000
mov rax, r8 ; Otherwise set RAX = 0x7FFFFFFF0000
lbl_1:
mov rax, qword ptr [rax]
mov qword ptr [rsp+58h], rax
lea rax, [rsp+58h]
mov qword ptr [rsp+50h], rax
jmp lbl_ok ; WTF! Really weird quirk of the compiler?
jmp lbl_exit
lbl_ok:
mov qword ptr [rsp+20h], rax
movzx r8d, dl
call ntkrnlmp!ObWaitForSingleObject ; This internal function does the actual work
lbl_exit:
add rsp, 38h
ret
As you can see, the logic inside will bypass verification of input parameters if the PreviousMode
is set to 0, or KernelMode
. And thus, if an Nt*
function is called with the PreviousMode
set to 0, it will skip all the checks as well. And this is exactly what a matching Zw*
function does anyway.
To be honest, calling anNt*
function with thePreviousMode
set toKernelMode
is way more efficient than calling aZw*
function. (Obviously, both from the kernel mode.)I think I proved above why.
POC To Illustrate The Performance Impact
Finally, to illustrate the performance impact of calling a kernel function versus staying in user-mode, or the difference between using a kernel synchronization object versus a critical section, I made a small POC project. It will show the result by displaying the timing for each method:
This project is called "CritSectionVsKernelObject". You can download it from my GitHub.
The code in the POC is pretty straightforward, so I won't spend too much time on it.
It emulates the situation when the read/write access to a global buffer needs to be synchronized from multiple threads, and then runs it for a large number of iterations and times how long it takes to finish. The result is presented on the screen.
We run two tests. First, using the default critical section in Windows. And the second time using our home-made custom implementation of the lock using a Win32 event object, that will require a trip to the kernel every time we want to wait on it, or to check its state.
I put the home-made custom implementation into the class, that I named DontUse_MyCritSection
, to dissuade whoever wants to use it from doing it.
So having run this POC app, I'm sure no one will have any doubts about which implementation of a synchronization lock is more efficient.
Conclusion
I hope that by showing you this very long sequence of actions that happens during a syscall
I was able to convince you that the cost of entering the kernel on Windows is very high.
I used an example of the NtWaitForSingleObject
function and walked all the way from the syscall
instruction in user-mode, to the actual eponymous kernel function, and out back to the user-mode. That code sequence seemed enormous.
In conclusion, I want to admit that this turned out into a beefy blog post that took a long time to complete. But I hope that this will let you use my efforts, that I've spent on this research, in a wise fashion for your own software development.