Critical Section vs Kernel Objects - Spinning in user-mode versus entering kernel - the cost of a SYSCALL in Windows.

This article contains functions and features that are not documented by the original manufacturer. By following advice in this article, you're doing so at your own risk. The methods presented in this article may rely on internal implementation and may not work in the future.

Intro

One of my previous clients had an interesting dilemma at hand. Their commercial software was running without any hitches, but the users were complaining about its slow performance. When their own developers tried to profile that software they came up empty handed, having produced just a long list of stats without any single function standing out from the rest.

I analyzed their source code and found one peculiar implementation. Some previous developer had written his (or her) own version of a critical section (or a synchronization lock.) At the time it probably seemed like a great idea. But after I had analyzed their implementation and wrote a small POC app to demonstrate its performance, I was ready to present my findings. I will share those with the readers of my blog as well.

This is another blog post that needs a table of contents:

Critical Section
- Critical Section Internals
Kernel Synchronization Objects
Entering Kernel From a User-Mode
The Cost of a SYSCALL
Entering SYSCALL
Leaving SYSCALL
Alternative Exit
KiServiceInternal
Zw* Kernel Functions Prolog
Nt* Kernel Functions
POC To Illustrate The Performance Impact
Conclusion

Critical Section

In Windows terminology, a critical section is a fast synchronization lock that can be used inside a single process. Or, in other words, it cannot be used across two processes. But why would you limit it to a single process?

You can read more detailed information about critical sections in the MSDN page.

Also, if you remember your computer science class, what is the difference between a critical section and a mutex? Or a semaphore? Or even an event? All of those objects can be used as a synchronization lock. Sure, let's exclude for now the "same process" limitation. But why can't we use a mutex or an event instead of a critical section? What is the point to have a critical section after all? Don't we have several synchronization objects already in that list?

Those are all valid questions.

Luckily C++'s STL makes a distinction internally. For their std::mutex class they use SRW locks in Windows. We'll get back to it in some later blog post though. There's too much to cover here.

To understand the difference, the best way is to look under the hood of those synchronization primitives. Which we will do in this post.

For now though, let me quickly explain the difference.

Critical Section Internals

A critical section in Windows is called a "fast user-mode lock" for a reason. To understand why, let's review how it works. In a very basic approximation it functions using the following principle:

The following is the result of reverse engineering it on Windows 10. Keep in mind that the exact sequence of the instructions that I present below may change from one build of the operating system to another.

When you enter a critical section (with a call to the EnterCriticalSection API) that function quickly checks if the critical section was previously entered from another thread. And if not, it returns without doing much work.
Let's review how it looks like at the Assembly language level.

The EnterCriticalSection function call is translated into the ntdll!RtlEnterCriticalSection function by the C compiler, that becomes the following:
RtlEnterCriticalSection[Copy]
```
	sub     rsp, 28h
	mov     rax, gs:30h                 ; RAX = TEB*
	
	lock btr dword ptr [rcx+8], 0       ; Check & reset bit-0 in RTL_CRITICAL_SECTION::LockCount
	mov     rax, [rax+48h]              ; RAX = thread ID
	jnb     lbl_cs_entered              ; jump if bit-0 was not set
	
	mov     [rcx+10h], rax              ; RTL_CRITICAL_SECTION::OwningThread = RAX
	xor     eax, eax
	mov     dword ptr [rcx+0Ch], 1      ; RTL_CRITICAL_SECTION::RecursionCount = 1
	
	add     rsp, 28h
	retn
```
The sequence above is very simple. At first the function checks and resets bit-0 with the atomic btr instruction in the LockCount in the provided CRITICAL_SECTION structure. If that bit was set (as it will be the first time that critical section is entered), the call simply remembers the ID of the thread that called EnterCriticalSection in OwningThread, and sets RecursionCount to 1 and returns.

As you can see, that sequence pretty much takes (almost) no time to execute.
By the way, the CRITICAL_SECTION structure is equivalent to the kernel's RTL_CRITICAL_SECTION, and is declared as such:
C[Copy]
```
typedef struct _RTL_CRITICAL_SECTION {
	PRTL_CRITICAL_SECTION_DEBUG DebugInfo;   //Used only if the process is being debugged

	//
	//  The following three fields control entering and exiting the critical
	//  section for the resource
	//

	/*  8h */ LONG LockCount;             // Bit-0 will be reset if CS was entered once
	/* 0Ch */ LONG RecursionCount;        // Count of how many times CS was entered from the same thread
	/* 10h */ HANDLE OwningThread;        // Thread ID that first entered the critical section
	/* 18h */ HANDLE LockSemaphore;
	/* 20h */ ULONG_PTR SpinCount;
} RTL_CRITICAL_SECTION, *PRTL_CRITICAL_SECTION;
```
In case bit-0 of LockCount was initially reset, this means that the critical section was previously entered. So in that case we switch to another branch at the lbl_cs_entered label:
x86-64 Assembly[Copy]
```
lbl_cs_entered:
	
	cmp     [rcx+10h], rax              ; Check if thread ID is the same
	jnz     lbl_cs_locked               ; Jump if not
	
	inc     dword ptr [rcx+0Ch]         ; Increment RTL_CRITICAL_SECTION::RecursionCount
	xor     eax, eax
	add     rsp, 28h
	retn
```
The code above checks if the current thread ID is the same as the thread that initially entered the critical section, and if so, it simply increments the RecursionCount variable of the CRITICAL_SECTION structure. This allows the critical section not to lock the thread in case of a repeated, recursive entering from the same thread.

But, if this is a different thread, this means that our critical section is locked and we need to wait. Thus we jump to the lbl_cs_locked label. There we can see a call to the internal RtlpEnterCriticalSectionContended function that executes further logic.
Thus if the critical section is already held by another thread, the internal logic in the RtlpEnterCriticalSectionContended function assumes that it will be a transient state, and enters what is knows as a spin-lock, in which the thread spins for a very short time while checking the state of the critical section.
The exact implementation of the RtlpEnterCriticalSectionContended function is too much to cover for this blog post. Let's just say that it increases in complexity in comparison to what we saw above.

The interesting part is the spin-lock itself:
spin-lock[Copy]
```
	xor     ecx, ecx                   ; Set count to 0
	nop     word ptr [rax+rax+00h]     ; Alignment on 16 bytes for the loop that follows

lbl_repeat_spin_lock:

	mov     eax, [r9]                  ; EAX = RTL_CRITICAL_SECTION::LockCount
	test    r13b, al                   ; See if bit-0 is set in EAX (R13 = 1)
	jnz     lbl_lock_changed           ; If yes, leave spin-lock if CS lock count was changed

	cmp     ecx, ebx                   ; Check this spin-lock count
	jz      lbl_timed_out              ; Leave if spin-lock timed out
	pause                              ; CPU instruction to wait
	inc     ecx                        ; Increment spin-lock count

	jmp     lbl_repeat_spin_lock       ; Start over	
```
As you can see the spin-lock is a simple loop that counts to a certain number. That number comes from the spin-count, that you can set by calling the InitializeCriticalSectionAndSpinCount function, or it is set automatically to 2000 if you use the InitializeCriticalSection function to initialize your critical section.

The count for the spin-loop above is initially held in the EBX register and is calculated by multiplying the spin-count for the critical section by 10 and then by dividing it by some predefined constant. In my tests it was 47.

So if the initial spin-count was 2000, the counter of iterations for the loop above becomes 425, or 2000 * 10 / 47.
To be honest, I have no idea where 10 and 47 come from in that formula.

Then the CPU spins in the spin-lock loop, periodically checking if the LockCount variable of the CRITICAL_SECTION has changed, or until the loop times out by running out of iterations.

One important aspect to note here is that the spin-loop has the pause instruction that informs the CPU about a pending wait-loop. This helps with the power consumption, also to optimize the CPU cache use, as well as to eliminate a possibility of the "expensive" memory order violations (i.e. mis-speculations) when leaving the spin-loop.

The important assumption is that the spin-loop is relatively short. This obviously helps to reduce the waste of CPU cycles for a longer wait, but also removes the need to enter the kernel for a shorter lock duration. That is why a critical section is recommended for locks that get unlocked fairly quickly.
If the critical section is released by another thread, while our thread is spinning in a spin-lock, that thread is released from the spin cycle and our thread enters the critical section.
This happens in the lbl_lock_changed conditional branch of the Assembly code above (after some additional checks.)

Note that the entire spin-lock resides in the user-mode, which is important for performance. This will become apparent later.
But, in the worst case scenario, if the spin-lock times out and the critical section is still held up by another thread, the waiting thread enters the waiting state in the kernel in the lbl_timed_out branch of the Assembly code above.
At this point the critical section implementation stops being efficient and continues on the path to the kernel through the sequence of the following function calls: RtlpWaitOnCriticalSection, RtlpWaitOnAddress, RtlpWaitOnAddressWithTimeout and finally to NtWaitForAlertByThreadId that takes it to the kernel for an extended wait.

Such wait state may technically continue indefinitely.

Also note that at this stage the critical section is no longer an efficient synchronization lock because of the wasted CPU cycles at the spin-lock, and thus you should generally avoid using a critical section when the locked portion of the code takes a significantly longer time to run.
If all goes well and the other thread that was holding the critical section attempts to release it from the kernel wait, it will do so by calling the RtlpWakeByAddress function that in turn wakes the waiting thread by calling NtAlertThreadByThreadId.
You can read more about the NtWaitForAlertByThreadId and NtAlertThreadByThreadId functions in my other blog post.

Because of the implementation that I outlined above, a critical section is usually intended for a very transient lock for the processing that doesn't take too long to complete.

Also due to the fact that the speed of a critical section is achieved by avoiding entering into the kernel (by using the spin-lock loop) this limits the use of a critical section to a single process only. But why? To be able to check its state while spinning, the spin-lock needs to access some shared memory inside a critical section object, which will not be possible for different processes without entering the kernel because of a process isolation in Windows.

Kernel Synchronization Objects

On the other hand, a wait on a kernel object is done in the kernel. When you call the WaitForSingleObject or a similar function that waits for a kernel object (such as event, semaphore, Win32 mutex, or others) it enters the kernel without any pre-processing and all the waiting is done in the kernel.

This works fine except for the condition when the kernel object did not require any waiting. Say, if the event was signaled, or a mutex was not acquired. In that case a trip to the kernel and back will not be justified for an intra-process lock.

And that is the main difference between a critical section and a Win32 mutex, event, or a semaphore.

But why does a trip to the kernel and back make a difference?

This post will give you the answer by analyzing the Assembly code that has to run just for a trip to the kernel and to return back.

Entering Kernel From a User-Mode

Let's look at the WaitForSingleObject function. This is the function that can be used when we want to enter a lock. It checks if the lock is acquired, and if not, it acquires it and returns control back. And if the lock was already acquired when we called WaitForSingleObject, it then waits for a provided time interval until the lock is released.

To simplify our reverse engineering work, we can quickly step through the WaitForSingleObject API and notice that internally it calls the native function, called NtWaitForSingleObject, that in itself is just a wrapper for the following syscall:

NtWaitForSingleObject (user-mode)[Copy]

	mov     r10, rcx       ; Save RCX in R10
	mov     eax, 4         ; "System service number" of the system call

	test    0x7FFE0308, 1
	jnz     lbl_alt

	syscall                ; Initiate a trip to the kernel
	retn

lbl_alt:

	int     2Eh            ; Alternative (slower) route to the kernel
	retn

If we step through the Assembly instructions above, and try to step-into the syscall instruction, the execution will happen immediately like if that system call was just a monolith atomic instruction.

But what is really happening there?

The user-mode debugger will not be able to step-into the kernel mode through the syscall instruction, like it would otherwise if it was just a local call instruction.
So we need to use some clever trick to do it manually.

If you know how Microsoft structured their native ntdll.dll library in Windows NT, you'd remember that pretty much most of its functions are mirrored in the kernel, with a slight difference between the Nt and Zw prefixes on function names. Thus, we may assume that there's the NtWaitForSingleObject function in the kernel that does all the work by waiting for the object.

We can verify this by searching the ntoskrnl.exe file in the C:\Windows\System32\ directory for the NtWaitForSingleObject export name with the WinAPI Search tool. Such export indeed exists.

But what happens after the syscall instruction in the kernel and before we get to the NtWaitForSingleObject function?

Let's review it next...

The Cost of a SYSCALL

Let's look at the Assembly code from the Windows kernel-side right after the CPU executes the syscall instruction. There are technically two sides of the coin: entering kernel, and leaving it. Let's review them one by one.

Entering SYSCALL

As we saw above the execution enters the kernel mode with the syscall instruction. But what exactly happens there?

`syscall` Instruction

I won't go too deep into it (you can look it up yourself in the official Intel documentation.) In a nutshell, the syscall instruction performs the following steps:

Further on, I will be assuming a 64-bit CPU mode, and a 64-bit Windows 10 OS.

RCX register is set to the address of the instruction following the syscall.
R11 register is set to the value of RFLAGS.
RFLAGS register is masked with the IA32_FMASK MSR (at address C0000084h.) Each bit that is set in the value of IA32_FMASK is cleared in RFLAGS.
MSR stands for "Model Specific Register" - these are Intel's architectural registers that usually convey certain information about the CPU, or allow to issue privileged control commands to the CPU. MSRs are available only in the kernel-mode, or in ring-zero.

The value of IA32_FMASK MSR on my Windows 10 is 00000000`00004700h, which means that the RFLAGS register will have the following flags cleared during a syscall: TF - "Trap Flag", IF - "Interrupt Flag", DF - "Direction Flag", and NT - "Nested Task".

The action above will disable maskable interrupts, among others.
CS code segment is set as follows (without checking permissions):
- Selector for CS is set from bits [32-47] of IA32_STAR MSR (at address C0000081h), and then by AND-ing it with FFFCh, thus resetting its RPL/CPL to 0.
- Set the following CS code segment attributes: CS.Base=0, CS.Limit=FFFFFh, CS.Type=1011b ("Nonconforming, execute & read code segment, accessed"), CS.S=1, CS.DPL=0, CS.P=1, CS.L=1 (64-bit long mode), CS.D=0, CS.G=1 ("4-KByte granularity"), CPL=0 (for "ring-zero").
SS stack segment is set as follows (without checking permissions):
- Selector for SS is set from bits [32-47] of IA32_STAR MSR (at address C0000081h), plus the value of 8h, thus making the SS segment follow the CS segment.
- Set the following SS segment attributes: SS.Base=0, SS.Limit=FFFFFh, SS.Type=0011b ("Read/write data segment, accessed"), SS.S=1, SS.DPL=0, SS.P=1, SS.B=1 ("expand-up, 32-bit stack segment"), SS.G=1 ("4-KByte granularity").
RIP register is set to the value of the IA32_LSTAR MSR (at address C0000082h).
Note that for AMD processors, it is set slightly differently.

By the way, you can probably see why the user-mode portion of the transition to the kernel in ntdll saved the RCX register in R10. The former one is a part of the x64 calling convention for the 1st function parameter, but it is also overwritten by the syscall instruction, and thus needs to be temporarily saved. Microsoft chose the R10 volatile register to do that.

This elaborate sequence is actually called a "fast" transition to the kernel. Why? Because a more involved (and older) way of transiting to the kernel is done via a software interrupt with the int instruction. It involves way more steps than what I outlined above and is much slower.

Still, even though it's called "fast", a syscall instruction is anything but fast.

Beginning of The System Call Handler

Let's see what happens when we get to the kernel right after the syscall instruction.

The RIP register will point to the nt!KiSystemCall64Shadow entry point in the kernel and the execution begins in ring-zero.

Keep in mind that the following is mostly a reverse engineered and undocumented code that may change at any moment. Do not rely on it in your production environment!

I'll try to add some brief comments with descriptions in the Assembly code below:

KiSystemCall64Shadow[Copy]

	swapgs                                     ; gs = KPCR + KPRCB
	mov     qword ptr gs:[9010h], rsp          ; KPRCB::UserRspShadow (user stack pointer)
	mov     rsp, qword ptr gs:[9000h]          ; RSP = KPRCB::KernelDirectoryTableBase

	bt      dword ptr gs:[9018h], 1            ; KPRCB::ShadowFlags
	jb      lbl_01
	mov     cr3, rsp                           ; Meltdown mitigation: Kernel Virtual Address Shadow (KVAS)

lbl_01:

	mov     rsp, qword ptr gs:[9008h]          ; RSP = KPRCB::RspBaseShadow (kernel stack)
	push    2Bh                                ; Dummy SS selector
	push    qword ptr gs:[9010h]               ; Save KPRCB::UserRspShadow (user stack pointer)
	push    r11                                ; Save previous RFLAGS
	push    33h                                ; Dummy 64-bit CS selector (Also used as 'is_syscall'. We set it to 33h here)
	push    rcx                                ; Save return address to the user-mode
	mov     rcx, r10                           ; Restore 1st parameter back into RCX
	sub     rsp, 8
	push    rbp                                ; Save previous RBP
	sub     rsp, 158h                          ; Reserve space for local variables
	lea     rbp, [rsp+80h]                     ; RBP = frame pointer on the stack
	mov     qword ptr [rbp+0C0h], rbx          ; Save nonvolatile registers
	mov     qword ptr [rbp+0C8h], rdi
	mov     qword ptr [rbp+0D0h], rsi

	test    byte ptr [ntkrnlmp!KeSmapEnabled], 0FFh   ; SMAP
	je      lbl_02
	test    byte ptr [rbp+0F0h], 1
	je      lbl_02                             ; Jump if(is_syscall & 1) == 0
	stac                                       ; Enable alignment checking for user-mode data accesses

lbl_02:

	mov     qword ptr [rbp-50h], rax           ; Save system service number
	mov     qword ptr [rbp-48h], rcx           ; Save 1st input parameter
	mov     qword ptr [rbp-40h], rdx           ; Save 2nd input parameter

	mov     rcx, qword ptr gs:[188h]           ; KPRCB::KTHREAD*
	mov     rcx, qword ptr [rcx+220h]          ; KPROCESS*
	mov     rcx, qword ptr [rcx+9E0h]          ; EPROCESS::SecurityDomain

	mov     qword ptr gs:[270h], rcx           ; KPRCB::TrappedSecurityDomain

	mov     cl, byte ptr gs:[850h]             ; KPRCB::BpbRetpolineExitSpecCtrl
	mov     byte ptr gs:[851h], cl             ; KPRCB::BpbTrappedRetpolineExitSpecCtrl

	mov     cl, byte ptr gs:[278h]             ; KPRCB::BpbState
                                               ;   1h = BpbCpuIdle
                                               ;   2h = BpbFlushRsbOnTrap
                                               ;   4h = BpbIbpbOnReturn
                                               ;   8h = BpbIbpbOnTrap
                                               ;  10h = BpbIbpbOnRetpolineExit

	mov     byte ptr gs:[852h], cl             ; KPRCB::BpbTrappedBpbState:
                                               ;   1h = BpbTrappedCpuIdle
                                               ;   2h = BpbTrappedFlushRsbOnTrap
                                               ;   4h = BpbTrappedIbpbOnReturn
                                               ;   8h = BpbTrappedIbpbOnTrap
                                               ;  10h = BpbTrappedIbpbOnRetpolineExit

	movzx   eax, byte ptr gs:[27Bh]            ; KPRCB::BpbKernelSpecCtrl

	cmp     byte ptr gs:[27Ah], al             ; KPRCB::BpbCurrentSpecCtrl
	je      lbl_03

	mov     byte ptr gs:[27Ah], al             ; KPRCB::BpbCurrentSpecCtrl

	mov     ecx, 48h                           ; IA32_SPEC_CTRL MSR
	xor     edx, edx
	wrmsr

lbl_03:

	movzx   edx, byte ptr gs:[278h]            ; KPRCB::BpbState

	test    edx, 8                             ; 8h = BpbTrappedIbpbOnTrap
	je      lbl_04

	mov     eax, 1                             ; Indirect Branch Prediction Barrier (IBPB)
	xor     edx, edx
	mov     ecx, 49h                           ; IA32_PRED_CMD MSR
	wrmsr

	jmp     lbl_60

lbl_04:

	test    edx, 2                             ;  2h = BpbFlushRsbOnTrap
	je      lbl_f50

	test    byte ptr gs:[279h], 4              ; KPRCB::BpbFeatures:
                                               ;  1h = BpbClearOnIdle
                                               ;  2h = BpbEnabled
                                               ;  4h = BpbSmep
	jne     lbl_f50

	; Software implementation of "Indirect Branch Prediction Barrier" feature

	call    lbl_c40

lbl_c10:
	add     rsp, 8
	call    lbl_c41
lbl_c11:
	add     rsp, 8
	call    lbl_c10
lbl_c12:
	add     rsp, 8
	call    lbl_c11
lbl_c13:
	add     rsp, 8
	call    lbl_c12
lbl_c14:
	add     rsp, 8
	call    lbl_c13
lbl_c15:
	add     rsp, 8
	call    lbl_c14
lbl_c16:
	add     rsp, 8
	call    lbl_c15
lbl_c17:
	add     rsp, 8
	call    lbl_c16
lbl_c18:
	add     rsp, 8
	call    lbl_c17
lbl_c19:
	add     rsp, 8
	call    lbl_c18
lbl_c20:
	add     rsp, 8
	call    lbl_c19
lbl_c21:
	add     rsp, 8
	call    lbl_c20
lbl_c22:
	add     rsp, 8
	call    lbl_c21
lbl_c23:
	add     rsp, 8
	call    lbl_c22
lbl_c24:
	add     rsp, 8
	call    lbl_c23
lbl_c25:
	add     rsp, 8
	call    lbl_c24
lbl_c26:
	add     rsp, 8
	call    lbl_c25
lbl_c27:
	add     rsp, 8
	call    lbl_c26
lbl_c28:
	add     rsp, 8
	call    lbl_c27
lbl_c29:
	add     rsp, 8
	call    lbl_c28
lbl_c30:
	add     rsp, 8
	call    lbl_c29
lbl_c31:
	add     rsp, 8
	call    lbl_c30
lbl_c32:
	add     rsp, 8
	call    lbl_c31
lbl_c33:
	add     rsp, 8
	call    lbl_c32
lbl_c34:
	add     rsp, 8
	call    lbl_c33
lbl_c35:
	add     rsp, 8
	call    lbl_c34
lbl_c36:
	add     rsp, 8
	call    lbl_c35
lbl_c37:
	add     rsp, 8
	call    lbl_c36
lbl_c38:
	add     rsp, 8
	call    lbl_c37
lbl_c39:
	add     rsp, 8
	call    lbl_c38
lbl_c40:
	add     rsp, 8
	call    lbl_c39
lbl_c41:
	add     rsp, 8

lbl_f50:

	lfence

lbl_60:

	mov     byte ptr gs:[853h], 0              ; KPRCB::BpbRetpolineState:
                                               ;  1h = BpbRunningNonRetpolineCode
                                               ;  2h = BpbIndirectCallsSafe
                                               ;  4h = BpbRetpolineEnabled
	
	jmp     ntkrnlmp!KiSystemServiceUser

By the way, an interesting aside is the KiSystemCall64Shadow function. If you look through Microsoft public symbols, you will notice the KiSystemCall64 function as well. That is because the new KiSystemCall64Shadow function was added later with the mitigations for the "Meltdown hardware vulnerability" that was discovered in 2018.

If you look at the older KiSystemCall64 entry point, it starts with the following Assembly code. Try to spot the difference:

KiSystemCall64[Copy]

	swapgs                                     ; gs = KPCR + KPRCB
	mov     qword ptr gs:[10h], rsp            ; KPCR::UserRsp (user stack pointer)
	mov     rsp, qword ptr gs:[1A8h]           ; RSP = KPRCB::RspBase = kernel stack
	push    2Bh                                ; Dummy SS selector
	push    qword ptr gs:[10h]                 ; KPCR::UserRsp
	push    r11                                ; Save previous RFLAGS
	push    33h                                ; Dummy 64-bit CS selector (Also used as 'is_syscall'. Set it to 33h)
	push    rcx                                ; Save return address to the user-mode
	mov     rcx, r10                           ; Restore 1st parameter back into RCX
	sub     rsp, 8
	push    rbp                                ; Save previous RBP
	sub     rsp, 158h                          ; Reserve space for local variables
	lea     rbp, [rsp+80h]                     ; RBP = frame pointer on the stack
	mov     qword ptr [rbp+0C0h], rbx          ; Save nonvolatile registers
	mov     qword ptr [rbp+0C8h], rdi
	mov     qword ptr [rbp+0D0h], rsi

	test    byte ptr [ntkrnlmp!KeSmapEnabled], 0FFh   ; SMAP
	je      lbl_02
	test    byte ptr [rbp+0F0h], 1
	je      lbl_02                             ; Jump if(is_syscall & 1) == 0
	stac                                       ; Enable alignment checking for user-mode data accesses

lbl_02:

	mov     qword ptr [rbp-50h], rax
	mov     qword ptr [rbp-48h], rcx
	mov     qword ptr [rbp-40h], rdx

	mov     rcx, qword ptr gs:[188h]           ; KPRCB::KTHREAD*
	mov     rcx, qword ptr [rcx+220h]          ; KPROCESS*
	mov     rcx, qword ptr [rcx+9E0h]          ; EPROCESS::SecurityDomain

	mov     qword ptr gs:[270h], rcx           ; KPRCB::TrappedSecurityDomain

	mov     cl, byte ptr gs:[850h]             ; KPRCB::BpbRetpolineExitSpecCtrl
	mov     byte ptr gs:[851h], cl             ; KPRCB::BpbTrappedRetpolineExitSpecCtrl

	mov     cl, byte ptr gs:[278h]             ; KPRCB::BpbState
                                               ;   1h = BpbCpuIdle
                                               ;   2h = BpbFlushRsbOnTrap
                                               ;   4h = BpbIbpbOnReturn
                                               ;   8h = BpbIbpbOnTrap
                                               ;  10h = BpbIbpbOnRetpolineExit

	mov     byte ptr gs:[852h], cl             ; KPRCB::BpbTrappedBpbState
                                               ;   1h = BpbTrappedCpuIdle
                                               ;   2h = BpbTrappedFlushRsbOnTrap
                                               ;   4h = BpbTrappedIbpbOnReturn
                                               ;   8h = BpbTrappedIbpbOnTrap
                                               ;  10h = BpbTrappedIbpbOnRetpolineExit

	movzx   eax, byte ptr gs:[27Bh]            ; KPRCB::BpbKernelSpecCtrl
	cmp     byte ptr gs:[27Ah], al             ; KPRCB::BpbCurrentSpecCtrl
	je      lbl_03

	mov     byte ptr gs:[27Ah], al             ; KPRCB::BpbCurrentSpecCtrl

	mov     ecx, 48h                           ; IA32_SPEC_CTRL MSR
	xor     edx, edx
	wrmsr

lbl_03:

	movzx   edx, byte ptr gs:[278h]            ; KPRCB::BpbState
	test    edx, 8
	je      lbl_04

	mov     eax, 1                             ; Indirect Branch Prediction Barrier (IBPB)
	xor     edx, edx
	mov     ecx, 49h                           ; IA32_PRED_CMD MSR
	wrmsr

	jmp     lbl_60

lbl_04:

	test    edx, 2                             ;  2h = BpbFlushRsbOnTrap
	je      lbl_f50
	test    byte ptr gs:[279h], 4              ; KPRCB::BpbFeatures
                                               ;  1h = BpbClearOnIdle
                                               ;  2h = BpbEnabled
                                               ;  4h = BpbSmep
	jne     lbl_f50

	; Software implementation of "Indirect Branch Prediction Barrier" feature

	call    lbl_c40

lbl_c10:
	add     rsp, 8
	call    lbl_c41
lbl_c11:
	add     rsp, 8
	call    lbl_c10
lbl_c12:
	add     rsp, 8
	call    lbl_c11
lbl_c13:
	add     rsp, 8
	call    lbl_c12
lbl_c14:
	add     rsp, 8
	call    lbl_c13
lbl_c15:
	add     rsp, 8
	call    lbl_c14
lbl_c16:
	add     rsp, 8
	call    lbl_c15
lbl_c17:
	add     rsp, 8
	call    lbl_c16
lbl_c18:
	add     rsp, 8
	call    lbl_c17
lbl_c19:
	add     rsp, 8
	call    lbl_c18
lbl_c20:
	add     rsp, 8
	call    lbl_c19
lbl_c21:
	add     rsp, 8
	call    lbl_c20
lbl_c22:
	add     rsp, 8
	call    lbl_c21
lbl_c23:
	add     rsp, 8
	call    lbl_c22
lbl_c24:
	add     rsp, 8
	call    lbl_c23
lbl_c25:
	add     rsp, 8
	call    lbl_c24
lbl_c26:
	add     rsp, 8
	call    lbl_c25
lbl_c27:
	add     rsp, 8
	call    lbl_c26
lbl_c28:
	add     rsp, 8
	call    lbl_c27
lbl_c29:
	add     rsp, 8
	call    lbl_c28
lbl_c30:
	add     rsp, 8
	call    lbl_c29
lbl_c31:
	add     rsp, 8
	call    lbl_c30
lbl_c32:
	add     rsp, 8
	call    lbl_c31
lbl_c33:
	add     rsp, 8
	call    lbl_c32
lbl_c34:
	add     rsp, 8
	call    lbl_c33
lbl_c35:
	add     rsp, 8
	call    lbl_c34
lbl_c36:
	add     rsp, 8
	call    lbl_c35
lbl_c37:
	add     rsp, 8
	call    lbl_c36
lbl_c38:
	add     rsp, 8
	call    lbl_c37
lbl_c39:
	add     rsp, 8
	call    lbl_c38
lbl_c40:
	add     rsp, 8
	call    lbl_c39
lbl_c41:
	add     rsp, 8

lbl_f50:

	lfence

lbl_60:

	mov     byte ptr gs:[853h], 0              ; KPRCB::BpbRetpolineState
                                               ;  1h = BpbRunningNonRetpolineCode
                                               ;  2h = BpbIndirectCallsSafe
                                               ;  4h = BpbRetpolineEnabled

	jmp     ntkrnlmp!KiSystemServiceUser

As you can see, the updated KiSystemCall64Shadow function has the following Meltdown mitigations (if you read the comments in the Assembly code above):

Kernel Virtual Address Shadow (KVAS) - Microsoft's version of the Kernel Page Table Isolation (KPTI) to provide a separate page table for the kernel code for the virtual address translation.

There's too much going on there, so I'll try to cover just the basics:

The first thing that happens is that the swapgs instruction exchanges the GS base register with the value of the IA32_KERNEL_GS_BASE MSR (at address C0000102h). It is important to do this first since the GS register is used to hold the base for the "Kernel Processor Control Region" (or KPCR), which is the data structure that holds important information needed for the kernel mode operation.
KPCR is specific for each processor core.
KPRCB is an extended data struct that follows KPCR in memory.

If GS register is not set up, or corrupted, this is a guaranteed way to cause a Blue-Screen-Of-Death, or to hang up the system.
Then the handler saves the user-mode stack pointer (RSP) in the UserRspShadow variable of the KPCR.
Note that the structure of KPCR is not well documented and tends to change from one build of the OS to another. Thus the offsets in it are prone to change.
Then the code briefly reuses the RSP register to set up the Meltdown mitigation by setting the CR3 register (to switch the physical address for the base of the page directory in accordance with the "Kernel Virtual Address Shadow" guidelines), depending on the bit 1 of the ShadowFlags of the KPRCB struct.
After that it sets up the kernel stack by getting its pointer from KPRCB::RspBaseShadow, and saves provided nonvolatile registers in the kernel stack.
Note that the code above pushes to the stack the hardcoded values for the stack segment and for the code segment, as 2Bh and 33h, respectively.
I'm not entirely sure why they decided to do this originally.

Also note that the hardcoded value 33h is also checked and used later in the code. I marked it as is_syscall in the comments. (I will return to it later.)
Then depending on the global KeSmapEnabled, it checks if SMAP should be enabled and if so sets it up. SMAP stands for "Supervisor-Mode Access Prevention". This is a hardware feature that guards kernel code from accessing user-mode address space without the prior knowledge. This feature is set by the CR4.SMAP control bit, and is later guided by the RFLAGS.AC flag, and the stac instruction.
Then the code initiates additional security features for the syscall routine:
- EPROCESS::SecurityDomain is copied into KPRCB::TrappedSecurityDomain from the current thread.
- Retpoline mitigations are initiated via BpbTrappedRetpolineExitSpecCtrl that is copied from BpbRetpolineExitSpecCtrl of the KPRCB.
- Additionally BpbState flags are copied into BpbTrappedBpbState of the KPRCB.
- If BpbKernelSpecCtrl value is not set up in the BpbCurrentSpecCtrl member of the KPRCB, the IA32_SPEC_CTRL MSR is set to its value. That MSR controls "Speculative Execution Side Channel Mitigations" on the hardware level.
- Then the BpbTrappedIbpbOnTrap flag in the KPRCB::BpbState is checked, and if it's on, the code enables "Indirect Branch Prediction Barrier" (or IBPB) via the IA32_PRED_CMD MSR.
- And if the BpbFlushRsbOnTrap flag is not set in the KPRCB::BpbState, and BpbSmep flag is enabled in KPRCB::BpbFeatures, the code skips the following software emulation of the IBPB.
- Then notice a convoluted sequence of call instructions that call up the code to a previous add rsp, 8 instruction that restores the stack pointer, only to repeat it again multiple times. This is Microsoft's way to implement the "Indirect Branch Prediction Barrier" in software if the hardware implementation is not available, or if BpbFlushRsbOnTrap flag is set.
  That call sequence seems to overwhelm the hardware pipeline, and such defeats a possible attack.
  
  But ouch! What a waste of CPU cycles.
  I wonder though, why didn't they optimize it to a bunch of call instructions, following by a single add rsp, n instead of inserting add rsp, 8 after each call instruction. If anyone knows, please leave a comment below?
At the end the lfence instruction issues more assurance against the indirect branch prediction attacks.
Finally a jump to KiSystemServiceUser label continues on to the shared code with the original KiSystemCall64 function.

Let's review what happens next. We're far from being done with the system call handler.

Kernel Stack Layout

To help us navigate over the local variables that are used by the syscall handler that we're reviewing here, I made the following kernel stack diagram. Assuming that the initial value of RSP was 1000h when we set it during the beginning of the handler, the following layout of local variables will be used:

I know that the address of the kernel stack cannot be 1000h. I'm using such a lower value so that we don't deal with crazy-long 64-bit numbers here.

Kernel stack layout[Copy]

1000 - rsp at the time we entered syscall:

FF8 = 2Bh             Dummy SS segment
FF0 = original RSP    (from user mode) = KPRCB::UserRspShadow
FE8 = original R11    RFLAGS from user mode
FE0 = 33h             Dummy CS segment & also 'is_syscall'
FD8 = original RCX    return address (to the user mode)
FD0 = 
FC8 = original RBP    (from user mode)
FC0 = original RSI    (from user mode)
FB8 = original RDI    (from user mode)
FB0 = original RBX    (from user mode)
FA8 = Beginning of the CONTEXT for a trap frame. Also: alt_exit: P1Home
FA0 = 
F98 = 
F90 = 
F88 =
F80 = 
F78 = 
F70 = 0 (word)        MxCsr: 'word_f70'
F68 = 
F60 = 
F58 = 
F50 = 
F48 = 
F40 = 
F38 = scratch:        \                              original XMM5 (high)
F30 = scratch:         \                             original XMM5 (lower)
F28 = scratch:          \                            original XMM4 (high)
F20 = scratch:           |                           original XMM4 (lower)
F18 = scratch:           |                           original XMM3 (high)
F10 = scratch:           |                           original XMM3 (lower)
F08 = scratch:           |                           original XMM2 (high)
F00 = scratch:           | PsAltSystemCallDispatch   original XMM2 (lower)
EF8 = scratch:           |                           original XMM1 (high)
EF0 = <-- RBP            |                           original XMM1 (lower)
EE8 = scratch:           |                           original XMM0 (high)
EE0 = scratch:           |                           original XMM0 (lower)
ED8 = scratch:          /
ED0 = scratch:         /                             original R10     (from user mode) = RCX before syscall
EC8 = scratch:        /                              original R10     (from user mode) = RCX before syscall
EC0 = scratch: original R9     (from user mode)
EB8 = scratch: original R8     (from user mode)
EB0 = scratch: original RDX    (from user mode)
EA8 = scratch: original R10    (from user mode) = RCX before syscall
EA0 = scratch: original RAX    (from user mode) = system service number
    E9F = \
    E9E =  \ original MXCSR   (from user mode)
    E9D =  /
    E9C = /
    E9B = 2         set in KiSystemServiceUser
    E9A = 
    E99 = 
    E98 = alt_exit: PreviousMode
E90 = 
E88 = 
E80 = 
E78 = 
E70 = <- RSP
E68 =   \
E60 =    \
E58 =     \
E50 =      |
E48 =      |
E40 =      |
E38 =      | Stack for the service function call
E30 =      |
E28 =      |
E20 =      |
E18 =     /
E10 =    /
E08 =   /
E00 = <- (RSP for the service function)
DF8 = return address from the service function

The easiest way to read this is to search this page by a variable name to see where in code it will be used.

KiSystemServiceUser

Now that all the vulnerability mitigations are covered, the following part of the code begins servicing the user-mode call.

It first loads the KTHREAD::TrapFrame into the CPU cache with the prefetchw instruction to improve performance.

A "trap frame" points to the base of the stack and is often used to store the registers of the current thread, among other parameters, when an interrupt occurs, or during a system service call.

It then saves the MXCSR control and status register (for the SSE registers) in the kernel stack, and loads the kernel copy of MXCSR from KPRCB::MxCsr.

It then checks DISPATCHER_HEADER::DebugActive byte to see if the thread has a debugger attached to it.

KiSystemServiceUser[Copy]

	mov     byte ptr [rbp-55h], 2
	mov     rbx, qword ptr gs:[188h]           ; KPRCB::KTHREAD*
	prefetchw [rbx+90h]                        ; KTRAP_FRAME
	stmxcsr dword ptr [rbp-54h]
	ldmxcsr dword ptr gs:[180h]                ; KPRCB::MxCsr
	cmp     byte ptr [rbx+3], 0                ; DISPATCHER_HEADER::DebugActive
	mov     word ptr [rbp+80h], 0              ; word_f70 = 0    (lower CONTEXT::MxCsr)
	je      lbl_s_01                           ; Jump if no debugger attached

The section of the code that executes when a debugger is attached follows. Since during normal operation this block of code is not invoked, I will not dwell on it too much.

An interesting part to note is that when a debugger is attached, a syscall is processed differently. For instance, the KiSaveDebugRegisterState function saves the debugging registers and re-sets them for the kernel mode. Additionally, the PsAltSystemCallDispatch function may take the execution to an alternate processing routine. (Both are outside of the scope of this blog post.)

The following is a quick snippet of the debugging logic:

KiSystemServiceUser (Debugger)[Copy]

	test    byte ptr [rbx+3], 3                ; DISPATCHER_HEADER::DebugActive
                                               ;  1h = ActiveDR7
                                               ;  2h = Instrumented
                                               ;  4h = Minimal
                                               ; 20h = AltSyscall
                                               ; 40h = UmsScheduled
                                               ; 80h = UmsPrimary

	mov     qword ptr [rbp-38h], r8            ; Save 3rd input parameter
	mov     qword ptr [rbp-30h], r9            ; Save 4th input parameter

	je      lbl_dbg_01
	call    ntkrnlmp!KiSaveDebugRegisterState

lbl_dbg_01:

	test    byte ptr [rbx+3], 24h              ; ltSyscall + Minimal  (DISPATCHER_HEADER::DebugActive)
	je      lbl_dbg_03

	mov     qword ptr [rbp-20h], r10
	mov     qword ptr [rbp-28h], r10
	movaps  xmmword ptr [rbp-10h], xmm0
	movaps  xmmword ptr [rbp], xmm1
	movaps  xmmword ptr [rbp+10h], xmm2
	movaps  xmmword ptr [rbp+20h], xmm3
	movaps  xmmword ptr [rbp+30h], xmm4
	movaps  xmmword ptr [rbp+40h], xmm5
	sti                                        ; Enable maskable interrupts
	mov     rcx, rsp
	call    ntkrnlmp!PsAltSystemCallDispatch

	cmp     al, 1
	je      lbl_dbg_03
	mov     rax, qword ptr [rbp-50h]           ; rax = system service number
	jl      lbl_dbg_02

	mov     ecx, 0C000001Ch                    ; STATUS_INVALID_SYSTEM_SERVICE
	xor     edx, edx
	mov     r8, qword ptr [rbp+0E8h]           ; Return address to the user-mode
	call    ntkrnlmp!KiExceptionDispatch
	int     3

lbl_dbg_02:
	test    byte ptr [rbx+3], 4                ; Minimal  (DISPATCHER_HEADER::DebugActive)
	je      KiSystemServiceExit
	jmp     KiSystemServiceExitPico

lbl_dbg_03:

	test    byte ptr [rbx+3], 80h              ; UmsPrimary  (DISPATCHER_HEADER::DebugActive)
	je      lbl_dbg_04

	mov     ecx, 0C0000102h                    ; IA32_KERNEL_GS_BASE
	rdmsr   
	shl     rdx, 20h
	or      rax, rdx
	cmp     rax, qword ptr [ntkrnlmp!MmUserProbeAddress]
	cmovae  rax, qword ptr [ntkrnlmp!MmUserProbeAddress]
	cmp     qword ptr [rbx+0F0h], rax          ; TEB*
	je      lbl_dbg_04

	mov     rdx, qword ptr [rbx+1F0h]          ; UCB
	bts     dword ptr [rbx+74h], 8
	dec     word ptr [rbx+1E6h]                ; SpecialApcDisable
	mov     qword ptr [rdx+80h], rax
	sti                                        ; Enable maskable interrupts
	call    ntkrnlmp!KiUmsCallEntry

	jmp     lbl_dbg_05

lbl_dbg_04:

	test    byte ptr [rbx+3], 40h              ; UmsScheduled  (DISPATCHER_HEADER::DebugActive)
	je      lbl_dbg_05

	bts     dword ptr [rbx+74h], 10h           ; UmsPerformingSyscall

lbl_dbg_05:

	mov     r8, qword ptr [rbp-38h]            ; Restore 3rd input parameter
	mov     r9, qword ptr [rbp-30h]            ; Restore 4th input parameter

The part of the code that deals with the situation when a debugger is attached to a thread is quite interesting, and I may return to it in another blog post.

The code then retrieves first two input parameters that were passed into the original native function (that initiated the syscall that we are processing here.) Remember, in our case it was NtWaitForSingleObject. And then saves the first argument into the KTHREAD struct for the current thread, along with the syscall number.

It then enables maskable interrupts, that were automatically disabled by the syscall instruction when we entered the kernel.

KiSystemServiceUser[Copy]

lbl_s_01:

	mov     rax, qword ptr [rbp-50h]           ; system service number
	mov     rcx, qword ptr [rbp-48h]           ; Restore 1st input parameter
	mov     rdx, qword ptr [rbp-40h]           ; Restore 2nd input parameter	
	sti                                        ; Enable maskable interrupts	
	mov     qword ptr [rbx+88h], rcx           ; KTHREAD::FirstArgument
	mov     dword ptr [rbx+80h], eax           ; KTHREAD::SystemCallNumber
	nop

Next it remembers the current "trap frame" in the KTHREAD struct.

KiSystemServiceStart - System Service Number

At this point the service routine begins processing the "system service number" itself, or the EAX register value that was passed into the syscall instruction in the user-mode, that defines which service function we're calling.

For that the code splits the system service number (or 4 for the NtWaitForSingleObject call in our case) into two components:

EAX = lower 12 bits of the system service number.
EDI = bit 12 of the system service number.

I'll explain a bit later how these are used.

This could be visualized by the following bit breakdown:
1 11            
2 1098 7654 3210  - bit numbers
----------------
u iiii iiii iiii  - bits
Where:

i = EAX lower 12 bits, or "syscall index"

u = EDI bit 12.

All that looks pretty neat in code:

KiSystemServiceStart[Copy]

	mov     qword ptr [rbx+90h], rsp           ; KTHREAD::KTRAP_FRAME*
	mov     edi, eax                           ; eax = system service number
	shr     edi, 7
	and     edi, 20h                           ; bit 12
	and     eax, 0FFFh                         ; lower 12 bits = syscall index

The interesting part about this code block is that this is where most of the Zw* kernel function prologues jump to from the KiServiceInternal shim. We'll get back to it a bit later.

System Service Descriptor Tables

The next block of code deals with the so-called "System Service Descriptor Tables" (or SSDTs.) These tables effectively map a syscall to the address of its kernel service function using the syscall's "system service number" as an index.

SSDT can be visualized as the following struct:

C++[Copy]

struct SERVICE_DESCRIPTOR_TABLE
{
	SYSTEM_SERVICE_TABLE nt;                   // for service functions in: ntoskrl.exe or ntkrnlmp
	SYSTEM_SERVICE_TABLE win32k;               // for service functions in: win32k.sys or GUI subsystem
	SYSTEM_SERVICE_TABLE reserved2;
	SYSTEM_SERVICE_TABLE reserved3;
};

Where:

C++[Copy]

struct SYSTEM_SERVICE_TABLE
{
	LONG* ServiceTable;         // Array of service function offsets & number of parameters
	void* CounterTable;
	ULONG ServiceLimit;         // Number of elements in 'ServiceTable'
	void* ArgumentTable;
};

So the code below picks the SSDT that it needs, based on the KTHREAD::ThreadFlags for the user-mode thread that invoked the syscall, and stores it in the R10 register. If it's not a GUI thread, it uses the KeServiceDescriptorTable. If it's a regular GUI thread, it uses KeServiceDescriptorTableShadow. Otherwise, for filtered GUI threads, it goes with KeServiceDescriptorTableFilter.

KiSystemServiceRepeat[Copy]

KiSystemServiceRepeat:

	lea     r10, [ntkrnlmp!KeServiceDescriptorTable]
	lea     r11, [ntkrnlmp!KeServiceDescriptorTableShadow]

	test    dword ptr [rbx+78h], 80h           ; GuiThread  (KTHREAD::ThreadFlags)
	je      lbl_s_22                           ; Jump if not a GUI-thread
	test    dword ptr [rbx+78h], 200000h       ; RestrictedGuiThread  (KTHREAD::ThreadFlags)
	je      lbl_s_21                           ; Jump if not a filtered GUI-thread
	lea     r11, [ntkrnlmp!KeServiceDescriptorTableFilter]
lbl_s_21:
	mov     r10, r11
lbl_s_22:
	cmp     eax, dword ptr [r10+rdi+10h]       ; SYSTEM_SERVICE_TABLE::ServiceLimit
	jae     lbl_check_n_conv_gui               ; Jump if syscall index is out of bounds

Finally, it checks the lower 12 bits of the "system service number" (or the syscall index) for an overflow by comparing it to SYSTEM_SERVICE_TABLE::ServiceLimit of the appropriate SSDT. Remember, the code above kept it in the EAX register.

That same code also set EDI register (or collectively, its bigger 64-bit brother RDI) to 20h if the "system service number" for the syscall referred to a GUI (win32k.sys) subsystem. Thus, the cmp eax, dword ptr [r10+rdi+10h] instruction will add 20h offset (in RDI) to the address of the previously selected SSDT in R10. And, as you can see in the SERVICE_DESCRIPTOR_TABLE struct, win32k member follows right after the first nt member, whose sizeof is exactly 20h bytes. This is what that magic 20h was doing in the EDI register for a GUI syscall in the code above.

Finally, if the syscall index is greater-or-equal to SYSTEM_SERVICE_TABLE::ServiceLimit, the execution jumps to the lbl_check_n_conv_gui label. This is somewhat outside of the scope of this post. But I'll explain it briefly without including the Assembly code.

The code at the lbl_check_n_conv_gui branch will check if EDI contains the value of 20h, thus referring to a GUI subsystem. And if it does, this will mean that it's the first invocation of a GUI function and we need to convert our thread to a GUI thread (by calling KiConvertToGuiThread function), and then repeat all the checks again by jumping back to the KiSystemServiceRepeat code label above.

If on the other hand EDI is not 20h, this means that we were passed an incorrect "system service number" that is out of bounds. In this case the execution simply returns STATUS_INVALID_SYSTEM_SERVICE back to the user-mode by exiting the syscall.

Note that a "GUI thread" moniker exists only due to historic reasons. Some time ago in the ancient past Microsoft decided to move most of their GUI code from the user-mode into kernel (for performance reasons.) This was done after a good portion of the NT kernel infrustructure was already in place. Thus, they had to separate syscalls from the regular NT-based syscalls (in the ntoskrl module) from GUI syscalls, whose processing code resided in a totally different module: win32k.sys.

An interesting aside is how a thread becomes the "GUI thread". Obviously there is no way of knowing this until one of the syscalls to a GUI subsystem is made. Microsoft can differentiate their syscalls by using bit 12 of the "system service number" to denote a call to a GUI subsystem. So if a syscall service code notices that bit 12 of the "system service number" is set (by doing an AND with 20h) it invokes the KiConvertToGuiThread function to convert that thread to a GUI thread.
This is done only once per thread.

Another question that may come to mind is, what the heck is the "filtered GUI thread", or the one that uses KeServiceDescriptorTableFilter SSDT?
There's not much official documention there. But from what I can gather, this is an internal feature that Microsoft uses as a fine-grained filter to their documented PROCESS_MITIGATION_SYSTEM_CALL_DISABLE_POLICY option. If you enable the latter option on a process, such process will not be able to make any calls to the win32k.sys GUI subsystem. And in most cases this could be analogous to using a sledgehammer to fix a dent. So the filtering SSDT allows to specify a detailed list of GUI syscalls that are available for a process to make.

By the way, the only reason why Microsoft is attempting to isolate their win32k.sys subsystem is because of its very poor reputation in regards to security vulnerabilities that were found in it. To put it briefly, win32k.sys contains very buggy code that you don't want potential attackers to abuse in your critical application.

System Service Number To Service Function

Next thing, the code needs to begin preparing for the conversion of the "system service number" to the actual service function pointer.

It first retrieves the base address of the ServiceTable from the SSDT and stores it in R10. It then adds the offset to it to get the address of the actual kernel function to invoke. (It is updated and stored in the R10 register.)

The code gets a specific 32-bit signed value from the SYSTEM_SERVICE_TABLE::ServiceTable array for the offset using the syscall index in EAX (here using the full RAX register), that was obtained earler.

Note though that the code shifts the 32-bit signed value by 4 bits to the right to get the offset. This tells us that lower 4 bits of that value are used for something else.
We'll see its use in the next code block.

x86-64[Copy]

	mov     r10, qword ptr [r10+rdi]           ; R10 = SYSTEM_SERVICE_TABLE::ServiceTable
	movsxd  r11, dword ptr [r10+rax*4]
	mov     rax, r11                           ; RAX = R11 = Offset to function & number of parameters
	sar     r11, 4
	add     r10, r11                           ; R10 = address of service function

	cmp     edi, 20h
	jne     lbl_no_w32_callout                 ; Jump if non-GUI syscall

The following code-block is not relevant for our example as it deals with GDI batching for a GUI thread. Our syscall will skip over it:

GDI Batching[Copy]

	mov     r11, qword ptr [rbx+0F0h]          ; R11 = TEB*

KiSystemServiceGdiTebAccess:

	cmp     dword ptr [r11+1740h], 0           ; TEB::GdiBatchCount
	je      lbl_no_w32_callout

	mov     qword ptr [rbp-50h], rax
	mov     qword ptr [rbp-48h], rcx
	mov     qword ptr [rbp-40h], rdx
	mov     rbx, r8
	mov     rdi, r9
	mov     rsi, r10
	mov     ecx, 7
	xor     edx, edx
	xor     r8, r8
	xor     r9, r9
	call    ntkrnlmp!PsInvokeWin32Callout
	mov     rax, qword ptr [rbp-50h]
	mov     rcx, qword ptr [rbp-48h]
	mov     rdx, qword ptr [rbp-40h]
	mov     r8, rbx
	mov     r9, rdi
	mov     r10, rsi
	nop     dword ptr [rax]

lbl_no_w32_callout:

Service Function Input Parameters

Now we get to the calculation and a setup of the service function parameters that we are invoking.

As you can see from the description above, the R10 register now holds the address of the function that we need to call to service our syscall. But if you remember the x64 calling convention, we can't call it just yet.

Even though we have the first four input parameters for that function in: RCX, RDX, R8 and R9 registers; any further input parameters are passed on the stack. And we don't have those set up yet.

This is what the next chunk of code will do.

But before we get to it, remember that we had 4-lower bits left out from the function offsets in the SYSTEM_SERVICE_TABLE::ServiceTable array? Well, this is where they come handy. Those 4 bits represent 16 possible combinations for the number of input parameters for a function that is used to service the syscall.

To be precise, for the x64 calling convention, that we've been dealing with, the number of input parameters stored in the lower 4 bits of the function offsets in the SYSTEM_SERVICE_TABLE::ServiceTable array represents only the input parameters that are "passed on the stack". And thus we have to add 4 to the actual number of possible combinations (since the first 4 parameters are passed in registers: RCX, RDX, R8, R9.)
So if we exclude 0 as no-parameters, then the maximum number of possible input parameters that we can pass into a syscall in the 64-bit Windows is 19, or 15 + 4.

Let's see how that 4-bit nibble is used.

If it is 0, then nothing needs to be done and the code jumps to the KiSystemServiceCopyEnd label.

Otherwise the value of a 4-bit nibble is multiplied by 8 (in the shl eax, 3 instruction) and that value is subtracted from the address of the KiSystemServiceCopyEnd label.

But why multiplying it by 8, you may ask?
If you look at the pairs of the mov instructions in the sequence that starts from the KiSystemServiceCopyStart label, each pair takes 8 bytes in machine code:
x86-64[Copy]
48 8B 46 70	       mov   rax, qword ptr [rsi+70h]
48 89 47 70        mov   qword ptr [rdi+70h], rax
And each pair of those mov instructions is used to copy a single input parameter on the stack for the service function. That is why we multiply by 8: to get the code offset from the KiSystemServiceCopyStart label (backwards) to where we should begin copying the stack.

Lastly, we need to calculate the address of where to copy the input parameters from the user-mode stack (and store it in the rsi register), as well as the destination address in the kernel stack (and store it in the rdi register.)

For security reasons, the code needs to make sure that the user-mode caller passed us a stack address in the user-mode address range. For that it checks the rsi register against the MmUserProbeAddress constant (or MM_USER_PROBE_ADDRESS), which is the address of demarcation between the user-mode and the kernel-mode address ranges.
If the caller passed us a user-stack pointer in the kernel address range, the code substitutes it with the MmUserProbeAddress constant itself. (As a side note, this is a strange way to deal with this situation. I'd expect raising an exception.)

Finally, when we get the source and destination stacks (in rsi and rdi registers, respectively) and know how many parameters to copy, the optimized code below will execute a jmp r11 instruction, that will redirect the execution to one of the pairs of the mov instructions in the sequence that starts from the KiSystemServiceCopyStart label. That sequence of mov instructions will copy all necessary input parameters from the user-mode stack into the kernel one.

x86-64[Copy]

	and     eax, 0Fh
	je      KiSystemServiceCopyEnd
	shl     eax, 3	
	lea     rsp, [rsp-70h]                     ; Adjust RSP for the service function
	lea     rdi, [rsp+18h]                     ; Reserve space for the "shadow stack"
	mov     rsi, qword ptr [rbp+100h]          ; RSI = user-mode RSP value
	lea     rsi, [rsi+20h]                     ; "shadow stack" + return address

	test    byte ptr [rbp+0F0h], 1
	je      lbl_jmp_r11	                       ; Jump if(is_syscall & 1) == 0

    ; Check user-mode RSP for overflow

	cmp     rsi, qword ptr [ntkrnlmp!MmUserProbeAddress]   ; 7fff`ffff0000h
	cmovae  rsi, qword ptr [ntkrnlmp!MmUserProbeAddress]
	nop     dword ptr [rax]

lbl_jmp_r11:

	lea     r11, KiSystemServiceCopyEnd
	sub     r11, rax
	jmp     r11

	int     3
	int     3
	int     3

KiSystemServiceCopyStart:

	mov     rax, qword ptr [rsi+70h]
	mov     qword ptr [rdi+70h], rax
	mov     rax, qword ptr [rsi+68h]
	mov     qword ptr [rdi+68h], rax
	mov     rax, qword ptr [rsi+60h]
	mov     qword ptr [rdi+60h], rax
	mov     rax, qword ptr [rsi+58h]
	mov     qword ptr [rdi+58h], rax
	mov     rax, qword ptr [rsi+50h]
	mov     qword ptr [rdi+50h], rax
	mov     rax, qword ptr [rsi+48h]
	mov     qword ptr [rdi+48h], rax
	mov     rax, qword ptr [rsi+40h]
	mov     qword ptr [rdi+40h], rax
	mov     rax, qword ptr [rsi+38h]
	mov     qword ptr [rdi+38h], rax
	mov     rax, qword ptr [rsi+30h]
	mov     qword ptr [rdi+30h], rax
	mov     rax, qword ptr [rsi+28h]
	mov     qword ptr [rdi+28h], rax
	mov     rax, qword ptr [rsi+20h]
	mov     qword ptr [rdi+20h], rax
	mov     rax, qword ptr [rsi+18h]
	mov     qword ptr [rdi+18h], rax
	mov     rax, qword ptr [rsi+10h]
	mov     qword ptr [rdi+10h], rax
	mov     rax, qword ptr [rsi+8]
	mov     qword ptr [rdi+8], rax

KiSystemServiceCopyEnd:

Calling The Service Function

And lastly, after a couple of checks for tracing options, we are ready to call our destination service function via the call rax instruction:

Remember that r10 contained our service function pointer, which the code below copies into rax for the call instruction.
I'm not sure though, why they couldn't just use call r10 instruction instead?

x86-64[Copy]

	test    dword ptr [ntkrnlmp!KiDynamicTraceMask], 1
	jne     lbl_track_syscall
	test    dword ptr [ntkrnlmp!PerfGlobalGroupMask+0x8], 40h
	jne     lbl_perf_info_log_syscall

	mov     rax, r10
	call    rax                                ; Call the service function

In case you are curious what does the code do that is pointed by the lbl_track_syscall and lbl_perf_info_log_syscall labels:

lbl_track_syscall - wraps the invocation of the service function for the syscall with the KiTrackSystemCallEntry function. It is used to invoke a trace callback for the system service dispatcher.

lbl_perf_info_log_syscall - does the same wrapping, but with the PerfInfoLogSysCallEntry function that does performance logging for the system service dispatcher.

The rest of the code sequence in those branches is exactly the same.

After the indirect call instruction, the execution will follow to the NtWaitForSingleObject function in our specific case. This may put the calling thread into a waiting state, or return immediately, depending on the state of the kernel object that this function was called with.

Leaving SYSCALL

After the processing of the service function is done, the execution will return from the indirect call instruction, in the code branch shown above.

But we're far from being done with the syscall. Let's see what happens next.

Return From The Service Function

The return value from running a service function is returned in the rax register, and thus the remaining part of the syscall code needs to preserve it. The code will do so by placing it on the stack into a temporary address at the [rbp-50h] offset.

Note that it seems like a kernel service function cannot accept any direct floating point values (via SSE registers) or return one. Although, it can hypothetically accept floating point variables passed by reference.

The first thing the code increments the syscall counter in KPRCB::KeSystemCalls. This may be Microsoft's way to keep track of some telemetry, or to calculate the system work load.

Then the code restores 3 of the nonvolatile registers: RBX, RDI and RSI. They have to be preserved, according to the x64 calling convention.

And the R11 register is set to point to the KTHREAD struct, that describes the current user thread.

Then the code makes sure that the IRQL is at the PASSIVE_LEVEL, and that KTHREAD::ApcStateIndex is at 0, and that APCs are not disabled, by checking the KernelApcDisable and SpecialApcDisable flags (used mutally as the KTHREAD::CombinedApcDisable union.) If any of these conditions don't hold, the code executes a KiBugCheckDispatch, or a KeBugCheckEx call (that will cause a BSoD.)

The KTHREAD::ApcStateIndex will not be 0 if the thread is not currently attached to the same process that initiated the syscall. It will be a violation to return that thread to the user-mode in such a state.

The bug-check in this case will have the following codes: IRQL_GT_ZERO_AT_SYSTEM_SERVICE for the incorrect IRQL, or APC_INDEX_MISMATCH for the APC index mismatch.

Lastly, the code disables maskable interrupts for the rest of its function.

Maskable interrupts will be enabled back when the kernel code executes the sysretq instruction.

KiSystemServiceExit[Copy]

	nop     dword ptr [rax]
	inc     dword ptr gs:[2EB8h]               ; KPRCB::KeSystemCalls

KiSystemServiceExit:

	mov     rbx, qword ptr [rbp+0C0h]
	mov     rdi, qword ptr [rbp+0C8h]
	mov     rsi, qword ptr [rbp+0D0h]

	mov     r11, qword ptr gs:[188h]           ; R11 = KPRCB::KTHREAD*

	test    byte ptr [rbp+0F0h], 1
	je      lbl_alt_exit                       ; If (is_syscall & 1) == 0: perform alternative exit from syscall

	mov     rcx, cr8                           ; IRQL (Task Priority Register, TPR)
	or      cl, byte ptr [r11+24Ah]            ; KTHREAD::ApcStateIndex
	or      ecx, dword ptr [r11+1E4h]          ; KTHREAD::CombinedApcDisable
	jne     lbl_bug_check_1

	cli                                        ; Ignore maskable interrupts

Note the conditional jump to the lbl_alt_exit branch. This is an alternative exit from a syscall handler that we will discuss later.

Processing User-Mode APCs

The next stage is to process all accumulated user-mode APCs.

We already wrote about user-mode and kernel APC in two different blog posts.

The code below spins in a loop checking if SpecialUserApcPending or UserApcPending flags of the KTHREAD::KAPC_STATE object are set to denote the presence of one or more queued APCs. And if so, it invokes KiInitiateUserApc function to process them one-at-a-time.

Note that before processing an APC, the code raises the IRQL to the APC_LEVEL and enables maskable interrupts. It then disables maskable interrupts after the KiInitiateUserApc call returns, and lowers the IRQL back down to the PASSIVE_LEVEL.

User APCs[Copy]

lbl_apc_01:

	mov     rcx, qword ptr gs:[188h]           ; RCX = KPRCB::KTHREAD*
	test    byte ptr [rcx+0C2h], 3             ; KTHREAD::KAPC_STATE::UserApcPendingAll
	je      lbl_ex_10                          ; Jump if no user APCs

	mov     qword ptr [rbp-50h], rax           ; Save return value from the service function

	xor     eax, eax
	mov     qword ptr [rbp-48h], rax
	mov     qword ptr [rbp-40h], rax
	mov     qword ptr [rbp-38h], rax
	mov     qword ptr [rbp-30h], rax
	mov     qword ptr [rbp-28h], rax
	mov     qword ptr [rbp-20h], rax
	pxor    xmm0, xmm0
	movaps  xmmword ptr [rbp-10h], xmm0
	movaps  xmmword ptr [rbp], xmm0
	movaps  xmmword ptr [rbp+10h], xmm0
	movaps  xmmword ptr [rbp+20h], xmm0
	movaps  xmmword ptr [rbp+30h], xmm0
	movaps  xmmword ptr [rbp+40h], xmm0

	mov     ecx, 1                             ; IRQL = APC_LEVEL
	mov     cr8, rcx

	sti                                        ; Enable maskable interrupts
	call    ntkrnlmp!KiInitiateUserApc         ; Initiate a user-mode APC
	cli                                        ; Disable maskable interrupts

	mov     ecx, 0
	mov     cr8, rcx                           ; IRQL = PASSIVE_LEVEL

	mov     rax, qword ptr [rbp-50h]           ; Restore return value from the service function

	jmp     lbl_apc_01

	lbl_ex_10:

In a nutshell, APC stands for "Asynchronous Procedure Call". This is Microsoft's way to delay execution of some callback until a later, more appropriate time, in the same thread. APCs are mostly relevant for the kernel-mode, but are also used in the user land. For instance, one can use QueueUserAPC to queue one.

Security Mitigations At Exit

The following section of code deals with more vulnerability mitigations during an exit from the syscall. In this case it implements the "Single Thread Indirect Branch Predictors" (or STIBP) mitigation that deals with protecting user-mode code against the Spectre v2 vulnerability when hyper-threading is enabled.

In a nutshell, this vulnerability "leaks" information from one thread to another when hyper-threading is enabled. Thus, the following mitigation attempts to pair the thread state from one logical CPU core to another using the KiUpdateStibpPairing function.

STIBP[Copy]

	test    byte ptr gs:[27Eh], 2              ; KPRCB::PairRegister
	je      lbl_ex_12                          ; Jump if pairing state is not stale

	mov     qword ptr [rbp-50h], rax
	xor     ecx, ecx
	call    ntkrnlmp!KiUpdateStibpPairing      ; Single Thread Indirect Branch Predictors (STIBP)
	mov     rax, qword ptr [rbp-50h]

lbl_ex_12:

It then performs some additional processing, depending on the state of the KTHREAD::DISPATCHER_HEADER struct.

KiRestoreSetContextState[Copy]

	mov     rcx, qword ptr gs:[188h]           ; RCX = KPRCB::KTHREAD*
	test    dword ptr [rcx], 8000000h          ; DISPATCHER_HEADER::Lock
	je      lbl_ex_13

	mov     qword ptr [rbp-50h], rax           ; Save return value from the service function

	xor     eax, eax
	mov     qword ptr [rbp-48h], rax
	mov     qword ptr [rbp-40h], rax
	mov     qword ptr [rbp-38h], rax
	mov     qword ptr [rbp-30h], rax
	mov     qword ptr [rbp-28h], rax
	mov     qword ptr [rbp-20h], rax
	pxor    xmm0, xmm0
	movaps  xmmword ptr [rbp-10h], xmm0
	movaps  xmmword ptr [rbp], xmm0
	movaps  xmmword ptr [rbp+10h], xmm0
	movaps  xmmword ptr [rbp+20h], xmm0
	movaps  xmmword ptr [rbp+30h], xmm0
	movaps  xmmword ptr [rbp+40h], xmm0

	call    ntkrnlmp!KiRestoreSetContextState

lbl_ex_13:

	mov     rcx, qword ptr gs:[188h]           ; RCX = KPRCB::KTHREAD*
	test    dword ptr [rcx], 40010000h         ; DISPATCHER_HEADER::Lock
	je      lbl_ex_14

	mov     qword ptr [rbp-50h], rax           ; Save return value from the service function

	test    byte ptr [rcx+2], 1                ; CycleProfiling  (DISPATCHER_HEADER::ThreadControlFlags)
	je      lbl_ex_13_1

	call    ntkrnlmp!KiCopyCounters

	mov     rcx, qword ptr gs:[188h]           ; RCX = KPRCB::KTHREAD*

lbl_ex_13_1:

	test    byte ptr [rcx+3], 40h              ; UmsScheduled  (DISPATCHER_HEADER::DebugActive)
	je      lbl_ex_13_2

	lea     rsp, [rbp-80h]                     ; Restore RSP (stack pointer)
	xor     ecx, ecx
	call    ntkrnlmp!KiUmsExit

lbl_ex_13_2:

	mov     rax, qword ptr [rbp-50h]           ; Restore return value from the service function

lbl_ex_14:

The code above will not be invoked for a regular processing of a syscall, like our NtWaitForSingleObject function call.

Instrumentation Callback

Then the code restores the MXCSR control and status register, and also restores CPU debugging registers in the KiRestoreDebugRegisterState function, if they were previously saved.

As an interesting aside, the code below performs additional action if KPROCESS::InstrumentationCallback is not 0. In that case the R10 register is set to the initial return address from the syscall, and the actual return address is set to the value of KPROCESS::InstrumentationCallback. This effectively allows to specify an alternate return target from a syscall.

KiRestoreDebugRegisterState[Copy]

	ldmxcsr dword ptr [rbp-54h]                ; Restore MXCSR

	xor     r10, r10
	cmp     word ptr [rbp+80h], 0              ; word_f70    (lower CONTEXT::MxCsr)
	je      lbl_ex_25

	mov     qword ptr [rbp-50h], rax           ; Save return value from the service function

	call    ntkrnlmp!KiRestoreDebugRegisterState

	mov     rax, qword ptr gs:[188h]           ; RAX = KPRCB::KTHREAD*
	mov     rax, qword ptr [rax+0B8h]          ; RAX = KTHREAD::KAPC_STATE::KPROCESS
	mov     rax, qword ptr [rax+3D8h]          ; RAX = KPROCESS::InstrumentationCallback

	or      rax, rax
	je      lbl_ex_24
	cmp     word ptr [rbp+0F0h], 33h
	jne     lbl_ex_24                          ; Jump if(is_syscall != 0x33)

	mov     r10, qword ptr [rbp+0E8h]          ; R10 = return address from the syscall

	mov     qword ptr [rbp+0E8h], rax          ; Update return address from the syscall

lbl_ex_24:

	mov     rax, qword ptr [rbp-50h]           ; Restore return value from the service function

lbl_ex_25:

As the multiple conditional checks in the code above show, this is most certainly a debugging feature of the syscall handler.

More Security Mitigations At Exit

After that the code saves (one more time) the return value from the service function in RAX at the [rbp-50h] address on the stack, and then resets the following security mitigations:

Retpoline mitigation state in the KPRCB::BpbRetpolineState thread variable.
"Speculative Execution Side Channel Mitigations" via the IA32_SPEC_CTRL MSR.
"Indirect Branch Prediction Barrier" via the IA32_PRED_CMD MSR.

x86-64[Copy]

	mov     qword ptr [rbp-50h], rax           ; Save return value from the service function

	mov     byte ptr gs:[853h], 0              ; KPRCB::BpbRetpolineState
                                               ;  1h = BpbRunningNonRetpolineCode
                                               ;  2h = BpbIndirectCallsSafe
                                               ;  4h = BpbRetpolineEnabled

	movzx   eax, byte ptr gs:[27Dh]            ; KPRCB::BpbUserSpecCtrl
	cmp     byte ptr gs:[27Ah], al             ; KPRCB::BpbCurrentSpecCtrl
	je      lbl_ex_30

	mov     byte ptr gs:[27Ah], al             ; KPRCB::BpbCurrentSpecCtrl
	mov     ecx, 48h                           ; IA32_SPEC_CTRL MSR
	xor     edx, edx
	wrmsr

lbl_ex_30:

	btr     word ptr gs:[278h], 2              ; KPRCB::BpbState & BpbIbpbOnReturn
	jae     lbl_ex_31

	mov     eax, 1                             ; Indirect Branch Prediction Barrier (IBPB)
	xor     edx, edx
	mov     ecx, 49h                           ; IA32_PRED_CMD MSR
	wrmsr

lbl_ex_31:

Note that it is equally important to perform these security mitigations at the exit from the kernel mode as it is when entering into it.

`sysretq` Instruction

Finally the code restores the RAX, RCX and R11 registers that will be used in the sysretq instruction to return to the user mode. It then clears the XMM registers and recovers the previous values of RBP and RSP stack register.

It then executes the swapgs instruction to swap the GS base register with the value of the IA32_KERNEL_GS_BASE MSR, basically to revert what was done when we entered the syscall.

At the end of this long sequence, the sysretq instruction returns execution back to the user mode. This is done pretty much in reverse order to what syscall had done originally.

Without going into too many details, the sequence is as follows:

RFLAGS is set to the value of the R11 register, with the exception of the RF and VM flags that are always set to 0, and other reserved flags that remain unchanged.
CS code segment selector is set from bits [48-64] of IA32_STAR MSR (at address C0000081h), plus the value of 10h, and then by OR-ing it with 03h, thus resetting its RPL/CPL to 3.
Set the following CS code segment attributes: CS.Base=0, CS.Limit=FFFFFh, CS.Type=0011b ("Execute & read code segment, accessed"), CS.S=1, CS.DPL=3, CS.P=1, CS.L=1 (64-bit long mode), CS.D=0, CS.G=1 ("4-KByte granularity"), CPL=3 (for "user-mode").
SS code segment selector is set from bits [48-63] of IA32_STAR MSR (at address C0000081h), plus the value of 8h, thus making the SS segment follow the CS segment. And then by OR-ing it with 03h, thus resetting its RPL to 3.
Set the following SS segment attributes: SS.Base=0, SS.Limit=FFFFFh, SS.Type=0011b ("Read/write data segment, accessed"), SS.S=1, SS.DPL=3, SS.P=1, SS.B=1 ("expand-up, 32-bit stack segment"), SS.G=1 ("4-KByte granularity").
RIP is set to the value of the RCX register.

Note that unless KTHREAD::KAPC_STATE::KPROCESS::InstrumentationCallback was set to some non-0 address, the execution will return to the instruction that follows the syscall that had started this sequence. Otherwise it will jump to the address from InstrumentationCallback and r10 will contain the original return address.

x86-64[Copy]

	mov     rax, qword ptr [rbp-50h]           ; Restore return value from the service function
	mov     r8, qword ptr [rbp+100h]           ; R8 = original user-mode RSP
	mov     r9, qword ptr [rbp+0D8h]           ; R9 = original user-mode RBP

	xor     edx, edx
	pxor    xmm0, xmm0                         ; Reset volatile SSE registers to 0
	pxor    xmm1, xmm1
	pxor    xmm2, xmm2
	pxor    xmm3, xmm3
	pxor    xmm4, xmm4
	pxor    xmm5, xmm5

	mov     rcx, qword ptr [rbp+0E8h]          ; RCX = return address to user mode
	mov     r11, qword ptr [rbp+0F8h]          ; R11 = original RFLAGS from user mode

	test    byte ptr [ntkrnlmp!KiKvaShadow], 1
	jne     KiKernelSysretExit

	mov     rbp, r9
	mov     rsp, r8                            ; Restore user-mode stack pointer

	swapgs 
	sysretq                                    ; Return to user-mode

System Exit With Meltdown Mitigations

In case we used the "Kernel Virtual Address Shadow" (KVAS) for the Meltdown mitigation, the code above will take the KiKernelSysretExit branch and perform additional actions:

A check will be made using bit 1 of the KPRCB::ShadowFlags, like was done in the beginning of the syscall handler. The following steps will be performed only if that bit is not set.
The base of the page directory in CR3 register will be set to the value of KPROCESS::UserDirectoryTableBase to switch the virtual address translation to its own table for the user-mode.
Note how the code uses bit 0 of the address from KPROCESS::UserDirectoryTableBase. If it's 0, it will not clear bit 0 in KPRCB::ShadowFlags. Otherwise, it will check bit 0 of KPRCB::ShadowFlags and will set bit 63 in the address obtained from KPROCESS::UserDirectoryTableBase.
If bit 1 of KPRCB::ShadowFlags is set, the code will also run the verw instruction to verify the selector, stored in KPRCB::VerwSelector for write access.
To be honest I'm not entirely sure of the need for that verw instruction in that sequence. verw sets (or resets) ZF flag in RFLAGS, which is never used anywhere after that instruction.

Then finally, the code sequence below executes the sysretq instruction, as I described above, to return to the user mode.

KiKernelSysretExit[Copy]

KiKernelSysretExit:

	mov     esp, dword ptr gs:[9018h]          ; KPRCB::ShadowFlags
	bt      esp, 1
	jb      lbl_ex_55

	mov     rbp, qword ptr gs:[188h]           ; KPRCB::KTHREAD*
	mov     rbp, qword ptr [rbp+220h]          ; KPROCESS*
	mov     rbp, qword ptr [rbp+388h]          ; KPROCESS::UserDirectoryTableBase

	bt      ebp, 0
	jae     lbl_ex_54
	
	bt      esp, 0
	jb      lbl_ex_53

	bts     rbp, 3Fh
	jmp     lbl_ex_54

lbl_ex_53:

	and     dword ptr gs:[9018h], 0FFFFFFFEh   ; KPRCB::ShadowFlags & ~1

lbl_ex_54:

	mov     cr3, rbp                           ; Set base of page table

lbl_ex_55:

	mov     rbp, r9

	bt      esp, 1
	jb      lbl_ex_56

	verw    word ptr gs:[902Ah]                ; KPRCB::VerwSelector

lbl_ex_56:

	mov     rsp, r8                            ; Restore user mode stack pointer
	swapgs  
	sysretq                                    ; Return to user mode

And this will conclude the gargantuan sequence of actions that is performed for each and every call to a kernel function.

Alternative Exit

While reviewing the Assembly code for the syscall handler, you might have noticed a bunch of mysterious checks for the [rbp+0F0h] memory location on the stack. It's obviously some local variable. An astute reader might have even asked themselves, "What does that variable do?"

And this is what I set off to determine.

First off, to make it easier to spot, I labeled it as is_syscall in this post. So use Ctrl+F keyboard shortcut (or ⌘+F on the Mac) to highlight it in your web browser.

The is_syscall variable begins its life as a hardcoded value 33h on the stack at the very beginning of the processing of a syscall. But it may also come from the Zw* function prolog.

In that frame reference, the is_syscall variable looks very much like a hardcoded code segment, or CS register.

It is then checked in many places, and if its bit 0 is cleared, the code will completely bypass security checks and mitigations, and it will take a totally different (and super-fast) system exit in the KiSystemServiceExit branch:

KiSystemServiceExit (alternative exit)[Copy]

lbl_alt_exit:

	mov     rdx, qword ptr [rbp+0B8h]          ; alt_exit: P1Home  (or NormalContext for APC)
	mov     qword ptr [r11+90h], rdx           ; KPRCB::KTHREAD::KTRAP_FRAME::P1Home

	mov     dl, byte ptr [rbp-58h]             ; alt_exit: PreviousMode
	mov     byte ptr [r11+232h], dl            ; KPRCB::KTHREAD::PreviousMode

	cli                                        ; Disable maskable interrupts
	mov     rsp, rbp
	mov     rbp, qword ptr [rbp+0D8h]          ; RBP = original RBP from user mode
	mov     rsp, qword ptr [rsp+100h]          ; RSP = original RSP from user mode
	sti                                        ; Enable maskable interrupts

	ret

As you can see above, the return from the syscall handler is done via the ret instruction instead of a more conventional sysretq. Additionally, this code branch also restores the RBP and RSP registers to their original values (before the syscall handler was invoked.)

Also notice that before returning, the code seems to be setting the NormalContext for the APC in the KTRAP_FRAME::P1Home variable for the current thread to the local variable from the stack that I marked as "alt_exit: P1Home". If you look at my kernel stack diagram, that area points to an unused location on the stack, at least for the syscall handler.
The code also sets the KTHREAD::PreviousMode from the location on the stack that I marked as "alt_exit: PreviousMode" in my stack layout diagram. (The PreviousMode can be later retrieved via the documented ExGetPreviousMode function.)

The bottom line is that if the initial value of the hardcoded is_syscall variable on the kernel stack is set without bit-0, the syscall handler will skip most of the vulnerability mitigations and security checks, and return to the address marked as the offset FD8 in my kernel stack diagram.

My original guess was that the is_syscall variable was used as a sort of an internal debugging constant. That was until I discovered the KiServiceInternal function.

KiServiceInternal

To explain the existence of the alternative exit branch from the KiSystemServiceExit code block in the syscall handler one needs to look at the KiServiceInternal shim.

Further more, if we search the ntoskrnl.exe module for references to the KiServiceInternal symbol (using any static analysis tools, such as Ghidra for instance) we will come up with a myriad of hits. And if we look at some of them, they will all be almost identical.

The reason for that is because the KiServiceInternal code shim is used as a part of the Zw* kernel functions prolog. (We'll review it next.)

For now though, if we look at the code in KiServiceInternal, most of the actions will be already familiar. They follow very similar pattern as we saw above at the beginning of the syscall handler.

KiServiceInternal[Copy]

	sub     rsp, 8h
	push    rbp                                ; Save frame pointer register
	sub     rsp, 158h
	lea     rbp, [rsp+80h]

	mov     qword ptr [rbp+0C0h], rbx          ; Saves some nonvolatile registers
	mov     qword ptr [rbp+0C8h], rdi
	mov     qword ptr [rbp+0D0h], rsi
	sti                                        ; Enable maskable interrupts
	mov     rbx, qword ptr gs:[188h]           ; KPRCB::KTHREAD*

	prefetchw [rbx+90h]                        ; KTRAP_FRAME

	movzx   edi, byte ptr [rbx+232h]           ; KPRCB::KTHREAD::PreviousMode
	mov     byte ptr [rbp-58h], dil            ; alt_exit: PreviousMode
	mov     byte ptr [rbx+232h], 0x0           ; KPRCB::KTHREAD::PreviousMode = 0 (KernelMode)

	mov     r10, qword ptr [rbx+90h]           ; KTRAP_FRAME::P1Home
	mov     qword ptr [rbp+0B8h], r10          ; alt_exit: P1Home

	lea     r11, [KiSystemServiceStart]
	jmp     r11                                ; Jump to KiSystemServiceStart

The first thing it saves the RBP register on the stack, adds a space for the local variables by subtracting 0x158 from the stack pointer, and then sets RBP to point slightly higher.

All this matches exactly my kernel stack layout for the syscall handler.

It then saves RBX, RDI and RSI nonvolatile registers and enables maskable interrupts. It then preloads the KTHREAD::TrapFrame into the CPU cache (exacly as we saw above.)

Then the code saves the value of the KPRCB::KTHREAD::PreviousMode on the stack in what we marked as "alt_exit: PreviousMode" offset and resets the PreviousMode to 0, or KernelMode.

Such behavior is exactly what Zw* functions do in the kernel.

Finally, it saves the value of KTRAP_FRAME::P1Home in our offset on the stack that we marked as "alt_exit: P1Home". (This is a part of the NormalContext for the kernel APC data.)

And jumps to the KiSystemServiceStart branch of the regular syscall handler.

But this still doesn't make sense. This is because the KiServiceInternal shim can't be used alone. It needs to be used in tandem with one of the Zw* function prologs.

`Zw*` Kernel Functions Prolog

Finally, to tie it all together and to understand the meaning and use of the KiServiceInternal and of the "alternative exit" code branches, as well as to figure out the use of the local variable that we labeled is_syscall, we need to look at one of the Zw* function prologs. Since we started torturing the NtWaitForSingleObject function, let's look at its kernel counterpart ZwWaitForSingleObject then.

ZwWaitForSingleObject[Copy]

	mov     rax, rsp
	cli                                        ; Disable maskable interrupts
	sub     rsp, 10h
	push    rax                                ; Save RSP
	pushfq                                     ; Save RFLAGS
	push    10h                                ; is_syscall = 10h
	lea     rax, [KiServiceLinkage]
	push    rax                                ; Return address from syscall handler

	mov     eax, 4h                            ; "System service number"

	jmp     KiServiceInternal

The actions above are quite simple. The code first disables maskable interrupts (since it will be messing with the stack pointer) and follows the same pattern of filling out the stack as we saw in the beginning of the syscall handler.

Why? Because it will soon be merging with it at the KiSystemServiceStart branch in the KiServiceInternal shim above.

This is how the code above fills out the stack. Compare it to the syscall stack layout. I'll show just the first few bytes for brevity:

Zw* stack layout[Copy]

	1008 = Return address from Zw* function call
	1000 =
	FF8  = 
	FF0  = original RSP
	FE8  = original RFLAGS
	FE0  = 10h               Dummy CS segment & also 'is_syscall'
	FD8  = KiServiceLinkage  Return address from syscall handler
	FD0  = 
	FC8  = original RBP      (filled out in KiServiceInternal)
	FC0  = original RSI      (filled out in KiServiceInternal)
	FB8  = original RDI      (filled out in KiServiceInternal)
	FB0  = original RBX      (filled out in KiServiceInternal)
	...

As you can see, one of the main differences is that our local variable is_syscall is set to 0x10 with its bit-0 cleared. This means that the main syscall handler logic will skip most of the security and vulnerability checks, as I explained above, and after the service function is executed, the KiSystemServiceExit will take the alternative exit branch.

Then if you recall the alternative exit branch, the code there will reset the PreviousMode to the old value for the thread taken from KPRCB::KTHREAD::PreviousMode (that is saved by the KiServiceInternal branch); and will reset the NormalContext for kernel APC via the "alt_exit: P1Home" stack variable.

The second difference that you can see in the code above is that the return address from the syscall handler is set to the KiServiceLinkage function. (It is marked with the FD8 offset in my kernel stack layout diagram.)

Then if we look at the KiServiceLinkage function itself, it can't be simpler than this:

KiServiceLinkage[Copy]

ret

Finally, the EAX register is hardcoded to the same "system service number" (or 4 in case of the ZwWaitForSingleObject, as we saw earlier) and the control flow is diverted to the KiServiceInternal shim that in turn will bring it to the KiSystemServiceStart branch in the main syscall handler.

And this concludes the code path that is taken when a Zw* function (such as ZwWaitForSingleObject) is invoked from the kernel. In a sense, those Zw* functions are executed via a similar syscall handler mechanism.

To be honest, I'm not sure why Microsoft chose such a convoluted way to redirect their Zw* functions to the Nt* ones, that have the actual implementation. My guess is that it saves on the amount of code. But it definitely doesn't help with the efficiency.

`Nt*` Kernel Functions

Finally, for completeness, let's review how Nt* kernel functions are structured.

If you remember from the documentation, they are the ones that perform verification of the input parameters that may be passed in from the user-mode. But that is not entirely true. Let's look at the implementation of the NtWaitForSingleObject function to see what it does. It has a fairly simple logic:

NtWaitForSingleObject (kernel)[Copy]

;	NTSYSAPI NTSTATUS NtWaitForSingleObject(
;		HANDLE         Handle,                 ; RCX
;		BOOLEAN        Alertable,              ; RDX
;		PLARGE_INTEGER Timeout                 ; R8
;	)

	mov     qword ptr [rsp+18h], r8            ; Save R8 in shadow stack frame
	sub     rsp, 38h

	movzx   r9d, dl                            ; R9 = RDX

	mov     qword ptr [rsp+58h], 0

	mov     rax, qword ptr gs:[188h]           ; RAX = KPRCB::KTHREAD*
	movzx   edx, byte ptr [rax+232h]           ; EDX = KPRCB::KTHREAD::PreviousMode

	mov     rax, qword ptr [rsp+50h]           ; RAX = saved R8 (3rd parameter)

	test    rax, rax
	je      lbl_ok                             ; Jump if 3rd parameter is NULL
	test    dl, dl
	je      lbl_ok                             ; Jump if PreviousMode == KernelMode

	; Check if the 3rd parameter is a user-mode address
	;
	mov     r8, ntkrnlmp!MmUserProbeAddress    ; R8 = 7fff`ffff0000h (or MM_USER_PROBE_ADDRESS)
	cmp     rax, r8
	jb      lbl_1                              ; Jump if RAX < 0x7FFFFFFF0000

	mov     rax, r8                            ; Otherwise set RAX = 0x7FFFFFFF0000

lbl_1:

	mov     rax, qword ptr [rax]
	mov     qword ptr [rsp+58h], rax
	lea     rax, [rsp+58h]
	mov     qword ptr [rsp+50h], rax

	jmp     lbl_ok                             ; WTF! Really weird quirk of the compiler?
	jmp     lbl_exit

lbl_ok:

	mov     qword ptr [rsp+20h], rax
	movzx   r8d, dl
	call    ntkrnlmp!ObWaitForSingleObject     ; This internal function does the actual work

lbl_exit:

	add     rsp, 38h
	ret

As you can see, the logic inside will bypass verification of input parameters if the PreviousMode is set to 0, or KernelMode. And thus, if an Nt* function is called with the PreviousMode set to 0, it will skip all the checks as well. And this is exactly what a matching Zw* function does anyway.

To be honest, calling an Nt* function with the PreviousMode set to KernelMode is way more efficient than calling a Zw* function. (Obviously, both from the kernel mode.)
I think I proved above why.

POC To Illustrate The Performance Impact

Finally, to illustrate the performance impact of calling a kernel function versus staying in user-mode, or the difference between using a kernel synchronization object versus a critical section, I made a small POC project. It will show the result by displaying the timing for each method:

POC project with the timing results.

This project is called "CritSectionVsKernelObject". You can download it from my GitHub.

The code in the POC is pretty straightforward, so I won't spend too much time on it.

It emulates the situation when the read/write access to a global buffer needs to be synchronized from multiple threads, and then runs it for a large number of iterations and times how long it takes to finish. The result is presented on the screen.

We run two tests. First, using the default critical section in Windows. And the second time using our home-made custom implementation of the lock using a Win32 event object, that will require a trip to the kernel every time we want to wait on it, or to check its state.

I put the home-made custom implementation into the class, that I named DontUse_MyCritSection, to dissuade whoever wants to use it from doing it.

So having run this POC app, I'm sure no one will have any doubts about which implementation of a synchronization lock is more efficient.

Conclusion

I hope that by showing you this very long sequence of actions that happens during a syscall I was able to convince you that the cost of entering the kernel on Windows is very high.

I used an example of the NtWaitForSingleObject function and walked all the way from the syscall instruction in user-mode, to the actual eponymous kernel function, and out back to the user-mode. That code sequence seemed enormous.

In conclusion, I want to admit that this turned out into a beefy blog post that took a long time to complete. But I hope that this will let you use my efforts, that I've spent on this research, in a wise fashion for your own software development.

Blog Post

Critical Section vs Kernel Objects

Spinning in user-mode versus entering kernel - the cost of a SYSCALL in Windows.

Intro

Table Of Contents

Critical Section

Critical Section Internals

Kernel Synchronization Objects

Entering Kernel From a User-Mode

The Cost of a SYSCALL

Entering SYSCALL

`syscall` Instruction

Beginning of The System Call Handler

Kernel Stack Layout

KiSystemServiceUser

KiSystemServiceStart - System Service Number

System Service Descriptor Tables

System Service Number To Service Function

Service Function Input Parameters

Calling The Service Function

Leaving SYSCALL

Return From The Service Function

Processing User-Mode APCs

Security Mitigations At Exit

Instrumentation Callback

More Security Mitigations At Exit

`sysretq` Instruction

System Exit With Meltdown Mitigations

Alternative Exit

KiServiceInternal

`Zw*` Kernel Functions Prolog

`Nt*` Kernel Functions

POC To Illustrate The Performance Impact

Conclusion

Social Media

Contact

Related Articles

Blog Post

Critical Section vs Kernel Objects

Spinning in user-mode versus entering kernel - the cost of a SYSCALL in Windows.

Intro

Table Of Contents

Critical Section

Critical Section Internals

Kernel Synchronization Objects

Entering Kernel From a User-Mode

The Cost of a SYSCALL

Entering SYSCALL

syscall Instruction

Beginning of The System Call Handler

Kernel Stack Layout

KiSystemServiceUser

KiSystemServiceStart - System Service Number

System Service Descriptor Tables

System Service Number To Service Function

Service Function Input Parameters

Calling The Service Function

Leaving SYSCALL

Return From The Service Function

Processing User-Mode APCs

Security Mitigations At Exit

Instrumentation Callback

More Security Mitigations At Exit

sysretq Instruction

System Exit With Meltdown Mitigations

Alternative Exit

KiServiceInternal

Zw* Kernel Functions Prolog

Nt* Kernel Functions

POC To Illustrate The Performance Impact

Conclusion

Social Media

Contact

Related Articles

`syscall` Instruction

`sysretq` Instruction

`Zw*` Kernel Functions Prolog

`Nt*` Kernel Functions