Reverse Engineering - Stepping Into a System Call - How to step into a SYSCALL with a debugger using kernel binary patch.

This article contains functions and features that are not documented by the original manufacturer. By following advice in this article, you're doing so at your own risk. The methods presented in this article may rely on internal implementation and may not work in the future.

Intro

Have you ever wondered if it's possible to step into the Intel x86-64 syscall instruction with a Windows debugger?

The short answer is, no. A long answer, is yet, but with some clever logic. Let's review the latter one in this blog post.

This blog post will be mostly covered in a video screencast, so click here if you want to watch that instead.

Preparation

I decided to write this blog post mostly to show how I researched what happens inside a system call for my other blog post. In there I was using the NtWaitForSingleObject user-mode native function. Thus, let's continue with it for this example as well.

NtWaitForSingleObject is a deeper call from the documented WaitForSingleObject Win32 API.

The original C++ code that we can work with could be something simple as this:

C++[Copy]

HANDLE h = CreateEvent(NULL, FALSE, FALSE, NULL);

WaitForSingleObject(h, INFINITE);

CloseHandle(h);

The WaitForSingleObject call above will invoke the native NtWaitForSingleObject function, which in turn will perform the following (in the Assembly language):

NtWaitForSingleObject[Copy]

	mov     r10, rcx
	mov     eax, 4         ; "System service number" of the system call

	test    0x7FFE0308, 1
	jnz     lbl_alt

	syscall                ; Initiate a trip to the kernel
	retn

lbl_alt:

	int     2Eh            ; Alternative (slower) route to the kernel
	retn

In this case, the syscall instruction above is what we will be attempting to step into. And if you try it with your user-mode debugger of choice, it will not be able to do it. It will step over a syscall as if it was a single atomic instruction.

Tools

To be able to do our research we will need the following tools:

Virtual machine (VM) to run the tests in. I'm using VMWare Workstation software for that. But you can run it in a VirtualBox as well.
User-mode debugger to run in the VM & the Visual Studio C++ to build the code sample. Follow this blog post for instructions on how to set it up. (Like I said, we will only need a user-mode debugger and the VS C++ compiler.)
WinDbg Next, as a kernel debugger on our host machine. Follow this blog post for details on how to set it up to work with a VM.

The reason we will need a VM is because we will be doing some kernel mode debugging, and for that you will need a separate OS. Even though you can use another computer, it is much simpler (and faster) to use a virtual machine.

Additionally, a big benefit of using a VM is that you can set up a new version of WinDbg to connect to it using a fast network connection versus the antiquated COM port connection. If you've ever used a COM port debugging, you will know what I mean.

Kernel Side

Once we attach the kernel debugger to our guest OS, we can get to the location in the kernel where the syscall instruction takes us. For that we will be using the IA32_LSTAR MSR (at address C0000082h).

WinDbg has the command to retrieve it:

rdmsr c0000082

This command will give us an address, and if we copy-and-paste it into the disassembly window in WinDbg, we will get to the point in the kernel where execution will follow the syscall instruction from the user mode:

KiSystemCall64Shadow[Copy]

; nt!KiSystemCall64Shadow:

fffff805`08a13180 0f01f8               swapgs  
fffff805`08a13183 654889242510900000   mov     qword ptr gs:[9010h], rsp
fffff805`08a1318c 65488b242500900000   mov     rsp, qword ptr gs:[9000h]
fffff805`08a13195 650fba24251890000001 bt      dword ptr gs:[9018h], 1
fffff805`08a1319f 7203                 jb      ntkrnlmp!KiSystemCall64Shadow+0x24 (fffff80508a131a4)
fffff805`08a131a1 0f22dc               mov     cr3, rsp
fffff805`08a131a4 65488b242508900000   mov     rsp, qword ptr gs:[9008h]
fffff805`08a131ad 6a2b                 push    2Bh
fffff805`08a131af 65ff342510900000     push    qword ptr gs:[9010h]
fffff805`08a131b7 4153                 push    r11
fffff805`08a131b9 6a33                 push    33h
fffff805`08a131bb 51                   push    rcx
fffff805`08a131bc 498bca               mov     rcx, r10
fffff805`08a131bf 4883ec08             sub     rsp, 8
fffff805`08a131c3 55                   push    rbp
.....
.....

So this is technically where we will need to place our kernel breakpoint, but there are some gotchas...

Kernel Breakpoint - Issue 1

We cannot place a kernel mode breakpoint on any of the following instructions:

KiSystemCall64Shadow (no breakpoints)[Copy]

; nt!KiSystemCall64Shadow:

fffff805`08a13180 0f01f8               swapgs  
fffff805`08a13183 654889242510900000   mov     qword ptr gs:[9010h], rsp
fffff805`08a1318c 65488b242500900000   mov     rsp, qword ptr gs:[9000h]
fffff805`08a13195 650fba24251890000001 bt      dword ptr gs:[9018h], 1
fffff805`08a1319f 7203                 jb      ntkrnlmp!KiSystemCall64Shadow+0x24 (fffff80508a131a4)
fffff805`08a131a1 0f22dc               mov     cr3, rsp
fffff805`08a131a4 65488b242508900000   mov     rsp, qword ptr gs:[9008h]

Why?

Well, technically we can place a breakpoint there, but if we run the guest OS, it will either crash or hang.

Such will happen because at that critical entry point, the GS segment register and the kernel stack in RSP register are not set up yet. And since the kernel debugging engine uses them for its own purpose, breaking at any of those locations will cause a BSOD.

The GS register is used in the 64-bit Windows kernel to store a pointer to the internal KPCR struct. Kernel code cannot run without it.
The reason kernel code will crash without having a properly configured RSP register that points to a kernel stack is because of the CPU feature, called, "Supervisor-Mode Access Prevention", or SMAP, that will raise an exception if kernel code tries to read or write from a user-mode memory, like when the RSP register still points to a user-mode stack after the syscall instruction.

We can easily resolve this limitation by placing a breakpoint right after that code block. Say, on any of the following instructions:

KiSystemCall64Shadow (allowed breakpoints)[Copy]

fffff805`08a131ad 6a2b                 push    2Bh
fffff805`08a131af 65ff342510900000     push    qword ptr gs:[9010h]
fffff805`08a131b7 4153                 push    r11
fffff805`08a131b9 6a33                 push    33h
fffff805`08a131bb 51                   push    rcx
fffff805`08a131bc 498bca               mov     rcx, r10
fffff805`08a131bf 4883ec08             sub     rsp, 8
fffff805`08a131c3 55                   push    rbp
.....
.....

The first push instruction would be a good place for it.

Kernel Breakpoint - Issue 2

The second more challenging issue is that the KiSystemCall64Shadow service routine serves as an entry point into the kernel from all of the Win32 APIs for all the threads in all the processes that run in the guest OS. And that is a lot of calls!

If we could hypothetically draw a heat map of the system RAM to denote the most visited (or executed) locations with a brighter color, the address occupied by the KiSystemCall64Shadow function will grow red hot like a sun.

So how do you isolate all the other syscalls from the one that we need?

One would suggest using a conditional breakpoint, which would be a good idea in general case, but in our very busy part of the system, a conditional breakpoint will grind the guest OS to a halt.

The reason this would happen is because WinDbg uses JavaScript engine to evaluate conditional breakpoints, which makes it several orders of magnitude slower than the normal code flow in the syscall service routine.

Thus we need to come up with a different way to place a conditional breakpoint. My favorite one - is a kernel binary patch in memory. But first we need to prepare our user-mode code for it.

Preparing User-Mode Code

If you look at the declaration of the NtWaitForSingleObject function:

C[Copy]

NTSTATUS NtWaitForSingleObject(
	[in] HANDLE         Handle,
	[in] BOOLEAN        Alertable,
	[in] PLARGE_INTEGER Timeout
	);

We can use its 3rd parameter, which is a pointer to the PLARGE_INTEGER struct, to pass our specially crafter pointer, say 0x11224455, that will be quite rare in other (general) cases.

We can construct such a pointer by using the VirtualAllocEx function, which allows us to request a specific virtual address in its 2nd parameter. Thus, we can rewrite our initial test code into something like this:

C++[Copy]

HANDLE h = CreateEvent(NULL, FALSE, FALSE, NULL);

LPVOID pAddr = VirtualAllocEx(GetCurrentProcess(),
                              (LPVOID)0x11224455,         //Request our special address
							  0x10000,                    //Ask for 1 page of memory
							  MEM_COMMIT | MEM_RESERVE,   //Make it ready to use
							  PAGE_READWRITE);            //Need it for reading & writing
if(pAddr)
{
	//Because VirtualAllocEx will return an address rounded down to the page size,
	//we need to adjust it to our desired value: 0x11224455
	(size_t&)pAddr |= 0x4455;

	//Execute the syscall
	NtWaitForSingleObject(h, FALSE, (PLARGE_INTEGER)pAddr);

	VirtualFreeEx(GetCurrentProcess(), pAddr, 0, MEM_RELEASE);
}
else
{
	//Oops, can't run our test. Restart the OS and try again ...
	wprintf(L"ERROR: %d - VirtualAllocEx 2\n", GetLastError());
}

CloseHandle(h);

There's no guarantee that the system memory manager will oblige and return the address that we want. So don't use it in your production code. But for our purpose it will suffice.

The code above really has not much sense for a production purpose. We constructed it only to pass our special value 0x11224455 into the kernel.

Making a Kernel Trap

So the only thing left to do is to write a kernel trap to catch when the RAX register is set to 4, and the R8 register is set to 0x11224455 inside of the kernel service routine for the syscall.

The reason we're checking RAX for equality to 4 is because that is the register that is used to convey the "system service number", which, if you remember, was 4 in our case for the NtWaitForSingleObject function call.
And, the R8 is how the 3rd input parameter is passed into the NtWaitForSingleObject function according to the x64 calling convention for Windows.

The trap itself can be written as such:

x86-64[Copy]

	cmp     rax, 4h
	jne     lb_continue
	cmp     r8, 11224455h
	jne     lb_continue

	nop                            ; Place breakpoint here

lb_continue:

In there we check for the condition that we outlined above, and if it is met, we provide a nop instruction to place our kernel breakpoint on. This will be analogous to a conditional breakpoint but with much less overhead.

Then we will also need to jump into our trap from the start of the KiSystemCall64Shadow service routine by replacing one of the original instructions:

KiSystemCall64Shadow[Copy]

fffff805`08a131ad 6a2b                 push    2Bh
fffff805`08a131af 65ff342510900000     push    qword ptr gs:[9010h]

The size of a jmp instruction is 5 bytes, thus we can't use the first push 2Bh instruction for that. So let's use the push qword ptr gs:[9010h] instruction instead. Thus our original KiSystemCall64Shadow service routine becomes this, after the patch:

KiSystemCall64Shadow (patched)[Copy]

; nt!KiSystemCall64Shadow:

fffff805`08a13180 0f01f8               swapgs  
fffff805`08a13183 654889242510900000   mov     qword ptr gs:[9010h], rsp
fffff805`08a1318c 65488b242500900000   mov     rsp, qword ptr gs:[9000h]
fffff805`08a13195 650fba24251890000001 bt      dword ptr gs:[9018h], 1
fffff805`08a1319f 7203                 jb      ntkrnlmp!KiSystemCall64Shadow+0x24 (fffff805`08a131a4)
fffff805`08a131a1 0f22dc               mov     cr3, rsp
fffff805`08a131a4 65488b242508900000   mov     rsp, qword ptr gs:[9008h]
fffff805`08a131ad 6a2b                 push    2Bh

; Our patch: Jump to our trap
fffff805`08a131af e92c020000           jmp     ntkrnlmp!KiSystemCall64Shadow+0x260 (fffff805`08a133e0)
fffff805`08a131b4 90                   nop     
fffff805`08a131b5 90                   nop     
fffff805`08a131b6 90                   nop     

fffff805`08a131b7 4153                 push    r11
fffff805`08a131b9 6a33                 push    33h
fffff805`08a131bb 51                   push    rcx
fffff805`08a131bc 498bca               mov     rcx, r10
fffff805`08a131bf 4883ec08             sub     rsp, 8
fffff805`08a131c3 55                   push    rbp
.....
.....

We can use WinDbg to make the memory patch that I showed above by using the eb command, as such:
eb fffff805`08a131af e9 2C 02 00 00 90 90 90
Note that we can calculate the offset for the jmp instruction using the address of where we will place our trap, shown below.

Also note that we've padded the jmp instruction with 3 nops to match the size of the original push qword ptr gs:[9010h] instruction.

Then we can put our trap in the free space somewhere at the end of the KiSystemCall64Shadow function:

Our trap[Copy]

fffff805`08a133e0 4883f804           cmp     rax, 4
fffff805`08a133e4 750a               jne     ntkrnlmp!KiSystemCall64Shadow+0x270 (fffff805`08a133f0)
fffff805`08a133e6 4981f855442211     cmp     r8, 11224455h
fffff805`08a133ed 7501               jne     ntkrnlmp!KiSystemCall64Shadow+0x270 (fffff805`08a133f0)
fffff805`08a133ef 90                 nop                          ; Place breakpoint here
fffff805`08a133f0 65ff342510900000   push    qword ptr gs:[9010h]
fffff805`08a133f8 e9bafdffff         jmp     ntkrnlmp!KiSystemCall64Shadow+0x37 (fffff805`08a131b7)

We can use the following WinDbg command to write the machine code for our trap:
eb fffff805`08a133e0 48 83 f8 04 75 0a 49 81 f8 55 44 22 11 75 01 90 65 ff 34 25 10 90 00 00 e9 BA FD FF FF
As for the machine code itself, you can use any assembler to generate it. I personally use this one.

You can locate manually the "free space" at the end of the KiSystemCall64Shadow function using the kernel debugger. Simply look for the padding 00's or CC's at the end of the function body. This padding is usually placed there by the compiler to optimize the code flow.

Finally, place the breakpoint at the fffff805`08a133ef address (in case of our patch) with the bp command in WinDbg:

bp fffff805`08a133ef

Optionally, you can place a hardware execution breakpoint on that instruction using the ba command:
ba e 1 fffff805`08a133ef

After that let the guest OS run and step into the syscall from the NtWaitForSingleObject user-mode function. This should trigger the breakpoint in the kernel.

Then continue stepping through the kernel code to do your further research.

I would recommend to disable the kernel breakpoint in the syscall service handler after it is triggered to prevent repeated invocations. You can do it with the bl command in WinDbg.

Screencast

To recap everything that I've shown above, please watch the following screencast:

Play video fullscreen

Conclusion

There are probably many ways to ensure that you can step into the kernel syscall with a debugger. The one that I showed above had worked for me.

In case you know a better way, or want to share yours, please leave a comments below.

Blog Post

Reverse Engineering - Stepping Into a System Call

How to step into a SYSCALL with a debugger using kernel binary patch.