This article contains functions and features that are not documented by the original manufacturer. By following advice in this article, you're doing so at your own risk. The methods presented in this article may rely on internal implementation and may not work in the future.
Intro
Have you ever wondered if it's possible to step into the Intel x86-64 syscall
instruction with a Windows debugger?
The short answer is, no. A long answer, is yet, but with some clever logic. Let's review the latter one in this blog post.
This blog post will be mostly covered in a video screencast, so click here if you want to watch that instead.
Preparation
I decided to write this blog post mostly to show how I researched what happens inside a system call for my other blog post. In there I was using the NtWaitForSingleObject
user-mode native function. Thus, let's continue with it for this example as well.
NtWaitForSingleObject
is a deeper call from the documentedWaitForSingleObject
Win32 API.
The original C++ code that we can work with could be something simple as this:
HANDLE h = CreateEvent(NULL, FALSE, FALSE, NULL);
WaitForSingleObject(h, INFINITE);
CloseHandle(h);
The WaitForSingleObject
call above will invoke the native NtWaitForSingleObject
function, which in turn will perform the following (in the Assembly language):
mov r10, rcx
mov eax, 4 ; "System service number" of the system call
test 0x7FFE0308, 1
jnz lbl_alt
syscall ; Initiate a trip to the kernel
retn
lbl_alt:
int 2Eh ; Alternative (slower) route to the kernel
retn
In this case, the syscall
instruction above is what we will be attempting to step into. And if you try it with your user-mode debugger of choice, it will not be able to do it. It will step over a syscall
as if it was a single atomic instruction.
Tools
To be able to do our research we will need the following tools:
- Virtual machine (VM) to run the tests in. I'm using VMWare Workstation software for that. But you can run it in a VirtualBox as well.
- User-mode debugger to run in the VM & the Visual Studio C++ to build the code sample. Follow this blog post for instructions on how to set it up. (Like I said, we will only need a user-mode debugger and the VS C++ compiler.)
- WinDbg Next, as a kernel debugger on our host machine. Follow this blog post for details on how to set it up to work with a VM.
The reason we will need a VM is because we will be doing some kernel mode debugging, and for that you will need a separate OS. Even though you can use another computer, it is much simpler (and faster) to use a virtual machine.
Additionally, a big benefit of using a VM is that you can set up a new version of WinDbg to connect to it using a fast network connection versus the antiquated COM port connection. If you've ever used a COM port debugging, you will know what I mean.
Kernel Side
Once we attach the kernel debugger to our guest OS, we can get to the location in the kernel where the syscall
instruction takes us. For that we will be using the IA32_LSTAR
MSR (at address C0000082h
).
WinDbg has the command to retrieve it:
rdmsr c0000082
This command will give us an address, and if we copy-and-paste it into the disassembly window in WinDbg, we will get to the point in the kernel where execution will follow the syscall
instruction from the user mode:
; nt!KiSystemCall64Shadow:
fffff805`08a13180 0f01f8 swapgs
fffff805`08a13183 654889242510900000 mov qword ptr gs:[9010h], rsp
fffff805`08a1318c 65488b242500900000 mov rsp, qword ptr gs:[9000h]
fffff805`08a13195 650fba24251890000001 bt dword ptr gs:[9018h], 1
fffff805`08a1319f 7203 jb ntkrnlmp!KiSystemCall64Shadow+0x24 (fffff80508a131a4)
fffff805`08a131a1 0f22dc mov cr3, rsp
fffff805`08a131a4 65488b242508900000 mov rsp, qword ptr gs:[9008h]
fffff805`08a131ad 6a2b push 2Bh
fffff805`08a131af 65ff342510900000 push qword ptr gs:[9010h]
fffff805`08a131b7 4153 push r11
fffff805`08a131b9 6a33 push 33h
fffff805`08a131bb 51 push rcx
fffff805`08a131bc 498bca mov rcx, r10
fffff805`08a131bf 4883ec08 sub rsp, 8
fffff805`08a131c3 55 push rbp
.....
.....
So this is technically where we will need to place our kernel breakpoint, but there are some gotchas...
Kernel Breakpoint - Issue 1
We cannot place a kernel mode breakpoint on any of the following instructions:
; nt!KiSystemCall64Shadow:
fffff805`08a13180 0f01f8 swapgs
fffff805`08a13183 654889242510900000 mov qword ptr gs:[9010h], rsp
fffff805`08a1318c 65488b242500900000 mov rsp, qword ptr gs:[9000h]
fffff805`08a13195 650fba24251890000001 bt dword ptr gs:[9018h], 1
fffff805`08a1319f 7203 jb ntkrnlmp!KiSystemCall64Shadow+0x24 (fffff80508a131a4)
fffff805`08a131a1 0f22dc mov cr3, rsp
fffff805`08a131a4 65488b242508900000 mov rsp, qword ptr gs:[9008h]
Why?
Well, technically we can place a breakpoint there, but if we run the guest OS, it will either crash or hang.
Such will happen because at that critical entry point, the GS
segment register and the kernel stack in RSP
register are not set up yet. And since the kernel debugging engine uses them for its own purpose, breaking at any of those locations will cause a BSOD.
TheGS
register is used in the 64-bit Windows kernel to store a pointer to the internalKPCR
struct. Kernel code cannot run without it.The reason kernel code will crash without having a properly configured
RSP
register that points to a kernel stack is because of the CPU feature, called, "Supervisor-Mode Access Prevention", or SMAP, that will raise an exception if kernel code tries to read or write from a user-mode memory, like when theRSP
register still points to a user-mode stack after thesyscall
instruction.
We can easily resolve this limitation by placing a breakpoint right after that code block. Say, on any of the following instructions:
fffff805`08a131ad 6a2b push 2Bh
fffff805`08a131af 65ff342510900000 push qword ptr gs:[9010h]
fffff805`08a131b7 4153 push r11
fffff805`08a131b9 6a33 push 33h
fffff805`08a131bb 51 push rcx
fffff805`08a131bc 498bca mov rcx, r10
fffff805`08a131bf 4883ec08 sub rsp, 8
fffff805`08a131c3 55 push rbp
.....
.....
The first push
instruction would be a good place for it.
Kernel Breakpoint - Issue 2
The second more challenging issue is that the KiSystemCall64Shadow
service routine serves as an entry point into the kernel from all of the Win32 APIs for all the threads in all the processes that run in the guest OS. And that is a lot of calls!
If we could hypothetically draw a heat map of the system RAM to denote the most visited (or executed) locations with a brighter color, the address occupied by the KiSystemCall64Shadow
function will grow red hot like a sun.
So how do you isolate all the other syscall
s from the one that we need?
One would suggest using a conditional breakpoint, which would be a good idea in general case, but in our very busy part of the system, a conditional breakpoint will grind the guest OS to a halt.
The reason this would happen is because WinDbg uses JavaScript engine to evaluate conditional breakpoints, which makes it several orders of magnitude slower than the normal code flow in the syscall
service routine.
Thus we need to come up with a different way to place a conditional breakpoint. My favorite one - is a kernel binary patch in memory. But first we need to prepare our user-mode code for it.
Preparing User-Mode Code
If you look at the declaration of the NtWaitForSingleObject
function:
NTSTATUS NtWaitForSingleObject(
[in] HANDLE Handle,
[in] BOOLEAN Alertable,
[in] PLARGE_INTEGER Timeout
);
We can use its 3rd parameter, which is a pointer to the PLARGE_INTEGER
struct, to pass our specially crafter pointer, say 0x11224455
, that will be quite rare in other (general) cases.
We can construct such a pointer by using the VirtualAllocEx
function, which allows us to request a specific virtual address in its 2nd parameter. Thus, we can rewrite our initial test code into something like this:
HANDLE h = CreateEvent(NULL, FALSE, FALSE, NULL);
LPVOID pAddr = VirtualAllocEx(GetCurrentProcess(),
(LPVOID)0x11224455, //Request our special address
0x10000, //Ask for 1 page of memory
MEM_COMMIT | MEM_RESERVE, //Make it ready to use
PAGE_READWRITE); //Need it for reading & writing
if(pAddr)
{
//Because VirtualAllocEx will return an address rounded down to the page size,
//we need to adjust it to our desired value: 0x11224455
(size_t&)pAddr |= 0x4455;
//Execute the syscall
NtWaitForSingleObject(h, FALSE, (PLARGE_INTEGER)pAddr);
VirtualFreeEx(GetCurrentProcess(), pAddr, 0, MEM_RELEASE);
}
else
{
//Oops, can't run our test. Restart the OS and try again ...
wprintf(L"ERROR: %d - VirtualAllocEx 2\n", GetLastError());
}
CloseHandle(h);
There's no guarantee that the system memory manager will oblige and return the address that we want. So don't use it in your production code. But for our purpose it will suffice.
The code above really has not much sense for a production purpose. We constructed it only to pass our special value 0x11224455
into the kernel.
Making a Kernel Trap
So the only thing left to do is to write a kernel trap to catch when the RAX
register is set to 4, and the R8
register is set to 0x11224455
inside of the kernel service routine for the syscall.
The reason we're checkingRAX
for equality to 4 is because that is the register that is used to convey the "system service number", which, if you remember, was 4 in our case for theNtWaitForSingleObject
function call.And, the
R8
is how the 3rd input parameter is passed into theNtWaitForSingleObject
function according to the x64 calling convention for Windows.
The trap itself can be written as such:
cmp rax, 4h
jne lb_continue
cmp r8, 11224455h
jne lb_continue
nop ; Place breakpoint here
lb_continue:
In there we check for the condition that we outlined above, and if it is met, we provide a nop
instruction to place our kernel breakpoint on. This will be analogous to a conditional breakpoint but with much less overhead.
Then we will also need to jump into our trap from the start of the KiSystemCall64Shadow
service routine by replacing one of the original instructions:
fffff805`08a131ad 6a2b push 2Bh
fffff805`08a131af 65ff342510900000 push qword ptr gs:[9010h]
The size of a jmp
instruction is 5 bytes, thus we can't use the first push 2Bh
instruction for that. So let's use the push qword ptr gs:[9010h]
instruction instead. Thus our original KiSystemCall64Shadow
service routine becomes this, after the patch:
; nt!KiSystemCall64Shadow:
fffff805`08a13180 0f01f8 swapgs
fffff805`08a13183 654889242510900000 mov qword ptr gs:[9010h], rsp
fffff805`08a1318c 65488b242500900000 mov rsp, qword ptr gs:[9000h]
fffff805`08a13195 650fba24251890000001 bt dword ptr gs:[9018h], 1
fffff805`08a1319f 7203 jb ntkrnlmp!KiSystemCall64Shadow+0x24 (fffff805`08a131a4)
fffff805`08a131a1 0f22dc mov cr3, rsp
fffff805`08a131a4 65488b242508900000 mov rsp, qword ptr gs:[9008h]
fffff805`08a131ad 6a2b push 2Bh
; Our patch: Jump to our trap
fffff805`08a131af e92c020000 jmp ntkrnlmp!KiSystemCall64Shadow+0x260 (fffff805`08a133e0)
fffff805`08a131b4 90 nop
fffff805`08a131b5 90 nop
fffff805`08a131b6 90 nop
fffff805`08a131b7 4153 push r11
fffff805`08a131b9 6a33 push 33h
fffff805`08a131bb 51 push rcx
fffff805`08a131bc 498bca mov rcx, r10
fffff805`08a131bf 4883ec08 sub rsp, 8
fffff805`08a131c3 55 push rbp
.....
.....
We can use WinDbg to make the memory patch that I showed above by using theeb
command, as such:eb fffff805`08a131af e9 2C 02 00 00 90 90 90
Note that we can calculate the offset for the
jmp
instruction using the address of where we will place our trap, shown below.Also note that we've padded the
jmp
instruction with 3nop
s to match the size of the originalpush qword ptr gs:[9010h]
instruction.
Then we can put our trap in the free space somewhere at the end of the KiSystemCall64Shadow
function:
fffff805`08a133e0 4883f804 cmp rax, 4
fffff805`08a133e4 750a jne ntkrnlmp!KiSystemCall64Shadow+0x270 (fffff805`08a133f0)
fffff805`08a133e6 4981f855442211 cmp r8, 11224455h
fffff805`08a133ed 7501 jne ntkrnlmp!KiSystemCall64Shadow+0x270 (fffff805`08a133f0)
fffff805`08a133ef 90 nop ; Place breakpoint here
fffff805`08a133f0 65ff342510900000 push qword ptr gs:[9010h]
fffff805`08a133f8 e9bafdffff jmp ntkrnlmp!KiSystemCall64Shadow+0x37 (fffff805`08a131b7)
We can use the following WinDbg command to write the machine code for our trap:eb fffff805`08a133e0 48 83 f8 04 75 0a 49 81 f8 55 44 22 11 75 01 90 65 ff 34 25 10 90 00 00 e9 BA FD FF FF
As for the machine code itself, you can use any assembler to generate it. I personally use this one.
You can locate manually the "free space" at the end of theKiSystemCall64Shadow
function using the kernel debugger. Simply look for the padding00
's orCC
's at the end of the function body. This padding is usually placed there by the compiler to optimize the code flow.
Finally, place the breakpoint at the fffff805`08a133ef
address (in case of our patch) with the bp
command in WinDbg:
bp fffff805`08a133ef
Optionally, you can place a hardware execution breakpoint on that instruction using theba
command:ba e 1 fffff805`08a133ef
After that let the guest OS run and step into the syscall
from the NtWaitForSingleObject
user-mode function. This should trigger the breakpoint in the kernel.
Then continue stepping through the kernel code to do your further research.
I would recommend to disable the kernel breakpoint in thesyscall
service handler after it is triggered to prevent repeated invocations. You can do it with thebl
command in WinDbg.
Screencast
To recap everything that I've shown above, please watch the following screencast:
Conclusion
There are probably many ways to ensure that you can step into the kernel syscall
with a debugger. The one that I showed above had worked for me.
In case you know a better way, or want to share yours, please leave a comments below.