Intro
One can probably write a large book on the subject of reverse engineering, so one blog post is definitely not going to cut it. Still, I will try to give some basic outlines to anyone who wants to learn the art of software reverse engineering.
An Art Form
You probably noticed above that I called reverse engineering an art.
Note that reverse engineers, and software developers in general, by their nature are quite a lazy bunch. We don't like to type a lot. One would argue a possible oxymoron though: "Your whole existence revolves around typing text, or working with the source code!" This is true, and maybe that is why we don't like typing. Although most of it is copy-and-pasting.But overall, this is probably why most reverse engineers abbreviate "reverse engineering" down to just "RE". And this is what I will do throughout this blog post. (And no, in this context RE doesn't stand for "regarding" or "referencing".)
RE is not the same as programming or software development. If you are learning C, or C++, or any other programming language, a teacher usually shows you the syntax of the said programming language and how to write programming constructs with it, like variables, functions, conditional statements, loops, etc. After that you have to follow the strict syntax of the programming language while writing your code. And if you deviate from it, your work would either not compile or it will crash.
You may also later learn about data structures and algorithms, or learn how to use this or that library, or some specifics about system APIs for an operating system that you are coding for. All of that will give you more freedom of action, but still in general most software developers follow a somewhat strict set of rules and best practices while creating their apps.
RE, on the other hand, is not an exact science. It does not have a set of rigid rules to follow and cannot outline a distinctive path that you can follow to achieve your goal. RE'ing is more like an art form where on one occasion you can achieve your goal quickly by taking a shortcut, when on another, you may spend days going in one direction, only to realize that you got nowhere and have to retrace your steps back and to start it all over again on a totally different path.
Such is normal for a reverse engineer. At times some RE work that seems trivial at first may take days.
What Is Reverse Engineering?
RE'ing in general is trying to look into internals of something, usually some device or machine, with a goal of trying to understand how it works. RE'ing is usually required when a manufacturer, or maker of a said device did not provide a comprehensive manual for it. Additionally a reverse engineer may look into the workings of a device to verify its maker's claims, or to find its secret (or undocumented) capabilities.
This definition of RE'ing is my own. If you want to check what Wikipedia has to say about it, click here.
I also need to point out that there're several branches of RE'ing. Just to name a few:
- Software RE - is what we will concentrate on in this blog post - RE'ing of software components without having access to the source code of the said software. It is usually done with the following purposes:
- General attempts to understand how software works, especially if it's undocumented, or poorly documented.
- To have a better knowledge of the internal workings of this or that system API for programming purposes.
- Verification that software is benign, or confirmation that software does what it claims "on the box".
- Ability to modify, or "patch", said software without having its source code.
- Need to neuter malicious or unwanted software (so called "malware research".)
- Search for software vulnerabilities (so called "vulnerability research".)
- "Bug bounty hunting", or discovery of software bugs and vulnerabilities for a reward.
- To learn how compilers work.
- To be able to "hook", or modify the existing software for the purpose of software virtualization or sandboxing.
- Espionage (industrial or state-sponsored.)
- To circumvent software licensing checks, or DRM, in the said software. Or to write so-called "software cracks" and "cheats".
I probably forgot a few use-cases for RE'ing in my list above. But you get the idea, right?
You can probably notice that RE'ing can be used for both benign and malicious purposes. In that sense, RE'ing is definitely a grayscale set of skills.
- Hardware RE - similar to the purposes that I outlined above someone may be RE'ing microchips, circuit boards, industrial machines and other mechanical devices. This topic will not be covered in this blog post.
- Medical RE - a recent newcomer to the block is a medical or genetic RE'ing. This includes RE'ing constituent components of this or that drug or treatment, as well as RE'ing molecular and genetic code of the said compounds. I obviously have no idea what I am talking about in that field of RE'ing, thus I will not be covering it in this post either.
- RE in "Life" - this is more of a vernacular term for being able to "hack" your way through everyday things. Stuff, like using Pepsi to clean your commode, or similar skills to cleverly use products that weren't intended for some particular application. This subject will not be covered in this blog either.
As I mentioned above, this blog post will only mention software RE'ing.
Although there are many ways to apply software RE'ing, I will mostly concentrate on RE'ing of native, and managed, or .NET binaries.
Experience ... LOTS of Experience
What matters in RE'ing is experience, and lots of it.
Quite often you may be working on something and spend days until you fully tackle it. In the process you may be acquiring some important understanding of how some components work internally, which may save you hours if you have to deal with a similar problem again in the future.
This is akin to trying to find some location in a city that you have never been to, without having a working GPS. Once you navigate through countless wrong turns and ask people for directions and finally find it, the next time you have to go there, it will take much less time to get there. And, the more you go there, the more efficient you will be at cutting your travel time.
RE'ing works in a very similar way.
Knowledge of the Computer/Device Architecture
It will be very difficult to fix a car engine without understanding what is under the hood. In the same vein, any reverse engineer needs to know the basic workings of a device he/she is RE'ing to be proficient at their job. The more you understand how your device works, the better reverse engineer you will be.
In case of a computer, smartphone, or any other computing device with a microchip in it, you need to understand the following concepts:
- CPU, or Central Processing Unit: Most of your work will be done with it. Also a grasp of how multi-core systems work.
- GPU, or Graphics Processing Unit: Unless you're working with video and GUI in general, you don't need in-depth understanding of GPUs, but a basic knowledge of what they are and how they work will be very handy.
- CPU Cache and how it is used by the CPU and for computing in general. This is not only helpful for your RE work, but for writing efficient programs as well.
- RAM, or Random-Access Memory: You will be working with it a lot.
- Data Bus and general ways how CPU communicates with RAM and other peripherals.
- (Disk) Drive and persistent data storage in general. Also how SSD and other newer drives work.
- Peripherals such as sound card, network card, etc. Especially understanding how these parts communicate and operate along with other components.
In general, in-depth understanding or expertise in one or more fields of computer hardware will never hurt your RE work.
Knowledge of the CPU Architecture
Like I pointed out above, CPU (or multiple CPUs) is something that you will be dealing with a lot, especially if you are doing native RE'ing.
Thus knowing really well, at least one, of the following CPU architectures is very important for a reverse engineer:
- Intel: x86-64 (or x86) - this is probably the most popular CPU architecture today. Intel CPUs are used in a wide variety of computers and laptops. Most all Windows systems run on Intel CPUs, as well as many Linux and older Macs.
Note that even though an older (32-bit)
x86
architecture is still around and it is still very important if you want to RE Windows software, I would probably dedicate more attention to the newerx86-64
architecture.The only exception to this statement is RE'ing malware, which is still primarily written using 32-bit
x86
architecture. - ARM: Armv8 (Armv7) - at the time of this writing, ARM architecture was gaining momentum and is very popular. Note that the version of the ARM chips will inevitably change in the future. For now,
AArch64
andAArch32
flavors (or extensions) is what I would start with.AArch64
is currently used in many Android mobile devices. Apple also based their Silicon (currently M1 and M2 chips) on it as well. It is used extensively in their latest iPhone, iPad and Mac devices.It will be also quite important to understand the main difference between RISC and CISC CPU architectures and why the former one is so much more power efficient and why it is gaining a fast ground on the CISC design with the mobile device manufacturers.
Note that although the older
Armv7
extension of the ARM architecture is still around and is widely used in embedded devices,AArch64
is quickly replacing it and is definitely the future. So you may want to dedicate more of your learning resources to it.
Word of caution: Do not try to learn more than one CPU architecture at once. These are enormous fields of study. You may branch out from one architecture to another once you master it really well. But don't do it at an early stage.
There are many more CPU architectures than what I listed above. The ones listed are the most popular ones.
Knowledge of the OS Architecture
Any software doesn't just live in a vacuum. It runs in an Operating System (OS). By that I mean that it calls system functions, or APIs from the said OS, as well as abides by the rules and structures of binary files in that OS. Thus knowing the internals of at least one of the following OSes (depending on your personal preference, or availability of the job market) is quite crucial for any software reverse engineer:
- Windows - this is by far the most popular OS out there. It is used on a vast number of hardware, from desktops to laptops to other personal devices.
If you choose Windows, I would recommend also learning the following techniques:
- Win32/WinAPI - these are lower-level user-mode system functions that most of the code running on Windows calls eventually in their execution stack. Thus understanding how these lower-level APIs work is essential for a reverse engineer on Windows.
- COM - or, Component Object Model, is a binary interface that is used internally by many Windows components.
- .NET - this is a framework that Microsoft introduced on top of Win32. Binaries compiled for .NET are called managed and require a slightly different approach to RE'ing. C# programming language is based on it.
- Calling Conventions - these are specific ways functions are encoded in Assembly. They differ for each compiler, as well as for the OS a binary will run on.
- Kernel Primitives - anything from exception handling, to types of handles, to process isolation, to virtual memory, to ways of IPC, to synchronization locks used by the operating system. This list is quite extensive. I won't name everything here.
The list above is far from exhaustive. In general, the more you know about the OS internals the better it is for your RE work.
- Linux - is probably the second popular OS. It is open source and is very popular among computer enthusiasts.
If you choose Linux, I would look into learning the following:
- syscalls - most of Linux programs run by invoking system calls. These are documented in many online resources.
- Calling Conventions - Linux has a slightly different way how functions are involved from the Assembly level. This is partially because software that runs on Linux is compiled with a different set of compilers than Windows.
- Bash - is a scripting language that can help you achieve anything from automation to a quick lookup of different OS parameters and settings.
- Console - is very important on Linux. Most of the tools are available only via a command line call.
I am not a big expert on Linux RE'ing, thus my knowledge is somewhat limited.
- macOS - is Apple's OS for their laptops and desktops. It is closed-source and some of its internals may not be freely available. But this doesn't make it less attractive for a reverse engineer.
If you work on macOS, I'd concentrate on the following:
- syscalls - macOS is based on FreeBSD and in many ways is similar to Linux, and thus shares a lot of system functions that are used on that OS.
- Calling Conventions - these are also similar to Linux, and are important for a reverse engineer.
- Frameworks - many of the Apple apps use this or that framework and API sets. To name just a few: Cocoa, Carbon, AppKit, Core Data, XNU/IOKit, etc.
- Bash - the same as for Linux, this scripting language will be very handy for a reverse engineer on macOS.
- Terminal - command line tools are quite an integral part of advanced operations in macOS.
Some of the macOS's kernel is open source, which makes its RE'ing quite easy.
- iOS/watchOS/tvOS - these are younger cousins of the macOS that I described above. So pretty much the same bullet points apply here as well. The difference is that RE'ing iOS and others is way more difficult due to hardware security restrictions imposed by Apple.
- Android - is another very popular mobile OS, provided by Google. I am not an expert in it, and thus I cannot give you any advice.
The same word of caution applies to learning OS internals. Do not try to jump from one OS to another. These are quite extensive fields of study and it is virtually impossible to be proficient in your knowledge of internals of more than one OS.
Your choice of which OS to pick should be your personal preference. I am sure you are familiar with some of them by just being a user.
Understanding of the Basic Computing Principals
Good grasp of the following principals and concepts is quite important for your RE work, so make sure to brush up on these:
- Multithreading - most of the programs that you will encounter these days run in a mutlithreaded environment. Thus knowing how to handle it is very important for a reverse engineer.
- Synchronization/Locks - these are important primitives that are involved in multithreading.
- Process Isolation - most operating systems these days support security isolation of the running programs. Understanding how it's done is important for a reverse engineer.
- Virtual Memory - is another concept used for process isolation and is an important thing to grasp for your RE work.
- Exceptions/Interrupts - these are important concepts of how hardware (mostly CPUs) handle errors in the running code and process input from peripheral devices.
- CPU Privilege Levels - is another way running code can be securely isolated between different security levels in the operating system. This concept is vital for a reverse engineer.
- Kernel/User Mode - finally, it is very important for a reverse engineer to understand the fundamental differences between user-mode and kernel-mode code.
The list above is just something that I was able to think of. I'm sure I've missed some of the computing concepts. If so, remind me in the comments below.
Reverse Engineering Tools
You can't do any of the RE work without tools. These include debuggers, disassemblers, decompilers, hex viewers and editors, as well as any other system tools and utilities that can assist in your work.
This is a subject of its own, thus I wrote a separate blog post to demonstrate tools that I use for my RE work and also to show how to download and set them up.
Being a Programmer/Software Developer
Being a programmer or a software developer is always a plus. Some reverse engineers argue that being a programmer is not essential. This may be true, but it is always beneficial. Plus, I am somewhat biased since I am a software developer myself.
My thinking in favor of needing some programming skills to be better at RE'ing goes as such. By knowing and having a hands-on experience in how to create software you can better see the internals of someone else's code and thus spend less effort and time RE'ing it.
What programming languages do I recommend for a reverse engineer?
With some exceptions, I'd definitely try to learn compiled languages, such as:
- C - nothing teaches you better about computers and their architecture than the C programming language. It is also one of the oldest languages in the list that I will definitely recommend learning for a reverse engineer.
To take it even further, I'd recommend learning C as your first programming language.
- C++ - is a younger cousin of the C language. It is vastly expanded these days. Knowing how to write code in C++ will prepare a reverse engineer to face multitude of software that was written in it.
- Python - is not really a compiled language, but it is handy for automation in your RE work. It is also widely used by the RE community these days.
Python, strictly speaking, is not important for the act of RE itself.
- C# - if you are intended to RE managed (or .NET) software on Windows.
- Java - it is another managed language. I put it here because it is widely used for development on Android, as well as for embedded devices.
- Go, Rust, etc. - any other compiled languages, especially newer ones like Rust could be handy at RE'ing binaries written with those languages.
- Objective-C, Swift - if you are RE'ing in the Apple echo system, these languages will be as important as C and C++.
Another word of caution: Please do not attempt to learn more than one programming language at once. Take your time if that subject interests you.
Having said that, my next point, knowing Assembly programming language is also quite vital.
Low-Level Assembly Language
Even though some latest debuggers come with built-in decompilers, or tools that allow to convert native machine code that runs on a CPU to a pseudo-language, such as C; I would strongly recommend to avoid solely relying on decompilers.
The reason being is that available decompilers may produce confusing and erroneous code, which may throw you off in your RE'ing work. Moreover some commercial decompilers are very expensive, and IMHO are not worth the money.
Since native binaries that you may be faced with are comprised of machine code that runs directly on a specific CPU, learning the Assembly language for that CPU is quite vital for a reverse engineer.
The Assembly language to choose will depend on the CPU architecture that you pick.
Knowledge of Compilers
By that I mean knowing how compilers build binary executables. This is especially true for malware research, because malware and unwanted software in general often abuse operating system loaders by corrupting, or manipulating binary files.
Compilers are somewhat specific for an OS:
- msvc - is a Microsoft Visual C++ compiler that is used primary to compile native C or C++ binaries on Windows.
- GCC - this C/C++ compiler is used primarily on Linux, although the latest versions of Microsoft Visual Studio also support it as well.
- Clang - is a C/C++ compiler that is used by default in Xcode to compile macOS/iOS software, but it may be used in other OSes as well.
Description of the compiler intrinsics is beyond the scope of this blog post and can probably warrant a thick book of its own.
In general, knowledge of the internals of a compiled binary is always a plus for a reverse engineer, as well as for a software developer.
Binary Formats
And speaking of binary files, understanding the structure and binary formats of files that contain executable code is important for a reverse engineer. Namely:
- Portable Executable (PE) - this binary format is used primary on Window. Any files with the
.exe
,.dll
,.sys
and other binary extensions contain data in that format. - Executable and Linkable Format (ELF) - is a binary file format for executables, primarily on Linux.
- Mach-O - this binary format is used internally by executables on macOS/iOS and other Apple operating systems.
Explanation of each binary format will probably take a long blog post of its own, and thus I will not do it here.
Make sure to match a binary format to the operating system that you are intended to do your RE work on.
Conclusion
In spite of the fact that the list of requirements to become a successful reverse engineer is quite daunting, don't let that faze you. If you start at one thing and slowly move forward, you won't notice how quickly you will reach the level of proficiency.
This by the way applies not only to the field of reverse engineering.
The main advice is - don't give up! Move forward. The journey is not as hard as it looks in the beginning.
Happy RE'ing!