In this tutorial I'm going to be teaching you how to debug ARM assembly language on MacOS with the lldb debugger. Ultimately, my goal with these tutorials is to teach advanced debugging techniques which will be applicable to both offensive and defensive information security. These tutorials are designed for novices with very little programming experience.
Lets get started by writing a simple C program like so:
// hello_world.c
#include <stdio.h>
int main(){
printf("Hello, World !\n");
return 1;
}
As I've said before C is the Latin of programming languages, it is a very verbose language that has influenced innumerable programming languages. It is very nice in that it gives the programmer a ton of control over memory utilization, and because it is a compiled language it runs very fast. Because of these traits C was the language chosen to write most operating systems and higher level programming languages in.
C is such a low level language that in order to print something to STDOUT we must include the Standard Input/Output library, which is exactly what we're doing when we declare #include <stdio.h>
at the top of our program, this allows us access to the printf
function.
All programs must run in a main
function, languages like Ruby, Python and JavaScript abstract away the utilization and deceleration of a main
function but if you look at the stack traces for those languages closely you'll see that it is there in the background. Functions run and evaluate to a value, and we must specify the type of that value so that the CPU can allocate a proper amount of memory.
In this short example we return an integer which is a rational number which takes up 4 bytes of memory. A byte is a collection of 8 bits. A bit is a single segment of memory that can be in a binary state of either a 1 or 0. This means that an integer is actually composed of 32 1's or 0's residing within memory. Typically a successful run of a program will be denoted by returning 1, while an error will be noted by the return of 0.
Computers are smart, but they aren't smart enough to read and run C program. C is just a convenient language for humans to write in. A compiler
is what transforms our C code to a series of binary
instructions that the computer can "understand" and execute.
MacOS comes with the gcc
compiler by default. Gcc can do a lot, if you run gcc --help
you'll see all of the different flags that the tool comes with. We're only compiling a relatively simple program and we can do so with this command:
gcc -g -o hello_world hello_world.c
The -o
flag allows us to name our executable in this case it will be called hello_world
. The -g
flag tells our compiler to store the debugging symbols of the executable within a .dSYM
directory. If you poke around and run ls
in the directory where you compiled your program you'll notice that we now have a hello_world
executable and a hello_world.dSYM
directory. The -o
flag is optional, if we chose to omit it our binary would be given the default name which is a.out
, and correspondingly a a.out.dSYM
directory would be created.
Lets run our executable by typing ./hello_world
into the terminal.
You didn't come here just to write "Hello, World!" programs. It's time to dive deep into the magical world of computation by debugging our executable with lldb
the native debugger on MacOS. The debugger will bring us into a shell like environment where we can run, pause and modify our program in real time. You can start your debugger with lldb hello_world
.
You're now in the debuggers shell. If you ever get lost you can run help
to see a list of commands. If you want to know more about a specific command, say for example the breakpoint
command you would type help breakpoint
. Lastly there are sub-commands to each command and if you wanted to learn more about the breakpoint set
command you'd type help breakpoint set
.
Now that we're in our debugger shell we can view our original C code by running
list main
which should produce the following:
File: /Users/corery/c_projects/hello_world.c
1 #include <stdio.h>
2
3 int main(){
4 printf("Hello, World !\n");
5 return 1;
6 }
7
Let's pause our program right before the printf function executes by setting a break point on line 4.
(lldb) breakpoint set -l 4
Breakpoint 1: where = hello_world`main + 24 at hello_world.c:4:2, address = 0x0000000100003f88
The address of line 4 within the memory of the binary is at 0x0000000100003f88
. This number is written in what is called hexadecimal
which is a base 16 numerical system, as opposed to the traditional base 10 numbering system that you're used to. The numeral system of base 10 vs base 16 is shown below:
base 10 | base 16
________|___________
0 | 0
1 | 1
2 | 2
3 | 3
4 | 4
5 | 5
6 | 6
7 | 7
8 | 8
9 | 9
10 | A
11 | B
12 | C
13 | D
14 | E
15 | F
Typically when debugging you don't need to convert hexadecimal numbers to base 10 to understand what's going on, that would be way too much work. You just need to be able to understand where different segments of memory addresses are in relation to one another. That is to say which addresses are higher and lower than one another.
Now that we have some understanding of base 16 and we've set a breakpoint it's time to run our program with the unironically name run
command. Notice how the program stops just before the printf
call. You should see the following:
(lldb) run
Process 3181 launched: '/Users/corery/c_projects/hello_world' (arm64)
Process 3181 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = breakpoint 1.1
frame #0: 0x0000000100003f88 hello_world main at hello_world.c:4:2
1 #include <stdio.h>
2
3 int main(){
-> 4 printf("Hello, World !\n");
5 return 1;
6 }
7
Target 0: (hello_world) stopped.
Our debugger conviently shows both the memory adress of the paused programs code and the actual C line which we're stopped at. Exectables like the one we compiled are divided into 4 memory segments from lowest to highest adresses they are the code
, data
, stack
, and heap
. We'll dive into each of these segments into more detail later, for now all you need to understand is that the code
segment stores the actual instruction set for the executable and as you can see our program is paused at the instructions located at 0x0000000100003f88
.
As our program takes a break from running we can finally dive into the magical world of ARM assembly. Lets disassemble our program.
(lldb) disass
hello_world`main:
0x100003f70 <+0>: sub sp, sp, #0x20
0x100003f74 <+4>: stp x29, x30, [sp, #0x10]
0x100003f78 <+8>: add x29, sp, #0x10
0x100003f7c <+12>: stur wzr, [x29, #-0x4]
0x100003f80 <+16>: adrp x0, 0
0x100003f84 <+20>: add x0, x0, #0xfa8 ; "Hello, World !\n"
-> 0x100003f88 <+24>: bl 0x100003f9c ; symbol stub for: printf
0x100003f8c <+28>: mov w0, #0x1
0x100003f90 <+32>: ldp x29, x30, [sp, #0x10]
0x100003f94 <+36>: add sp, sp, #0x20
0x100003f98 <+40>: ret
That's a lot to take in, as you can see our 2 line C function produced 11 lines of assembly, imagine how much assembly a 100 line C program would produce. I used to think of C as a low level programming language, mostly because I was used to programming with Ruby but after realizing how verbose and complex assembly language is I came to appreciate the level of abstraction and utility of C.
You came here to understand ARM assembly, so lets break down the first line of code.
0x100003f70 <+0>: sub sp, sp, #0x20
The 0x100003f70
all the way to the left is the memory address of the instruction. The actual instructions located at 0x100003f70
are located all the way to the right ie. sub sp, sp, #0x20
. Like any language ARM assembly has a grammar or syntax that it's instructions must meet. In this case the format is <operation> <destination>, <source1>, <source2>
. Note note all instructions make use of the second operand as it is refereed to as flexible
. Flexible operations are a distinguishing feature between x86_64 and ARM assembly.
Before diving into what every line of this assembly means we need to understand two concepts: registers
and operations
. A process register is a hardware variable, it is where our computer stores data as the program executes. Registers are stored within the RAM of the system, the more RAM you have the more processes you can run in parrallel because you have more available registers. We'll go over registers in the next part of this series.
Top comments (0)