Introduction
My solution to the 5th reverse engineering challenge from challenges.re. The assembly code to understand is the following:
f:
cmp rcx, rsi
ja .L10
sub rsi, rcx
add rsi, 1
mov r11, rsi
je .L10
test rcx, rcx
jne .L16
mov rax, rdi
ret
.L10:
xor eax, eax
ret
.L16:
push rbx
xor r10d, r10d
mov r9d, 1
.L4:
lea rax, [rdi+r10]
xor esi, esi
xor r8d, r8d
.L8:
movzx ebx, BYTE PTR [rdx+rsi]
cmp BYTE PTR [rax+rsi], bl
cmovne r8d, r9d
add rsi, 1
cmp rsi, rcx
jne .L8
test r8d, r8d
je .L12
add r10, 1
cmp r10, r11
jne .L4
xor eax, eax
.L12:
pop rbx
ret
A little bit longer than the previous one. Let's start analyzing it.
Analysis
Inferring f signature
As in our previous challenges we have a function f
, now this function has more jumps than our previous challenges. Looking at our first lines
cmp rcx, rsi
ja .L10
sub rsi, rcx
add rsi, 1
mov r11, rsi
je .L10
test rcx, rcx
jne .L16
mov rax, rdi
ret
We can infer the following, first it's expected that one of the parameters to be smaller than the other one. More precisely rcx
should be less than rsi
. Also the value in rcx
shouldn't be equal to rsi
. In both of these case the program will jump to the end of execution and will return a zeroed eax
register. For that reason we can assume that rcx
and rsi
are parameters passed in our function, and both of them numbers. Then we have
// ...
if (b >= a)
return NULL;
Now one thing particular interesting here, is that after this check we perform two instructions which equivalent in C would be something like this:
int d = a - b + 1;
Which give us a clue that maybe we are dealing here with a counting of elements in an array or string. This is quite similar, if we want to count the amount of elements between an index i and j of an array, including both. For example, the elements between index i = 0, and j = 3, would be 3 - 0 + 1 = 4. Also it's also possible, that if we are dealing with strings instead of arrays(which in C is just an array of char
type), what we perform with this operation can be seen as the difference in size between two stings. Let's keep this thought in mind. It's also save to assume that we are receiving in our arguments both of these arrays or strings. The main reason for this assumption, is that there's no malloc
call, so the program is not allocating memory in the heap for any of these 2 arrays or strings. Now we need to decide which one of these two options should be. Let's explore a bit more before deciding that.
Next instructions, give us more clues about what could be happening here, for example:
test rcx, rcx
jne .L16
mov rax, rdi
ret
Which basically it's checking if one of our parameter it's 0 we can return the value in register rdi
. Interesting, what's in rdi
? Knowing this, we can infer the return type of f
. Making a search in the program you can notice that the way, rdi
it's used in the program, gave us the impression that rdi
it's first of all a parameter also passed into f
and a string. The reason for been a string, is specially this instruction:
.L4:
lea rax, [rdi + r10]
;; ... after .L8 tag
cmp BYTE PTR [rax + rsi], bl
Which can be interpreted as please load the address of rdi into rax register, and later we will perform a comparison of its character. From this we have that rdi
contains our first string. With this we can infer the most important part, the signature of f
. Which should be as follows:
char *f(char *str1, int str1_size, char *str2, int str2_size)
Putting all previously explained into into C code, at the moment we have:
char *f(char *str1, int str1_size, char *str2, int str2_size)
{
if (str2_size >= str1_size)
return NULL;
int diff = str1_size - str2_size + 1;
if (str2_size != 0) {
// something goes here...
}
return str1;
}
Nice this is a great progress. One key note to take here is that the signature of the function is one of the first thing you should try to figure out.
Inferring main logic
Now here comes the part that will give us the whole part of the puzzle. At the moment we can describe what we have in a formal language as: the function f receives two strings, if the second one it's longer than the first one we return a NULL response. In case the the second one has length 0, we return the first string. Late on, we will give a better formal description, now we need to figure out the main logic.
Ok, so the main logic happens under .L16 tag. First thing to notice here, is that we initialize a counter on register r10d
, and we use that counter to iterate over the string in register rdi
, which is our first string. The instructions that gave us this clue are the following:
.L16:
push rbx ;; push into stack to preserve its value, given that later we modify rbx register
xor r10d, r10d
mov r9d, 1
.L4:
;; .... more down in the code after .L8 tag
add r10, 1
cmp r10, r11
jne .L4
Yes! We can identify here a loop which stop condition it's that the counter in r10
register get the value of r11
. What is it in the r11
register? Looking back, we can remember that in r11
we stored the result of str1_size - str2_size + 1
. Then we can infer here, the following C code:
// previous code
int diff = str1_size - str2_size + 1;
if (str2_size != 0) {
int i = 0;
for (i = 0; i < diff; i++) {
// something goes here...
}
}
In the same way we can identify that there's also a second loop, which stop condition it's that the counter reach str2_size
. These can be seen in instructions:
.L4:
lea rax, [rdi + r10]
xor esi, esi
xor r8d, r8d
.L8:
;; instructions goes here...
add rsi, 1
cmp rsi, rcx
jne .L8
This an be written in C, in this way:
int diff = str1_size - str2_size + 1;
if (str2_size != 0) {
int i = 0, j = 0;
for (i = 0; i < diff; i++) {
int flag =0;
for (j = 0; j < str2_size; j++) {
if (str1[j] != str2[j]) {
flag = 1;
}
}
if (flag == 0) {
return str1;
}
str1++;
}
}
Here I added to subsequent logic, to resume a bit. The instructions
movzx ebx, BYTE PTR [rdx+rsi]
cmp BYTE PTR [rax+rsi], bl
cmovne r8d, r9d
;; ...
test r8d, r8d
je .L12
;; ...
.L12:
pop rbx
ret
Gave us the clue for the code inside the loops. The whole code in C, of function f
would be the following:
char *f(char *str1, int str1_size, char *str2, int str2_size)
{
// cmp rcx, rsi
// ja .L10
// and later on...
// je .L10
if (str2_size >= str1_size)
// .L10:
// xor eax, eax
// ret
return NULL;
// sub rsi, rcx
// add rsi, 1
// mov r11, rsi
int diff = str1_size - str2_size + 1;
if (str2_size != 0) {
int i = 0, j = 0;
for (i = 0; i < diff; i++) {
int flag =0;
// movzx ebx, BYTE PTR [rdx+rsi]
// cmp BYTE PTR [rax+rsi], bl
// cmovne r8d, r9d
// add rsi, 1
// cmp rsi, rcx
// jne .L8
for (j = 0; j < str2_size; j++) {
if (str1[j] != str2[j]) {
flag = 1;
}
}
if (flag == 0) {
return str1;
}
str1++;
}
return NULL;
}
return str1;
}
Formal description
The idea behind these challenges is not actually to translate all the code in C, but to been able to describe what this assembly code does with words. In a way that you can use that description as a function documentation for example. Then a proper description for this function would be:
Formal description or documentation:
Given two strings and their length, f checks if the second string its a substring of the first one. In case is not we will receive a NULL value, otherwise we receive the pointer to the position where the substring begins
Final notes
Advices:
- It's always better to have well defined the signature of the function, in case of any error in tht part your whole interpretation of the code will be messed up.
- Have a good understanding of what's the difference between
lea
andmov
instructions. In my particular case, I always mess up with these two simple instructions. You can checkout this answer in SO.
Top comments (0)