Gealber Morales

Posted on Sep 17, 2023 • Originally published at gealber.com

Challenge RE # 5

#reverseengineering #assembly #c #challenge

Introduction

My solution to the 5th reverse engineering challenge from challenges.re. The assembly code to understand is the following:


f:
    cmp rcx, rsi
    ja  .L10
    sub rsi, rcx
    add rsi, 1
    mov r11, rsi
    je  .L10
    test    rcx, rcx
    jne .L16
    mov rax, rdi
    ret
.L10:
    xor eax, eax
    ret
.L16:
    push    rbx
    xor r10d, r10d
    mov r9d, 1
.L4:
    lea rax, [rdi+r10]
    xor esi, esi
    xor r8d, r8d
.L8:
    movzx   ebx, BYTE PTR [rdx+rsi]
    cmp BYTE PTR [rax+rsi], bl
    cmovne  r8d, r9d
    add rsi, 1
    cmp rsi, rcx
    jne .L8
    test    r8d, r8d
    je  .L12
    add r10, 1
    cmp r10, r11
    jne .L4
    xor eax, eax
.L12:
    pop rbx
    ret

A little bit longer than the previous one. Let's start analyzing it.

Analysis

Inferring f signature

As in our previous challenges we have a function f, now this function has more jumps than our previous challenges. Looking at our first lines

    cmp rcx, rsi
    ja  .L10
    sub rsi, rcx
    add rsi, 1
    mov r11, rsi
    je  .L10
    test    rcx, rcx
    jne .L16
    mov rax, rdi
    ret

We can infer the following, first it's expected that one of the parameters to be smaller than the other one. More precisely rcx should be less than rsi. Also the value in rcx shouldn't be equal to rsi. In both of these case the program will jump to the end of execution and will return a zeroed eax register. For that reason we can assume that rcx and rsi are parameters passed in our function, and both of them numbers. Then we have

    // ... 
    if (b >= a)
        return NULL;

Now one thing particular interesting here, is that after this check we perform two instructions which equivalent in C would be something like this:

int d = a - b + 1;

Which give us a clue that maybe we are dealing here with a counting of elements in an array or string. This is quite similar, if we want to count the amount of elements between an index i and j of an array, including both. For example, the elements between index i = 0, and j = 3, would be 3 - 0 + 1 = 4. Also it's also possible, that if we are dealing with strings instead of arrays(which in C is just an array of char type), what we perform with this operation can be seen as the difference in size between two stings. Let's keep this thought in mind. It's also save to assume that we are receiving in our arguments both of these arrays or strings. The main reason for this assumption, is that there's no malloc call, so the program is not allocating memory in the heap for any of these 2 arrays or strings. Now we need to decide which one of these two options should be. Let's explore a bit more before deciding that.

Next instructions, give us more clues about what could be happening here, for example:

test rcx, rcx
jne .L16
mov rax, rdi
ret

Which basically it's checking if one of our parameter it's 0 we can return the value in register rdi. Interesting, what's in rdi? Knowing this, we can infer the return type of f. Making a search in the program you can notice that the way, rdi it's used in the program, gave us the impression that rdi it's first of all a parameter also passed into f and a string. The reason for been a string, is specially this instruction:

.L4:
lea rax, [rdi + r10]
;; ... after .L8 tag
cmp BYTE PTR [rax + rsi], bl

Which can be interpreted as please load the address of rdi into rax register, and later we will perform a comparison of its character. From this we have that rdi contains our first string. With this we can infer the most important part, the signature of f. Which should be as follows:

char *f(char *str1, int str1_size, char *str2, int str2_size)

Putting all previously explained into into C code, at the moment we have:

char *f(char *str1, int str1_size, char *str2, int str2_size)
{
    if (str2_size >= str1_size)
        return NULL;

    int diff = str1_size - str2_size + 1;

    if (str2_size != 0) {
        // something goes here...
    }

    return  str1;
}

Nice this is a great progress. One key note to take here is that the signature of the function is one of the first thing you should try to figure out.

Inferring main logic

Now here comes the part that will give us the whole part of the puzzle. At the moment we can describe what we have in a formal language as: the function f receives two strings, if the second one it's longer than the first one we return a NULL response. In case the the second one has length 0, we return the first string. Late on, we will give a better formal description, now we need to figure out the main logic.

Ok, so the main logic happens under .L16 tag. First thing to notice here, is that we initialize a counter on register r10d, and we use that counter to iterate over the string in register rdi, which is our first string. The instructions that gave us this clue are the following:

.L16:
    push rbx ;; push into stack to preserve its value, given that later we modify rbx register
    xor r10d, r10d
    mov r9d, 1
.L4:
    ;; .... more down in the code after .L8 tag
    add r10, 1
    cmp r10, r11
    jne .L4

Yes! We can identify here a loop which stop condition it's that the counter in r10 register get the value of r11. What is it in the r11 register? Looking back, we can remember that in r11 we stored the result of str1_size - str2_size + 1. Then we can infer here, the following C code:

    // previous code

    int diff = str1_size - str2_size + 1;

    if (str2_size != 0) {
        int i = 0;
        for (i = 0; i < diff; i++) {
            // something goes here...
        }
    }

In the same way we can identify that there's also a second loop, which stop condition it's that the counter reach str2_size. These can be seen in instructions:

.L4:
    lea rax, [rdi + r10]
    xor esi, esi
    xor r8d, r8d
.L8:
    ;; instructions goes here...
    add rsi, 1
    cmp rsi, rcx
    jne .L8

This an be written in C, in this way:


    int diff = str1_size - str2_size + 1;

    if (str2_size != 0) {
        int i = 0, j = 0;
        for (i = 0; i < diff; i++) {
            int flag =0;
            for (j = 0; j < str2_size; j++) {
                if (str1[j] != str2[j]) {
                    flag = 1;
                }
            }

            if (flag == 0) {
                return str1;
            }

            str1++;
        }
    }

Here I added to subsequent logic, to resume a bit. The instructions

    movzx   ebx, BYTE PTR [rdx+rsi]
    cmp BYTE PTR [rax+rsi], bl
    cmovne  r8d, r9d
    ;; ...
    test r8d, r8d
    je .L12
    ;; ...
.L12:
    pop rbx
    ret

Gave us the clue for the code inside the loops. The whole code in C, of function f would be the following:

char *f(char *str1, int str1_size, char *str2, int str2_size)
{

    // cmp  rcx, rsi
    // ja   .L10
    // and later on...
    // je   .L10
    if (str2_size >= str1_size)
        // .L10:
        //     xor  eax, eax
        //     ret
        return NULL;

    // sub  rsi, rcx
    // add  rsi, 1
    // mov  r11, rsi
    int diff = str1_size - str2_size + 1;

    if (str2_size != 0) {
        int i = 0, j = 0;
        for (i = 0; i < diff; i++) {
            int flag =0;
            // movzx    ebx, BYTE PTR [rdx+rsi]
            // cmp  BYTE PTR [rax+rsi], bl
            // cmovne   r8d, r9d
            // add  rsi, 1
            // cmp  rsi, rcx
            // jne  .L8
            for (j = 0; j < str2_size; j++) {
                if (str1[j] != str2[j]) {
                    flag = 1;
                }
            }

            if (flag == 0) {
                return str1;
            }

            str1++;
        }

        return NULL;
    }

    return  str1;
}

Formal description

The idea behind these challenges is not actually to translate all the code in C, but to been able to describe what this assembly code does with words. In a way that you can use that description as a function documentation for example. Then a proper description for this function would be:

Formal description or documentation:

Given two strings and their length, f checks if the second string its a substring of the first one. In case is not we will receive a NULL value, otherwise we receive the pointer to the position where the substring begins

Final notes

Advices:

It's always better to have well defined the signature of the function, in case of any error in tht part your whole interpretation of the code will be messed up.
Have a good understanding of what's the difference between lea and mov instructions. In my particular case, I always mess up with these two simple instructions. You can checkout this answer in SO.

DEV Community

Challenge RE # 5

Introduction

Analysis

Inferring f signature

Inferring main logic

Formal description

Final notes

Top comments (0)

Read next

Implement Circuit Breaker using Polly in .Net Core 8

Day 01 , what did learn

Custom GPTs: Revolutionizing Tech Integration

Mastering Parentheses in Emacs: Essential Commands