There are many different types of code optimization when it comes to assembly or assembler code.
There is of course most popular speed optimization that focuses on the fastest possible code, often with the use of MMX, SSE, AVX instructions to process as much data as possible.
But there is one particular area of assembly programming that focuses on size optimization. I have used this knowledge many times in many of my software reverse engineering projects to modify compiled binaries with a limited amount of space available to include the modified code or to develop shellcodes for 0-day exploits, where again the size of the shellcode is limited.
Programmers who write in an assembler tend to think that if you write in an assembler their code is already optimized to the maximum (finally, it's an assembler!), but as I've found out there any many tricks that can be used to achieve even better results in terms of minimizing the code size.
Zeroing of CPU registers
mov eax,0 ; 5 bytes -> B0 00 00 00 00
xor eax,eax ; 2 bytes -> 33 C0
sub eax,eax ; 2 bytes -> 2B C0
and eax,0 ; 3 bytes -> 83 E0 00
As it turns out, even the simplest operation can take up to 5 bytes, but if we use xor
instruction instead, the same operation will take 2 bytes in the resulting program code. The value 0
is often used as a base parameter for WinAPI functions.
Example with the standard version of code
push offset szSansSerif ; lpFace ; 5 bytes
push 0 ; pitch and family ; 2 bytes
push 0 ; output quality ; 2 bytes
push 0 ; clipping precision ; 2 bytes
push 0 ; output precision ; 2 bytes
push 1 ; char set identifier ; 2 bytes
push 0 ; strikeout attribute flag ; 2 bytes
push 1 ; underline attribute flag ; 2 bytes
push 0 ; italic attribute flag ; 2 bytes
push 400 ; font weight(normal) ; 5 bytes
push 0 ; base-line orientation angle ; 2 bytes
push 0 ; angle of escapement ; 2 bytes
push 0 ; logical average character ; 2 bytes
push 0Dh ; logical height of font ; 2 bytes
call CreateFontA
The total number of bytes of instructions needed to remember the parameters of the CreateFontA
percentage call will take 34 bytes in this case.
Size optimized version
sub eax,eax ; 2 bytes
push offset szSansSerif ; lpFace ; 5 bytes
push eax ; pitch and family ; 1 byte
push eax ; output quality ; 1 byte
push eax ; clipping precision ; 1 byte
push eax ; output precision ; 1 byte
push 1 ; char set identifier ; 2 bytes
push eax ; strikeout attribute flag ; 1 byte
push 1 ; underline attribute flag ; 2 bytes
push eax ; italic attribute flag ; 1 byte
push 400 ; font weight(normal) ; 5 bytes
push eax ; base-line orientation angle ; 1 byte
push eax ; angle of escapement ; 1 byte
push eax ; logical average character ; 1 byte
push 0Dh ; logical height of font ; 2 bytes
call CreateFontA
This time 27 bytes, a small profit compared to the previous function, but sometimes these few bytes can be useful for something else.
Passing series of the same values
If we need to pass the same parameters to the function, it's usually done like this:
push 0 ; 2 bytes
push 0 ; 2 bytes
push 0 ; 2 bytes
push 0 ; 2 bytes
push 0 ; 2 bytes
push 0 ; 2 bytes
push 0 ; 2 bytes
================================
= 14 bytes
Or more size optimized, like this:
sub eax,eax ; 2 bytes
push eax ; 1 byte
push eax ; 1 byte
push eax ; 1 byte
push eax ; 1 byte
push eax ; 1 byte
push eax ; 1 byte
push eax ; 1 byte
===============================
= 9 bytes
But it can be further size optimized using a simple loop:
sub eax,eax ; 2 bytes
push 7 ; 2 bytes
pop ecx ; 1 byte
@save_args:
push eax ; 1 byte
loop @save_args ; 2 bytes
================================
= 8 bytes
I haven't seen this type of size optimization nor in GCC or even in LLVM generated code (with size optimizations enabled), so it's a trick strictly reserved for hand-optimized assembly code.
Zeroeing EDX register
If we intend to zero the edx
register, we normally do so by e.g. xor edx,edx
but you can do it even more easily by using the cdq
instruction (it stands for Convert Double to Quad).
The cdq
instruction causes the edx
register to be filled with a sign bit from eax
register (sign bit is the most significant bit of the register value, so in this case it's the 31st bit).
So if we know that in eax
we have e.g. 1
, then execution of the cdq
instruction will cause edx
to be reset to zero.
If you are not sure about the content of the eax
register (for example, after the function calls) you shouldn't use, because it can lead to errors:
eax=80000001h = 1000000000000000000000000000000000000001b
^ most significant bit of the EAX register is set to 1
This execution of cdq
will cause edx
to be filled with a bit of eax
, which is 1
, so in edx
there will be 0FFFFFFh
.
cdq
instruction takes only one byte.
Transferring 32-bit values from 0-255 range to the CPU registers
mov eax,7Fh 5 bytes ; B0 FF 00 00 00
sub eax,eax 4 bytes ; 2 bytes C0
mov al,7Fh ; B0 FF
push 7Fh 3 bytes ; 6A FF
pop eax ; 58
It is often necessary to transfer values from 0-255 range into 32-bit register. We can do it like this:
mov eax,4 ; B0 04 00 00 00
This instruction takes 5 bytes. A value of 4 is treated as a full 32-bit value that needs 4 bytes to encode. The most optimized solution is to store aka push
this value on the stack and pop
it back to the CPU register:
push 4 ; 6A 04
pop eax ; 58
This time it takes only 3 bytes, even though it takes up more space in the source code, it takes up fewer bytes on the disk!
It should be mentioned, that the compiler will write the shortened form of push
instruction if the value is between 0-127 (signed integer value).
If you want to use the shortened version of push
instruction even for signed integer values, you need to do it either by using:
push -127
or by using helper macro
pushb macro byteval
db 06Ah,byteval
endm
pushb 080h ; store 128 value (
pop eax
After these instructions are completed, the eax
will hold a value of 0FFFFFF80h (-80h)
but why not 00000080h
?
The numbers in the range 128-255
in the short version of push
instruction are treated as negative numbers (aka sign-extended).
The sign bit from the short encoded integer value is then copied to the upper bits of the CPU register:
00000000 00000000 00000000 10000000 = 00000080h
^integer sign bit
11111111 11111111 11111111 10000000 = FFFFFF80h
^signed integer
There is another trick to make the code a little short in case you want to encode values in the range from 128-255
to a full 32-bit value:
Standard way:
mov eax,255 ; bytes
Size optimized way:
xor eax,eax ; bytes
mov al,255 ; bytes
The use of error codes returned by functions
This is another of the tricks often overlooked by HLL compilers.
Functions by definition return some values. In the case of WinAPI functions, the returned value is always stored in the eax
register.
Depending on the function, returned values can differ and it could be 0
, -1
, file handle, etc.
For example CreateFileA
function returns -1 in eax
register when we don't have access to the file we just wanted to open.
But another WinAPI function like CreateIcon
returns in eax 0 if there is an error.
We can use those values, before checking the MSDN documentation to our advantage:
push ...
call LoadBitmapA
Documentation about LoadBitmapA
function says the function returns the handle to the bitmap on success and 0
on error.
push ..
call LoadBitmapA
cmp eax,0 ; 83 F0 00
jz @error
cmp eax,0
instruction takes 3 bytes. Can't we do it better? Of course, we can by using logical operations like or
or test
:
call LoadBitmapA
or eax,eax ; 0B C0
jz @error
or:
call LoadBitmapA
test eax,eax ; 85 C0
jz @error
Both of the or
and test
instructions sets the CPU zero flag if the eax
register value is set to 0
, it gives us the same result as the cmp eax,0
instruction but with 1 byte less size in output code.
We can optimize it even further by using xchg
instruction:
call LoadBitmapA
xchg eax,ecx ; 1 byte
jecxz @error ; jecxz instruction takes 2 bytes (the same as jxx short range branches)
The jecxz
instruction jumps to the provided label if the ecx
register is set to 0
.
But there is a catch! The instruction itself is a conditional branch instruction to the nearest label in range of -127 to 128 bytes from the instruction itself in compiled code (it's a short jump type instruction only).
So if your destination, in our case @error
label is further away in compiled code than that you will get an error message from the compiler.
Some assembly compilers like an old school TASM compiler will automatically translate jecxz
with destinations further than 128 bytes to:
call LoadBitmapA
xchg eax,ecx
jecxz @dummy
jmp @next
@dummy:
jmp @error
@next:
Many WinAPI functions returns -1 (0FFFFFFFh)
value on error. How can we check it? The simplest way is of course:
call CreateFileA
cmp eax,-1 ; 83 F0 00
je @error
We can get the same result using much more size optimized code:
call CreateFileA
inc eax ; if there was -1 value returned, the inc instruction will set the EAX register to 0
je @error ; and we can detect it with a conditional JE/JZ instruction
dec ; if there wasn't an error, restore the originally returned value
In this case, the resulting code will be 1 byte smaller than the one using cmp eax,-1
.
Exchanging CPU registers values
Say you have a value of 4
stored in the eax
register and a value of 98
stored in edx
register. How to exchange those two registers?
We can do it like this:
push eax
push edx
pop eax
pop edx
This takes 4 bytes. We can use a temporary register like this:
mov ebx,eax
mov eax,edx
mov edx,ebx
But this one is even bigger with 6 bytes.
Or we can use this one clever trick using the logical xor
instruction:
xor edx,eax
xor eax,edx
xor edx,eax
Still 6 bytes in output code. But there is one overlooked instruction, not used by HLL compilers anymore.
It's called xchg
(from eXCHange), it's size is just 1 byte in output code and it does just what we need:
xchg eax,edx ; 92h
Is 1 byte in size, but:
xchg edx,esi ; 87h 0D6h
The xchg
instruction takes only 1 byte in output code, but only if one of the exchanged registers is eax
. Otherwise it's encoded as 2 bytes.
You will learn that many other instructions are smaller if you use the eax
register e.g.:
add edi,400000h ; 6 bytes -> 81 C7 00 00 40 00
add eax,400000h ; 5 bytes -> 05 00 00 40 00
So it's the same instruction add
, but if the eax
is used - the output code is 1 byte smaller. Keep that in mind.
CPU string instructions
There is a separate set of string instructions in CPUs. They operate on esi
and edi
registers only.
Some of those instructions are rarely used by modern compilers, but they have one advantage to us - the size of the output code.
Let's look at this example. We have a simple loop and after each iteration, we increase the value of the esi
pointer by 4
.
_loop_label:
...
...
...
add esi,4
loop _loop_label
Easy & simple. But the:
add esi,4 ; 83 C6 04
instruction takes 3 bytes. But we can use the string instruction lodsd
to make our code shorter and it does exactly the same:
lodsd ; AD = add esi,4
lodsw ; 66 0A = add esi,2
lodsb ; 0A = add esi,1
There are 3 variants of this instruction, operating on 32 bit, 16 bit and 8 bit values:
lodsd ; mov eax,dword ptr[esi]
; add esi,4
lodsw ; mov ax,word ptr[esi]
; add esi,2
lodsb ; mov al,byte ptr[esi]
; inc esi
So the optimized loop could look like this:
_loop_label:
...
...
...
lodsd ; mov eax,dword ptr[esi]
; add esi,4
loop _loop_label
So we can use it a short version of add esi,4
instruction, just keep in mind it access the memory pointer in esi
register (so it cannot be any value, it must be a pointer to some data) and it writes to eax
register.
If you need to preserve the value of the eax
register you can do it like this:
_loop_label:
...
push eax
lodsd
pop eax
loop _loop_label
There is also a scasX
instruction. It compares the value pointed by the edi
register to the value from eax
register and increases (if the direction flag DF is set to 0, use the cld
instruction) or decreases (if the direction flag DF is set to 1, use the std
instruction) the value of the edi
registers. It also comes in 3 variants for 32 bit, 16 bit and 8-bit comparisons. In order to use it, you need to make sure the edi
register points to a valid data buffer, so again it cannot be any number or value you want because it will end with an exception if you try that (access violation).
So if one of registers you want to increase is edi
, instead of this:
add edi,4 ; 83 C7 04
it's better to use:
scasd ; AF
scasw ; 66 AF
scasb ; AE
and it works like this:
scasd ; cmp dword ptr[edi],eax
; add edi,4
scasw ; cmp word ptr[edi],ax
; add edi,2
scasb ; cmp byte ptr[edi],al
; inc edi
The CPU direction flag decides if the value of the edi
register is increased or decreased:
std ; set DF (Direction Flag), 1 byte
scasd ; cmp dword ptr[edi],eax
; sub edi,4
Keep in mind the direction flag (DF) is always cleared after the application starts, at least for the Windows PE executables and it's also expected to be clear between any WinAPI functions.
So if you ever set it with std
instruction, make sure to reset it back afterward with cld
otherwise you might end up with hard to find bugs related to this issue in other applications or OS components.
std ; set DF (Direction Flag), 1 byte
lodsd ; mov eax,dword ptr[esi]
; sub esi,4
...
...
cld ; restore DF to its expected default state
The final word
It may seem that all this size optimization doesn't make sense nowadays, but it may come in handy if, for example, you write some shellcode or you need to modify the compiled code using as few instructions as possible, and the space to use will be very modest, the knowledge about optimization may be very useful.
If you want to learn more, you can read my free articles about programming (assembler, C/C++), malware analysis, and reverse engineering.
Top comments (0)