Aitor

Posted on Nov 24, 2020

How to create raw bytecode in python

#bytecode #python #tutorial #programming

Introduction

Well, first I had started doing this on github.io, but then a group of virtual gangsters beat me up and made me realize that there was no point in using that having this fantastic website. Well, there it is, i use this, as the title says, this is a raw python bytecode tutorial, i hope you enjoy it (Because there is a second part...)

Pre requirements

Basic knowledge of Python
Know what a bytes object is
Know the concept of stack

What is Python?

Python is a multiparadigm interpreted programming language, it supports polymorphism, object-oriented programming (OOP / OOP) and imperative programming.

How does it work?

Python, as already named, is an interpreted language, this means that it passes through an interpreter that connects what the computer is going to do, with what you write. Python does not generate machine code as a C or C ++ program would generate, but rather works more or less like Java, it has a virtual machine that interprets bytecode. This default interpreter is CPython, which is responsible for executing the bytecode on your computer. Here we are not going to use compilers, but rather we are going to handle language implementations, basically interpreters that interprets (forgive the redundancy) the written code after translating it into bytecode. There is a wide variety of these, e.g. IronPython (C # implementation), Jython (pure Java implementation), Micropython (C version optimized to run on microcontrollers).
Here is a schematic of how Python works and the steps that the interpreter takes to run the code that you wrote.

How to create USABLE bytecode

Well, we have two things, first, stripped bytecode, that is, bytes in hexadecimal representing opcodes and parameters, and secondly, we have CodeType, a data type in Python that helps us to create ByteCode that SUITABLE AND USABLE. Also to assemble, you have to know how to disassemble, we are going to use the module dis, this module is used to disassemble functions, files and code.

import dis

def sum (x, y):
    return x + y
dis.dis (sum)

The output of that snippet of code is as follows

1. 4   0 LOAD_FAST    0 (x)
2.     2 LOAD_FAST    1 (y)
3.     4 BINARY_ADD
4.     6 RETURN_VALUE
>>>

As we can see, all of that is bytecode, now the explanation.

As you may have noticed, I listed the lines in the output in order to make this explanation easier.
Each instruction in Python has a specific OPCODE (Operation Code), in this case we use 3, LOAD_FAST BINARY_ADD RETURN_VALUE, we will explain what each one does.

LOAD_FAST: Loads a variable to the top of the stack (Top Of Stack).
BINARY_ADD: Add the two values at the top of the stack and return them to the top of the stack.
RETURN_VALUE: Returns the value that is in TOS.

Well, now that we've explain the opcodes, we can get an idea of how our code works internally, but there are still doubts, annoying but necessary doubts, like these, "What is the 4 on the left side, the 4 that is at the beginning of the first line?", "What are the numbers to the left of the OPCODES? "Why does a 0 appear to the right of LOAD_FAST?, And the 1?", "We wouldn't want to loadx and y to add them instead of 0 and 1?".

Well, I will answer in order.

The 4 is the line where the disassembled bytecode begins.
These numbers represent the offset of the bytes.
The 0 and the 1 correspond to an index, since the variables of the code are stored in a list (array), the 0 and 1 represent the index, however, the module dis tells us which variable is to the right of this number (hence the 0 (x) and 1 (y)). *

How do we re-create our function to make it bytecode?

Well, the first thing we do is import CodeType andFunctionType (To pass it to function) from the [types] module (https://docs.python.org/3/library/types.html#module-types)

import dis
from types import CodeType, FunctionType

def sum (x, y):
    return x + y

After this, we are going to create our object code
python

import dis
from types import CodeType, FunctionType

def sum (x, y):
    return x + y

# This will be explained later, these are flags
CO_OPTIMIZED = 0x0001
CO_NEWLOCALS = 0x0002
CO_NOFREE = 0x0002

my_code = CodeType (
    2, #argcount
    0, #kwonlyargcount
    2, #nlocals
    2, #stacksize
    (CO_OPTIMIZED | CO_NEWLOCALS | CO_NOFREE), #flags
    bytes ([124, 0, 124, 1, 23, 0, 83, 0]), #codestring
    (0,), #constants
    (), # names of constants or global (names)
    ('x', 'y',), #variable names (varnames)
    'blog_no_name', #filename
    'crafted_sum', #name (code name / function)
     9, #Firstlineno (First line where this code appears)
     b'', #lnotab
     (), #freevars
     (), # freecellvars
     )

_sum = FunctionType (my_code, {})
result = _sum (213,3)
print (result)

# Expected output
# 216

Well well ... Many new things appear, we will explain these arguments right now.

CodeType: argcount, kwonlyargcount, nlocals, stacksize, flags, codestring, constants, names, varnames, filename, name, firstlineno, lnotab, freevars, freecellvars

Argument	Description
argcount	Number of arguments
kwonlyargcount	Number of keyword arguments
nlocals	Number of local variables (In this case 2, x and y)
stacksize	Maximum size in bytes that the stack will have (In this case 2 because x y requires two spaces in the stack frame)
flags	The flags are what determine some conditions of the bytecode, you can be guided by this reference . We are going to delve into flags in a more advanced tutorial.
codestring	This is a list (array) of bytes containing the sequence in question, in 124 it means LOAD_FAST, 23 BINARY_ADD and 83 RETURN_VALUE
constants	A tuple with the value of the constants (such as integers, False, True, built-in functions ...)
names	A tuple containing the name of the constants respectively
varnames	Local variable name
filename	This string represents the name of the file, when this value is not used it can be any string
name	Name of the code object or function
firstlineno	Represents the first line in which the code is executed, relevant if we import a file, otherwise it can be any integer
lnotab	This is a mapping between the offsets of the bytecode object and the offset of the lines, if you are not interested in putting information on the lines, you can use `b''`
freevars	I will explain these variables in an advanced tutorial, it is used in closures
cellvars	These variables are defined within a closure

One last two things to note before moving on to FunctionType, the first is that the 0s that follow the opcodes * eg [124, 0, ...] * are the argument, and the second is that each bytecode can vary from version to version, to know or orient yourself about the codestring, you can use the following snippet

def sum (x, y):
    return x + y
sum.__ code __.co_code

# Expected output in Python 3.7.9 (The version I use)
# b '|\x00|\x01\x17\x00S\x00'
# The bytes are interpreted as characters, probably to make it more readable. (If we put chr (124) it will print the character |)

"Crafting" the function

We are going to use FunctionType now.
FunctionType: code, globals, name, argdefs, closure

Argument	Description
code	Object code (osea, CodeType)
globals	A dictionary containing the globals as follows `{" Name ": ValueName}` that way, Name becomes an identifier, and then it is accessed as if it were a variable
name (Optional)	Override the value of the object code)
argdefs (Optional)	A tuple that specifies the value of the default arguments
closure (Optional)	A tuple that supplies the ties for the freevars

Well, once this is clear, now we would only have to add a FunctionType with our object code (my_code) and call it.

import dis
from types import CodeType, FunctionType

def sum (x, y):
    return x + y

After this, we are going to create our object code

import dis
from types import CodeType, FunctionType

def sum (x, y):
    return x + y

# This I will explain later, they are flags
CO_OPTIMIZED = 0x0001
CO_NEWLOCALS = 0x0002
CO_NOFREE = 0x0002

my_code = CodeType (
    2, #argcount
    0, #kwonlyargcount
    2, #nlocals
    2, #stacksize
    (CO_OPTIMIZED | CO_NEWLOCALS | CO_NOFREE), #flags
    bytes ([124, 0, 124, 1, 23, 0, 83, 0]), #codestring
    (0,), #constants
    (), # names of constants or global (names)
    ('x', 'y',), #variable names (varnames)
    'blog_no_name', #filename
    'crafted_sum', #name (code / function name)
    9, #Firstlineno (First line where this code appears)
    b '', #lnotab
    (), #freevars
    (), # freecellvars
    )

_sum = FunctionType (my_code, {})
result = _sum (213,3)
print (result)

# Expected output
# 216

This is all for now, later I will upload another tutorial explaining the closures

DEV Community

How to create raw bytecode in python

Introduction

Pre requirements

What is Python?

How does it work?

How to create USABLE bytecode

How do we re-create our function to make it bytecode?

"Crafting" the function

Sources

Top comments (0)

Read next

AI Style Transfer Boosts Mammogram Training Data, Improves Cancer Detection Models

New Training Method Makes AI Decision-Making More Transparent and Logical

Step-by-Step AI Reasoning System Improves Language Model Accuracy by 8.5%

Object Oriented Programming for Interview:)