I wanted to try making my own compiler and runtime environment, so I gave it a shot. It doesn’t cover all the syntax or opcodes yet, but I’m satisfied as long as it can run basic functionality. I’ll make some improvements if I feel like it.
Demo
contract TestContract {
uint256 a = 1 + 2 * 3 + 4;
function increment() returns (uint256) {
a = a + 1;
return a;
}
function compare(uint256 b) returns (bool) {
return a < b;
}
}
When you compile the code, you can get the following bytecode.
7f00000000000000000000000000000000000000000000000000000000000000807f0000000000000000000000000000000000000000000000000000000000000040527f00000000000000000000000000000000000000000000000000000000000000047f00000000000000000000000000000000000000000000000000000000000000037f00000000000000000000000000000000000000000000000000000000000000027f00000000000000000000000000000000000000000000000000000000000000017f000000000000000000000000000000000000000000000000000000000000000b7f0000000000000000000000000000000000000000000000000000000000000000557f000000000000000000000000000000000000000000000000000000000000032f807f00000000000000000000000000000000000000000000000000000000000001515f395ff37f00000000000000000000000000000000000000000000000000000000000000807f0000000000000000000000000000000000000000000000000000000000000040527f00000000000000000000000000000000000000000000000000000000000000e07f0000000000000000000000000000000000000000000000000000000000000000351c7f00000000000000000000000000000000000000000000000000000000d09de08a147f0000000000000000000000000000000000000000000000000000000000000196577f00000000000000000000000000000000000000000000000000000000000000e07f0000000000000000000000000000000000000000000000000000000000000000351c7f00000000000000000000000000000000000000000000000000000000c4b8c257147f0000000000000000000000000000000000000000000000000000000000000283577f00000000000000000000000000000000000000000000000000000000000000007f0000000000000000000000000000000000000000000000000000000000000000fd5b7f00000000000000000000000000000000000000000000000000000000000000017f000000000000000000000000000000000000000000000000000000000000000054017f0000000000000000000000000000000000000000000000000000000000000000557f0000000000000000000000000000000000000000000000000000000000000000547f0000000000000000000000000000000000000000000000000000000000000080527f00000000000000000000000000000000000000000000000000000000000000207f0000000000000000000000000000000000000000000000000000000000000080f35b7f0000000000000000000000000000000000000000000000000000000000000004357f000000000000000000000000000000000000000000000000000000000000000054107f0000000000000000000000000000000000000000000000000000000000000080527f00000000000000000000000000000000000000000000000000000000000000207f0000000000000000000000000000000000000000000000000000000000000080f3
demo is here. (I don't know why, but I can embed gif file on the article)
https://dev-to-uploads.s3.amazonaws.com/uploads/articles/8l5vmjmwwkp2iq85s7gd.gif
In the demo above, the following actions are performed:
- Run the EVM.
- Deploy the compiled bytecode (returns the contract code).
- Execute
compare
with argumentb
as 12 (the initial value ofa
is 11, so it returnstrue
, which is 1). - Execute
increment
(returns 12 in hexadecimal). - Execute
compare
again with argumentb
as 12 (since 12 == 12, it returnsfalse
, which is 0).
BTW, if you make a mistake in the return type as follows:
contract TestContract {
uint256 a = 1 + 2 * 3 + 4;
function compare(uint256 b) returns (uint256) {
return a < b; // Compilation error occurs here because the return type is mismatched
}
}
The error message looks like:
EVM
The EVM is a stack-based virtual machine.
Memory
Pre-allocated Memory
While inspecting compiled bytecode, I noticed that mstore
often takes 0x40
, so I looked into it.
Depending on the version, this might differ, but here is the reference I checked:
https://docs.soliditylang.org/en/v0.8.27/internals/layout_in_memory.html
The first 128 bytes of memory have specific purposes:
- 0x00 - 0x3f (64 bytes): Scratch space for hashing methods.
- 0x40 - 0x5f (32 bytes): Current memory size (referred to as the free memory pointer).
0x60 - 0x7f (32 bytes): Zero slot.
The scratch space is reserved for temporary use in hashing methods. If the allocated space is insufficient, it will use free memory.
The free memory pointer specifies the next available memory address. This means that the first
mstore
operation must store data at0x80
since the first 128 bytes (up to0x7f
) are already reserved. The typical first instruction in a contract is tomstore
0x80
at address0x40
.The zero slot is used as the default value for dynamic arrays. It remains filled with
0x00
and can be directly copied for initialization purposes.
Memory Allocation
Memory is allocated in 32-byte chunks.
Since the first 128 bytes are pre-allocated, the first writable memory address is 0x80
.
Memory expands dynamically when needed. For example, if a 30-byte string is replaced with a 40-byte string, memory will automatically expand from 0x80-0x9f
to 0x80-0xbf
.
However, memory expansion incurs additional gas costs, so it should be managed carefully. Also, memory does not get freed automatically, even when it is no longer in use.
https://learnevm.com/chapters/evm/memory#memory-expansion
If the next memory space is already occupied, new memory is allocated at the end of the current memory.
Storage
Slots
Unlike memory, storage is pre-allocated with 2^256-1
slots, each 32 bytes in size.
For fixed-size data types of 32 bytes or less, data is stored sequentially. For dynamically sized types, the following rules apply:
- The slot stores the length of the data.
- The actual data is stored in the slot calculated by
keccak256(slot)
.
For example, if slot 3
is used:
- Slot
(3)
stores the data length. - The actual data is stored at
keccak256(3)
, e.g.,55b2c6...
. If the data exceeds 32 bytes, subsequent slots after55b2c6...
are used. Arrays follow a similar pattern.
My implementation works only for strings of 32 bytes or less at the moment.
https://docs.alchemy.com/docs/smart-contract-storage-layout
For multidimensional arrays or mappings, slot numbers and key values are factored into the calculations.
https://docs.soliditylang.org/en/latest/internals/layout_in_storage.html
Currently, the implementation maps slots and values using Rust's HashMap
, but in practice, a Merkle Patricia Trie is used.
Function Definitions
Functions are identified using a function selector, derived by taking the first 4 bytes of the keccak256
hash of the function signature.
https://docs.soliditylang.org/en/latest/abi-spec.html
For example, for a function like function mint(uint256)
,
keccak256(mint(uint256)) = 0xa0712d680358d64f694230b7cc0b277c73686bdf768385d55cd7c547d0ffd30e
.
The function selector is 0xa0712d68
.
When a transaction is sent, the first 4 bytes of the calldata
contain the function selector. The contract compares this selector to the stored selector and jumps to the corresponding function if they match.
If you compile a simple contract in Remix and extract the opcodes for the function call, you will see something like this:
- CALLDATALOAD
- PUSH1 0xe0
- SHR
- DUP1
- PUSH4 0x6057361d
- EQ
- PUSH1 0x2a
- JUMPI
Explanation:
-
CALLDATALOAD
,PUSH1 0xe0
, andSHR
: Extract the first 4 bytes of the calldata (the function selector). -
PUSH4 0x6057361d
,EQ
: Compare the extracted function selector with the stored selector. - If they match,
PUSH1 0x2a
andJUMPI
jump to the function (0x2a is the program counter in the bytecode).
Calldata
The first 4 bytes of the calldata represent the function selector. The arguments are stored in 32-byte chunks following the selector.
For simplicity, this assumes all arguments fit within 32 bytes, so the number of arguments multiplied by 32 bytes follows the function selector.
Solidity Compiler
Syntax
Unfortunately, there are no public
or private
keywords in my implementation at the moment.
In fact, there are no if
statements or for
loops either. Additionally, functions cannot call other functions within them.
Types
Not all types are supported at the moment. Currently, only string
, uint256
, and bool
are available.
While string
is defined, it only works if it is 256 bits or smaller.
Constant Folding
Constant folding is a process where constant expressions are precomputed at compile time instead of being evaluated at runtime on the EVM.
For example, in the statement uint256 a = 1 + 2;
, instead of executing 1 + 2
on the EVM, the compiler calculates it during compilation and outputs uint256 a = 3;
.
This is straightforward for simple cases, but it gets tricky when the expression includes unknown values, such as function arguments.
For instance, in uint256 a = 1 + arg + 2;
(where arg
is a function argument), the ideal approach would be to compute uint256 a = 3 + arg;
at compile time and execute only 3 + arg
on the EVM. However, due to complexity, if an unknown value is encountered, the entire expression is left for the EVM to calculate.
Deployment Code
In the EVM, there are two types of bytecode: deployment bytecode and deployed contract bytecode.
The deployed contract bytecode is included within the deployment bytecode and is eventually deployed using the codecopy
opcode (technically, it is deployed at the point of RETURN
).
When compiled with solc
, the bytecode contains additional code before and after the contract code. In toythereum
, the deployment process follows these steps:
- Temporarily store all contract code without compiling functions until the entire contract code is scanned.
- Compile non-function code (e.g., initializing storage variables).
- Add deployment-specific instructions such as
codecopy
(use placeholders for contract code length as it is unknown at this stage). - Compile the functions (this becomes the contract code).
- Calculate the length of the compiled functions and insert the length into the placeholders in the deployment instructions.
As a result, the final bytecode consists of two parts:
- The first part handles initialization (e.g., storage setup) and deployment.
- The second part contains the actual contract bytecode.
Top comments (0)