Now that the Semantic Analyzer has verified that the code is actually correct and created syntax trees, it’s time to talk about IR generation.
This Compiler series:
What is IR?
After the Semantic Analyzer, the next step is to turn the trees that it validated and added type information to into a representation that is much closer to what the machine is actually going to generate. This representation is called an intermediate representation (IR). The Xojo compiler, when building for 64-bit or ARM, uses LLVM IR.
The resulting LLVM IR describes the actual control flow of the program and every single thing that will be in the final program. In addition to the user-visible code that gets executed and the implicit conversions that are now in the trees, it also needs to contain all of the hidden, behind the scenes calls like reference counting and introspection metadata.
LLVM IR is actually higher level and more abstract than the actual assembly that it’ll end up generating. For example, unlike assembly language, it’s entirely strongly typed and the Xojo compiler has to be very precise in what it generates (which is a good thing!).
IR Code Generation
To get started, here is the abstract syntax tree that was previously created by the Semantic Analyzer:
To generate IR, the compiler walks through the above tree, depth first, to get to the leaf nodes. Doing this gets us to the BinaryOperator* for the multiple on the lower right side of the tree.
The LLVM IR to multiply those two values (mul) looks like this:
%1 = mul i32 2, 4
Now it works backwards through the tree. So the next item is the implicit cast, which has to cast the value that was calculated in the previous command:
%2 = sitofp i32 %1 to double
The sitofp IR command means “Signed Integer to Floating Point”.
Continuing up the tree, the binary operator is next, so it can now grab the left hand-side value to apply to the right-hand side value. This is the IR to add the values:
%3 = fadd double 3.14, %2
The fadd IR command means “floating point add”.
And continuing up the tree, the implicit cast is next:
%4 = fptosi double %3 to i32
The fptosi IR command means “floating point to signed integer”.
Lastly, we reach the actual assignment (store) with IR that looks like this:
store i32 %4, i32* @sum
Here is the complete IR that gets generated:
%1 = mul i32 2, 4
%2 = sitofp i32 %1 to double
%3 = fadd double 3.14, %2
%4 = fptosi double %3 to i32
store i32 %4, i32* @sum
Reading through this you should now understand why no one wants to manually write code at such a low-level.
This is the last part of the compiler that is considered to be part of the front end. The rest of the compiler components belong to the back end.
Top comments (2)
Great post, thanks.
I've used LLVM a while ago to create a toy language, it was really fun.
Out of pure curiosity, is there any alternative to LLVM? When it's not the right tool for the job - language construction?
Maybe the Gnu tools? But I'm not familiar enough with them to say.