DEV Community

JayZho
JayZho

Posted on • Edited on

A poor introduction to compilers.(still updating)

What is a compiler

The job of a compiler is to turn source code that's written by humans:

int i = 0;
i = 10 + 20;
Enter fullscreen mode Exit fullscreen mode

into machine code(a bunch of 0s and 1s) that's understood by computers in order to run the program: (say, 8-bit instructions)

01110010
01010111
...
Enter fullscreen mode Exit fullscreen mode

Usually, the compilation process involves converting the source code into intermediate code called Assembly Code first, then further turning the assembly code into machine code.
The second step is relatively straight forward as it's mostly just a 1-to-1 replacement from assembly instruction to its binary representation.
This blog is largely about how we go from source code to assembly code.

From Source Code to Assembly

The big picture of this process is as shown below:
Image description


Step1: Lexical Analysis

-- turn the source code into a bunch of tokens.

In order to discuss lexical analysis, we should first have a look at what a programming language formally is.

Very much like natural languages(e.g. English), a programming language is just a set of strings(called "tokens"), where each string/token is a sequence of symbols allowed in this programming language.
For example, in English, the set of allowed symbols are roughly the 26 alphabets plus the standard English punctuation. Therefore,

How are you?

is a valid English sentence, while

H@w are you?

isn't, since "@" isn't a valid symbol defined in the language.

A programming language normally consists of 5 kinds of tokens:

  1. Identifiers: variable names, function names are both identifiers.
  2. Keywords: reserved words such as "if", "while" and "return".
  3. Operators: +, -, <, >,
  4. Separators: brackets, comma and semicolon.
  5. Literals: source representation of a value

The job of this lexical analysis step is to scan through the source code and produce a set of tokens. For example, for a program that only has one line of code:

int a = 0;
Enter fullscreen mode Exit fullscreen mode

The list of tokens produced after lexical analysis would be something like:

[
  Keyword("int"),
  Identifier("a"),
  Operator("="),
  Literal("0"),
  Separator(";")
]
Enter fullscreen mode Exit fullscreen mode

Step2: Syntax Analysis

-- given a list of tokens, build a tree to represent the structure of this program.

In order for a sentence to make sense, it must be both **syntactically **and **semantically **correct. The syntax of a program deals with the program structure, while the semantics of a program deals with the program logic.
For example, the following sentence would be considered syntactically incorrect in general:

Tom likes.

since the verb "like" is missing its object.
A sentence could also be syntactically correct while being semantically incorrect. The sentence

Tom sees sounds.

obviously doesn't make sense, but it's syntactically correct as it consists of the subject, the verb and the object of a sentence.

The syntax rules of a language are called "Grammar", which defines what the structure of a language could be.

Top comments (0)