Code is on my Github
Background
I've decided to take another stab at this interpreter. I'll be using Thorsten Ball's book Writing an Interpreter in Go as my primary source, with extra reading material from Bob Nystrom's Crafting Interpreters. I find that Thorsten's book gets me productive faster, while Bob's material is deeper and helps clarify and illuminate what I don't understand.
Thorsten's book defines a language called Monkey, that's similar in syntax to Javascript. Curly braces, integer math, not whitespace sensitive, recursion, the whole nine yards.
For this project I'll be using the OCaml language. OCaml sits at an intersection of academia and large production-scale software. It's used at Jane Street, Docker, Bloomberg, Facebook and other companies. I plan on redoing this project in Haskell eventually - I missed some syntax and conventions from Haskell in writing this chapter. I find inline function composition and application is more readable with Haskell's dot notation and $
application than pipelines.
-- Haskell
next_char . read_char $ lexer
(* OCaml *)
lexer |> read_char |> next_char
Requirements
- Dune 1.11.x
- opam 2.0.3
- OCaml 4.07.1
- Alcotest 0.8.5 (testing framework)
- Fmt 0.8.8 (pretty printer for testing)
Use opam
to set up OCaml and install those dependencies. I chose a new version of Dune, and I went with Alcotest as it seemed relatively popular on the OCaml discord. It's also quite readable and pleasant to configure.
Project structure
Take a look at this commit for some files you can copy
ROOT
- dune-project
- monkey.intall
- monkey.opam
- src/
- dune
- test/
- dune
The dune files will let you link and build your code and run the tests. We'll update these as we move along, but for now this will get you started.
Defining our first tokens
The first step for Thorsten's interpreter is defining a subset of our tokens. In lexical analysis, a token is a simple data structure that represents a small piece of syntax in our programming language. A token could be a SEMICOLON, a PLUS_SIGN, an IDENTIFIER("x"), an INTEGER(5), or some other representation.
We're going to define our tokens in src/token.ml
. Create that file and let's add some code. We're going to open a module and add basic tokens.
module Token = struct
type token_type =
| ILLEGAL
| EOF
(* Identifiers and literals *)
| IDENT of string
| INT of int
(* Operators *)
| ASSIGN
| PLUS
(* -- Delimiters *)
| COMMA
| SEMICOLON
| LPAREN
| RPAREN
| LBRACE
| RBRACE
(* -- Keywords *)
| FUNCTION
| LET
end
We're also going to add a token_to_string
function so we can print out some tokens for our tests, as well as keep our compiler happy when it inevitably barks at us for unused code.
let token_to_string = function
| ILLEGAL -> "ILLEGAL"
| EOF -> "EOF"
| IDENT a -> "IDENT " ^ a
| INT a -> "INT " ^ string_of_int a
| ASSIGN -> "ASSIGN"
| PLUS -> "PLUS"
| COMMA -> "COMMA"
| SEMICOLON -> "SEMICOLON"
| LPAREN -> "LPAREN"
| RPAREN -> "RPAREN"
| LBRACE -> "LBRACE"
| RBRACE -> "RBRACE"
| FUNCTION -> "FUNCTION"
| LET -> "LET"
Unlike Thorsten's Go implementation, we don't need to keep a list of constants. Rather, we can use OCaml's algebraic data types to represent tokens. In fact, OCaml is commonly used in compilers and similar dev tools for exactly this reason.
Thorsten then moves on to creating some tests, in order to build his NextChar
function. We're going to similar add tests.
We'll need to set up a small module export. Create src/monkey.ml
and add the following code.
module Token = Token
module Lexer = Lexer
Create a file test/test.ml
and let's add a bit of boilerplate.
open Monkey
include Lexer
include Token
let token_testable = Alcotest.testable Token.pretty_print (=)
let test_lexer_delimiters () =
Alcotest.(check (list token_testable))
"same token types" [
Token.ASSIGN
; Token.PLUS
; Token.LPAREN
; Token.RPAREN
; Token.LBRACE
; Token.RBRACE
; Token.COMMA
; Token.SEMICOLON
; Token.EOF
]
(Lexer.generate_tokens "=+(){},;")
let () =
Alcotest.run "Lexer"
[
( "list-delimiters",
[ Alcotest.test_case "first case" `Slow test_lexer_delimiters ] );
]
See that Token.pretty_print
function? We'll need to define a way to return a formatting string representing our Token. Let's go back into src/token.ml
let pretty_print ppf tok = Fmt.pf ppf "Token %s" (token_to_string tok)
We can look at Alcotest's source code to dig into what this code does. Effectively, for any inferred type a
it takes a function that formats a type a
and an equality checking function that takes two a
and returns a boolean. You can look at the implementations of int32
or string
to get a hint of how that works. Our token_testable
is a module that works on Token.token_type
, our defined set of token tags. If we were to run dune runtest
we'd get an error because we haven't yet implemented our lexer. So let's do that.
Setting up the Lexer
Let's create src/lexer.ml
.
(* lexer *)
module Lexer = struct
include Token
type lexer = {
input : string;
position : int;
read_position : int;
ch : char;
}
let null_byte = '\x00'
let new_lexer input_string =
{
input = input_string;
position = 0;
read_position = 0;
ch = null_byte;
}
end
I chose to use a null_byte instead of an OCaml option type because in my opinion, an Option represents a potential gap whereas we know that we won't have gaps in our source code. We'll simply be reading the string until we run out of string. The end of our source code isn't undefined, but a known value.
We need to define our read_char
function. In a lexer, reading a character is simply incrementing our pointer in our lexer and looking at the next character in the input string.
let read_char lexer =
let read_to_end = lexer.read_position >= String.length(lexer.input) in
let new_ch = if read_to_end then null_byte else String.get lexer.input lexer.read_position
in {lexer with position = lexer.read_position; read_position = lexer.read_position + 1; ch = new_ch}
OCaml is immutable by default, so we can't just adjust our lexer. Or rather, we can, but why do that when we can avoid that mutable state. We look to see if we're at the end of the string. If we are, the character is a null, else we get the next character in the string and update our lexer. In order to initialize our lexer, we can use our read_char
. We'll update new_lexer
to look like
let new_lexer input_string =
let lexer = {
input = input_string;
position = 0;
read_position = 0;
ch = null_byte;
} in
read_char lexer
Awesome! Our test code is still calling Lexer.generate_tokens
so we'll need to continue working on this.
let next_char lexer = match lexer.ch with
| '=' -> (read_char lexer, Token.ASSIGN)
| ';' -> (read_char lexer, Token.SEMICOLON)
| '(' -> (read_char lexer, Token.LPAREN)
| ')' -> (read_char lexer, Token.RPAREN)
| ',' -> (read_char lexer, Token.COMMA)
| '+' -> (read_char lexer, Token.PLUS)
| '{' -> (read_char lexer, Token.LBRACE)
| '}' -> (read_char lexer, Token.RBRACE)
| '\x00' -> (lexer, Token.EOF)
| _ -> failwith "unmatched character"
let generate_tokens input_string =
let lexer = new_lexer input_string in
let rec gen lxr tokens =
match next_char lxr with
| (_, Token.EOF) -> List.rev_append tokens [Token.EOF]
| (l, tok) -> gen l (tok :: tokens)
in gen lexer []
This code matches on our known inputs and returns a tuple of an updated lexer and the last read token. This tuple allows us to collect tokens and continue to call next_char
until it ends in our generate_tokens
function.
generate_tokens
is defining a recursive function that calls next_char
on a lexer. For any token, it'll add the token to a list of seen tokens and call gen
again on the new lexer. Once it finds an EOF
token, it will rev_append
the collected tokens to the EOF
.
Side note: rev_append
reverses the first list and concatenates it to the second. Because we're using the OCaml cons
operator in the non-EOF case, we're actually constructing a list in reverse order so we need to flip it at the end. The reason for this is that cons
is constant time where append
is linear time proportional to the first list (the operation has to traverse the entire list before adding the elements of the second list). So it's easier to reverse at the end (linear time once) instead of appending (linear time for every token).
If we run dune runtest
we get passing tests!
Up next
The next chapter will take care of adding tokens for real source code.
Top comments (0)