DEV Community

Jason Barr
Jason Barr

Posted on • Edited on

Create Your Own Programming Language 2: Primitives

In the last post we added compiling and evaluating numbers to the Wanda Programming Language. Now we're going to add the other primitive types: strings, booleans, keywords, and nil.

If you haven't seen the previous post, where we scaffold the compiler and add numbers to the language, you can find it here.

As a reminder, strings are delimited with double quotes, and keywords are valid identifiers that start with a colon:

"hello"
:this-is-a-keyword
true
nil
Enter fullscreen mode Exit fullscreen mode

We're also going to make it so you can pretty-print the AST in the REPL instead of evaluating the expression. This can help with debugging later. If you're not getting the results you expect from your Wanda program, you can print the AST and make sure that what the compiler's getting is the same thing you expect it to get.

High Level Overview of The Changes

Most of the work in this article is going to be done in the lexer. Tokenizing strings turns out to be a lot of work, for one thing.

We'll also make some changes to the reader, the parser, the emitter, and the printer as well as adding a function to print the AST. We'll also make some changes to the CLI so you can start the REPL in AST printing mode if you want.

We'll also add a dedicated file for running the CLI so you can add a wanda command to your terminal if you wish.

It may not seem like it takes as much work to get other primitives up and running like it did for numbers, but that's because there was a lot of scaffolding we had to do for the compiler as a whole just to make numbers work. Some of the articles will be like that: they'll feel like we didn't do a whole lot, but now we have at least one new language feature to show for it. Others will take a lot of work and not feel like much has changed at the end of the day. That's just how it goes sometimes!

The Lexer

The first thing we'll add to the lexer is new token types for the new primitives, in src/lexer/TokenTypes.js:

export const TokenTypes = {
  Number: "Number", // still there!
  String: "String",
  Boolean: "Boolean",
  Keyword: "Keyword",
  Nil: "Nil",
};
Enter fullscreen mode Exit fullscreen mode

We're also going to add new character matchers for double-quotes, colons, valid identifier starting characters, and valid identifier characters. Plus string matchers for booleans and nil. In src/lexer/utils.js:

// after isPlus...
export const isDoubleQuote = (ch) => /"/.test(ch);
export const isColon = (ch) => /:/.test(ch);
export const isSymbolStart = (ch) => /[=<>%:|?\\/*\p{L}_$!+-]/u.test(ch);
export const isSymbolChar = (ch) =>
  /[:=@~<>%:&|?\\/^*&#'\p{L}\p{N}_$!+-]/u.test(ch);

// after isNumber...
export const isBoolean = (str) => /true|false/.test(str);
export const isNil = (str) => /nil/.test(str);
Enter fullscreen mode Exit fullscreen mode

Now in the Lexer class we'll add new if/else clauses to the main lexer loop for the new primitives in src/lexer/Lexer.js:

// after the case for else if (isDigit(ch)) {
// ...
      } else if (isDoubleQuote(ch)) {
        tokens.push(this.readString());
      } else if (isColon(ch)) {
        tokens.push(this.readKeyword());
      } else if (isSymbolStart(ch)) {
        tokens.push(this.readSymbol());
      } else {
        // continue with error for unrecognized syntax
Enter fullscreen mode Exit fullscreen mode

This means we'll need methods for readString, readKeyword, and readSymbol. If you want to have the same as what I've got, I'm putting the methods in alphabetical order.

First, readString since it's going to end up being the most complicated one:

  readString() {
    let { pos, line, col, file } = this.input;
    const srcloc = SrcLoc.new(pos, line, col, file);
    let str = this.input.next(); // collect opening double-quote

    str += this.readEscaped();
    return Token.new(TokenTypes.String, str, srcloc);
  }
Enter fullscreen mode Exit fullscreen mode

No, it doesn't look too complicated... yet... but now we need to create a method called readEscaped.

readEscaped will take the string in, one character at a time. If it encounters a backslash, it recognizes it as the start of an escape sequence so it looks ahead and uses the readEscapeSequence method to read in the escaped character. If it hits an unexpected \n in a string literal it throws an error, since Wanda strings can only be single line. Also, if it encounters a backtick as part of the string it escapes it for the emitter, since we're going to be using backtick strings in the compiled output. This eliminates some issues we might otherwise have had with newline characters in double-quoted string literals in the output JavaScript.

Note that reading the escape characters in the lexer is a design decision. We could just as well have done them in the reader or parser. Since handling them now is the simplest way to do it, we'll do that.

Here's the readEscaped method:

  readEscaped() {
    let str = "";
    let escaped = false;
    let ended = false;

    while (!this.input.eof()) {
      let ch = this.input.next();

      if (escaped) {
        str += this.readEscapeSequence(ch);
        escaped = false;
      } else if (ch === "\\") {
        escaped = true;
      } else if (isDoubleQuote(ch)) {
        ended = true;
        str += ch;
        break;
      } else if (ch === "\n") {
        throw new Exception(
          "Unexpected newline in nonterminated single-line string literal"
        );
      } else if (ch === "`") {
        str += "\\`";
      } else {
        str += ch;
      }
    }

    if (!ended && this.input.eof()) {
      throw new Exception(
        "Expected double quote to close string literal; got EOF"
      );
    }

    return str;
  }
Enter fullscreen mode Exit fullscreen mode

and readEscapeSequence:

  readEscapeSequence(c) {
    let str = "";
    let seq = "";

    if (c === "n") {
      str += "\n";
    } else if (c === "b") {
      str += "\b";
    } else if (c === "f") {
      str += "\f";
    } else if (c === "r") {
      str += "\r";
    } else if (c === "t") {
      str += "\t";
    } else if (c === "v") {
      str += "\v";
    } else if (c === "0") {
      str += "\0";
    } else if (c === "'") {
      str += "'";
    } else if (c === '"') {
      str += '"';
    } else if (c === "\\") {
      str += "\\";
    } else if (c === "u" || c === "U") {
      // is Unicode escape sequence
      seq += this.input.readWhile(isHexDigit);
      str += String.fromCodePoint(parseInt(seq, 16));
    }

    return str;
  }
Enter fullscreen mode Exit fullscreen mode

Note that this will read Unicode escape characters in the format \uHHHH with any number of hex numbers necessary to complete a code point.

That's lexing strings. Keywords are considerably simpler:

// between readEscapeSequence and readNumber
  readKeyword() {
    let { pos, line, col, file } = this.input;
    const srcloc = SrcLoc.new(pos, line, col, file);
    const kw = this.input.next() + this.input.readWhile(isSymbolChar);

    return Token.new(TokenTypes.Keyword, kw, srcloc);
  }
Enter fullscreen mode Exit fullscreen mode

We don't need to distinguish here between valid identifier starting characters and valid identifier body characters, since every valid identifier starting character is also a valid identifier body character. We'll use the same trick in the readSymbol method:

// below readString
  readSymbol() {
    let { pos, line, col, file } = this.input;
    const srcloc = SrcLoc.new(pos, line, col, file);
    const sym = this.input.readWhile(isSymbolChar);

    if (isBoolean(sym)) {
      return Token.new(TokenTypes.Boolean, sym, srcloc);
    } else if (isNil(sym)) {
      return Token.new(TokenTypes.Nil, sym, srcloc);
    }

    // Throw for now, since we haven't implemented symbols yet
    throw new SyntaxException(sym, srcloc);
  }
Enter fullscreen mode Exit fullscreen mode

Note that we're just using the readSymbol method to read primitive literals for booleans and nil for now. Soon we'll use it to read identifiers as well, at which point we'll remove the exception at the end of the method.

The Reader

The changes to the reader are pretty simple: we just add cases to the readAtom function to handle the new token types. Just like with numbers, the reader handles the other primitive tokens themselves as the forms for those values.

In src/reader/read.js, here's the new readAtom function:

const readAtom = (reader) => {
  const tok = reader.peek();

  switch (tok.type) {
    case TokenTypes.Number:
      reader.skip();
      return tok;
    case TokenTypes.String:
      reader.skip();
      return tok;
    case TokenTypes.Boolean:
      reader.skip();
      return tok;
    case TokenTypes.Keyword:
      reader.skip();
      return tok;
    case TokenTypes.Nil:
      reader.skip();
      return tok;
    default:
      throw new SyntaxException(tok.value, tok.srcloc);
  }
};
Enter fullscreen mode Exit fullscreen mode

The AST and Parser

First, we add new node types to the ASTTypes enum in src/parser/ast.js:

export const ASTTypes = {
  Program: "Program",
  NumberLiteral: "NumberLiteral",
  StringLiteral: "StringLiteral",
  BooleanLiteral: "BooleanLiteral",
  KeywordLiteral: "KeywordLiteral",
  NilLiteral: "NilLiteral",
};
Enter fullscreen mode Exit fullscreen mode

Next, we add new constructor methods to the AST object for the new primitives:

export const AST = {
  // Program and NumberLiteral constructors
StringLiteral(token) {
    return {
      type: ASTTypes.StringLiteral,
      value: token.value,
      srcloc: token.srcloc,
    };
  },
  BooleanLiteral(token) {
    return {
      type: ASTTypes.BooleanLiteral,
      value: token.value,
      srcloc: token.srcloc,
    };
  },
  KeywordLiteral(token) {
    return {
      type: ASTTypes.KeywordLiteral,
      value: token.value,
      srcloc: token.srcloc,
    };
  },
  NilLiteral(token) {
    return {
      type: ASTTypes.NilLiteral,
      value: token.value,
      srcloc: token.srcloc,
    };
  },
}
Enter fullscreen mode Exit fullscreen mode

Now in src/parser/parse.js we make changes to parsePrimitive that are very similar to what we did in the reader:

const parsePrimitive = (reader) => {
  const token = reader.peek();

  switch (token.type) {
    case TokenTypes.Number:
      reader.skip();
      return AST.NumberLiteral(token);
    case TokenTypes.String:
      reader.skip();
      return AST.StringLiteral(token);
    case TokenTypes.Boolean:
      reader.skip();
      return AST.BooleanLiteral(token);
    case TokenTypes.Keyword:
      reader.skip();
      return AST.KeywordLiteral(token);
    case TokenTypes.Nil:
      reader.skip();
      return AST.NilLiteral(token);
    default:
      throw new SyntaxException(token.value, token.srcloc);
  }
};
Enter fullscreen mode Exit fullscreen mode

The Emitter

Now we need to emit code for the new primitive forms. In src/emitter/Emitter.js we'll need to add emitString, emitBoolean, emitKeyword, and emitNil methods to the Emitter class. We'll also add cases for each primitive node to the switch in the emit method.

Here's the new emit method:

  emit(node = this.program) {
    switch (node.type) {
      case ASTTypes.Program:
        return this.emitProgram(node);
      case ASTTypes.NumberLiteral:
        return this.emitNumber(node);
      case ASTTypes.StringLiteral:
        return this.emitString(node);
      case ASTTypes.BooleanLiteral:
        return this.emitBoolean(node);
      case ASTTypes.KeywordLiteral:
        return this.emitKeyword(node);
      case ASTTypes.NilLiteral:
        return this.emitNil(node);
      default:
        throw new SyntaxException(node.type, node.srcloc);
    }
  }
Enter fullscreen mode Exit fullscreen mode

The emitBoolean method is about as simple as it can get:

  emitBoolean(node) {
    return node.value;
  }
Enter fullscreen mode Exit fullscreen mode

Since keywords are JavaScript symbols under the hood, we'll need to use the Symbol.for constructor in emitting keywords (Note that we're emitting a string containing the "Symbol.for" call, not the constructor call itself):

  emitKeyword(node) {
    return `Symbol.for("${node.value}")`;
  }
Enter fullscreen mode Exit fullscreen mode

The emitNil method just returns the string "null":

  emitNil(node) {
    return "null";
  }
Enter fullscreen mode Exit fullscreen mode

Finally, emitString slices off the opening and closing double quotes and wraps the value in backticks to keep certain escape sequences from causing weird problems when evaluating the result:

  emitString(node) {
    return "`" + node.value.slice(1, -1) + "`";
  }
Enter fullscreen mode Exit fullscreen mode

The Printer

Now that we have some new values besides numbers that will be output by the EVAL function in the CLI, we'll need to handle them in the printer. Here's the new printString function in src/printer/printString.js:

export const printString = (value, withQuotes) => {
  switch (typeof value) {
    case "number":
      return String(value);
    case "string":
      return withQuotes ? `"${value}"` : value;
    case "symbol":
      return value.description;
    case "boolean":
      return String(value);
    case "undefined":
      return "nil";
    case "object":
      if (value === null) {
        return "nil";
      }
    default:
      throw new Exception(`Invalid print value ${value}`);
  }
};
Enter fullscreen mode Exit fullscreen mode

Note that since both null and undefined are represented as nil in Wanda we have to handle both cases in the printer. Also, currently null is the only "object" type value that will be produced by EVAL for now.

We're also going to implement an AST printer so we can use it for debugging purposes if needed. Create a new file printAST.js in src/printer and import the ASTTypes enum:

import { ASTTypes } from "../parser/ast.js";
Enter fullscreen mode Exit fullscreen mode

We're going to do the AST printer as a class that implements the Visitor pattern then export a functional interface to it, just like the code emitter. First, the constructors:

class ASTPrinter {
  constructor(program) {
    this.program = program;
  }

  static new(program) {
    return new ASTPrinter(program);
  }
}
Enter fullscreen mode Exit fullscreen mode

We'll need a print method to dispatch on the AST node type, and individual methods to handle those nodes. Since all our nodes except Program are primitives, we can actually handle all of those with a single method, printPrimitive:

  printPrimitive(node, indent) {
    return `${" ".repeat(indent)}${node.type}: ${
      node.type === "NilLiteral" ? "nil" : node.value
    }`;
  }
Enter fullscreen mode Exit fullscreen mode

Note the indent property, which we'll eventually use to show how nodes are nested inside other nodes once we have compound forms that aren't just primitives.

We also need a method to handle the Program node:

  printProgram(node, indent) {
    let pStr = "";

    for (let n of node.body) {
      pStr += this.print(n, indent);
      +"\n";
    }

    return pStr;
  }
Enter fullscreen mode Exit fullscreen mode

Finally, the print method to dispatch based on the node type:

  print(node = this.program, indent = 0) {
    switch (node.type) {
      case ASTTypes.Program:
        return this.printProgram(node, indent);
      case ASTTypes.NumberLiteral:
      case ASTTypes.StringLiteral:
      case ASTTypes.BooleanLiteral:
      case ASTTypes.KeywordLiteral:
      case ASTTypes.NilLiteral:
        return this.printPrimitive(node, indent);
    }
  }
Enter fullscreen mode Exit fullscreen mode

Finally, at the bottom of printAST.js, let's export a function to use the ASTPrinter:

export const printAST = (ast) => ASTPrinter.new(ast).print();
Enter fullscreen mode Exit fullscreen mode

Improving the CLI

We're going to add 2 AST pretty print functions to the CLI. We're also going to add the ability to invoke the CLI with a print command and option so the REPL will print the AST given to it as a program. This will allow us to debug things later when our programs get more complicated.

First, here's src/cli/pprint.js:

import { tokenize } from "../lexer/tokenize.js";
import { read } from "../reader/read.js";
import { expand } from "../expander/expand.js";
import { parse } from "../parser/parse.js";
import { desugar } from "../desugarer/desugar.js";
import { printAST } from "../printer/printAST.js";

export const pprintDesugaredAST = (input, file = "stdin") =>
  printAST(desugar(parse(expand(read(tokenize(input, file))))));

export const pprintAST = (input, file = "stdin") =>
  printAST(parse(expand(read(tokenize(input, file)))));
Enter fullscreen mode Exit fullscreen mode

We're going to add a helper function that will let us throw an Exception from an expression position in the compiler's code, in src/shared/fail.js:

import { Exception } from "./exceptions.js";

export const fail = (msg, exn = Exception) => {
  throw new exn(msg);
};
Enter fullscreen mode Exit fullscreen mode

Now we'll make some changes in src/cli/repl.js:

import readlineSync from "readline-sync";
import { EVAL } from "./eval.js";
import { pprintAST, pprintDesugaredAST } from "./pprint.js";
import { print } from "../printer/print.js";
import { fail } from "../shared/fail.js";

const read = (prompt) => readlineSync.question(prompt);

export const repl = (mode) => {
  const proc =
    mode === "repl"
      ? EVAL
      : mode === "printDesugared"
      ? pprintDesugaredAST
      : mode === "printAST"
      ? pprintAST
      : fail("Invalid REPL mode specified");
  let prompt = "wanda> ";

  while (true) {
    try {
      const input = read(prompt);

      if (input === ":quit") {
        process.exit(0);
      }

      let result = proc(input);

      print(result, mode === "repl");
    } catch (e) {
      console.error(e.message);
    }
  }
};
Enter fullscreen mode Exit fullscreen mode

Note that now if the user enters the keyword :quit at the prompt it will exit the REPL.

Now in src/cli/run.js we're going to create an actual run function that will set the mode for the REPL instead of just importing and calling the repl function:

import { Exception } from "../shared/exceptions.js";
import { repl } from "./repl.js";

export const run = () => {
  let mode = "";
  switch (process.argv[2]) {
    case "print":
      if (process.argv[3] === "-d") {
        mode = "printDesugared";
        break;
      } else if (process.argv[3] === "-a") {
        mode = "printAST";
        break;
      }
    case undefined:
    case "-i":
    case "repl":
      mode = "repl";
      break;
    default:
      throw new Exception("Invalid command specified");
  }
  repl(mode);
};

Enter fullscreen mode Exit fullscreen mode

Now we have a print command we can use when we execute the program, which is further used with either -a or -d options to print either the base AST or its desugared version. Right now there's no difference between the two, but that will change!

Creating The wanda Command

Now we need a way to call the run function. If you follow these instructions you'll be able to start the Wanda REPL just using the command wanda.

First, in the project root, open the package.json file. Somewhere in the toplevel properties (I've got it just below the "name" property), add:

{
// ...
  "bin": "bin/wanda.js",
// ...
}
Enter fullscreen mode Exit fullscreen mode

Now in the project root create the bin directory (not src/bin) and add the file wanda.js to it. Here are its contents:

#!/usr/bin/env node

import { run } from "../src/cli/run.js";

run();
Enter fullscreen mode Exit fullscreen mode

If the first line looks strange to you, it's called a shebang. It's a convention from UNIX shell scripts. It lets the system know to use Node.js to process the file, just like if you were writing a Bash script it would let the system know to use Bash to process the file.

After saving bin/wanda.js, make sure you're in the project root in your terminal and enter the command npm link. This will tell NPM to link your wanda command as a toplevel system command. Now whatever code is executed in bin/wanda.js will run when you enter the command wanda.

You can try it yourself: enter wanda into your terminal and hit ENTER.

Now you can enter values of any of the primitive types, including numbers, strings, booleans, keywords, and nil.

Want to see the AST printer in action? Run the command wanda print -a and see it work!

Conclusion

That wraps it up for today! We've now implemented all the primitive types in Wanda. Next time we'll add call expressions and change the data structure output by the reader to be more like that of a conventional Lisp language: using cons cells (linked lists) to represent the forms. This won't seem like a revolutionary change at first, but as it turns out this will enable us to do some pretty neat things in the future.

To see an exact snapshot of the state of the Wanda compiler and CLI as of the end of this article, check out the tag on GitHub. Note that the repo also includes JSDoc comments for the compiler as well as tests.

I'll see you next time!

Top comments (0)