DEV Community

Cover image for A simple recursive descent parser
Andrea Griffini
Andrea Griffini

Posted on • Edited on

A simple recursive descent parser

I am going to show you the code for a very simple recursive descent parser that will be able to handle the four operations (+, -, * and /), parenthesis, numbers and variables.

I will use Javascript, but without any of the higher level stuff (e.g. regular expressions or eval), so the code could indeed be implemented almost identical in many other languages like C++, Python, Java or Go.

Function signature

The function will accept a string and will return either an error or a function: this resulting function will evaluate the result of the expression given a dictionary with current values for the variables.

An example use will be:

// parse a string into an expression
let ee = parse("x*x + y*y");

// Compute the expression value
console.log(ee({x: 12, y: 3}));
Enter fullscreen mode Exit fullscreen mode

The output will be in this case 12*12 + 3*3 = 144 + 9 = 153

Character reading

First of all I’ll need a variable to keep current position in the input string s; I will use i.

I am also creating an utility function to call in case of an error, throwing an exception with the error message and the position in which the error happened during the parsing.

function parse(s) {
    let i = 0,
        err = (msg) => {
            let e = Error("Error: " + msg);
            e.source = s; e.pos = i;
            throw e;
        };
Enter fullscreen mode Exit fullscreen mode

Whitespace skipping

The parser will ignore whitespace in most positions, so a utility function that will skip over all whitespace (if there is any) will be needed in a few places:

let skipsp = () => {
        while (" \r\t\n".indexOf(s[i]) >= 0) i++;
    };
Enter fullscreen mode Exit fullscreen mode

Numeric and alphabetic characters

Another few utilities to keep the code cleaner are about digits and alphabetic characters needed for numbers and identifiers:

let num = (c) => c >= "0" && c <= "9",
    alpha = (c) => ((c >= "a" && c <= "z") ||
                    (c >= "A" && c <= "Z") ||
                    c === "_"),
    alphanum = (c) => alpha(c) || num(c);
Enter fullscreen mode Exit fullscreen mode

The recursive parser

Now that all needed utility functions have been defined we can move to the main part of the parser. Parsing is a naturally recursive problem once you allow parenthesis… more specifically there are “level-0” expressions that are just numbers or variables, “level-1” that are multiplications or divisions and “level-2” that are additions and subtractions. Recursion is present because at “level-0” you can also add a parenthesized expression, like in x + y*(4 + z).

Instead of making different functions for each level I will write just a single function accepting a “level” parameter, being 0 for numbers, 1 for mu/div, 2 for add/sub.

Indeed it’s even simpler if all the binary operators are placed in a table, with the priority level and the math function to compute the result, this way adding new operators won’t need more code:

let binops = {
        "*": { level: 1, f: (x, y) => x * y},
        "/": { level: 1, f: (x, y) => x / y},
        "+": { level: 2, f: (x, y) => x + y},
        "-": { level: 2, f: (x, y) => x - y},
    };
Enter fullscreen mode Exit fullscreen mode

The main structure of the parser is:

if (level === 0) {
    skipsp(); // skip over spaces
    // check for numbers, variables,
    // unary minus or parenthesized sub-expressions
} else {
    let x = parse(level-1);
    while (/*... there is a proper level operand ...*/) {
        let y = parse(level - 1);
        // combine x and y in (x op y)
        // and place the combination in x
    }
    return x;
}
Enter fullscreen mode Exit fullscreen mode

Checking for numbers

Here I will just store where the number starts and keep skipping over characters until the number ends; to get the numeric value I’ll just Javascript unary + operator on the substring. In C++ for example I could use atof to compute the numeric value.

if (num(s[i])) {
    let i0 = i; // mark where the number starts
    while (num(s[i])) i++; // skip over all digits
    if (s[i] === ".") { // decimal point?
        i++; // skip over it
        while (num(s[i])) i++; // skip over all digits
    }
    if (s[i] === "e" || s[i] === "E") { // exponent?
        i++; // skip over E char
        // skip over optional sign
        if (s[i] === "+" || s[i] === "-") i++;
        if (!num(s[i])) err("Exponent digits expected");
        while (num(s[i])) i++; // skip over all digits
    }
    // The number is over; get the numeric value
    let value = +s.slice(i0, i);
    // Return the resulting function
    return (vars) => value; // vars ignored in this case!
}
Enter fullscreen mode Exit fullscreen mode

Checking for variables

Here I will collect the variable name and return a lookup function

if (alpha(s[i])) {
    let i0 = i; // mark where the name starts
    // skip over all alphanumeric
    while (alphanum(s[i])) i++;
    let varname = s.slice(i0, i);
    // value looked up in passed vars object
    return (vars) => vars[varname];
}
Enter fullscreen mode Exit fullscreen mode

Unary minus

The unary minus is simple; we just skip and wrap the sub-expr result in a sign-changed function

if (s[i] === "-") {
    i++; // skip over the minus sign
    // Note: we parse at level 0!
    // For example "-x+y" means "(-x)+y" not "-(x+y)"
    let v = parse(0);
    return (vars) => -v(vars);
}
Enter fullscreen mode Exit fullscreen mode

Parenthesized sub-expression

if (s[i] === "(") {
    i++; // skip over the open parenthesis
    // Note: we parse at level 2 here
    let v = parse(2);
    skipsp();
    if (s[i] !== ")") err("')' expected");
    i++; // skip over closed parenthesis
    return v;
}
Enter fullscreen mode Exit fullscreen mode

That’s all Folks! (for level 0)

In case we’re at level 0 and none of the previous cases was found then we’ve an error!
This can happen for example parsing "x + 2*" as after the * an expression is expected but none is present and the string just ends. Another possibility for this error to be found is for example "1 + 2*/7" because after the * there is an operator instead of an expression.

err("Expression expected");
Enter fullscreen mode Exit fullscreen mode

Parsing intermediate levels

For the intermediate levels (1/2) of the parser the code is

let x = parse(level - 1);
for(;;) { // endless loop - we'll quit with `break`
    skipsp();
    let op = binops[s[i]];
    // check if it's an operator and if the level is ok
    if (op === undefined || op.level !== level) break;
    i++; // skip over the operator
    let a = x, // left hand
        b = parse(level - 1), // right hand
        f = op.f; // result computation function
    // Combine the two sub-expressions in the result
    x = (vars) => f(a(vars), b(vars));
}
return x;
Enter fullscreen mode Exit fullscreen mode

Main function

The recursive “core” of the parser is complete, we just need to call it and check that no junk is present at end of the expression:

let r = parse(2);
skipsp();
if (i < s.length) err("Extra characters at end of expression");
return r;
Enter fullscreen mode Exit fullscreen mode

Full code

That’s all that is needed; here is the full code for easier copy-n-paste if you want to hack on it:

function parse(s) {
    let i = 0,
        err = (msg) => {
            let e = Error("Error: " + msg);
            e.source = s; e.pos = i;
            throw e;
        },
        skipsp = () => { while (" \r\t\n".indexOf(s[i]) >= 0) i++; },
        num = (c) => c >= "0" && c <= "9",
        alpha = (c) => (c >= "a" && c <= "z") || (c >= "A" && c <= "Z") || c === "_",
        alphanum = (c) => alpha(c) || num(c),
        binops = {
            "*": { level: 1, f: (x, y) => x * y},
            "/": { level: 1, f: (x, y) => x / y},
            "+": { level: 2, f: (x, y) => x + y},
            "-": { level: 2, f: (x, y) => x - y},
        },
        parse = (level) => {
            if (level === 0) {
                skipsp(); // skip over spaces
                if (num(s[i])) {
                    let i0 = i; // mark where the number starts
                    while (num(s[i])) i++; // skip over all digits
                    if (s[i] === ".") { // decimal point?
                        i++; // skip over it
                        while (num(s[i])) i++; // skip over all digits
                    }
                    if (s[i] === "e" || s[i] === "E") { // exponent?
                        i++; // skip over E char
                        if (s[i] === "+" || s[i] === "-") i++; // skip over optional sign
                        if (!num(s[i])) err("Exponent digits expected");
                        while (num(s[i])) i++; // skip over all digits
                    }
                    // The number is over; get the numeric value of this part of the input
                    let value = +s.slice(i0, i);
                    // Return the result: a function accepting variables and returning expr value
                    return (vars) => value; // vars ignored in this case
                }
                if (alpha(s[i])) {
                    let i0 = i; // mark where the name starts
                    while (alphanum(s[i])) i++; // skip over all alphanumeric
                    let varname = s.slice(i0, i);
                    return (vars) => vars[varname]; // value looked up in passed vars object
                }
                if (s[i] === "-") {
                    i++; // skip over the minus sign
                    let v = parse(0); // Note: we parse at level 0: -x+y means (-x)+y
                    return (vars) => -v(vars);
                }
                if (s[i] === "(") {
                    i++; // skip over the open parenthesis
                    let v = parse(2); // Note: we parse at level 2 here
                    skipsp();
                    if (s[i] !== ")") err("')' expected");
                    i++; // skip over closed parenthesis
                    return v;
                }
                err("Expression expected");
            } else {
                let x = parse(level - 1);
                for(;;) { // endless loop - we'll quit with a break statement
                    skipsp();
                    let op = binops[s[i]];
                    // check if it's an operator and if the level is the correct one
                    if (op === undefined || op.level !== level) break;
                    // There is an operator and the precedence level is correct
                    i++; // skip over the operator
                    let a = x, // left hand
                        b = parse(level - 1), // right hand
                        f = op.f; // result computation function
                    // Combine the two sub-expressions in the result
                    x = (vars) => f(a(vars), b(vars));
                }
                return x;
            }
        };

    let r = parse(2);
    skipsp();
    if (i < s.length) err("Extra characters at end of expression");
    return r;
}

let ee = parse("x*x + y*y");
console.log(ee({x: 12, y: 3}));
Enter fullscreen mode Exit fullscreen mode

Some more serious test

Ok... 153 tests fine, but the code is far from trivial, so probably more testing is needed. In the following testing code I wrote an expression, the vars to be passed and the expected result, including some error cases (in which I check the expected error position).

The result has been computed by hand, and not using the function result itself (that wouldn't have been a serious functional test, at most a non-regression test) and indeed helped me finding an error... an error in my manual computation of the expected result! :-D

// Testing
let test = (src, vars, res) => {
    let v = undefined;
    try {
        let expr = parse(src);
        return res === expr(vars);
    } catch(err) {
        return typeof res === "string" && res.indexOf("!") === err.pos;
    }
}

[
    {src: "1", vars:{}, res: 1},
    {src: "2+3", vars:{}, res: 5},
    {src: "2-3", vars:{}, res: -1},
    {src: "2*3", vars:{}, res: 6},
    {src: "5/2", vars:{}, res: 2.5},
    {src: "2 +3", vars:{}, res: 5},
    {src: "2- 3", vars:{}, res: -1},
    {src: " 2 * 3", vars:{}, res: 6},
    {src: "5/2 ", vars:{}, res: 2.5},
    {src: "1+2*3", vars:{}, res: 7},
    {src: "1+5/2", vars:{}, res: 3.5},
    {src: "7-3-1", vars:{}, res: 3},
    {src: "8/2/2", vars:{}, res: 2},
    {src: "8/-2/2", vars:{}, res: -2},
    {src: "-8/-2/2", vars:{}, res: 2},
    {src: "--------1", vars:{}, res: 1},
    {src: "1+2*(3+4)", vars:{}, res: 15},
    {src: "1+2*(3+4)", vars:{}, res: 15},
    {src: "x", vars:{x:2}, res: 2},
    {src: "x+1", vars:{x:2}, res: 3},
    {src: "x*x", vars:{x:2}, res: 4},
    {src: "x * (x+1)", vars:{x:2}, res: 6},
    {src: "x * x + 1", vars:{x:2}, res: 5},
    {src: "(x-1) * (x+1)", vars:{x:2}, res: 3},
    {src: "(x-1) * (x+1)", vars:{x:3}, res: 8},
    {src: "(x-1) * (x2+1)", vars:{x:3, x2:9}, res: 20},
    {src: "(x-1) * (x 2+1)", vars:{x:3, x2:9}, res: "(x-1) * (x !"},
    {src: "(x-1) * ", vars:{x:3, x2:9}, res: "(x-1) * !"},
    {src: " * (x-1)", vars:{x:3}, res: " !"},
    {src: "(x-1", vars:{x:3}, res: "(x-1!"},
].forEach(t => {
    console.log(JSON.stringify(t.src), " ==> ", test(t.src, t.vars, t.res) ? "ok" : "** FAIL **");
});
Enter fullscreen mode Exit fullscreen mode

Conclusion

Parsers are often considered a "difficult" topic, but they don't have to be. For simple languages, the implementation is indeed straightforward.

For complex and irregular languages, things become more complicated. However, it's worth considering whether you really want to work with such complex and irregular languages. They are not only harder for computers to parse but, more importantly, also challenging for humans to use. Is it truly a good idea to venture into that territory?

Take C++, for example; its extremely complex grammar is something you may not want to deal with, both as a parser implementer and as a user.

While this code is just a toy example, it's less than 100 lines long and doesn't rely on any libraries or high-level functions.

Adding more features such as tabled unary operators, right-associativity support, postfix operators, or constructs like array item access x[y] or function calls x(y, z) isn't very difficult to do. Even incorporating constructs like if-then-else or while doesn't require much more complexity.

In my opinion, an interesting programming exercise would be to read and study this code and then attempt to write your own version from scratch, possibly in another language (e.g., Python).

Top comments (0)