I am going to show you the code for a very simple recursive descent parser that will be able to handle the four operations (+
, -
, *
and /
), parenthesis, numbers and variables.
I will use Javascript
, but without any of the higher level stuff (e.g. regular expressions or eval
), so the code could indeed be implemented almost identical in many other languages like C++, Python, Java or Go.
Function signature
The function will accept a string and will return either an error or a function: this resulting function will evaluate the result of the expression given a dictionary with current values for the variables.
An example use will be:
// parse a string into an expression
let ee = parse("x*x + y*y");
// Compute the expression value
console.log(ee({x: 12, y: 3}));
The output will be in this case 12*12 + 3*3 = 144 + 9 = 153
Character reading
First of all I’ll need a variable to keep current position in the input string s
; I will use i
.
I am also creating an utility function to call in case of an error, throwing an exception with the error message and the position in which the error happened during the parsing.
function parse(s) {
let i = 0,
err = (msg) => {
let e = Error("Error: " + msg);
e.source = s; e.pos = i;
throw e;
};
Whitespace skipping
The parser will ignore whitespace in most positions, so a utility function that will skip over all whitespace (if there is any) will be needed in a few places:
let skipsp = () => {
while (" \r\t\n".indexOf(s[i]) >= 0) i++;
};
Numeric and alphabetic characters
Another few utilities to keep the code cleaner are about digits and alphabetic characters needed for numbers and identifiers:
let num = (c) => c >= "0" && c <= "9",
alpha = (c) => ((c >= "a" && c <= "z") ||
(c >= "A" && c <= "Z") ||
c === "_"),
alphanum = (c) => alpha(c) || num(c);
The recursive parser
Now that all needed utility functions have been defined we can move to the main part of the parser. Parsing is a naturally recursive problem once you allow parenthesis… more specifically there are “level-0” expressions that are just numbers or variables, “level-1” that are multiplications or divisions and “level-2” that are additions and subtractions. Recursion is present because at “level-0” you can also add a parenthesized expression, like in x + y*(4 + z)
.
Instead of making different functions for each level I will write just a single function accepting a “level” parameter, being 0 for numbers, 1 for mu/div, 2 for add/sub.
Indeed it’s even simpler if all the binary operators are placed in a table, with the priority level and the math function to compute the result, this way adding new operators won’t need more code:
let binops = {
"*": { level: 1, f: (x, y) => x * y},
"/": { level: 1, f: (x, y) => x / y},
"+": { level: 2, f: (x, y) => x + y},
"-": { level: 2, f: (x, y) => x - y},
};
The main structure of the parser is:
if (level === 0) {
skipsp(); // skip over spaces
// check for numbers, variables,
// unary minus or parenthesized sub-expressions
} else {
let x = parse(level-1);
while (/*... there is a proper level operand ...*/) {
let y = parse(level - 1);
// combine x and y in (x op y)
// and place the combination in x
}
return x;
}
Checking for numbers
Here I will just store where the number starts and keep skipping over characters until the number ends; to get the numeric value I’ll just Javascript unary +
operator on the substring. In C++ for example I could use atof
to compute the numeric value.
if (num(s[i])) {
let i0 = i; // mark where the number starts
while (num(s[i])) i++; // skip over all digits
if (s[i] === ".") { // decimal point?
i++; // skip over it
while (num(s[i])) i++; // skip over all digits
}
if (s[i] === "e" || s[i] === "E") { // exponent?
i++; // skip over E char
// skip over optional sign
if (s[i] === "+" || s[i] === "-") i++;
if (!num(s[i])) err("Exponent digits expected");
while (num(s[i])) i++; // skip over all digits
}
// The number is over; get the numeric value
let value = +s.slice(i0, i);
// Return the resulting function
return (vars) => value; // vars ignored in this case!
}
Checking for variables
Here I will collect the variable name and return a lookup function
if (alpha(s[i])) {
let i0 = i; // mark where the name starts
// skip over all alphanumeric
while (alphanum(s[i])) i++;
let varname = s.slice(i0, i);
// value looked up in passed vars object
return (vars) => vars[varname];
}
Unary minus
The unary minus is simple; we just skip and wrap the sub-expr result in a sign-changed function
if (s[i] === "-") {
i++; // skip over the minus sign
// Note: we parse at level 0!
// For example "-x+y" means "(-x)+y" not "-(x+y)"
let v = parse(0);
return (vars) => -v(vars);
}
Parenthesized sub-expression
if (s[i] === "(") {
i++; // skip over the open parenthesis
// Note: we parse at level 2 here
let v = parse(2);
skipsp();
if (s[i] !== ")") err("')' expected");
i++; // skip over closed parenthesis
return v;
}
That’s all Folks! (for level 0)
In case we’re at level 0 and none of the previous cases was found then we’ve an error!
This can happen for example parsing "x + 2*"
as after the *
an expression is expected but none is present and the string just ends. Another possibility for this error to be found is for example "1 + 2*/7"
because after the *
there is an operator instead of an expression.
err("Expression expected");
Parsing intermediate levels
For the intermediate levels (1/2) of the parser the code is
let x = parse(level - 1);
for(;;) { // endless loop - we'll quit with `break`
skipsp();
let op = binops[s[i]];
// check if it's an operator and if the level is ok
if (op === undefined || op.level !== level) break;
i++; // skip over the operator
let a = x, // left hand
b = parse(level - 1), // right hand
f = op.f; // result computation function
// Combine the two sub-expressions in the result
x = (vars) => f(a(vars), b(vars));
}
return x;
Main function
The recursive “core” of the parser is complete, we just need to call it and check that no junk is present at end of the expression:
let r = parse(2);
skipsp();
if (i < s.length) err("Extra characters at end of expression");
return r;
Full code
That’s all that is needed; here is the full code for easier copy-n-paste if you want to hack on it:
function parse(s) {
let i = 0,
err = (msg) => {
let e = Error("Error: " + msg);
e.source = s; e.pos = i;
throw e;
},
skipsp = () => { while (" \r\t\n".indexOf(s[i]) >= 0) i++; },
num = (c) => c >= "0" && c <= "9",
alpha = (c) => (c >= "a" && c <= "z") || (c >= "A" && c <= "Z") || c === "_",
alphanum = (c) => alpha(c) || num(c),
binops = {
"*": { level: 1, f: (x, y) => x * y},
"/": { level: 1, f: (x, y) => x / y},
"+": { level: 2, f: (x, y) => x + y},
"-": { level: 2, f: (x, y) => x - y},
},
parse = (level) => {
if (level === 0) {
skipsp(); // skip over spaces
if (num(s[i])) {
let i0 = i; // mark where the number starts
while (num(s[i])) i++; // skip over all digits
if (s[i] === ".") { // decimal point?
i++; // skip over it
while (num(s[i])) i++; // skip over all digits
}
if (s[i] === "e" || s[i] === "E") { // exponent?
i++; // skip over E char
if (s[i] === "+" || s[i] === "-") i++; // skip over optional sign
if (!num(s[i])) err("Exponent digits expected");
while (num(s[i])) i++; // skip over all digits
}
// The number is over; get the numeric value of this part of the input
let value = +s.slice(i0, i);
// Return the result: a function accepting variables and returning expr value
return (vars) => value; // vars ignored in this case
}
if (alpha(s[i])) {
let i0 = i; // mark where the name starts
while (alphanum(s[i])) i++; // skip over all alphanumeric
let varname = s.slice(i0, i);
return (vars) => vars[varname]; // value looked up in passed vars object
}
if (s[i] === "-") {
i++; // skip over the minus sign
let v = parse(0); // Note: we parse at level 0: -x+y means (-x)+y
return (vars) => -v(vars);
}
if (s[i] === "(") {
i++; // skip over the open parenthesis
let v = parse(2); // Note: we parse at level 2 here
skipsp();
if (s[i] !== ")") err("')' expected");
i++; // skip over closed parenthesis
return v;
}
err("Expression expected");
} else {
let x = parse(level - 1);
for(;;) { // endless loop - we'll quit with a break statement
skipsp();
let op = binops[s[i]];
// check if it's an operator and if the level is the correct one
if (op === undefined || op.level !== level) break;
// There is an operator and the precedence level is correct
i++; // skip over the operator
let a = x, // left hand
b = parse(level - 1), // right hand
f = op.f; // result computation function
// Combine the two sub-expressions in the result
x = (vars) => f(a(vars), b(vars));
}
return x;
}
};
let r = parse(2);
skipsp();
if (i < s.length) err("Extra characters at end of expression");
return r;
}
let ee = parse("x*x + y*y");
console.log(ee({x: 12, y: 3}));
Some more serious test
Ok... 153 tests fine, but the code is far from trivial, so probably more testing is needed. In the following testing code I wrote an expression, the vars to be passed and the expected result, including some error cases (in which I check the expected error position).
The result has been computed by hand, and not using the function result itself (that wouldn't have been a serious functional test, at most a non-regression test) and indeed helped me finding an error... an error in my manual computation of the expected result! :-D
// Testing
let test = (src, vars, res) => {
let v = undefined;
try {
let expr = parse(src);
return res === expr(vars);
} catch(err) {
return typeof res === "string" && res.indexOf("!") === err.pos;
}
}
[
{src: "1", vars:{}, res: 1},
{src: "2+3", vars:{}, res: 5},
{src: "2-3", vars:{}, res: -1},
{src: "2*3", vars:{}, res: 6},
{src: "5/2", vars:{}, res: 2.5},
{src: "2 +3", vars:{}, res: 5},
{src: "2- 3", vars:{}, res: -1},
{src: " 2 * 3", vars:{}, res: 6},
{src: "5/2 ", vars:{}, res: 2.5},
{src: "1+2*3", vars:{}, res: 7},
{src: "1+5/2", vars:{}, res: 3.5},
{src: "7-3-1", vars:{}, res: 3},
{src: "8/2/2", vars:{}, res: 2},
{src: "8/-2/2", vars:{}, res: -2},
{src: "-8/-2/2", vars:{}, res: 2},
{src: "--------1", vars:{}, res: 1},
{src: "1+2*(3+4)", vars:{}, res: 15},
{src: "1+2*(3+4)", vars:{}, res: 15},
{src: "x", vars:{x:2}, res: 2},
{src: "x+1", vars:{x:2}, res: 3},
{src: "x*x", vars:{x:2}, res: 4},
{src: "x * (x+1)", vars:{x:2}, res: 6},
{src: "x * x + 1", vars:{x:2}, res: 5},
{src: "(x-1) * (x+1)", vars:{x:2}, res: 3},
{src: "(x-1) * (x+1)", vars:{x:3}, res: 8},
{src: "(x-1) * (x2+1)", vars:{x:3, x2:9}, res: 20},
{src: "(x-1) * (x 2+1)", vars:{x:3, x2:9}, res: "(x-1) * (x !"},
{src: "(x-1) * ", vars:{x:3, x2:9}, res: "(x-1) * !"},
{src: " * (x-1)", vars:{x:3}, res: " !"},
{src: "(x-1", vars:{x:3}, res: "(x-1!"},
].forEach(t => {
console.log(JSON.stringify(t.src), " ==> ", test(t.src, t.vars, t.res) ? "ok" : "** FAIL **");
});
Conclusion
Parsers are often considered a "difficult" topic, but they don't have to be. For simple languages, the implementation is indeed straightforward.
For complex and irregular languages, things become more complicated. However, it's worth considering whether you really want to work with such complex and irregular languages. They are not only harder for computers to parse but, more importantly, also challenging for humans to use. Is it truly a good idea to venture into that territory?
Take C++, for example; its extremely complex grammar is something you may not want to deal with, both as a parser implementer and as a user.
While this code is just a toy example, it's less than 100 lines long and doesn't rely on any libraries or high-level functions.
Adding more features such as tabled unary operators, right-associativity support, postfix operators, or constructs like array item access x[y]
or function calls x(y, z)
isn't very difficult to do. Even incorporating constructs like if-then-else
or while
doesn't require much more complexity.
In my opinion, an interesting programming exercise would be to read and study this code and then attempt to write your own version from scratch, possibly in another language (e.g., Python).
Top comments (0)