This article provides an overview of how to use the Chevrotain library to write a small expression parser.
This article will not go into much detail about what a parser is, and is intended for an intermediate level audience.
If you want more detail, let me know in the comments and I will modify this article accordingly.
A little bit of context
I'm working on a Headless CMS project, which is based on a JSON data schema and generates a GraphQL API. To facilitate a little bit the filtering via the API, I need to be able to manage it via a simple custom grammar.
I usually use ANTLR, which is probably one of the best parser generators.
But this time, I want to test something new, and after some research, I came across a library called Chevrotain
Chevrotain is not a parser generator, instead it takes advantage of Javascript directly to describe Lexer and Grammar with the code.
The target
The goal is to be able to filter the elements of our query using a very simple language that must meet the following criteria:
- Filter fields via matching operators
age lt 20
fruit not in ['apple', 'banana']
email eq 'xxxx@xxxx.xxx'
- Use multiple criteria via the AND and OR operators
group eq 'admin' and active eq 1
- Prioritize operators with parenthesis
(amount lte 100 and date gt dt{'2020-01-01'}) or byPass eq 1
- Order on fields
order by age desc name asc
- Skip some records
skip 5
- Take a limited number of records
take 2
The Lexer
First, we need to write a lexer in order to split each word into tokens. Tokens are used in Parsing rules to create the target AST. An AST or Abstract Synax Tree is the final result of the parsing state.
A token can represent a static keyword, just like any dynamic value, such as a number, a string, or an identifier like variables, method names, etc.
So we need to defined all Tokens first to tell Chevrotain how to understand the input text and prepare it to be parsed.
CreateToken
With Chevrotain, token creation is relatively simple.
First we import the createToken function
const createToken = chevrotain.createToken;
Then we define the tokens
const Identifier = createToken({name: "Identifier" , pattern: /[a-zA-Z_][\w\d_]*/});
As you can see, to define a token, you specify a name, and a pattern. The name is the unique identifier of the token, and the pattern is a regular expression used by the scanner to recognize the token.
It is also possible to remove recognition ambiguities by specifying an alternative that should be used instead for a longer token.
For example, an Integer and a Float cause recognition ambiguity. A Float will be interpreted as an Integer by default.
This can be handled as follows:
const Float = createToken({name: "Float" , pattern: /\d+\.\d+/});
const Integer = createToken({name: "Integer" , pattern: /\d+/, longer_alt: Float});
Now an Integer will be recognized as an Integer only if it is not a Float.
After defining all your tokens, you must now group them together to create an instance of the lexer.
const allTokens = [OrderBy,WhiteSpace,Asc, Desc,Take, Skip, NotInOp,InOp,AndOp,OrOp,GteOp,GtOp,LteOp,LtOp,NotEqOp,EqOp,LParen, RParen, LBraket, RBraket, Comma, Float, Integer, Dt, Identifier, LCurly, RCurly, String];
const FilterLexer = new Lexer(allTokens);
The Grammar
Let's see how the grammar should be
At the top level, we have the expressions
rule. It is composed by one andOrExp
rule, optionally followed by an orderBy
rule, a skip
rule and a take
rule.
What are grammar rules?
When working with parsers, it is good to understand a few prerequisites.
To write a grammar, you will need to use 2 types of information. The source to be parsed will be decomposed into nodes.
The nodes can be classified in 2 categories, terminal and non-terminal nodes.
In the image above, you can see the non-terminal nodes, which are in squared boxes, and the terminal ones in rounded boxes.
A terminal node is a final one, it is a value or a keyword, or any token you have defined.
A non terminal node is a rule, in wich you can continue to parse.
In summary, when we have to process the LBraket
node, we do not go further, this node has the value [
.
On the other hand, for the next node atomicExp
, we will continue the processing before being able to evaluate its final value.
As you can see, we cannot determine the expression value, which can be of several types. That's why it is a non terminal node.
From theory to implementation.
Let's start by analyzing the rule we want to write.
The first token is of type andOrExp, and is mandatory.
The three others are all optional but processed sequentially.
Let's start by creating the Rule itself.
const $ = this;
// This is an empty rule
$.RULE("expressions", () => {
});
Now we can add the first rule to consume as a subrule of the current one. This will tell Chevrotain how to understand the rule.
$.RULE("expressions", () => {
$.SUBRULE($.andOrExp);
});
Handle optional rule
Now we need to set the first optional rule.
$.RULE("expressions", () => {
$.SUBRULE($.andOrExp);
$.OPTION(() => { $.SUBRULE($.orderBy); })
});
And the others
$.RULE("expressions", () => {
$.SUBRULE($.andOrExp);
$.OPTION(() => { $.SUBRULE($.orderBy); })
$.OPTION2(() => { $.SUBRULE($.skip); })
$.OPTION3(() => { $.SUBRULE($.take); })
});
Yes we did it. We have just declared the Rule :-)
Handle alternative rules
Let's see the andOrExp
rule.
This rule is an interesting one because it is structurally complex without being complicated. And that's the point, keeping things simple in order to build something complex.
Expression is a mandatory rule. AndOP and OrOp are both optional and alternatives to each other, and everything after the first rule can be used several times.
So let's see how to handle that.
$.RULE("andOrExp", () => {
$.SUBRULE($.expression, { LABEL: "lhs" });
});
Here we can use a subrule to start with. Note the use of the LABEL option. This will be necessary for the implementation of the visitor.
Then we can declare Alternatives by using the OR function. AndOp and OrOp are Tokens not rules, so we use the CONSUME method instead of SUBRULE.
$.OR([
{ALT: () => { $.CONSUME(AndOp); }},
{ALT: () => { $.CONSUME(OrOp); }}
]);
This sequence can be declared multiple times, so we need to encapsulate it as follows.
$.MANY(() => {
$.OR([
{ALT: () => { $.CONSUME(AndOp); }},
{ALT: () => { $.CONSUME(OrOp); }}
]);
});
Abd now the full rule
$.RULE("andOrExp", () => {
$.SUBRULE($.expression, { LABEL: "lhs" });
$.MANY(() => {
$.OR([
{ALT: () => { $.CONSUME(AndOp); }},
{ALT: () => { $.CONSUME(OrOp); }}
]);
$.SUBRULE2($.expression,{LABEL: "rhs" });
});
})
Left recursive approach versus chained approach
As I had to mention earlier, I'm more used to use ANTLR, which has the particularity of being Left Recursive.
So the naive approach to add the andOrExp with parenthesis could have been like this :
andOrExp:
expression ((AndOp | OrOp) expression)* |
LPren andOrExp RParen
But Chevrotain is not Left recursive. So we have to adapt the grammar in 3 steps.
Now we had acheive the same result 😄
And the sample
(billAmount lte 200 and billAmount gte 100) or startDate eq dt{'2020-01-01'}
order by name asc age desc
skip 100 take 20
Will be converted in a relatively indigestible syntax tree...
Conclusion
In the next article we will see how to define the corresponding Visitor to explore and transform the AST into something more usefull, and also how to implement a derived visitor to generate MongoDB filtering from this parser.
If you want to play with this sample, open the Chevrotain playgroung
Then past the source
(function FilterCst() {
"use strict";
/**
* An Example of implementing a Calculator with separated grammar and semantics (actions).
* This separation makes it easier to maintain the grammar and reuse it in different use cases.
*
* This is accomplished by using the automatic CST (Concrete Syntax Tree) output capabilities
* of chevrotain.
*
* See farther details here:
* https://github.com/SAP/chevrotain/blob/master/docs/concrete_syntax_tree.md
*/
const createToken = chevrotain.createToken ;
const tokenMatcher = chevrotain.tokenMatcher ;
const Lexer = chevrotain.Lexer ;
const CstParser = chevrotain.CstParser ;
const Identifier = createToken({name: "Identifier" , pattern: /[a-zA-Z_][\w\d_]*/});
const LParen = createToken({name: "LParen" , pattern: /\(/});
const RParen = createToken({name: "RParen" , pattern: /\)/});
const Float = createToken({name: "Float" , pattern: /\d+\.\d+/});
const Integer = createToken({name: "Integer" , pattern: /\d+/, longer_alt: Float});
const String = createToken({name: "String" , pattern: /'.*?'/});
const Comma = createToken({name: "Comma" , pattern: /,/});
const LCurly = createToken({name: "LCurly" , pattern: /\{/});
const RCurly = createToken({name: "RCurly" , pattern: /\}/});
const LBraket = createToken({name: "LBraket" , pattern: /\[/});
const RBraket = createToken({name: "RBraket" , pattern: /\]/});
const Dt = createToken({name: "Dt" , pattern: /dt/, longer_alt: Identifier});
const EqOp = createToken({name: "EqOp" , pattern: /eq/, longer_alt: Identifier});
const NotEqOp = createToken({name: "NotEqOp" , pattern: /!eq/, longer_alt: Identifier});
const LtOp = createToken({name: "LtOp" , pattern: /lt/, longer_alt: Identifier});
const LteOp = createToken({name: "LteOp" , pattern: /lte/, longer_alt: Identifier});
const GtOp = createToken({name: "GtOp" , pattern: /gt/, longer_alt: Identifier});
const GteOp = createToken({name: "GteOp" , pattern: /gte/, longer_alt: Identifier});
const AndOp = createToken({name: "AndOp" , pattern: /and/, longer_alt: Identifier});
const OrOp = createToken({name: "OrOp" , pattern: /or/, longer_alt: Identifier});
const InOp = createToken({name: "InOp" , pattern: /in/, longer_alt: Identifier});
const NotInOp = createToken({name: "NotInOp" , pattern: /!in/, longer_alt: Identifier});
const OrderBy = createToken({name: "OrderBy" , pattern: /order\s+by/, longer_alt: Identifier});
const Asc = createToken({name: "Asc" , pattern: /asc/, longer_alt: Identifier});
const Desc = createToken({name: "Desc" , pattern: /desc/, longer_alt: Identifier});
const Take = createToken({name: "Take" , pattern: /take/, longer_alt: Identifier});
const Skip = createToken({name: "Skip" , pattern: /skip/, longer_alt: Identifier});
// marking WhiteSpace as 'SKIPPED' makes the lexer skip it.
const WhiteSpace = createToken({
name: "WhiteSpace",
pattern: /\s+/,
group: Lexer.SKIPPED
});
const allTokens = [OrderBy,WhiteSpace,Asc, Desc,Take, Skip, NotInOp,InOp,AndOp,OrOp,GteOp,GtOp,LteOp,LtOp,NotEqOp,EqOp,LParen, RParen, LBraket, RBraket, Comma, Float, Integer, Dt, Identifier, LCurly, RCurly, String];
const FilterLexer = new Lexer(allTokens);
// ----------------- parser -----------------
// Note that this is a Pure grammar, it only describes the grammar
// Not any actions (semantics) to perform during parsing.
class FilterPure extends CstParser {
constructor() {
super(allTokens);
const $ = this;
$.RULE("expressions", () => {
$.SUBRULE($.andOrExp);
$.OPTION(() => { $.SUBRULE($.orderBy); })
$.OPTION2(() => { $.SUBRULE($.skip); })
$.OPTION3(() => { $.SUBRULE($.take); })
});
$.RULE("expression", () => {
$.OR([
{ ALT:() => { $.SUBRULE($.compareRule) }},
{ ALT:() => { $.SUBRULE($.inExp) }},
{ ALT:() => { $.SUBRULE($.notInExp) }},
{ ALT:() => { $.SUBRULE($.parentAndOrExp)}}
])
})
$.RULE("take", () => {
$.CONSUME(Take);
$.CONSUME(Integer);
})
$.RULE("skip", () => {
$.CONSUME(Skip);
$.CONSUME(Integer);
})
$.RULE("orderBy", () => {
$.CONSUME(OrderBy);
$.AT_LEAST_ONE(() => {
$.CONSUME(Identifier);
$.OR([
{ALT: () => {$.CONSUME(Asc)}},
{ALT: () => {$.CONSUME(Desc)}},
]);
})
})
$.RULE('array', () => {
$.CONSUME(LBraket);
$.AT_LEAST_ONE_SEP({
SEP: Comma,
DEF: () => {
$.SUBRULE($.atomicExp);
}
})
$.CONSUME(RBraket);
})
$.RULE("inExp", () => {
$.CONSUME(Identifier);
$.CONSUME(InOp);
$.SUBRULE($.array);
})
$.RULE("notInExp", () => {
$.CONSUME(Identifier);
$.CONSUME(NotInOp);
$.SUBRULE($.array);
})
$.RULE("andOrExp", () => {
$.SUBRULE($.expression, { LABEL: "lhs" });
$.MANY(() => {
$.OR([
{ALT: () => { $.CONSUME(AndOp); }},
{ALT: () => { $.CONSUME(OrOp); }}
]);
$.SUBRULE2($.expression,{LABEL: "rhs" });
});
})
$.RULE("parentAndOrExp", () => {
$.CONSUME(LParen);
$.SUBRULE($.andOrExp);
$.CONSUME(RParen);
})
$.RULE("compareRule", () => {
$.CONSUME(Identifier);
$.OR([
{ ALT:() => { $.CONSUME(EqOp) }},
{ ALT:() => { $.CONSUME(NotEqOp) }},
{ ALT:() => { $.CONSUME(GtOp) }},
{ ALT:() => { $.CONSUME(GteOp) }},
{ ALT:() => { $.CONSUME(LtOp) }},
{ ALT:() => { $.CONSUME(LteOp) }},
]);
$.SUBRULE($.atomicExp);
});
$.RULE("atomicExp", () => {
$.OR([
{ ALT:() => { $.CONSUME(Integer) }},
{ ALT:() => { $.CONSUME(Float) }},
{ ALT:() => { $.CONSUME(String) }},
{ ALT:() => { $.SUBRULE($.dateExp) }},
]);
});
$.RULE("dateExp", () => {
$.CONSUME(Dt);
$.CONSUME(LCurly);
$.CONSUME(String);
$.CONSUME(RCurly);
});
// very important to call this after all the rules have been defined.
// otherwise the parser may not work correctly as it will lack information
// derived during the self analysis phase.
this.performSelfAnalysis();
}
}
// wrapping it all together
// reuse the same parser instance.
const parser = new FilterPure([]);
// ----------------- Interpreter -----------------
const BaseCstVisitor = parser.getBaseCstVisitorConstructor()
class FilterInterpreter extends BaseCstVisitor {
constructor() {
super()
// This helper will detect any missing or redundant methods on this visitor
this.validateVisitor()
}
expression(ctx) {
return this.visit(ctx.additionExpression)
}
atomicExp(ctx) {
if("dateExp" in ctx) {
return this.visit(ctx.dateExp);
}
if ("Integer" in ctx) {
return Number(ctx.Integer[0].image);
}
if ("Float" in ctx) {
return Number(ctx.Float[0].image);
}
return ctx.String[0].image.slice(1, ctx.String[0].image.length - 1)
}
dateExp(ctx) {
return new Date(ctx.String[0].image.slice(1, ctx.String[0].image.length - 1));
}
compareRule(ctx) {
}
expressions(ctx) {
return ctx
}
andOrExp(ctx) {}
array(ctx) {}
inExp(ctx) {}
notInExp(ctx){}
parentExpression(ctx){}
parentAndOrExpression(ctx){}
parentAndOrExp(ctx){}
orderBy(ctx){}
take(ctx){}
skip(ctx){}
}
// for the playground to work the returned object must contain these fields
return {
lexer: FilterLexer,
parser: FilterPure,
visitor: FilterInterpreter,
defaultRule: "expressions"
};
}())
Top comments (0)