A popular PEG-based parsing is nice, easy and beautiful. All the way until you want to define binary expressions with precedence and associativity. Then it's quickly getting ugly.
expr = { [expr1] "*" [expr] => ... }
/ { [expr1] "/" [expr] => ... }
/ [expr1];
expr1 = { [atom] "+" [expr1] => ... }
/ { [atom] "-" [expr1] => ... }
/ [atom];
... and so on
Of course, there are ways of handling left-recursive grammars with PEG (Packrat, in particular), see PEPM08 slides for a quick introduction into one of the most practical methods of doing so. Is it sufficient? Not quite, grammars are still clumsy, and good luck expressing associativity in any meaningful way.
The grammar above is right-associative, and the easiest way to turn it left-associative is to annotate the AST you're constructing and transform it later, without touching the grammar itself. And it's an obviously ugly way of doing things.
Just as an illustration of how bad this accidental right-associativity is: an expression x-1-1
will be parsed as x-(1-1)
, which is clearly not what we want.
Now, there is one little trick that PEG implementation vendors don't want you to know about. It's called Pratt parsing.
It's very easy to understand and even easier to implement. The whole algorithm can be described as follows:
The prefix
parser:
- Try parsing an
atom
- If successful, store as
Left
, otherwise return the failure - Try parsing a
binary
operator, one at a time (e.g., "*", "/", "+", "-"...) - If all attempts failed, return
Left
, otherwise check precedence for the successful operator; if it's less than the current precedence, apply correspondinginfix
parser, otherwise returnLeft
An infix
parser:
- Try parsing right side with a current precedence (using a
prefix
parser), if right-associative, use current precedence minus one. - If successful, return the current result (apply an AST constructor, whatever),
otherwise, backtrack to the beginning of the current
prefix
entry.
Pity it all only makes sense for the binary expressions with precedence and associativity, and pretty much useless for the rest of your grammar. Luckily, this algorithm does not tell us anything about how we're parsing atoms and binary operators. Looks like an opportunity!
Turns out, Pratt algorithm is very easy to embed into pretty much any kind of Packrat or any other PEG implementation - you can still do all the memoisation you want, and the whole Pratt is just another parsing order operator inside your PEG.
The grammar above now turns into something much nicer:
binary expr =
(200) [expr] "*" [expr] => ...
| (200) [expr] "/" [expr] => ...
| (100) [expr] "+" [expr] => ...
| (100) [expr] "-" [expr] => ...
| [atom]
;
atom = { "(" [expr] ")" => ... }
/ ... all the other trivial entries ...
Here, 200
is the precedence of the *
and /
operations, and 100
is a precedence of +
and -
. We can have as many precedence levels as we want, suitable for any language imaginable.
We don't care any more about all the consequences of having a left-recursive grammar, and we get a default left associativity for free. Our infamous x-1-1
is not translated into x-0
any more, with no extra effort!
This approach is successfully used in my optimising PEG parsing system which is a central part of the PFront language, and I highly recommend anyone implementing a PEG parser to do the same. It'll make your life easier, I promise.
Top comments (0)