In this article, we’ll dig rapidly through the code of the parser, which will produce the AST we discussed in the previous article in output.
This will hopefully be the last piece about the parser, and then we’ll move on to the engine and the juicy part, the backtracking system.
So, the parser is contained in the Pyrser class, as we already saw in Part 3, which has the method parse, that takes as input a regular expression in the form of a string, and gives us the corresponding AST in output.
Pyrser structure
Pyrser
structure.
We already spoke about the various functions and what is their purpose in previous articles.
Now we will look closer to the code of some of them. I will not explain what each line does, this isn’t the purpose of this series of articles, but I will rather express some final thoughts after.
If you’re not super interested in the code, feel free to skip reading through it.
Parse Regex Sequence
parse_re_seq
code.
Parse Group
parse_group
code.
Parse Element
parse_el
code.
Final Words
In conclusion, the parser was not as hard to code as hard to debug.
When I first coded it I did it quickly and without having a perfectly clear idea in mind of what I wanted in output.
In fact, the parse_re
function is indeed a leftover of an error:
- in the beginning, the
^
and$
symbols were meant to be “absolute”, meaning that once typed, they would have been applied to every single regex in the sequence - This meant that a regex such as
^a|b$
returned false on test stringcb
, while it should return true because thatb
must be at the beginning isn’t specified in the second regex, but this original implementation maintained the “match start” from the other one - This happened because at the beginning I thought that should be the “right” behavior, and I created the RE node of the AST just for this purpose.
I later realized that this behavior was confusing and didn’t make any sense, so I had to figure out a way to change it, possibly maintaining the RE node, because the engine was already coded and using it.
In the end, parse_re
remained, completely emptied of any behavior, useless, for the sake of compatibility with the — already working — engine.
Apart from this, the parser and the AST does a really good job overall, and coding a TDRDP parser isn’t so hard once you figured out a good grammar, which is no joke for non-trivial projects.
Top comments (0)