It is finally time to write about the engine of the regex engine we’re building.
As we already saw, the parser produces an AST as a result of the parsing of a regular expression.
Now, the engine has to check whether a given string matches or not a regex.
To do so, the engine uses the AST representation of the regex, which extracts the fundamental features/elements of the regular expression, and tries to “find” if the test string has them.
The Engine
The engine is again a Python class, having a fundamental method match, which is a closure containing the engine and all the needed auxiliary methods.
Match signature:
def match(self, re: str, string: str)
Match Semantic
The semantic we will give to the match method is the following:
- it returns on the first occurrence of the regex in the string
- if there is no match, it tries again shifting the input string one character to the right (that is, moving the string one character to the right) until it reaches the end of the string.
That is, if the input string is abac
and the regex is c
, on the first attempt abac
won’t match c
, so match tries again, but shifting abac
one character to the right this time, so tries to match bac
with c
, resulting in another fail, so it tries again with ac
and c
(failing), and, lastly, it tries c
with c
, resulting in a match, thus returning true.
The
match
function so far.
How To Match
match_group
is the most important method of the engine.
It works by iterating through an AST node’s children, and by calling itself recursively when a GroupNode
is found or, with some differences, when an OrNode
is found.
If an OrNode
is found, it checks if the left children (the one in position 0) matches, if it doesn’t, it tries the right one.
When a leaf node is found (ElementNode
, …) it tries to match it.
After every match, the str_i
(the index of the string) is increased of the matched characters.
The string index is set to position 0 (or the first position to consider more in general) at the beginning, and is kind of a “global variable”, but local to the match function.
The mechanism of the recursive calls makes it easy to manage the “point in the tree” and “point in the string” considered. It also allows you to think of each group as a separate entity, making it easier to follow mentally the code execution.
Obviously, this isn’t the whole engine, because as stated in the first article, we will need a backtracking system to assure to explore all the possible paths to match the string.
A little problem
In fact, the semantic of leaf nodes and groups is to match them to the most characters possible by default, and while this is convenient most of the time, there are some edge cases in which this won’t lead to a correct answer.
Let’s examine this example:
If you have a regex a+ab
and a test string aab
:
- the resulting AST will consist of a
GroupNode
, with three children, all leaves - a first
a
to be matched [1, +∞[ times - a second
a
to be matched exactly one time - a final
b
to be matched exactly one time.
During the execution of our match function as presented, this is what will happen:
-
match_group
is called and start iterating on its children - the first
a
element is matched as much as possible, so two timesaa
- then the engine tries to match the second
a
element, but it fails as now the only left character is ab
- false would be returned, but we all know the correct answer is true.
This is because of the semantics we gave to our match function. What we would like the system to do is to “try another way”:
- since the first element could accept 1 or more occurrences of the letter
a
, and it matched two of them, we would like to be able to tell it “hey, please try by matching it one time less” - in this case, the first
a
would be matched one time, leaving enough room to fit the seconda
element, and then also theb
element, our third and last - this process worked, and so the answer would be true, as it should be.
This process is called backtracking and will be the topic of the next article, for now, let’s ignore the problem.
Conclusion
Overall, we covered pretty much everything we needed to cover to explain the behavior of the match method.
The skeleton of our code will look something like this
match
function’s structure.
In the next article, we will cover the interesting topic of backtracking and will complete our regex engine.
Top comments (1)
What regex engine does Python use? dua to protect husband from another woman