In which I explore a totally overkill solution to the problem that isn't Regex: State Machines!
Things mentioned in this post: state machines, list comprehension, pytransitions, string library
The Challenge Part 1
Link to challenge on Advent of Code 2020 website
The challenge is presented as a password parsing problem, you are given a list of strings that have a specific format, and asked to apply some validation rules to them.
The example given was:
1-3 a: abcde
1-3 b: cdefg
2-9 c: ccccccccc
The format is:
- number (low range)
- a dash
- number (high range)
- space
- a letter (the letter in question)
- colon and space
- a string to validate
The validation rule is: The string to validate must contain the letter in question between low
and high
times (inclusive).
There are two parts to this challenge: the parsing, which is the act of taking some input and extracting the useful "bits" of it (spoiler alert: Regex); and the validation which takes those "bits" and applies some rule to them to figure out whether it is considered valid or not. Of these two parts, it is the parsing that is the more complex part.
Reading the file
As usual, the data comes in as one "entry" per line in a text file. I'll dispense with the buildup since I already covered a lot about the file reading in Day 01, and skip straight to the good stuff:
data = open("input.txt").readlines()
Result:
['1-2 x: xpxc\n',
'1-5 b: bwlbbbbcq\n',
'3-5 v: qvjjdhvl\n',
'9-12 t: ttfjvvtgxtctrntnhtt\n',
...
That's it. Nothing else. While it is python to use the context manager with open() as fp:
to open files (for the reasons I mentioned previously), for such a short exercise, I'm throwing caution to the winds in favour of convenience.
Since we want a simple array of strings, the readlines()
function is nice and straightforward to use, though it does end up with some extra new-lines at the end of strings (the \n
characters in the output above), those will get ignored by the code later, so I won't bother removing them.
The naïve solution
I'll be up front and say that this question screams Regex. Any time you have this kind of parsing problem where the data is well-formatted, but not an existing machine-readable format (e.g. XML, JSON, etc.), and not delimited in a regular way (e.g. CSVs), Regex might be the answer.
But, for your entertainment, I'm going to dedicate this post to going in any direction that isn't Regex (part 2 will cover the Regex solution which is substantially shorter).
So, if we aren't allowed Regex, how would we do it? Actually it's not that uncommon a task. When we work with low-level systems (code that needs to run on very little resources, or run very quickly), there is often a need to parse structured data.
Most of the approaches for this kind of data involve read incoming string one character at a time, and take the most appropriate action for each character depending on what it is and what came before. Let's break out one of the many entries of data, and take a look at it:
entry = data[3]
print(entry)
for char in entry:
print(char)
Output
9-12 t: ttfjvvtgxtctrntnhtt
9
-
1
2
t
:
t
t
f
j
v
v
t
g
x
t
c
t
r
n
t
n
h
t
t
This is the kind of data we're dealing with. Note that python is quite happy letting you iterate one character at a time over a string using the all-handy for..in
.
So if I were the machine doing this what would I be doing? maybe something like this:
- I see a 9. That's a number, great, since I'm looking for the first "low" number, I'll capture that
- I see a dash. Ok, that wasn't another number, so the "low" number must be done, let's move onto looking for the "high" number
- I see a 1. That's a number, great, I'll capture that for the "high" number
- I see a 2. That's another number, put it together with the previous one
- I see a space. Ok, this means we're done with the "high" number, moving onto looking for the letter
- I see a "t". Ok, this is my character etc.
As you can see, because the numbers sometimes span multiple characters, we can't simply break it down into nice, predictable characters spans, we have to read each character to decide where it should go and whether we're done with scanning for a value.
In other words, "what state we are in" matters (where state is probably one of "reading the low number", "reading the high number", etc.)
With that in mind we can re-state our algorithm to read more like this:
- When I am in the "reading the low number" state. If I see a number, add this to the low string. If I see a dash character, move onto the "reading the high number" state
- When I am in the "reading the high number" state. If I see a number, add this to the high string. If I see a space character, move onto the "reading the letter" state And so on.
Now that we establish that we need the idea of "state", we can write code that look something like this:
low_str = ""
high_str = ""
letter = ""
password = ""
state = "low"
for char in entry:
print(f"State: {state}, {char}")
if state == "low":
if char in "0123456789":
low_str += char
elif char == "-":
state = "high"
elif state == "high":
if char in "0123456789":
high_str += char
elif char == " ":
state = "letter"
elif state == "letter":
if char in "abcdefghijklmnopqrstuvwxyz":
letter += char
elif char == ":":
state = "password"
elif state == "password":
if char in "abcdefghijklmnopqrstuvwxyz":
password += char
print(f"Low: {low_str}, High: {high_str}, Letter: {letter}, Password: {password}")
Output
State: low, 9
State: low, -
State: high, 1
State: high, 2
State: high,
State: letter, t
State: letter, :
State: password,
State: password, t
State: password, t
State: password, f
State: password, j
State: password, v
State: password, v
State: password, t
State: password, g
State: password, x
State: password, t
State: password, c
State: password, t
State: password, r
State: password, n
State: password, t
State: password, n
State: password, h
State: password, t
State: password, t
State: password,
Low: 9, High: 12, Letter: t, Password: ttfjvvtgxtctrntnhtt
Bit of a mouthful. But, this is an important construct - a "State Machine". State Machines are hugely important in software, as they can be used as the building blocks of complex behaviour based on a set of rules. They're used in everything from what mode the LED panel is on your microwave, to governing the behaviour of robots, to decoding the data coming over your wifi right now (unsurprisingly, just like we are decoding this line of password validation exercise with a state machine, you can also decode network protocols with something similar).
Tidying up the state machine
State machines are so important that I'm not going to move onto Regex. The code above is annoying and fiddly to write, we can tidy it up a bit.
Firstly, those big constants can be removed since python's own string
library already defines them for us:
if char in "abcdefghijklmnopqrstuvwxyz":
Can become:
if char in string.ascii_lowercase:
(and likewise string.digits
)
Secondly, this code is annoyign to maintain because these number checks and rules are all jumbled together in quite a dense bit of code. We can do better to keep it clean and use a bunch of convenience functions to re-use:
def is_number(char):
return char in string.digits
def is_letter(char):
return char in string.ascii_lowercase
def is_separator(char):
return char in " -:"
We can also add some functions to help add letters onto our low_str
and high_str
variables:
low_str = ""
high_str = ""
def ingest_low(char):
low_str += char
def ingest_high(char):
high_str += char
But this isn't great practice to do it exactly like this, because low_str
as defined now are actually global variables that are being mutated inside these functions. That's bad form (a whole topic in itself). We should stick all of the above in a class to keep things encapsulated:
class PasswordParser:
def __init__(self):
self.low_str = ""
self.high_str = ""
self.letter = ""
self.password = ""
def ingest_low(self, char):
self.low_str += char
def ingest_high(self, char):
self.high_str += char
def ingest_letter(self, char):
self.letter += char
def ingest_password(self, char):
self.password += char
def is_number(self, char):
return char in string.digits
def is_letter(self, char):
return char in string.ascii_lowercase
def is_separator(self, char):
return char in " :-"
@property
def low(self):
return int(self.low_str) if self.low_str else 0
@property
def high(self):
return int(self.high_str) if self.high_str else 0
Notice at the end there, I've added some getters as a convenience to convert those strings into ints for convenience.
The state machine code can now look like this:
parser = PasswordParser()
state = "low"
for char in entry:
print(f"State: {state}, {char}")
if state == "low":
if parser.is_number(char):
parser.ingest_low(char)
elif parser.is_separator(char):
state = "high"
elif state == "high":
if parser.is_number(char):
parser.ingest_high(char)
elif parser.is_separator(char):
state = "letter"
elif state == "letter":
if parser.is_letter(char):
parser.ingest_letter(char)
elif parser.is_separator(char):
state = "password"
elif state == "password":
if parser.is_letter(char):
parser.ingest_password(char)
print(f"Low: {parser.low}, High: {parser.high}, Letter: {parser.letter}, Password: {parser.password}")
To be honest, it's not a whole lot neater or more readable, we're sort of half-way through tidying it up. Notice how each state contains two if
checks, one that decides whether to pull in data, and another one to decide whether to switch state.
We can therefore generalize this, and build our own StateMachine class into which we can register the checking function, and the transition function for each state, rather than have to write out a whole bunch of if
statements.
It's a "solved problem"
Here's the thing. Going back to what I said last post, while it's fun and educational to write code, a lot of the problems we encounter in the real-world aren't new, and there are probably libraries that already exist to solve the problem. State Machines are definitely a "solved problem". They're used everywhere!
In Python's ecosystem of libraries, there's a nice one called transitions (which I've used in my robotics work). transitions
(as the name suggests) lets you define your state machines more declaratively in terms of transitions between states, the rules that govern those transitions, and the triggers that cause these transitions to happen.
Being declarative means you can think about state machines in terms of these states and transitions and rules, rather than the nitty gritty of implementation.
Here's what the above code looks like when implemented using transitions
:
from transitions import Machine
parser = PasswordParser()
machine = Machine(parser, ["low", "high", "letter", "password"], initial="low")
machine.add_transition("consume", "low", "low", conditions=["is_number"], after=["ingest_low"])
machine.add_transition("consume", "low", "high", conditions=["is_separator"])
machine.add_transition("consume", "high", "high", conditions=["is_number"], after=["ingest_high"])
machine.add_transition("consume", "high", "letter", conditions=["is_separator"])
machine.add_transition("consume", "letter", "letter", conditions=["is_letter"], after=["ingest_letter"])
machine.add_transition("consume", "letter", "password", conditions=["is_separator"])
machine.add_transition("consume", "password", "password", conditions=["is_letter"], after=["ingest_password"])
As can be seen, instead of being made up of a whole bunch of little if
statements, the state machine is now made up of seven transition rules.
machine.add_transition("consume", "low", "low", conditions=["is_number"], after=["ingest_low"])
Take this line for example. This defines a transition rule from the "low" state back into the "low" state if the condition is "is_number". This is necessary because when we're in the "low" state, receiving a number should keep us in this state since we might need to look for another number. This code also specifies that "ingest_low" is the function to run (with the incoming data) after this transition happens, that function helps collect our numbers together. The first argument "consume" is the trigger that causes this transition rule to attempt to run.
If we were to draw a state transition diagram for the above, it would look something like this:
With all the setup done, the state machine's actual execution is simple:
for char in entry:
print(f"State: {parser.state}, {char}")
parser.consume(char)
print(f"Low: {parser.low}, High: {parser.high}, Letter: {parser.letter}, Password: {parser.password}")
I won't reproduce the output as it's identical to last time.
For each character, we tell the state machine to fire the "consume" trigger, which actually for this machine triggers all the transitions (whether the actual transition happens depends on the transition's states and rules). Internally the machine will go through a similar set of logic to what we defined before, but we didn't have to write a single if
character, and the code is a bit cleaner to look at since it declares "what" needs to be done, and not "how" it needs to be done (the hallmark of "declarative programming")
So there we go. A state-machine based parser. As much fun as state machines are, they are however entirely overkill for this problem, since Regex solves it all very quickly. I'll go over that solution in my next post.
Top comments (1)
Thank you for such interesting and step-by-step solution! It was really helpful for me as a beginner.